key: cord-0325420-z8ix4kz7
authors: Robert, Alexis; Kucharski, Adam J; Gastañaduy, Paul A; Paul, Prabasaj; Funk, Sebastian
title: Probabilistic reconstruction of measles transmission clusters from routinely collected surveillance data
date: 2020-02-15
journal: nan
DOI: 10.1101/2020.02.13.20020891
sha: 38bab3f17b186bd8ee289e5d135bb7d500ef500b
doc_id: 325420
cord_uid: z8ix4kz7

Pockets of susceptibility resulting from spatial or social heterogeneity in vaccine coverage can drive measles outbreaks, as cases imported into such pockets are likely to cause further transmission and lead to large transmission clusters. Characterising the dynamics of transmission is essential for identifying which individuals and regions might be most at risk. As data from detailed contact tracing investigations are not available in many settings, we combined age, location, genotype, and onset date of cases in order to probabilistically reconstruct the importation status and transmission clusters within a newly developed R package called o2geosocial. We compared our inferred cluster size distributions to 737 transmission clusters identified through detailed contact-tracing in the United States between 2001 and 2016. We were able to reconstruct the importation status of the cases and found good agreement between the inferred and reference clusters. The results were improved when the contact-tracing investigations were used to set the importation status before running the model. Spatial heterogeneity in vaccine coverage is difficult to measure directly. Our approach was able to highlight areas with potential for local transmission using a minimal number of variables and could be applied to assess the intensity of ongoing transmission in a region.

from unrelated cases can be very close genetically and genetic sequences from measles cases are not 56 usually indicative of direct transmission links [27, 28] . 57

As measles is highly infectious, under-immunized communities (also called pockets of susceptibles) 58 resulting from local heterogeneity in vaccine coverage can lead to large, long-lasting outbreaks [30] [31] [32] [33] [34] . 59

Detecting these pockets of susceptibles can be challenging, as historical local values of coverage 60 throughout a given country are rarely available. The size distribution of transmission trees resulting 61 from each importation during outbreaks (otherwise known as the cluster size distribution) will depend 62 both on individual factors (e.g. age of the imported case which might affect contact patterns) and 63 community factors (e.g. the history of coverage in the area) [35, 36] . The size of a cluster can therefore 64 reflect the level of susceptibility of individuals directly and indirectly connected to the index case 65 [37, 38] . 66

Here we introduced a model combining age, location, genotype, and rash onset date of cases to 67 reconstruct probabilistic transmission trees. We chose these features to make the model applicable to a 68 wide range of settings as they are commonly reported and informative on transmission. We wrote the 69 R package o2geosocial to conduct inference on individual-level data using this model. It is based on 70 the package outbreaker2 and is designed for outbreaks with partial sampling of cases, or uninformative 71 genetic sequences, such as measles outbreaks [9, 39] . We used the likelihood of transmission links 72 between different cases to estimate their importation status. We compared the inferred importation We allowed for missing generations between cases due to an unreported individual, and corresponds 89 to the number of generations between and . We calculated the temporal probability of transmission 90 between and from the number of days between the dates of infection of the two cases and and 91 the generation time of the disease (t). This probability of infection was quantified by ( − , ), 92 ( ) = ∏ , where Π is the convolution operator. We used an exponential distribution ( | ) to 93 quantify the probability of observing missing generation between and from the conditional report 94 ratio which quantifies the probability of missing generation between two connected cases in a cluster. 95

It does not correspond to the overall report ratio of an outbreak as entire missing clusters, or unreported 96 cases infected after the last case or before the ancestor of a cluster are not included in . The "ancestor" 97 is the earliest identified case of a cluster. 98 ( , , ) was defined as the probability of transmission between age groups and . This 99 probability corresponds to the proportion of contacts to the age group that originated from and 100 can be deduced from studies such as Polymod [36] . We defined ( , ) as the probability of 101 observing the pathogen genotype in case in the tree containing case j. There can only be one 102 measles virus genotype per transmission tree, or cases with unreported genotype. 103 Five of the proposals had already been implemented in the outbreaker2 package and were adapted to 126 this setting: i) change the number of generations between two cases; ii) change the conditional report 127 ratio ; iii) change the time of infection; iv) change the infector of a case (if the case is not the ancestor 128 of a tree); v) swap infector-infectee (if none is the ancestor of a tree). 129

We added two proposals to change and , the spatial kernel parameters. For each proposal, the 130 probability of transmission between every geographical unit was re calculated with the new values. 131

Depending on the number of geographical units, this calculation considerably slowed down the 132 algorithm. Therefore, when or were estimated, we limited the maximal number of missing 133 generations to 1 (max ( ) = 2). Finally, the last proposal was designed to change the ancestor of the 134 tree whilst conserving the overall number of trees ( Figure 1 ). 135

Unrelated measles cases stemming from different importations and different regions can be part of the 137 same dataset. Grouping cases and excluding unrealistic transmission links reduces the number of 138 possible trees and speeds up the MCMC runs. To do so, we listed each case's potential infectors using 139 three criteria: i) The potential infectors must be of the same genotype as the case, or have unreported 140 genotype, ii) The location of potential infectors must be less than km away from the case and iii) the 141 potential infectors must have been reported later than days before the case. This threshold should be 142 determined from the maximum plausible generation time of the disease. The spatial threshold should 143 be defined according to the relevance of long-distance transmissions. Cases with no potential infector 144 were considered as importations. Otherwise, they were grouped together with i) their potential infectors 145 and ii) cases with common potential infectors. 146

After grouping the cases, we estimated their importation status and the cluster size distribution using 147 two runs of MCMC ( Figure 2 ). The first run was shorter and aimed at removing the most unlikely 148 connections among each group, as they can reflect unrealistic estimates for incubation periods or 149 generation times and corrupt the estimation of the date of infection. We defined a reference threshold , 150 whereby if the individual value of log-likelihood L i was worse than , then the connection between 151 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint and their index was considered unlikely. In Outbreaker2, was a relative value, defined from a quantile 152 of the individual log-likelihoods. In o2geosocial, can be a relative value or an absolute value, chosen 153 from the number of components of the likelihood. For each sample saved from the short run, we 154 computed the number of unlikely connections n. If there was no iteration where all connections where 155 better than , min(n) new importations were added to the initial tree for the long run ( Figure 2) . Figure S1 ). The 176 importation status, 5-year age group, onset date, county, and state of residence were fully reported for 177 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint 2,077 cases. The 21 cases with missing data were discarded. 25% of the cases were classified as 178 importations. 39% of the cases had their genotype reported. The dataset of 2,077 cases is referred to as 179 "reference dataset" in the results section, and was used to evaluate the performance of the inference 180 method. 181

Among cases with complete data, 737 independent clusters, containing 1 to 380 cases, were 182 reconstructed through contact tracing investigations. Not every identified case could be linked to an 183 importation, and some transmission clusters contained multiple imported cases (e.g. when related 184

individuals travel together to a foreign country and were infected there). Out of the 737 reference 185 clusters, 38 had several cases classified as importations, 256 had none identified. 186

The distributions and priors used in the studies are listed in Table 1 . As no studies quantifying the 188 probability of age-specific contacts have been carried out in the United States, we used the estimates 189 from the POLYMOD study in the UK [36] . The incubation period and the generation time of measles 190 were taken from previous studies [47] [48] [49] . We used the population centroid of each county to compute 191 the distance matrix[50]. We used a beta distribution as the prior of the conditional report ratio [8] . The 192 mean of the prior distribution was calculated using the number of clusters whose first case was not 193 classified as an imported case, meaning the investigations were not able to trace back to the first case 194 imported. As there was no prior information on the possible values of the spatial parameters, we used 195 uniform distributions as priors for and . 196

For pre-clustering of cases, we set the temporal threshold to 30 days, which is above the 97.5% upper 197 quantile of the generation time with a missing generation. We were interested in local transmission to 198 describe the impact of an imported case on a community. But we only had information on the county 199 of residency for each case. Counties are large geographical units: the average county land area is 200 2,911km 2 and the maximum values reach 50,000km 2 . Therefore, we set the spatial threshold to 100km 201 to exclude long distance transmission, while still allowing for cross-county transmission. 202 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint Finally, we tested several relative and absolute importation thresholds . Absolute values were 203 calculated from a factor , multiplied by the number of components in , excluding the binary genetic 204 component. Tested values were k = 0.05 ( = −15) and k = 0.1 ( = −11). Connections were 205 considered unlikely if the log-likelihood was worse than . Relative values were quantiles of all 206 recorded log-likelihoods in the sampled trees (Table 1) . 207

Using the contact tracing investigations, we considered three different initial distributions of the 208 importation status. In scenario 1, there was no inference of the importation status of cases, and the first 209 case of each epidemiological cluster was classified as importation (Ideal importation). In scenario 2: 210 there was no inference of the importation status of cases, and all cases identified as importation in the 211 contact tracing investigations were classified as importations (Epidemiological importation). Finally, in 212 Scenario 3, the importation status of cases was inferred, using different thresholds , and using no prior 213 information on the importation status of cases or the importation status from the contact tracing 214

In order to compare the inferred and reference clusters, we calculated for each case i) the proportion of 216 the reference cluster correctly inferred (sensitivity) and ii) the proportion of the inferred cluster that was 217 part of the reference cluster (precision). These values were calculated at every iteration, and the median 218 values were used to evaluate the fit obtained with different values of . We also used the inferred cluster 219 size distribution to the reference data. The credibility intervals for each case are reported in the 220 Supplement (Supplement Figure S2 ). 221

We clustered 2,077 measles cases reported in the United States between January 2001 and December 223 2016 using their onset date, age groups, location and genotype. Using the contact tracing investigations, 224 we considered three different initial importation status distribution: i) only the ancestors of each 225 epidemiological cluster (first case of each cluster) were importations (ideal importation), ii) all cases 226 classified as importation in the contact tracing investigations were importations (epidemiological 227 importation), iii) no prior information on importation status of cases. The importation status of the cases 228 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint was therefore not probabilistically inferred in scenario 1 and 2. The short preliminary run was 30,000 229 iterations and 70,000 iterations. For each run, the trace of the posterior distribution shows the 230 convergence of the algorithm (Supplement Figure S3) . 231

In scenario 1, we did not infer the importation status of cases. The inferred cluster size distribution 232 matched the contact tracing investigations ( Figure 4A ); 98% of the reference singletons were also 233 isolated in the inferred cluster. For 94% (95% Credibility Interval: 91-98%) of cases, the inferred cluster 234 had a sensitivity and precision above 75%, meaning more than 75% of the cases in the inferred cluster 235 were in the reference cluster, and more than 75% of the cases in the reference cluster were in the inferred 236 cluster ( Figure 4B ). For 80% (78 -93%) of cases, the inferred clusters were a perfect match with the 237 reference clusters. The cluster size distribution stratified by state was similar to the contact tracing 238 investigations (Supplement Figure S4 ). Therefore, when each ancestor was considered as an 239 importation, the inferred clusters were very close to the reference ones. 240

In scenario 2, we used the importation status distribution of cases reported in the contact tracing 241 investigations (539 importations). Pre-clustering highlighted 165 cases with no potential infector, which 242 were also classified as importations. We observed discrepancies between the inferred cluster size 243 distribution and the reference one: Among the 704 cases inferred as importation, 61 (9%) were not 244 importations in the reference cluster. Furthermore, 94 cases were the ancestor of a reference cluster and 245

were not classified as importations in the inferred clusters (13%). The overall cluster size distribution 246 matched the reference distribution, but 111 reference singletons were inferred as part of transmission 247 clusters ( Figure 4A , Supplement Figure S5 ). Although the precision of the inferred cluster was above 248 75% for 93% (88-93%) of the cases, 31% (6-39%) had a sensitivity score below 0.5, meaning they were 249 classified with less than half of their reference clusters ( Figure 4C ). The discrepancies observed in this 250 scenario are due to inconsistencies between the importation status distribution and the clustering of 251 cases in the contact tracing investigations, as reference clusters that gathered several importations were 252 split into different inferred clusters in Scenario 2. 253

In scenario 3, the importation status of cases was inferred from a threshold . For each case , if the log-254 likelihood was worse than , the connection between the case and its index was removed and the 255 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint case was considered imported. Firstly, using an absolute factor k = 0.05 ( = −15), 586 (581-593) 256 cases were classified as importations, 361 (355-369) of them were singletons. These numbers are much 257 lower than the reference datasets that contains 737 clusters, and 539 singletons ( Figure 5A , Supplement 258 Figure S6 ). We observed very few misclassifications of importation status and singletons (15 (10-22) 259 misclassified importations, 4 (0-14) misclassified singletons), and the cluster size distribution for 260 clusters including two cases and more was very similar to the reference one. The precision of the 261 reconstructed cluster was very high (above 75% for 88% (85-93%) of cases) ( Figure 5B) . Overall, the 262 algorithm was not able to accurately identify importations and singletons as the threshold was too low 263 to eliminate some unrealistic connections, but the inferred larger clusters matched their reference 264 counterparts. 265

We then observed the impact of increasing on the inferred cluster size distribution. Runs obtained 266

using an absolute threshold with = 0.10 ( = −11) and 95% relative threshold yielded very similar 267 results. The number of cases inferred as importations was higher than in previous runs, while all 268 remaining links showed good connection between cases. The number of importations was closer to the 269 reference dataset, and the number of singletons was greater than the reference. Nevertheless, the 11% 270 (10-12%) of the inferred importations was not classified as importation in the reference clusters. 271

Furthermore, the number of two-case chains was overestimated, and bigger clusters were likely to be 272 split because of the removal of weaker connections. Therefore, increasing did not improve the cluster 273 size distribution, as many importations in the reference clusters were not identified and the number of 274 mismatches increased (Supplement Figures S7) . 275

Finally, we combined prior information and inference of importation status. Cases considered as 276 importations in the contact tracing investigations were set as importations, and we inferred the 277 importation status of the remaining cases. We used a low threshold, to remove the least likely 278 transmission links ( = 0.05). Including prior information led to some misclassification of importation 279 status due to the inconsistencies between the epidemiological importation status and the reference 280 clusters. As in scenario 2, some cases were classified with only part of their reference clusters because 281 clusters with several importations were split into different clusters. Indeed, the sensitivity score of 34% 282 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint (7-51%) of cases was below 0.5. Nevertheless, the cluster size distribution observed in the simulation 283 was the closest to the reference clusters. There were 725 (719-731) clusters, 89% of importations were 284 also ancestors of reference clusters and the number of singletons matched the reference clusters ( Figure  285 5A-C). The inferred clusters of 88% (86-94%) of the cases had a precision score of 1, showing they 286 were clustered without any false positives. Despite discrepancies in several states (Massachusetts, 287 Ohio), the cluster size distribution stratified by state showed good agreement with the reference clusters 288 (Supplement Figures S8) . 289

The conditional report ratio in the transmission chains and the spatial parameters and was 290 estimated in each scenario. The parameter estimates did not depend on the prior importation status 291 distribution or the value of . was consistently estimated above 90%, showing a low number of 292 missing generations between cases (Supplement Figure S9 ). This number is not representative of the 293 overall report ratio, which is usually much lower[51], and does not take into account missing 294 importations in singletons and chains. High values of show that the reported cases can be connected 295 without missing generations. 296

There was little variation in the estimates of the spatial parameters between the different scenarios. The 297 population parameter was estimated between 0.6 and 1 for every scenario, and the distance parameter 298 b was between 0.08 and 0.12. In every scenario, more than 80% of the inferred transmission were 299 between cases distant of less than 10km, and few long-distance transmissions were recorded (50-300 100km), hence although most of the reconstructed connections were between cases from the same 301 county, the algorithm was able to identify clusters spreading over several counties or states (Supplement 302 Figure S10 ). 303

We highlighted the added value of including the spatial distance between cases in the likelihood by 304 comparing the cluster size distribution inferred by selecting certain components of (Supplement 305 Figure S11 ). The credibility intervals were much wider when the distance between cases is not part of 306 the likelihood, and the number of chains containing 2 to 10 cases was over estimated. The important 307 impact of the spatial component of likelihood was also due to the widespread American territory, and 308 could be lower in a different setting. 309 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint

We used the ratio of the number of importations over the number of subsequent cases per state to 310 evaluate the intensity of transmission in each state between 2001 and 2016 ( Figure 6 ). The maps 311 obtained in the scenario 1 (ideal scenario) or in scenario 3 (estimation of importation, with 312 epidemiological importations and = 0.05) were very similar. We only observed minor differences, 313 for example in South Dakota and in Massachusetts, where the ratios were higher in scenario 3. The 314 highest ratio (31.8 in scenario 1) was observed in Ohio, and is mostly due to a 383 case outbreak in 315 2014 [32] . We observed major differences between the incidence map ( Figure 3A Similarly, we used the inferred transmission chain to compute the inferred reproduction number in each 320 state. According to the model, about 60% cases did not cause future transmission, and about 5% caused 321 more than 5 subsequent cases (Supplement Figure S12) . These numbers were consistent in each run. 322

The geographical distribution of reproduction number was very similar to the importation -subsequent 323 cases ratio (Supplement Figure S13) . 324

We developed the R package o2geosocial to classify measles cases into transmission clusters and 326 estimate their importation status using routinely collected surveillance data (genotype, age, onset date 327 and location of the cases). As recently observed during the 2018-2019 measles outbreak in New York, 328 delays in childhood vaccination, local susceptibility, and increased contacts can lead to large outbreaks 329 following importations [52, 53] . Therefore, we were interested in highlighting the effect of imported 330 cases on communities and we focused on short distance transmission to identify areas where they 331 repeatedly caused subsequent transmission chains. Although this is not predictive of future 332 transmission, it highlights communities with potential for large transmission clusters. 333

We compared the inferred transmission clusters to the contact tracing investigations of 2,077 confirmed 334 measles cases reported in the United States between 2001 and 2016. We were able to produce reliable 335 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint estimates of known transmission clusters using epidemiological features with only few 336 misclassifications. Estimating the importation status of cases without prior knowledge was challenging 337 and caused uncertainty on the results. We tested different threshold to eliminate unlikely 338 transmissions, and we were able to identify most of the imported cases. Nevertheless, if several cases 339 were imported in the same region at a similar time, we could not find all of them without discarding 340 valid transmission events, and increasing the number of false positives. When we used the importation 341 status as defined in the contact tracing investigations without probabilistic inference (scenario 1 and 2) , 342 the reconstructed clusters were similar to the reference ones. Results were also conclusive when we 343 combined prior information and importation inference. The reconstruction of transmission greatly 344 depends on the epidemiological investigations to identify measles importations in a community. 345

We used the genotype to censor connections between cases when it was reported, as there can be only 346 one reported genotype per transmission cluster. Using a simulated dataset (toy_outbreak_long in 347 o2geosocial), we explored the impact of increasing the proportion of genotyped cases on clustering and 348 observed it could help identify the number of concurrent transmission trees when multiple genotypes 349 are co-circulating. Moreover, we introduced a spatial component to the likelihood of connection 350 between cases using an exponential gravity model. Previous studies showed this model was able to 351 capture short distance dynamics better than other gravity models, and was easy to parametrise. 352

Introducing the spatial component greatly improved the precision and the sensitivity of the 353 reconstructed clusters (Supplement Figure S11) , and the parameter estimates were robust in the different 354

The final results on the clustering of the 2,077 cases using o2geosocial were obtained in 7 hours for 356 each run of 100,000 iterations on a standard desktop computer (Intel Core i7, 3.20 GHz 6 cores), which 357 is much faster than previous implementations of outbreaker and outbreaker2. With the addition of the 358 pre-clustering step, whereby we reduced the number of potential infectors for each case, the algorithm 359 ran faster. For smaller chains (50,000 iterations), 4 hours were needed to estimate the importation status 360 and cluster the cases. The code for the package and the analysis developed in this project is shared on 361 Github (https://github.com/alxsrobert/o2geosocial and alxsrobert/datapaperMO), with an illustrative 362 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint toy dataset, and can be used to analyse recent outbreaks where contact-tracing investigations were not 363 carried out. 364

Although the results obtained are promising, it should be noted that the dynamics of measles 365 transmission in the United States are likely to be very specific to this location. Indeed, there were less 366 than 700 annual cases between 2001 and 2016. These cases were scattered across a large area, which 367 made the pre-clustering of cases very efficient as we focused on short-distance transmission. In smaller 368 or more endemic settings, the number of potential infectors per cases after the pre-clustering step might 369 be higher, which would increase the running time. 370

Furthermore, as the location of each case was deduced from the population centroid of counties, we 371 assumed that the distance between cases from the same county was effectively zero. American counties 372 are large and widespread geographical units that can include more than 1 million individuals. For future 373 use of o2geosocial, more accurate information on the location of cases could improve cluster inference 374 by identifying multiple importations in a given county. Because cases are reported by the state of 375 residency, we had to ignore that cases may have been out of the reported county or state during their 376 incubation and infectious period, which has been seen during some outbreaks, such as the 2015 "Disney 377 outbreak" in California [54] . 378

We did not include prior information on the local susceptibility of the different areas affected in 379 o2geosocial, and these could be estimated using historical values of local coverage. However, protocols 380 to estimate local vaccination coverage can differ in time and space and be difficult to compare, or 381 unavailable at the local level. Furthermore, these estimates are cross-sectional in nature, and might not 382 take into account catch-up vaccination campaigns, or immunity induced by previous outbreaks. Local 383 seroprevalence surveys could identify pockets of susceptibles, but they have not been carried out on a 384 subnational scale in most countries [55] . 385

There has been no national quantitative analysis of age-specific contact patterns carried out in the United 386 observed between European countries, and a previous projection of the social contact matrix in the 389 United States yielded similar results [56] . POLYMOD data was probably the most reliable source of 390 information we could use to deduce an estimate of the contact matrix in the United States. 391 

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint 

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint sensitivity of the clusters for each case in scenario 3, with a 5% relative threshold, cases are classified in a category 457 depending on the proportion of their reference cluster that were inferred in the same cluster. Panel C: Same when 458 importation status is taken from the contact tracing investigations and inferred using a 5% relative threshold.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint outbreaks in the UK, is it when and where, rather than if? A database cohort study of childhood 566 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.13.20020891 doi: medRxiv preprint

Transmission intensity and impact of control 469 policies on the foot and mouth epidemic in Great Britain

Different Epidemic Curves for Severe Acute Respiratory Syndrome 472

Superspreading and the effect of individual 474 variation on disease emergence

Chains of 476 transmission and control of Ebola virus disease in Conakry, Guinea, in 2014: an observational 477 study

Relating phylogenetic trees to transmission trees 479 of infectious disease outbreaks

How generation intervals shape the relationship between growth rates 482 and reproductive numbers

Methods to infer transmission risk factors in complex outbreak 485 data

Disease Outbreaks by Combining Epidemiologic and Genomic Data

elucidates Ebola virus origin and transmission during the 2014 outbreak

Temporal 519 and spatial analysis of the 2014-2015 Ebola virus outbreak in West Africa

genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated 523 with putative origins of infection

Evolutionary analysis of the dynamics of viral infectious disease

Unifying the 528

Epidemiological and Evolutionary Dynamics of Pathogens

When are pathogen genome sequences 530 informative of transmission events?

Global distribution 533 of measles genotypes and measles molecular epidemiology

Measles molecular epidemiology : What does it tell us and why is it 536 important?

Whole-genome sequencing of 542 measles virus genotypes H1 and D8 during outbreaks of infection following the

Winter Games reveals viral transmission routes

Measles virus nomenclature Update

Heterogeneity in coverage for measles and 551 varicella vaccination in toddlers -Analysis of factors influencing parental acceptance

The effect of heterogeneity in uptake 554 of the measles, mumps, and rubella vaccine on the potential for outbreaks of measles: A 555 modelling study

A Measles Outbreak in 558 an Underimmunized Amish Community in Ohio

Large measles epidemic in the Netherlands

Measles population susceptibility in Liverpool

Characterizing the Transmission Potential of Zoonotic Infections 569 from Minor Outbreaks

Social contacts and 572 mixing patterns relevant to the spread of infectious diseases

Inference of R0 and Transmission Heterogeneity from the Size 575 Distribution of Stuttering Chains

Identifying 578 postelimination trends for the introduction and transmissibility of measles in the United States

Systematic comparison of trip distribution laws and 584 models

The P 1 P 2/D hypothesis: On the intercity movement of persons

A Universal Model of Commuting Networks

An introduction to MCMC for machine learning

Centers for Disease Control and Prevention (CDC). National Notifiable Disease Surveillance 597 System: measles/rubeola 2013

Incubation periods of acute 600 respiratory viral infections: a systematic review

The correlation between infectivity and incubation period of 603 measles, estimated from households with two cases

The Interval between Successive Cases of an Infectious Disease

US Census Bureau. Centers of Population for the

The tip of the iceberg : 611 incompleteness of measles reporting during a large outbreak in The Netherlands

Measles outbreak--California

Measles elimination, immunity, serosurveys, and other immunity gap diagnostic 624 tools

Projecting social contact matrices in 152 countries using contact 626 surveys and demographic data

We acknowledge Thibaut Jombart for technical support and feedback on the analysis plan. 405