key: cord-0256558-3hzrlncl authors: Liu, P.; Song, Y.; Colijn, C.; MacPherson, A. title: The impact of sampling bias on viral phylogeographic reconstruction date: 2022-05-13 journal: nan DOI: 10.1101/2022.05.12.22275024 sha: 73a29e69a344f6b2f922f1122d86c8e030fb2964 doc_id: 256558 cord_uid: 3hzrlncl Genomic epidemiology plays an ever-increasing role in our understanding of and response to the spread of infectious pathogens. Phylogeography, the reconstruction of the historical location and movement of pathogens from the evolutionary relationships among sampled pathogen sequences, can inform policy decisions related to viral movement among jurisdictions. However, phylogeographic reconstruction is impacted by the fact that the sampling and virus sequencing policies differ among jurisdictions, and these differences can cause bias in phylogeographic reconstructions. Here we assess the potential impacts of geographic-based sampling bias on estimated viral locations in the past, and on whether key viral movements can be detected. We quantify the effect of bias using simulated phylogenies with known geographic histories, and determine the impact of the biased sampling and of the underlying migration rate on the accuracy of estimated past viral locations. We then apply these insights to the geographic spread of Ebolavirus in the 2014-2016 West Africa epidemic. This work highlights how sampling policy can both impact geographic inference and be optimized to best ensure the accuracy of specific features of geographic spread. Genomic epidemiology refers to the use of pathogen genomes to understand how 2 infectious diseases are transmitted across a range of scales, from local outbreaks to 3 global geographic spread. Due to improvements in sequencing technology and 4 decreasing cost, genomic epidemiology has increased greatly in scale in many 5 jurisdictions and for many viruses to the extant that sequencing now plays an important 6 role in pathogen surveillance [1] . Sequencing allows us to identify new viruses, 7 understand their origins and natural reservoirs, to characterize their transmission 8 dynamics in human populations, and to understand their evolution [2, 3] . Interest in 9 genomic surveillance for viruses has grown, in the context of the high utility of genomic 10 data for these tasks and the increasing awareness of the severe threat that emerging between a set of taxa. Here we focus on the fact that the geographic locations of the 23 virus' ancestors in the past can be reconstructed by extrapolating the location of the 24 observed sequence back in time from the present day [7] [8] [9] [10] [11] . There are a number of 25 methods that perform this extrapolation. These treat the geographic location as either 26 a discrete or continuous trait evolving on the phylogenetic tree; some methods use 27 Bayesian approaches to simultaneously reconstruct the location and the phylogeny and 28 others reconstruct the geographic location on a fixed phylogeny [12, 13] . 29 Indeed, phylogeography is one of the main applications of large-scale virus sequence 30 datasets. In 2007, phylogeographic analysis revealed Guangdong as a likely source of 31 Avian H5N1 influenza viruses, and Indonesia a likely sink [9] . Phylogeographic analysis 32 was used to understand transmission dynamics of Ebolavirus in space and time in the 33 Democratic Republic of Congo in 2018-2020, and was part of the genomic surveillance 34 system informing the response to Ebolavirus in real time [14] . In human influenza, 35 phylogeographic analysis is used to characterize the emergence and global spread of 36 human seasonal influenza viruses and compare their circulation patterns [15] . In the 37 SARS-CoV-2 pandemic, phylogeography has been used at a range of scales, to uncover 38 the virus' early dissemination from Europe in 2020 [16] , to demonstrate repeated 39 introductions and localized spread in Spain [17] , Peru [18] , Canada [19] and other 40 settings, and to visualize geographic movements (e.g. in nextstrain) [20] . 41 A number of authors have noted that phylogeographic reconstructions depend on the 42 fraction of infections that are sampled, sequenced and shared, and that phylogeographic 43 estimates can be biased as a result [15, [21] [22] [23] [24] . These authors have taken different 44 approaches to compensating for this bias. de Maio et al. [23] propose a method called 45 the BAyesian STructured coalescent Approximation (BASTA) to approximate the 46 structured coalescent, having found that discrete trait analysis (DTA), in which 47 migration among locations is treated in the same manner as mutation, gives biased 48 estimates of the migration rates if the locations are strongly non-representatively 49 sampled. Perhaps more fundamentally for viral phylogeography, they also point out 50 that the relative sampling intensities of the locations in DTA-based phylogeography are 51 treated as data and inform the migration estimates. This makes phylogeographic 52 estimates sensitive to sampling choices and can lead to erroneous inferences and 53 erroneously small apparent uncertainties; the structured coalescent approach does not 54 have this issue, and parameter estimates in simulations were better than those from 55 DTA models [23] . Magee and Scotch characterize the impact of two sampling schemes 56 and a range of generalized linear models on the estimation of the root state (point of 57 origin) and on the variables (in the linear model) that are deemed to be important [25] . 58 Bias can occur both in estimates of the parameters of the process (e.g. migration or 59 dispersal rates) and in the reconstructed ancestral geographic locations themselves. [23] . 63 Guindon and De Maio address non-uniform sampling in a diffusion model of movement 64 in continuous space by using a Bayesian approach to include the sampling process in the 65 posterior distribution, but this requires an exchange algorithm that introduces 66 substantial computational complexity [26] . Their correction leads to differences in the 67 inferred growth rate, effective population size, dispersal parameter and the time of the 68 most recent common ancestor (compared to not correcting for the sampling process). 69 Kalkauskas et al. explore adding "empty" viral sequences in a continuous-space 70 model [24] , finding that this approach can bias in estimates of the root location, and 71 diffusion rate, but not eliminate it. Lemey et al. incorporate both "empty" viral 72 sequences and individual travel history in part to overcome sampling bias [8] in 73 phylogeography. Indeed, the role of travellers is of course important in shaping viral 74 geographic movements themselves, but also potentially important in reconstructing 75 them. In some jurisdictions, travellers' samples are prioritized for sequencing as part of 76 monitoring viral diversity [27] . The information about which sequences are from 77 traveller samples is not typically shared to public databases such as GISAID, which 78 could additionally confound phylogeographic inferences. Here we focus on maximum-likelihood phylogeographic inference with location 80 modeled as a discrete quantity. Discrete space is appropriate for many epidemiological 81 applications where data is both collected and policies implemented at a regional level. While methods have primarily been developed with Bayesian tools, and so have 83 inherited the disadvantage that limited numbers of sequences can be included if the 84 analysis is to be computationally tractable. An additional challenge is that Bayesian 85 phylogenetic reconstruction produces multiple (posterior) phylogenies with distinct 86 internal nodes. Without comparable internal nodes from tree to tree, it would not be a 87 well-posed task to summarize the nodes' locations, and thus to infer the phylogeographic 88 information. In practice, users of phylogeography who have thousands of sequences 89 often cannot use state-of-the-art Bayesian methods due to computational limitations of 90 these methods for large datasets, and instead use a simpler maximum-likelihood 91 approach [19] : construct a phylogeny, and then reconstruct the geographic locations of 92 the past nodes on that fixed phylogeny, both with maximum-likelihood methods [11] . Bayesian or maximum likelihood methods. Yet this has important practical implications 96 for the users of phylogeography, particularly where large datasets are used to 97 reconstruct a virus' global geographic movement. The implications of sampling bias for 98 the design of genomic surveillance and data sharing strategies are not well understood. 99 Here we characterize the impact of sampling bias on error in the maximum-likelihood 100 phylogeographic reconstruction of a virus' location in the past. We begin by using 101 simulations of viral branching, cure (or host death), and migration to quantify the 102 impact of sampling bias on the overall reconstruction accuracy, on the types of errors 103 that result, and on the sources of variability in the quality of reconstruction. This 104 exploration includes the impact of sampling bias on location of individual ancestral 105 nodes, the inference of the root location (origin), and the impact of over/under 106 representing recent travelers. To illustrate the potential impact of sampling in a natural 107 context, we then perform phylogeographic analysis using Ebolavirus sequences from 108 West Africa. Throughout, we offer practical comments on when geographic sampling 109 bias is likely to impact maximum-likelihood phylogeographic reconstructions. Assumptions and tree simulation 112 We simulate pathogen diversification under the binary-state speciation and extinction 113 (BiSSE) model [28] implemented in the R package diversitree [29] . Here we will 114 consider the case where the two states represent distinct geographical locations. Each 115 node of the resulting rooted binary-state phylogenetic tree is in either location A or 116 location B. Without loss of generality, the diversification is simulated with the initial 117 lineage (the root of the tree) in location A. We assume that speciation and extinction 118 rates are independent of a lineage's location (i.e. location is a neutral character), though 119 we relax this assumption in the Supplement. While this assumption is unlikely to be 120 true in natural systems, it allows us to focus on geographical differences in sampling 121 intensity in isolation. As with the speciation and extinction rate, we assume that 122 migration is symmetrical between the two locations. The resulting diversification model 123 has three parameters: the speciation rate λ, the extinction rate µ, and the migration 124 rate α between location A and location B. In the context of a pathogen phylogenetic 125 tree, "extinction" represents the end of the infectious period (which could be death, or 126 recovery, successful treatment, effective quarantine, etc). Similarly, "speciation" is tree consisting of a subset of the extinct tips. To obtain this tree, we drop all extant 135 (continuing past the present day, and not sampled, as sampling occurs through time in 136 the past) lineages of the simulated tree. We call the resulting tree consisting of only 137 "extinct" tips the "true tree", because it has all of the tips that could possibly have been 138 sampled. The geographical location of the internal nodes of this (simulated) true tree 139 are known and will be used as the reference for quantifying the accuracy of ancestral 140 state reconstruction. We select n tips from the true tree, where n reflects sequencing 141 capacity. We call the resulting subtree with only the n selected tips a "downsampled 142 tree". In order to ensure that all n sampled lineages may be in either location, the 143 initial simulation for a true tree is terminated only once it contains at least n tips in To assess the impact of sampling travellers or those with recent travel-related 155 exposure on phylogeographic reconstruction, we implement a downsampling scheme in 156 which lineages with changes in location near the tips are over-or under-represented. and m B recent migrants in location B. We construct a downsampled tree S p with n tips, 160 for which we select tips from T , attempting to have a fraction p of the location-A tips 161 be "recent migrants". For example, if n = 100, k = 0.4 and p = 0.8, we want 80% (or 162 32) of the 40 location-A tips in the downsampled tree to be recent migrant tips. There 163 may be fewer than 32 recent migrant tips in the true tree to begin with. In this case we 164 include all of them. If there are more than 32, we include 32 sampled at random. Mathematically: we uniformly select ⌊kn⌋ location-A tips at random such that 166 min(m A , ⌊kpn⌋) of the tips are recent migrants, and uniformly select n − ⌊kn⌋ 167 location-B tips at random such that min(m B , ⌊kpn⌋) of the tips are recent migrants. In 168 our experiments with this downsampling scheme, we set k to be 25%, 50% or 75%, and 169 p ranges from 5% to 95% in increments of 5%. To reconstruct the ancestral states at the internal nodes of the downsampled trees, we 172 use the maximum likelihood method first described by Pagel [30] and implemented in the ace function in the R package ape [31] . This method reconstructs the states of 174 internal nodes for a fixed unlabelled tree with known tip states, and gives the likelihood 175 that each internal node (including the root) was in each of the two states (locations). This method also estimates the migration rates of the downsampled trees under the 177 assumption of equal migration rates between the two locations. Reconstruction accuracy 179 We use two quantities to assess the quality of our phylogeographic reconstructions: can be high in a trivial way, as the reconstruction can simply estimate that all the nodes 184 were in that location. The relative accuracy accounts for this. It is the improvement in 185 the absolute accuracy, compared to a null model based only on the locations of the tips. 186 Mathematically, we define absolute accuracy and relative accuracy as follows. Consider a true tree T and its downsampled tree S with an internal node i. Let c i To define the relative accuracy of the downsampled tree S, we compare the absolute 194 accuracy a S to a null expectation e S assuming that the probability that an internal 195 node is in a given state is proportional to the frequency of the two locations at the tips 196 and hence agnostic of phylogenetic structure. Suppose S has n A location-A tips and n B 197 location-B tips. We define the probability that an internal node of S is in location A to 198 be p = n A /(n A + n B ) and the expected accuracy of node i as e i = 1 − |c i − p|. The To understand the impact of sampling on the reconstruction of "key migration events", 204 events that ultimately lead to large outbreaks following introduction, here we define a 205 procedure for selecting such events from a true tree and measures for quantifying the 206 accuracy of their reconstruction. Noting that the root of a true tree is always in 207 location A, we define a key migration event (KME) in a true tree T to be any edge 208 connecting a location-A internal (parent) node to a location-B internal (child) node 209 which subsequently has at least 15 location-B tips as its descendants. These location-B 210 tips are called the "tips of the KME". To quantify the different mechanisms resulting in inaccurate reconstruction of KMEs, 212 May 11, 2022 6/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint we characterize the state of a KME in a downsampled tree with two measures: the 213 presence of the internal nodes (the parent node and the child node of the KME) and the 214 presence of sufficient (at least 5) tips of the KME. The presence of the internal nodes of 215 a KME indicates if we can correctly infer spatial-temporal information about the 216 introduction (migration) event, and the presence of sufficient tips of the KME 217 determines if we consider the there exists a large outbreak led by the introduction event. 218 The reconstructed KME characterized by these two measures is summarized in Table 1 219 and explained below. The presence of internal nodes has three states: if both parent 220 and child internal nodes of a KME appear in a downsampled tree, then we say the KME 221 is "observed via internal nodes" in the downsampled tree; if only the parent internal 222 node of a KME appears in a downsampled tree, then we say the KME is "erred via 223 internal nodes" in the downsampled tree, and we measure the error by the branch 224 length between the original child internal node of the KME and the apparent child 225 internal node in the downsampled tree; if the parent internal node of a KME is not 226 present in a downsampled tree, then we consider the spatial-temporal information of the 227 introduction event can not be correctly reconstructed, and we say the KME is 228 "obscured via internal nodes". The presence of sufficient tips of the KME simply has two 229 states: if at least 5 tips of the KME appear in a downsampled tree, then we say the 230 KME is "observed via tips"; if fewer than 5 tips of the KME appear in a downsampled 231 tree, then we say the KME is "obscured via tips". To assess the KME reconstruction accuracy for a true tree, we construct 100 233 downsampled trees for each fraction k of location-A tips in downsampled trees; we count 234 the number of downsampled trees in which each KME from the true tree is observed, 235 erred and obscured via internal nodes and the number of downsampled trees in which 236 each KME from the true tree is observed and obscured via tips. May 11, 2022 7/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint We illustrate the impacts of sampling differences using data from an Ebolavirus 239 outbreak in Africa in 2014-2015 [32] . The data consists of 262 taxa collected from 4 240 countries: 152 from Guinea, 84 from Sierra Leone, 22 from Liberia, 2 from Mali, and 2 241 from unknown locations. The time-scaled tree is generated using BEAST with a 242 log-normal distributed relaxed molecular clock and a non-parametric coalescent prior. More details can be found in the original paper [32] . 244 Naturally, the published data only include some of the infections that occurred, but 245 we can explore the impact of sampling by further downsampling, selecting among the 246 tips that are present. We consider the reconstructed phylogeny with 262 tips as the true 247 tree here; we reconstruct the location for the internal nodes of the tree with all 262 tips 248 using ace and consider the reconstructed location of the internal nodes as true ancestral 249 locations. Similarly, if a tip in the true tree has a different location as its parent node, 250 then we say the tip is a "recent migrant". We apply two sampling schemes: randomly 251 downsampling tips in one location at a time and preferentially downsampling recent Guinea and Sierra Leone tips. We downsample by dropping 80% of the tips in each of 256 the two locations randomly. In addition, for the second downsampling scheme we also 257 randomly downsample 80% recent migrants. Once we have a downsampled tree, we use 258 ace to reconstruct the ancestral location for internal nodes included in the 259 downsampled tree and compare reconstructed ancestral location to the true ancestral 260 location (reconstructed with all 262 tips). We consider the impact of the sampling scheme on the inference of two broad categories 263 of phylogeographic inference: geography (the inference of ancestral states/locations) and 264 migration (the inference of migration rates and key migration events). For each of these 265 categories, we explore the effect of geographically biased sampling and sampling of tips 266 for which there is recent migration in the true tree. Absolute and relative reconstruction accuracy 269 We find that the overall absolute accuracy of phylogeographic reconstruction is often 270 high, particularly when the migration rate is low ( Fig. 1 and Fig. S1 ). Under a low . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint low-to-moderate migration rate (α = 0.3; Fig. 1 ), halving the number of location-A 276 samples in the tree reduces the average accuracy from above 90% to approximately 80%, 277 more than doubling the number of internal nodes with an incorrect location estimate. 278 Furthermore, the estimates are quite variable. Under higher migration rates, the 279 accuracy is lower, but it is less sensitive to sampling. In contrast to our initial expectation that absolute accuracy would peak under 281 unbiased sampling (i.e. when the proportion of location-A tips in the downsampled tree 282 was the same as the proportion in the true tree), we found that the absolute accuracy 283 increased as the proportion of location-A tips in the downsampled tree increased (Fig. 284 1B,D and Fig. S1 ). This is a result of the starting point that the root node is in location 285 A, so that over half of the internal nodes are expected to be in location A (recall that 286 migration from A to B and back are equal in our simulations). Without directly 287 accounting for sampling bias in the ancestral state reconstruction, the absolute accuracy 288 will therefore be higher when location A is oversampled. The higher absolute accuracy 289 is simply because almost all of the tips and internal nodes of the downsampled tree are 290 in location A, making it easy for the reconstruction to obtain correct locations. This is 291 relevant particularly when the migration rate is relatively low. In this case, there are 292 few migration events, and the true tree consists of larger monomorphic (same location) 293 clades. This greatly improves the ancestral state reconstruction by increasing the 294 probability that internal nodes have the same location as all their immediate 295 descendants, and this raises the absolute accuracy. In contrast to absolute accuracy, the relative accuracy (which takes this into account, 297 and captures the improvement over expected accuracy from a null model; see Methods) 298 peaks at a sampling proportion of 50% location-A tips, and the overall relative accuracy 299 depends on the geographic sampling bias ( Fig. 1 and Fig. S1 ). When the migration rate 300 is higher, the accuracy of the phylogeographic reconstruction is lower, and the 301 dependence on sampling is less pronounced. Both the impact of sampling proportion 302 and migration rate hold across a large number of true trees (Fig. S2 ). 303 Figure S3 illustrates how the phylogenetic reconstructions are inaccurate. Reconstruction of migration events, or the lack there of, between a parent and child 305 node is particularly sensitive to state of the parental node. If the parental state is 306 over(under)-sampled, we are likely to greatly over(under)-represent those types of edges 307 in the tree. In contrast, over-or under-sampling of the child state has a limited impact 308 on geographic inference. This result is most pronounced when the migration rate is high. 309 To interpret these results: suppose we are located in a region with comparatively high 310 sequencing (location B; left side of the panels in Figure S3 ). Sampling bias will impact 311 our understanding of both migration between jurisdictions and the extent of sustained 312 transmission within a jurisdiction. Specifically we will substantially underestimate the 313 migration rate from the under-sampled jurisdiction (A-B edges) but fairly accurately 314 capture migration to that jurisdiction from the highly sampled region (B-A edges). In 315 addition, we are likely to conclude there are long sustained transmission chains in the is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint under-sampled jurisdiction (A-A edges). 318 We also explored how sensitive our results are to the assumption of equal 'speciation' 319 rates. For example, transmission may be occurring at a higher rate in location A or B 320 due to different public health measures or population contact rates. When transmission 321 (or speciation, in the language of our model) is location-dependent, the overall patterns 322 of bias are similar to what we have found in the neutral case. However, the maximum 323 relative accuracy is obtained by sampling the location with the lower speciation rate 324 more than would be representative of the locations' frequencies in the true tree; see Oversampling recent migrants 327 We find that both the absolute and the relative accuracy decrease as more recent 328 migrants (recall that recent migrants are tips in a true tree that have a different 329 location from its parent node) are sampled ( Fig. 2A,B and Fig. S6 ). We use recent 330 migrants as models for either travelers or members of their contact networks. This 331 means if we sample more travel-associated infections, mark the corresponding taxa with 332 the location at which they were sampled, and use the taxa to infer ancestral states or 333 migration patterns, then the inference accuracy may be reduced. We note (Fig. 2C) 334 that there exist tips with migration in their very recent ancestry in a downsampled tree 335 that are not recent migrants in the true tree. We call such tips of the downsampled tree 336 the "apparent" recent migrants. Fig. 2D illustrates how the presence of recent migrants 337 affects the both the absolute accuracy of their ancestors, and that the apparent recent 338 migrants tips observed in downsampled trees also contribute to the inaccuracy in The maximum likelihood method that we use for ancestral state reconstruction gives 348 estimates of migration rates assuming that migration is symmetric between the two 349 locations. We find that at the least biased geographic sampling proportions, the 350 estimates of migration rates are close to the migration rates of the true trees, but the 351 best migration rate estimates do not always occur when the geographic sampling is least 352 biased (Fig. 3) . Migration rates are on average overestimated when the true migration 353 rate is low (Fig. 3 top panels) whereas rate estimates are highly variable but on average 354 closer to the truth when the true migration rate is high (Fig 3 bottom panels) . Migration rate estimation is presumably less biased because higher migration provides 356 May 11, 2022 10/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint more migration events on which to base the estimates. However, as the number of 357 migration events increases, so too does sensitivity to biases sampling (more migration 358 events are incorrectly inferred). Note that the y-axis in Fig. 3 is on a logarithmic scale, 359 the migration rate is often underestimated under extreme bias and overestimated when 360 sampling is unbiased, especially when the true migration rate is high. When sampling is 361 highly biased, the downsampled tree contains far fewer migration events than the true 362 tree (clades become more monomorphic) which results in an underestimate of the 363 migration rate. This is particularly true when the migration rate is high and there are 364 more migration events that can be excluded by biased sampling. In contrast, the 365 overestimation of migration rate when sampling is unbiased results from the fact that 366 downsampling makes clades less monomorphic than in the true tree. Here we assess how the reconstruction key migration events (KMEs) in downsampled 369 trees is impacted by sampling bias. Recall that a KME is an edge in the true tree with 370 a location-A parent internal node a location-B child internal node, and the subsequent 371 location-B tips of the child node of the KME are the tips of the KME. Whether the 372 edge can be reconstructed in a downsampled tree indicates whether we can learn about 373 the introduction event from the sample and whether we sample sufficient location-B tips 374 shows whether we can accurately infer that the introduction lead to onward . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint and Fig. S7 ) because, if we do not sample enough location-B tips, we will not know 378 there is ongoing transmission after an introduction. Note that there are always two 379 child lineages arising from the parental node of a KME, the focal child node which is in 380 location B and a sister clade which is, at least initially, in location A. For a KME to be 381 "observed via internal nodes" requires that the parental node is in the downsampled tree. 382 This in turn, depends on whether tips in the sister clade of the KME are sampled (Fig. 383 4 and Fig. S7 ). The sampling policy most likely to reconstruct KMEs depends on the migration rate. 385 When the migration rate is high sampling more location-B tips helps reconstruct KMEs, 386 since the true tree has few monomorphic clades and the KMEs would rarely be "erred" 387 or "obscured via internal nodes". When the migration rate is low, unbiased sampling 388 gives better results because the true tree has monomorphic clades and heavily biased 389 sampling can drop entire clades and cause the KMEs to be erred or obscured. Biased 390 sampling comes at an additional cost to reconstruction of KMEs. If sampling is biased 391 enough such that both parent and child internal nodes of a KME are reconstructed with 392 the same location, then we may not identify the KME even though both parent and 393 child internal nodes and the sufficient number of tips of the KME appear in the 394 downsampled tree (Fig. S5) . Relative to the simulations, the Ebolavirus tree has a relatively low migration rate 403 compared to the branching rates, resulting in large monomorphic clades. We would 404 therefore expect the absolute accuracy to be high overall, in keeping with Fig. 1 , but 405 that the relative accuracy would depend on the geographic sampling bias. We examine 406 the absolute accuracy of the internal nodes in downsampled trees, especially the internal 407 nodes representing an introduction event in the true tree. Downsampling 80% of Sierra 408 Leone tips does not greatly reduce the absolute accuracy of the downsampled tree (Fig. 409 S9) . This is because the Sierra Leone tips form a monomorphic clade. When 410 downsampling 80% of Guinea tips (less arranged in monomorphic clades), the overall 411 absolute accuracy of the internal nodes is also high (Fig. S10) , but we observe a group of 412 internal nodes (red box in Fig. S10 ) that have inaccurate reconstructed locations. This 413 does not have a large impact on overall absolute accuracy, but it significantly affects our 414 inferences about the introduction event. Specifically, in this case we would infer that 415 Ebolavirus was introduced to Guinea via Liberia, which is not the case in the true tree. 416 When we downsample 80% of the "recent migrant" tips (reflecting more likely 417 recent-travel-related sequences) in addition to dropping 80% of tips independent of 418 May 11, 2022 14/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; Table 2 ) for 100 downsampled trees with low-moderate. Number of downsampled trees in which a KME is erred (coloured bars) or observed via internal nodes (left-hand transparent-grey bars) and weather KMEs are observed via tips (solid-grey bars) or obscured via tips (right-hand transparent-grey bars). Parameters: λ = 4 and µ = 1. location, the ancestral state reconstruction is accurate except at the migration event in 419 the red box (Fig. 5 ). This is because the reconstructed MCC tree has relatively few 420 recent migrants (32/262), and removing most of these tips does not affect the absolute 421 accuracy. However, if we were particularly interested in the information in the red box 422 in Fig. 5 , that is, whether the geographic spread is directly from Sierra Leone to Guinea 423 or from Sierra Leone to Liberia to Guinea, then the impact of sampling travelers (or 424 those with recent travel contact) would be of concern. Table 2 summaries how inference 425 of the likely source (parent location) and destination (child location) of introductions is 426 impacted by down-sampling of recent migrants. After downsampling, the introduction 427 event appears to be most likely from Sierra Leone to Guinea, whereas the introduction 428 event in the true tree is from Liberia to Guinea. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022 Table 2 . Effect of downsampling on inferred introduction events. Likelihood of the parent node and the child node of a migration event in the red box being at each location before and after downsampling recent migrants. sampling, the reconstructed tree shape can omit nodes and edges related to key 438 migration events, making inference about important introduction events difficult. The 439 eventual accuracy depends on the underlying migration pattern, the migration rate, the 440 rate of sampling bias and the rate of sampling travelers. We found that the relative 441 accuracy is worsened with increased sampling bias and when travel-associated tips are 442 over-represented. If the underlying migration rate is higher , the relative accuracy of 443 phylogeographic reconstructions after sampling can be lower, but inferences about 444 migration rates and about key migration events can be more accurate than if the 445 underlying migration rate is lower. 446 We only simulated two locations in our model, and in the main text we used a 447 birth-death model without location-depending speciation rate or migration rate. We 448 showed in the supplementary material that location-dependant speciation rate can also 449 impact phylogeographic reconstruction, and in this case oversampling the location with 450 a lower speciation rate (compared to that location's representation in the true tree) can 451 help to compensate. We did not vary the size of downsampled trees, and we assumed 452 that a true tree is large enough to contain both at least n location-A tips and at least n 453 location-B tips, which may be an unrealistic assumption. We chose to generate true 454 trees first and simulate geographic sampling by downsampling the true trees after they 455 are generated. However, we note that there are methods to simulate geographic 456 sampling along a birth-death process, for example in [33] . We compared these two 457 approaches to simulate geographic sampling and construct downsampled trees in the 458 supplementary material, and we found the two approaches provide similar results in 459 terms of absolute reconstruction accuracy, hence relative reconstruction accuracy. See . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. ; Furthermore, we simulated trees and either downsampled those trees or simulated 471 birth, death and sampling through time; we did not simulate sequences evolving on the 472 trees, nor reconstruct trees from these simulated sequences. Accordingly, our results are 473 for the idealized circumstance where the reconstruction of the tree itself is essentially 474 perfect. In practice, tree uncertainty would add additional noise to phylogeographic 475 reconstruction, and the accuracy of reconstructions would be expected to be lower than 476 we have found here. 477 We have found that even under idealized circumstances, the portion of infections 478 from the different locations that are sequenced and represented in the data, and the 479 extent to which they are representative samples of the circulating infections in their 480 jurisdictions, can strongly shape the reliability of phylogeographic inferences. While this 481 was known in its impact on estimates of migration rates and the originating sequence 482 (root), the impact on the estimated virus locations themselves, or on the ability to 483 detect important viral movements, has not previously been characterized. Methods are 484 being developed to compensate for this bias, but currently, these are Bayesian methods 485 with high computational costs, and they are are not suitable for the analysis of the high 486 volumes of viral sequences that are being generated. Furthermore, while incorporating 487 past locations into Bayesian phylogenetic reconstruction is appealing, the fact that the 488 internal nodes in the posterior sample of phylogenetic trees are not the same from tree 489 to tree hinders interpretation of the reconstructed locations. This analysis is therefore 490 more suited to estimating rates than viral movements. One approach is to proceed as in 491 Lemey et al. [8] , summarizing the posterior trees to capture viral movements without 492 focusing on individuals nodes (though this does not resolve the issue of computational 493 capacity for Bayesian tree reconstruction beyond hundreds of sequences). Another 494 would be to use the BASTA framework of de Maio et al. [23] , but if there are too many 495 sequences for Bayesian phylogenetic estimation to be practical, fix the phylogenetic tree 496 (estimating it first by maximum likelihood), and take the structured coalescent approach 497 to the phylogeographic reconstruction. This may be limited by the assumptions of the 498 structured coalescent (like fixed effective population sizes for the locations, or demes). 499 Adjustment for sampling differences is likely to be feasible only when those 500 differences are known. Sharing data on the fraction of cases that are sampled and 501 sequenced, the fraction of infections that are reported as cases, and the strategy by 502 which samples are prioritized for sequencing, will help in making these adjustments. Ultimately this will aid in correctly inferring pathogens' geographic movements. Meanwhile, we would caution that in real datasets from public databases, covering 513 multiple jurisdictions sequencing viruses at potentially very different rates and in 514 non-representative ways (due to targeting outbreaks, travellers, health-care workers, 515 specific variants and so on for prioritized sequencing but then not making the reason for 516 sequencing available), the results of phylogeographic reconstructions should be taken 517 with caution. Whereas in the main text we downsampled many times from the same fixed true tree, here in Fig. S2 we show the absolute reconstruction accuracy and how it varies with sampling bias using 25 true trees for each choice of the migration rate; we downsample 50 times for each tree. Each edge in the true tree has two endpoints, that is, the parent node of the edge near the root and the child node near the tips. For an edge in a downsampled tree of a true tree, each node has a true location from simulation and a reconstructed location (a likelihood of being location A) from the function ace. The state of the edge with the true locations at the endpoints can be A-A, A-B, B-A or B-B. We count an edge with a location-A parent node and a location-A child node as one edge in the state A-A and analogously for the other states. Whereas the state of the edge with reconstructed locations at the endpoints is computed by multiplying the likelihood of the parent node being at location A (or B) and the the likelihood of the child node being at location A (or B). For example, if the parent node has a likelihood 0.3 of being at location A and the child node has a likelihood 0. tips for the true trees simulated with different migration rates. We also compute the difference of subtracting the number of edges with true locations at endpoints from the number of edges with reconstructed locations at endpoints in each state for the true trees. The results are displayed in Fig. S3 and Fig. S4 Furthermore, in the main text, we used a neutral branching process in which the branching and death rates did not depend on the location. Here, we simulate under with non-neutral speciation rates. Fig. S5 displays the absolute and relative accuracy for two true trees simulated with λ A = 8, λ B = 4 and λ A = 4, λ B = 8 respectively. We find that the relative accuracy is highest when the location with a lower speciation rate is oversampled compared to its representation in the true tree (few of the higher-speciation-rate location-A tips in Fig. S5A , and more of the now lower-speciation-rate location-A tips in Fig. S5B ). In both cases the relative accuracy is highest when the fraction of tips from the lower speciation rate location is 45-65%, whereas due to having a lower speciation rate, that location has fewer tips in the true tree. We demonstrate additional results regarding absolute and relative accuracy of downsampled trees with different proportions or recent migrants, where the downsampled trees have 25% and 75% location-A tips instead of 50%. Fig. S6 shows the results compared to Fig. 2A and Fig. 2B , and we observe the same pattern that the accuracy decreases as the proportion of recent migrants increases. We show complementary results about key migration events (KMEs) to Fig. 4B and Fig.4AD in Fig. S7 , where we count the number of downsampled trees in which a KME of the true tree is obscured, observed or erred for true trees displayed in Fig. 1A and Fig. 1C respectively. We perform additional experiments to investigate the absolute accuracy of the parent and the child internal nodes of KMEs. Fig. S8 shows the absolute and relative accuracy of the internal nodes of KMEs. We observe that if we sample more location-A tips, then the child nodes of KMEs are more likely to be reconstructed with the wrong location, and the KMEs are likely to be obscured via internal nodes. Similarly, if we sample more location-B tips, then the parent nodes of KMEs are more likely to be reconstructed with the wrong location, and the KMEs are also likely to be obscured via internal nodes. Therefore, the unbiased geographic sampling performs better in identifying key migration events. Application to the 2014-2015 Ebola Epidemic Fig. S9 and Fig. S10 shows the results of downsampled 80% of the Sierra Leone and Guiena tips respectively. In both cases, thee overall absolute accuracy is high. We obtained 10 downsampled trees for each location and calculate average absolute May 11, 2022 26/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2022. Supp Fig S7. The KME count for the true tree with migration rate 0.1 displayed in Fig. 1A and the true tree with migration rate 0.9 displayed in Fig. 1C respectively, where the left bars show the number of downsampled trees in which a KME of the true tree is observed (zero-distance green bars), erred (non-zero-distance color bars) and obscured (transparent grey bars) via internal nodes, and the right bars show the number of downsampled trees in which a KME of the true tree is observed (solid grey bars) and obscured (transparent grey bars) via tips. accuracy. The average absolute accuracy is 99% when downsampling Sierra Leone tips and 94% when downsampling Guinea tips. We show that the two approaches to simulating geographic sampling, generating true trees then downsampling and downsampling along birth-death processes, give similar results for the absolute accuracy. Fig. S11A shows the results of simulating geographic sampling with the first approach. Specifically, we simulate a single true tree with parameters λ = 4, µ = 1, α = 0.7, and we downsample 50 times from the true tree for each proportion of location-A tips. Fig. S11B shows the results of the second approach, where we run 50 birth-death processes with sampling using parameters λ = 4, µ = 1, α = 0.7 for each proportion of location-A tips. The function sim.bdtypes.stt.taxa in the R package TreeSim is used to realize the simulations, and the function produces a downsampled tree for each birth-death process, so we have in total 50 downsampled trees. We observe that the two approaches to simulating geographic sampling produce similar results regarding absolute accuracy. We also explored how geographic sampling impacts the reconstruction of the location and the time of the root (Fig. S12) . We find that geographic sampling bias can have a dramatic impact on the shows that the reconstructed state of the root depends on the proportion of location-A tips in downsampled trees. Since we begin simulations with a May 11, 2022 27/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint Supp Fig S8. Reconstruction accuracy of KME. Absolute accuracy of the parent and the child nodes over all observed (via internal nodes) KMEs in the true tree in Fig. 1A with migration rate 0.1 (Panel A), the true tree in Fig. 4A with migration rate 0.3 (Panel B), the true tree in Fig. 3C with migration rate 0.7 (Panel C), and the true tree in Fig. 1C with migration rate 0.9 (Panel D). root at location A, the more location-A tips are sampled, the more accurate (i.e. location A) the root location is. Fig. S13 also shows that the absolute accuracy of the root location is more accurate when the migration rate of the true tree is low, presumably because more location A tips occur in more monomorphic clades. The reconstructed time of the root also depends on whether there are extinction events or small clades near the root. If there are extinction events (hence tips) or small clades near the root and these are unsampled, then the reconstructed time of the root can be inaccurate. May 11, 2022 28/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint Supp Fig S11. Reconstruction accuracy of trees of extinct lineages trees versus trees simulated from a birth-death-sampling process. Absolute and relative accuracy of trees with different proportions of location-A tips. Panel A: Each data point in represents the absolute or the relative accuracy of a downsampled tree from a single true tree simulated with λ = 4, µ = 1, α = 0.7. The downsampled trees are obtained after the true tree is simulated, and the vertical lines indicate the proportion of location-A tips in the true tree. Panel B: Each data point in represents the absolute or the relative accuracy of tree obtained along a birth-death process with λ = 4, µ = 1, α = 0.7. May 11, 2022 31/32 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2022. ; https://doi.org/10.1101/2022.05.12.22275024 doi: medRxiv preprint Getting to the root of epidemic spread with phylodynamic analysis of genomic data Genomic surveillance to combat COVID-19: challenges and opportunities Alarming COVID variants show vital role of genomic surveillance Global initiative on sharing all influenza datafrom vision to reality Evolutionary analysis of the dynamics of viral infectious disease Bayesian phylogeography finds its roots Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2 A statistical phylogeography of influenza A H5N1 Ancestral Reconstruction Detection of HIV transmission clusters from phylogenetic trees using a multi-state birth-death model Integration of genomic sequencing into the response to the Ebola virus outbreak in Nord Kivu, Democratic Republic of the Congo Global circulation patterns of seasonal influenza viruses vary with antigenic drift Phylogeography of 27,000 SARS-CoV-2 Genomes: Europe as the Major Source of the COVID-19 Phylogeography of SARS-CoV-2 pandemic in Spain: a story of multiple introductions, micro-geographic stratification, founder effects, and super-spreaders Phylogenomics reveals multiple introductions and early spread of SARS-CoV-2 into Peru Early and ongoing importations of SARS-CoV-2 in Canada. medRxiv Nextstrain: real-time tracking of pathogen evolution Unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza H3N2 Bayesian phylogeography of influenza A/H3N2 for the 2014-15 season in the United States using three frameworks of ancestral state reconstruction New Routes to Phylogeography: A Bayesian Structured Coalescent Approximation Sampling bias and model choice in continuous phylogeography: Getting lost on a random walk The effects of random taxa sampling schemes in Bayesian virus phylogeography Accounting for spatial sampling patterns in Bayesian phylogeography Genome Canada Canadian COVID-19 Genomics Network (CanCOGeN) and the Canadian Public Health Laboratory Network CanCOGeN Working Group. Canadian national COVID-19 genomics surveillance priorities for existing and emerging variants of concern Estimating a Binary Character's Effect on Speciation and Extinction Diversitree: comparative phylogenetic analyses of diversification in R Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R Temporal and spatial analysis of the 2014-2015 Ebola virus outbreak in West Africa Uncovering epidemiological dynamics in heterogeneous host populations using phylogenetic methods Ancestral state reconstruction of Ebolavirus trees with Sierra Leone downsampled. The Maximum Clade Credibility (MCC) phylogenetic tree the Ebolavirus outbreak generated using BEAST Downsampled tree obtained by dropping 80% Sierra Leone samples with true ancestral location (Panel B). The downsampled tree with reconstructed ancestral location after dropping Sierra Leone samples (Panel C)