key: cord-0998571-gf25r9ip authors: Alcala, Nicolas; Jensen, Jeffrey D.; Telenti, Amalio; Vuilleumier, Séverine title: The Genomic Signature of Population Reconnection Following Isolation: From Theory to HIV date: 2015-11-04 journal: G3 (Bethesda) DOI: 10.1534/g3.115.024208 sha: c8d703462f802321d816c81555d52174ef477674 doc_id: 998571 cord_uid: gf25r9ip Ease of worldwide travel provides increased opportunities for organisms not only to colonize new environments but also to encounter related but diverged populations. Such events of reconnection and secondary contact of previously isolated populations are widely observed at different time scales. For example, during the quaternary glaciation, sea water level fluctuations caused temporal isolation of populations, often to be followed by secondary contact. At shorter time scales, population isolation and reconnection of viruses are commonly observed, and such events are often associated with epidemics and pandemics. Here, using coalescent theory and simulations, we describe the temporal impact of population reconnection after isolation on nucleotide differences and the site frequency spectrum, as well as common summary statistics of DNA variation. We identify robust genomic signatures of population reconnection after isolation. We utilize our development to infer the recent evolutionary history of human immunodeficiency virus 1 (HIV-1) in Asia and South America, successfully retrieving the successive HIV subtype colonization events in these regions. Our analysis reveals that divergent HIV-1 subtype populations are currently admixing in these regions, suggesting that HIV-1 may be undergoing a process of homogenization, contrary to popular belief. admixture migration coalescent site frequency spectrum HIV Past population demographic events leave distinctive signatures in DNA sequence polymorphisms and gene genealogies (Tajima 1989; Simonsen et al. 1995; Schneider and Excoffier 1999; Excoffier 2004; Zeng et al. 2006; Gutenkunst et al. 2009; Naduvilezhath et al. 2011) . For example, a population bottleneck increases the variance of coalescence times and reduces the number of variants at low frequency in the site frequency spectrum (SFS) (Tajima 1989) . Population expansion creates star-shaped genealogies and increases the mean number of variants at low frequency in the SFS (Tajima 1989; Slatkin 1996) . Population sub-division leads to long internal branches in the genealogy and increases the number of fixed variants (Harpending et al. 1998 ). It also results in "structured" genealogies in which the means and variances of coalescence times (e.g., time to the most recent common ancestor) are large, resulting in an excess of intermediate frequency variants in the SFS (Wakeley 1999) . Balancing selection leads to a similar structure, and to an excess of variants at intermediate frequency (Tajima 1989; Fay and Wu 2000; Barton and Etheridge 2004; Zeng et al. 2006) . Directional selection leads to genealogies with both star shapes (leading to the common ancestor of the selected lineages) and long branches (due to remaining ancestral lineages and/or recombination with the selected haplotype), and generates an excess of variants at both low and high frequency (Barton 1998) . Consequently, DNA sequence polymorphisms are increasingly used to infer past demographic events and have successfully reconstructed past histories of many organisms, including humans (Gravel et al. 2011; Reich et al. 2012) . Some of the first methods proposed were developed to infer population growth rates (Slatkin and Hudson 1991) or for specific demographic scenarios (e.g., Nielsen and Wakeley 2001 ; Hey diverged populations after isolation remains difficult, as models of past isolation with recent contact are challenging to distinguish from models of ancient population separation with a continuous exchange of migrants. Also, parameter inference requires a priori knowledge of a population split (Hey 2010) , and the time at which migration event(s) occurred has thus far been difficult to estimate (Sousa et al. 2011; Strasburg and Rieseberg 2011) . Nielsen and Wakeley (2001) first proposed a method to infer time of divergence (IM model), the migration rate between the two populations, as well as the relative sizes of the populations. This approach was extended to account for the divergence of multiple populations (Hey 2010) and temporal reduction of gene flow following an initial population split (Wilkinson-Herbots 2012). Though others have discussed the expected signature of a reconnection event following a past isolation period (Becquet and Przeworski 2009; Sousa and Hey 2013) , detailed theoretical formalization is still lacking. Besides this, the use of the aforementioned methods for general inference is also challenging as they are difficult to apply when the number of sampled populations is large (Beaumont et al. 2002; Drummond et al. 2005; Gutenkunst et al. 2009; Lohse et al. 2011; Excoffier et al. 2013) , and when multiple sampled time-points are available (Excoffier et al. 2013; Foll et al. 2014) . Further, they often rely on a specific evolutionary model, such as the Wright-Fisher model (e.g., Gutenkunst et al. 2009; Lohse et al. 2011; Excoffier et al. 2013 ) which might not be representative of, for example, the skewed offspring distributions found for marine species or viruses (Tellier and Lemaire 2014) Past population isolation and subsequent reconnection or secondary contact are common in natural populations and can have a drastic impact on species' evolutionary histories. Anatomically, modern humans are thought to have admixed with Neanderthal populations after a period of isolation between the African and European continents (Green et al. 2010; Patterson et al. 2012; Sankararaman et al. 2012) . Similarly, African and non-African populations of Drosophila melanogaster also experienced a period of isolation, followed by subsequent admixture (Pool et al. 2012) . Domesticated species of both plants and animals are isolated from their wild relatives, and often experience subsequent genetic exchange (e.g., in crops; Ellstrand et al. 1999) . Importantly, isolation and subsequent reconnection events are also common features in viral histories, including the Influenza A virus, Human Immunodeficiency Virus (HIV) and Human Cytomegalovirus (HCMV; Renzette et al. 2013) , where dependency on the host cell and other features of the life cycle isolate viral populations within hosts, groups of hosts, or host species inbetween infections. For example, reassortment between avian and human influenza viruses caused the pandemic outbreaks in 1957 and 1968, and a reassortment of swine, avian and human influenza viruses was responsible for the 2009 pandemic (Hsieh et al. 2006; Garten et al. 2009; Flahault and Zylberman 2010) . As in influenza, events of HIV isolation and reconnection are common and can be observed at different spatial and temporal scales. Indeed, tracing the origin of HIV has confirmed that cross-species transmission of SIVs from other primates to humans, acting as sources of HIV, occurred several times over the past 100 years (Liégeois et al. 2014) . The virus remained isolated for a very long period and adapted rapidly in humans, which ultimately rose to epidemic levels (Korber et al. 2001; Tebit and Arts 2011) . Circulating HIV populations in humans are strongly divergent. Two major types of HIV exist (HIV-1 and HIV-2) and both have diversified into several groups or subtypes (Robertson et al. 2000) . The HIV pandemic is mainly caused by HIV-1 Group M, which is composed by divergent (up to 35% sequence divergence) subtypes (named A-D, F-H, J, and K, Gaschen et al. 2002) . The HIV epidemic is associated with the worldwide movements of people that allowed HIV-1 subtype colonization events (Tebit and Arts 2011; Hamelaar et al. 2011) . In the course of their evolution in human populations, HIV-1 subtypes experienced several large-scale isolation events within both transmission risk groups (Su et al. 2000; Pang et al. 2012; Han et al. 2013; Li et al. 2013 ) and countries (Bello et al. 2011; Castro-Nallar et al. 2012) . Subsequent reconnection then occurred through coinfection, superinfection, and recombination (Tebit and Arts 2011; Vuilleumier and Bonhoeffer 2015) . Those reconnection events are the source of emergence of many new recombinant forms which constitute a major challenge for vaccine development and are at the origin of new epidemics (Tebit and Arts 2011; Feng et al. 2013) . The multiple colonization events of HIV-1 that occurred in Asia and in South America are representative of this trend (Hamelaar et al. 2011) . Indeed, with an estimated 4.8 million people living with HIV as of 2011, Asia is the second most HIV-affected region worldwide (Zhang et al. 2013) . Historical epidemics in Asia are associated with transmission events to various risk groups; HIV-1 subtypes B and C were first detected in the mid-1980s and early 1990s, respectively, in the Injecting Drug Users (IDUs) risk group (Lu et al. 2008) , and then spread to other populations. Subtype CRF01_AE was first detected in the Men who have Sex with Men (MSM) risk group in the 1990s, then spread to other groups and became the most prevalent HIV-1 subtype in many parts of China. Similarly, the epidemic clade circulating in South America is derived from subtype B viruses that migrated out of Haiti around 1969 Haiti around (1966 Haiti around -1972 and spread through the world (Bello et al. 2011) . Later, a single-founder subtype F1 strain, introduced in Brazil, spread and recombined with local subtype B viruses to form the HIV-1 BF1 and F1 epidemics (Aulicino et al. 2007) . The present study identifies specific temporal signatures of reconnection of diverged populations in two broadly used summaries of DNA polymorphism: the distribution of pairwise nucleotide differences (via coalescent theory), and the Site Frequency Spectrum (SFS, via simulation). We also describe how past isolation can influence commonly used SFSbased test statistics (Tajima's D, Fu and Li's D Ã and F Ã , Fay and Wu's H, Zeng et al.'s E) , and discuss the specificity and the robustness of the signature relative to other demographic and selective processes. Then, using theoretical results and full genome polymorphism data, we reconstruct the successive invasions of HIV-1 subtypes in China and in South America and analyze the subsequent dynamics of genome composition. We find extensive admixture between HIV-1 subtypes, potentially leading in the long term to homogenization of HIV-1 genomes in these regions. Using coalescent theory, we first present methods for detecting the signature of population reconnection after isolation on the distribution of pairwise nucleotide differences. Then, in the second section, we use coalescent simulations to derive the signature of population reconnection after isolation on the SFS. The third section highlights the robustness of these signatures to alternative demographic scenarios. Finally, the fourth section presents a data application for these methods: inference of recent HIV-1 evolution in China and in South America. Theoretical signature In our model, we consider d previously connected populations that became isolated at time T iso and then reconnected at time T reco . The scaled migration rate M between populations (i.e., twice the number of migrant genes per population per generation) is the same before and after the isolation period and mutations follow a Poisson process of rate u=2 (Kimura 1969) . We used an infinite sites model for the analytical investigations, and for the simulations we considered a finite sites model with a number of sites corresponding to HIV-1 genome size (9719bp, from the reference HXB1 sequence). Our developments are based on the Wright-Fisher model (Fisher 1930; Wright 1931) , where each population has a constant size N. For analytical simplicity, we considered that times T iso and T reco are scaled in units of N generations, so T iso ¼ 1 corresponds to an isolation event that occurred N generations ago. We derived an exact theoretical formula for the distribution of pairwise nucleotide differences in nonrecombining genomic regions (Appendix A), and simulated this distribution using coalescent simulations (software ms; Hudson 2002) in the case of recombining regions. We computed the power to sample sequences from each mode of the distribution using equations A.6a-A.6b (Supporting Information, Figure A in File S1). We generated the SFS through time following isolation and reconnection events using coalescent simulations. We studied the total SFS (considering pooled samples from all populations); in the main text, we present results for equal sample sizes in each populations, and we present results for alternative sampling schemes (Figures B and C in File S1) . We also studied the joint SFS between pairs of populations. We assessed the temporal changes of the total SFS using the visual test of Nawa and Tajima (2008) , which is based on the difference between the expected SFS under neutrality in a population with constant size, and the observed SFS. For each frequency i=n, where n is the sample size, the transformation representsû i ¼ ij i , instead of the corresponding number of variants j i . This simple test enables one to easily detect which frequencies present an excess or a deficit of variants. We also derived an optimal test statistic to detect the signature of population reconnection from the SFS (Text A and Figure D in File S1) and compute the power of common neutrality tests to detect such events ( Figure E in File S1). We then analyzed the sensitivity of our results to departures from the infinite sites model ( Figure F in File S1), from the constant population size model ( Figure G in File S1) and from the Wright-Fisher model ( Figure H in File S1). To account for repeated bottlenecks (e.g., during transmission events), extinction-recolonization dynamics, and rapid adaptation, we simulated the distribution of pairwise nucleotide differences under a scenario of reconnection of isolated populations under a beta-coalescent model. Beta-coalescent models have recently been proposed to model the genealogies of many organisms with a large variance in the number of offspring per individual (Birkner and Blath 2008; Tellier and Lemaire 2014) , in particular viruses (Neher and Hallatschek 2013) . We extended the algorithm from Birkner and Blath (Birkner and Blath 2008)which generates samples under a beta-coalescent at equilibriumto include an island model of migration (each lineage migrates to another population at rate M) and an isolation period. We assumed that coalescence processes follow the beta-coalescent at all times: before isolation, during isolation and after reconnection. In addition to mutation and migration rates, beta-coalescents rely on a parameter a (0 # a # 2) that gives the probability of multiple lineages coalescing at the same time. We present the results for four different parameter values: a ¼ 0 (equivalent to the classic Kingman coalescent), a ¼ 0:5, a ¼ 1 [corresponding to the Bolthausen-Sznitman coalescent, used to model strong positive selection (e.g., Neher and Hallatschek 2013)], and a ¼ 1:5. We also compared the signature of population reconnection after isolation with that of a closely related model: the Isolation with Migration model [IM (e.g., Takahata 1995; Wakeley 1996; Rosenberg and Feldman 2002) ], using coalescent simulations ( Figure I in File S1). We used 1646 whole genome HIV-1 samples from subtypes B, C, F1 and CRF01_AE and recombinant forms between these subtypes (Table A in File S1). We used the subset of genomes from China sequenced between 2005 and 2009 (297 sequences; Table B in File S1). For the study in South America (Brazil and Argentina), we used data available between 1999 and 2005 (128 sequences; Table C in File S1). Finally, for the worldwide PCA analysis, we used all 1646 genomes available. For the Chinese and South American data, we identified HIV-1 populations (subtype clusters) by Discriminant Analysis of Principal Components (DAPC) (Jombart et al. 2010 ). Our investigations considered dominant subtypes B, C, CRF01_AE and F1 and their associated recombinants. Each cluster encompassed a dominant subtype and some recombinant forms between subtypes. We found strong support for four groups using the Bayesian Information Criterion (ranging from 2 to 100 PCA axes) that are independent of sampling year (See Figure J in File S1 for a neighbor joining tree of the sequences where the different clusters and sampling years are represented). We then computed the distribution of pairwise nucleotide differences and the total and joint SFS for the major circulating HIV-1 subtypes in China and South America. We computed the distribution of pairwise nucleotide differences, and the total and joint SFS for the major circulating HIV-1 subtypes in China and South America. The distribution of pairwise nucleotide differences within (within-population pairwise differences) and between clusters (betweenpopulation pairwise differences) were generated for each nonrecombinant block between subtypes that we identified with jpHMM (Schultz et al. 2009 ). In total, 19 and 24 blocks were identified in sequences sampled in China and South America, respectively; the 5%, median and 95% quantiles of block length were 111bp, 220bp and 1333bp, respectively. Detection of bimodality in the distribution, the signature of a past isolation event, was performed by fitting a Gaussian Mixture Model using the Expectation-Maximization algorithm from Meng and Rubin (1993) as implemented in the R package mixtools (Benaglia et al. 2009 ). Gaussian distributions are used to estimate positions of the mode as distributions are symmetric (the 5% and 95% quantiles of the skewness of all components are -1.05 and 1.05, respectively). The algorithm also provides posterior probabilities of membership into modes, which is used to identify genomes that have bimodal distribution (i.e., indicating admixture). We analyzed the signatures in the pairwise nucleotide differences by comparing HIV genome signatures with the expected results, assuming either small linked sequences or large recombining sequences. The total and joint SFS were computed using subtype J as an outgroup. Results were robust to the chosen outgroup (Figure K in File S1). Note that the distinction between recombining and nonrecombining sequences does not apply for the SFS, as the expectation (and the maximum likelihood estimate) of the SFS is not affected by recombination: only its variance is affected (Gutenkunst et al. 2009 ). We tested the significance of the temporal changes of the SFS using a bootstrap test (Figure L in File S1). All data used are from the Los Alamos HIV database (http://www.hiv. lanl.gov/) ( Table A in File S1). Bimodality in pairwise nucleotide differences as a signal of population reconnection after isolation Two distinct signatures of population reconnection after isolation were found in the temporal distribution of pairwise nucleotide differences: bimodality and increased variance of the first mode ( Figure 1 ). Bimodality was observed in these distributions (both within and between populations, p w and p b ) only when a population reconnection occurred ( Figure 1A ). This bimodality reflects the different possible origins of the two sequences sampled: the two sequences either have a recent common ancestor (i.e., after the reconnection event; first mode in Figure 1A ) or an ancient common ancestor (before the isolation event; second mode in Figure 1A ). Bimodality was also observed in the distributions of total pairwise nucleotide difference (considering all populations, p) after a reconnection event but, in this case, bimodality was also observed when populations were completely isolated (T reco ¼ 0, Figure 1 , B and C). To differentiate a connection from an isolation event, bimodality needs to be associated with a temporal increase of the variance of the modes in the distribution (which depends on the recombination rate; see Figure 1 ). Although testing for increase Figure 1 Signature of reconnection of previously isolated populations on the distribution of pairwise nucleotide differences as a function of time since the reconnection event T reco (columns), (A) within-p w and between-population p b , (B-C) total p pairwise nucleotide differences. In panel A, we consider a sequence of size G, with G larger than the number of segregating sites without recombination; plots in panel A are computed using equation A.5. In panels B and C, we consider a sequence of 9719 bp (length of the reference sequence for HIV-1, HXB1) with various mean numbers of recombination events per generation, R ¼ 1 and R ¼ 10. Nucleotide differences are considered from a sample of 16 sequences and each plot represents the mean distribution over 2000 replicate simulations. Parameters are d ¼ 3 populations, duration of the isolation period T iso ¼ 6, scaled migration rate M ¼ 5, scaled mutation rate u ¼ 100. in variance requires data from different time-points, it has the advantage of not requiring a priori knowledge of the origin of the samples (i.e., being within or between populations). Population reconnection after isolation can thus be detected by testing for bimodality and a temporal increase of the variance of the modes. Power to detect bimodality, with high probability (95%) ( Figure A in File S1), depends on the sample size, the duration of the reconnection period, and the amount of gene flow among populations (scaled migration rate). As shown in Figure A in File S1, it is easier to detect a reconnection event from the distribution of pairwise nucleotide differences between populations than within populations. Indeed, the second mode in the distribution of pairwise nucleotide differences was larger between population (p b ) than that of within population (p w , Figure 1A ). This difference in mode size increases when migration is weak or intermediate (M # 1) and the reconnection event is not recent ( Figure A in File S1). Ancient reconnection events can be detected even with small sample sizes (n≃10) when migration is moderate or strong (M $ 1; Figure A in File S1). To evaluate the temporal increases in variance of the mode, no further development is required as the Bartlett test (Bartlett 1937) can be used. Thus, we show that distinctive signatures of past isolation and reconnection events can be identified on the distribution of the number of nucleotide differences between pairs of genomes sampled (within, between, or in the total population). We also determine the parameter ranges for which this signature can be detected (sample size, migration rate, and time since reconnection). The signature of population reconnection after isolation on the Site Frequency Spectrum (SFS) When a population is at mutation-drift equilibrium, the distribution of its alleles, or SFS, has a characteristic shape (dotted line Figure 2A ). When populations experience demographic changes, departures from this characteristic shape are expected and the size and the position of those changes are informative as to the demographic change. Consistent with the expectation, we found a large excess of variants at intermediate frequencies on the SFS following reconnection events between previously isolated populations ( Figure 2 ). This excess was due to exchanges of variants among populations that were once fixed in one or several populations during isolation. The size and the position of this excess is indicative of the number of populations that have reconnected and of the timing of the reconnection event (Figure 2A) , and can be identified by the visual test of equilibrium neutrality proposed by Nawa and Tajima (2008) . We predict that in the total SFS (SFS obtained with merged samples from all populations), shortly after a reconnection event one would expect narrow, high peaks of specific variants. These peaks are found at frequencies i=d, where d is the number of sampled populations and 1 # i , d, when populations have equal sample sizes. In Figure 2A , peaks can be seen at frequencies 1=d and 2=d (d ¼ 3). With unbalanced sample sizes, the number of observed peaks can be higher (up to 2 d 2 2 peaks; see Figure B in File S1). The temporal patterns in the joint SFS ( Figure 2B ) are also strongly informative as to the occurrence of reconnection events and as to the amount of gene flow among populations since the reconnection event. When populations are isolated, each has its own segregating variants. This translates, in the joint SFS, into variants being restricted to the axes of the joint SFS [classes (20,i), (i,20), with i ranging from 1 to 20 in Figure 2B ]. Then, following the reconnection event, gene flow distributes the variants among populations to the point that they occur at similar frequencies in each population and the genetic compositions of populations are fully homogenized. Consequently, in the joint SFS, several variants move progressively away from the axes toward the diagonal of the joint SFS, as expected in a structured population at equilibrium (last column Figure 2B ). To robustly detect the excess of variants on the SFS generated by a reconnection event, we derived an optimal test statistic, denoted T V , using developments provided by Ferretti et al. (2010) and Achaz (2009) (File S1). Although use of this test is restricted to recent reconnection events that follow long periods of isolation, it performs better than commonly used statistics such as Tajima's D (Tajima 1989 (Zeng et al. 2006) . Indeed, even with long isolation periods, those statistics have less than 60% power to detect reconnec-tion events ( Figure E in File S1). The results presented in this section apply to both recombining and nonrecombining sequences, as the expected SFS is the same in both cases (Gutenkunst et al. 2009 ). Also, the power estimates from the tests based on nonrecombining sequences are conservative as they correspond to the case with the highest variance (Achaz 2009 ) and thus the lowest power. Thus, we show that distinctive signatures of past isolation and reconnection events can be identified in the SFS as well as in the joint SFS. Namely, the SFS has an excess of variants, the size and position of which are indicative of the number of previously isolated populations and of the timing since the reconnection event. The temporal dynamic of the joint SFS is informative as to the level of gene flow between populations. (Table B in File S1) and in South America (Table C in File S1) (see method section 'HIV-1 genome analysis'). The total (pooled) SFS is presented in the first row following representation of Nawa and Tajima (2008) , the joint SFSs of each pairwise group of HIV subtypes are presented in the second, third, and fourth row and, in the fifth row, the Principal Component Analyses (PCAs) are presented. Subtype J is used as an outgroup (see Figure K in File S1, results for alternative outgroups). Original subtypes (subtypes found early in the epidemics in Africa; at the corners) are used to define a triangle in the PCA space. The position of the HIV-1 genomes on the straight edges of the triangle suggests strong admixture between subtypes. For the total and joint SFS, we consider the mean value of 500 random samples of sequences in each subtypes cluster (see Figure J in File S1 for the different clusters). A sample of 16 sequences per subtype cluster is considered in China, and of eight sequences per subtype cluster in South America. The significance of the trends in the dynamics is tested using the optimal test statistic T V derived in the SI ( Figure L in File S1). The signatures of population reconnection identified here, i.e., bimodality and temporal increase of variance of peaks in the SFS, are robust to population expansion ( Figure G in File S1), alternative evolutionary and population dynamic models such as repeated bottlenecks, rapid extinction-recolonization dynamics, recurrent positive selection or highly skewed offspring distributions ( Figure H in File S1). Moreover, the signature is distinct from the signature of a closely related model: the Isolation with Migration model (IM) ( Figure I in File S1). The observed signatures are robust even when a massive population expansion occurs concomitantly with the reconnection event ( Figure G in File S1). Such events modify the expected signature only by presenting an excess of low frequency variants compared to what is expected with populations of constant size (as expected from the results of Tajima (Tajima 1989 ) in a single panmictic population). Replacing the classic Kingman coalescent model (Wright-Fisher model) with a beta-coalescent model does not alter our results for values of the model parameter a (Figure H in File S1). Beta-coalescent models account for large variance in the number of offspring per individual, as expected under repeated bottlenecks, fast extinction-recolonization dynamics, or recurrent beneficial fixations (Neher and Hallatschek 2013; Tellier and Lemaire 2014) . Such variance leads to an excess of low frequency variants and smaller peaks in the total SFS compared to Kingman's coalescent model; this effect is larger for small a values (light-gray lines Figure H in File S1). However, having an excess of variants at low frequency does not alter the identified signature of reconnection on the total SFS (dynamics of the peaks). Also the temporal dynamics of the signature distinguish this pattern from that of an IM model. Indeed under the IM model, the size of the peaks in the SFS increases with time ( Figure I .B in File S1), whereas the signature of a reconnection of isolated populations is characterized by a decrease in peak size with time ( Figure 2A) . Thus, we show that, even though the signature at specific time-points may be confounded with that of other selective or demographic processes, the temporal signature of past isolation and reconnection identified is unlikely to be confounded, and provides a robust signal. We applied the conceptual and theoretical approaches via an analysis of the colonization history of different HIV-1 subtypes in Asia and in South America. Using data from the Los Alamos HIV database, we detected a strong signal of reconnection events between previously isolated HIV-1 subtypes in these populations. Past isolation between HIV-1 subtypes in China and South America is signaled by the excess of variants, with peaks at intermediate frequencies in the total SFS (Figure 3, A and B and G and H) . These peaks reflect the presence of variants from the sampled population that diverged during isolation and were present at high frequencies in each previously isolated subtype population identified with cluster analysis (see Results). The reconnection of HIV-1 subtype populations in China and South America is indicated by the bimodality in the distribution of pairwise nucleotide differences between-populations (considering sequences of small length Figure 4) . We found that a large proportion of the genome of HIV-1 displays bimodality (Figure 4 ). In China, this proportion was initially low but rapidly increased through time. Between 2006 and 2007, the proportion of bimodality increased fourfold (Figure 4) , and increased a further 40% between 2008 and 2009. This suggests that the time series analyzed here captures a burst of HIV-1 subtype admixture. The dynamics of genomic admixture between HIV-1 subtypes in China is also highlighted by the temporal changes in the distribution of pairwise nucleotide differences and the SFS. The peaks formed by the excess variants (resulting from population admixture in the total SFS) decreased in size, and the values of optimal test statistics T V significantly decreased ( Figure L in File S1). We find a significant increase in the variance of the first mode in the distribution of pairwise nucleotide differences between 2005 and 2009 (p , 0:01 two-sided Bartlett test). The joint SFSs for the two time points also indicate that the reconnection event and genome admixture is either recent or that genome admixture is limited but ongoing. Indeed, we can observe the excess of variants along the axes of the joint SFS shifting to the diagonal (Figure 3, C and D) . In South America, the proportion of bimodality in HIV-1 samples from 1999 is large (60%) and remains approximately Table B in File S1) and South America (see Table B in File S1). As expected under genomic admixture (Figure 1 ), the proportion of bimodality detected in the pairwise nucleotide differences between populations in HIV-1 genomes increased, and the variance of modes significantly increased with time in China (p , 0:01, two-sided Bartlett test) . We also see a (nonsignificant) trend of temporal changes in South America. constant from 1999 to 2004, signaling a prior reconnection event. Patterns observed in the joint SFS suggest that HIV-1 subtype genomes are well homogenized overall; further, this process of homogenization seems to have either slowed or ceased. Indeed, no increase of variance in the modes of the distribution of pairwise nucleotide differences was detected (two-sided Bartlett test) , and the total SFS shows only small changes between 2001 and 2005, with nonsignificant changes in values of optimal test statistics T V ( Figure L in File S1). Genome admixture in China and in South America is also visible when the sampled genomes are projected together in a PCA space defined by original subtypes i.e., which originate from the beginning of the HIV epidemic in Africa (Figure 3 , E and F and K and L). Several HIV-1 genomes are present on the straight edges of the triangles defined by the ones at the origin. Temporal PCA patterns also suggest homogenization between subtypes B and C in China and between B and F1 in South America, as they move away from the pure subtypes in PCA space. To provide insights on the generality and global importance of these results, we performed a PCA analysis of worldwide genomes from HIV-1 subtypes B, C, F1, and CRF01_AE (http://www.hiv.lanl.gov). Again we projected them in the PCA with original subtypes B, C, F1, and CRF01_AE as references ( Figure 5 ). Consistent with the observations for Chinese and South American samples, we observed that HIV-1 genomes are positioned on the straight edges of the triangle formed by original subtypes, positioned at the corners, forming a continuum along the axes rather than discrete clusters within. Very few genomes are localized outside the edges of the triangle defined by the 3 original subtypes. This pattern not only signals that most of the circulating HIV-1 genomes experience constant gene flow from other subtypes but also represents the current spatial distribution of HIV-1 populations. It reflects, for example, the extensive admixture between subtypes B and CRF01_AE in Brazil and China, between B and CRF01_AE in Thailand, and B and F1 in Brazil, Argentina, and Spain . We present an analytical framework to understand the evolutionary histories of organisms that are characterized by periods of isolation. Our theoretical investigations identify specific genomic signatures of past population isolation. We show, for example, that the number, size, and dynamics of modes in the distributions of the pairwise nucleotide differences in populations are informative on the existence of past isolation. Our results extend previous work on the distribution of pairwise coalescence times and pairwise nucleotide differences under a model of population split with or without migration (IM model). We show, for example, that under the reconnection model, the distribution of pairwise nucleotide differences has a temporal evolution which is distinct from the one expected under an IM model (Takahata 1995; Wakeley 1996; Rosenberg and Feldman 2002) : the distribution is first bimodal and then tends to become unimodal with time. We also found that in the Site Frequency Spectrum (SFS), the presence, the position and the value variants in excess are informative as to the numbers of previously isolated populations, and the joint SFS characterizes the amount of genetic overlap between populations. Signatures identified here are robust to the presence of recombination, extinctioncolonization processes, population bottlenecks, and expansion as well as skewed offspring distributions. Our developments add to the current literature in the field as we demonstrate that population reconnection after isolation is not fully captured by commonly used test statistics based on SFS classes of variants at high and/or low frequencies such as D, H, and E (Ferretti et al. 2010) . Taken together, this provides critical information on the evolution of the genomic composition of populations. We apply our theory in order to analyze the genome signature of successive colonization of HIV-1 subtypes in Asia and South America. In both regions, we reconstruct the successive colonization of HIV-1 subtypes, and find that they are followed by strong HIV-1 subtype admixture. We find, in these regions, that HIV diversification is followed by a temporal homogenization of HIV-1 genome composition. Our analysis highlights a large degree of admixture between circulating HIV-1 subtype genomes in China and South America, and most probably worldwide. These results suggest that genomic exchanges may indeed be relatively strong, and that HIV-1 is currently in a process of homogenization subsequent to isolation and diversification. HIV genome homogenization may seem to contradict the idea that HIV is undergoing a diversification process (Hamelaar et al. 2011; Tebit and Arts 2011; Feng et al. 2013) . However, this homogenization is consistent with observed diversity in the HIV-1 population and the increased number of new HIV recombinant forms described worldwide (Hamelaar et al. 2011; Tebit and Arts 2011; Feng et al. 2013) . Indeed, when isolated and diverged populations reconnect, a transient excess of genetic diversity is predicted in the resulting population (Alcala et al. 2013; Alcala and Figure 5 Projection of worldwide distributed HIV-1 sequences from subtypes B, C, CRF01_AE and F1, and URF and CRF recombinant forms between these subtypes. 1646 genome sequences (see Table A Vuilleumier 2014) as all the genetic material that accumulated during the isolation phase is shared among populations. However, this excess of genetic diversity is transient and with time, the diversity decreases (Alcala et al. 2013) . In HIV-1, such processes occur through multiple recombination events among subtypes through coinfection or superinfection and are the source of novel recombinant forms. In the long term, successive recombination events will ultimately lead to the homogenization of the genomic composition of HIV-1. Homogenization might be further enhanced by selective forces imposed by the immune response or treatment acting on the HIV genome (Fellay et al. 2007; Pelak et al. 2011; Chikata et al. 2014) . As many world regions are currently being newly colonized by different HIV-1 subtypes, it is expected that new recombinants may still emerge and HIV diversity will further increase. However, as demonstrated here, this is expected to be a transient state toward global homogenization. We find signatures of reconnection events among populations of HIV-1 subtypes that were previously isolated both in China and South America. The signatures on HIV-1 DNA polymorphism suggest reconnection events of three previously isolated HIV-1 populations in both regions. These results align with the known epidemiologic history of HIV-1 in the two regions (Su et al. 2000; Yang et al. 2002; Aulicino et al. 2007; Lu et al. 2008; Takebe et al. 2010; Bello et al. 2011; An et al. 2012; Pang et al. 2012; Han et al. 2013; Feng et al. 2013) and are supported by recently identified recombinants in China Wang et al. 2014; Wei et al. 2014; Hsi et al. 2014) . To confirm this, we identified the genome sequences that signal the reconnection event. As expected, they reflect first the HIV outbreak in China in 1989: colonization of B and C subtypes (variance of the first peak in the SFS is generated by B and C subtypes). The subsequent colonization of CRF01_AE at the origin of the HIV outbreak in the late 1990s is also signaled in the SFS (a high proportion of CRF01_AE in the highest mode). We obtained similar concordance for the origin of the sequences forming the peaks and the successive subtypes' colonization events in South America. Our methods also appear robust to high mutation rates and the presence of selection. Indeed, we detect past HIV-1 history despite its extraordinary defining features: HIV has a small genome (about 9.8 kb in length), a short generation time (24 hr; Mohammadi et al. 2013) , and a high mutation rate (about 0.002 substitutions/site/year; Korber et al. 2000; Schlub et al. 2014) . It is also under strong selective pressure from the immune system (e.g., Snoeck et al. 2011 ) and undergoes frequent bottlenecks followed by population expansion [during transmission (e.g., Keele et al. 2008) ]. Nevertheless, the signature observed here in the distributions of pairwise nucleotide difference and on SFS only slightly deviates from theoretical expectations. The robustness of these results may have a simple explanation: a separation of time-scales of the processes involved. Selection and population demographic changes drive divergence of populations during isolation. Migration and recombination during population reconnection allows a rapidrelative to divergencegenome admixture. Both impact genome polymorphism, but the latter does so more drastically and at much shorter time-scales. The evolutionary history of an organism is a powerful resource for understanding its current evolutionary trajectory. The analytical framework developed can be applied to understanding the potential evolutionary pathway of many pathogens and other species that are characterized by periods of isolation. These events are common features of pathogens (Karlsson et al. 2014 ), but also occur in numerous other species, including humans and domesticated plants and animals (Ellstrand et al. 1999; Lecis et al. 2006; Caicedo et al. 2007; Green et al. 2010; Patterson et al. 2012 ). Reconnection following a period of isolation is increasingly recognized as an effective mechanism to trigger fast and complex adaptation (Seehausen 2002 (Seehausen , 2004 Aguilée et al. 2013) . Following reconnection, accumulated diversity (from previously isolated populations) is shared, and the resulting high level of genetic diversity may increase the potential for adaptation (Alcala and Vuilleumier 2014) . Admixture among divergent populations may also promote complex adaptation by bringing together new functions acquired in different environments that have evolved independently but are compatible with the genome. Such processes have been observed in influenza and in HIV, for example. Indeed, recombination between strains previously isolated in their nonhuman hosts preceded the HIV epidemic in humans (Bailes et al. 2003; Sharp and Hahn 2010; Vuilleumier and Bonhoeffer 2015) . Such events are expected to be more frequent in the future due to the increase in human mobility (Karlsson et al. 2014) . Finally, our analytical framework could also be applied to evolutionary trajectories of other zoonotic viruses, such as Ebola virus, coronaviruses, hantaviruses, or henipaviruses. It is important to keep in mind that viral population isolation and reconnection events can occur at a much smaller time scale within hosts. There is increasing evidence suggesting that genetic compartmentalization of viral populations occurs in different organs, tissues, or cells, and might play a critical role for virus evolution and more importantly for its control. This has been demonstrated for HIV (Schnell et al. 2011; Buzón et al. 2011) as well as human cytomegalovirus (Renzette et al. 2011 (Renzette et al. , 2013 (Renzette et al. , 2014 (Renzette et al. , 2015 . Appling the analytical framework detailed here could provide a powerful tool to dissect the dynamics and genetic composition of virus across different genetic compartments, in addition to those dynamics between hosts. We thank the editor Stephen Wright, John Wakeley, and three anonymous reviewers, whose comments and suggestions greatly improved the manuscript. We would also like to thank Kristen Irwin and Nicholas Renzette for careful reading of the manuscript. This project was funded by the Swiss National Science Foundation (SNSF) grants PZ00P3_139421 and 31003A-130065 and PMPDP3_158381 (S.V.), 31003A_149724/1 (A.T.) and an interdisciplinary grant from the Faculty of Biology and Medicine (FBM) of the University of Lausanne (S.V. and A.T.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Where A w;1 ¼ Principal component analysis Frequency spectrum neutrality tests: one for all and all for one Adaptive radiation driven by the interplay of eco-evolutionary and landscape dynamics Turnover and accumulation of genetic diversity across large time-scale cycles of isolation and connection of populations Peak and persistent excess of genetic diversity following an abrupt migration increase Reconstituting the epidemic history of HIV strain crf01_ae among men who have sex with men (msm) in liaoning, northeastern China: Implications for the expanding epidemic among msm in China Description of the first full-length HIV type 1 subtype f1 strain in argentina: implications for the origin and dispersion of this subtype in South America Hybrid origin of SIV in chimpanzees Properties of sufficiency and statistical tests The effect of hitch-hiking on neutral genealogies The effect of selection on genealogies Approximate bayesian computation in population genetics Learning about modes of speciation by computational approaches The use of bioinformatics for studying HIV evolutionary and epidemiological history in south america mixtools: An r package for analyzing finite mixture models Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model Deep molecular characterization of HIV-1 dynamics under suppressive haart Genome-wide patterns of nucleotide polymorphism in domesticated rice The evolution of HIV: inferences using phylogenetics Host-specific adaptation of HIV-1 subtype b in the japanese population Bayesian coalescent inference of past population dynamics from molecular sequences Gene flow and introgression from domesticated plants into their wild relatives Patterns of DNA sequence diversity and genetic structure after a range expansion: lessons from the infinite-island model Robust demographic inference from genomic and SNP data Hitchhiking under positive darwinian selection A whole-genome association study of major determinants for host control of HIV-1 The rapidly expanding crf01_ae epidemic in China is driven by multiple lineages of HIV-1 viruses introduced in the 1990s Identification of a novel HIV type 1 circulating recombinant form (crf65_cpx) composed of crf01_ae and subtypes b and c in western Yunnan Optimal neutrality tests based on the frequency spectrum The genetical theory of natural selection Influenza pandemics: past, present and future challenges Wfabc: a Wright-Fisher abc-based approach for inferring effective population sizes and selection coefficients from time-sampled data Statistical tests of neutrality of mutations Antigenic and genetic characteristics of swine-origin 2009 a(h1n1) influenza viruses circulating in humans Diversity considerations in HIV-1 vaccine selection Demographic history and rare allele sharing among human populations A draft sequence of the neandertal genome Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data Global trends in molecular epidemiology of HIV-1 during High prevalence of HIV-1 intersubtype b /c recombinants among injecting drug users in Dehong, China Genetic traces of ancient demography Isolation with migration models for more than two populations Genome sequence of a novel HIV-1 circulating recombinant form (crf64_bc) identified from Yunnan, China Influenza pandemics: past, present and future Generating samples under a Wright-Fisher neutral model of genetic variation Discriminant analysis of principal components: a new method for the analysis of genetically structured populations Natural selection and infectious disease in human populations Identification and characterization of transmitted and early founder virus envelopes in primary HIV-1 infection The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations Timing the ancestor of the HIV-1 pandemic strains Evolutionary and immunological implications of contemporary HIV-1 variation Bayesian analyses of admixture in wild and domestic cats (Felis silvestris) using linked microsatellite loci Near full-length genome sequence of a novel HIV type 1 second-generation recombinant form (crf01_ae/crf07_bc) identified among men who have sex with men in jilin, china Full-length genome analyses of two new simian immunodeficiency virus (SIV) strains from mustached monkeys (c. cephus) in gabon illustrate a complex evolutionary history among the sivmus/mon/gsn lineage A general method for calculating likelihoods under the coalescent process The changing face of HIV in china Principal components analysis of population admixture Maximum likelihood estimation via the ecm algorithm: A general framework 24 hours in the life of HIV-1 in a t cell line Jaatha: a fast compositelikelihood approach to estimate demographic parameters Simple method for analyzing the pattern of DNA polymorphism and its application to SNP data of human Genealogies of rapidly adapting populations Distinguishing migration from isolation: a markov chain monte carlo approach The coalescent and the genealogical process in geographically structured population Extensive and complex HIV-1 recombination between b9, c and crf01_ae among Idus in south-east Asia Ancient admixture in human history Copy number variation of kir genes influences HIV-1 control Population genomics of sub-saharan Drosophila melanogaster: African diversity and non-african admixture Reconstructing native american population history Extensive genome-wide variability of human cytomegalovirus in congenitally infected infants Rapid intrahost evolution of human cytomegalovirus is shaped by demography and positive selection Human cytomegalovirus intrahost evolution-a new avenue for understanding and controlling herpesvirus infections Limits and patterns of cytomegalovirus genomic diversity in humans HIV-1 nomenclature proposal in Modern Developments in Theoretical Population Genetics: The legacy of Gustave Malécot The date of interbreeding between neandertals and modern humans Fifteen to twenty percent of HIV substitution mutations are associated with recombination Estimation of past demographic parameters from the distribution of pairwise differences when the mutation rates vary among sites: application to human mitochondrial DNA HIV-1 replication in the central nervous system occurs in two distinct cell types jphmm: improving the reliability of recombination prediction in HIV-1 Patterns in fish radiation are compatible with pleistocene desiccation of lake victoria and 14,600 year history for its cichlid species flock Hybridization and adaptive radiation The evolution of HIV-1 and the origin of AIDS Properties of statistical tests of neutrality for DNA polymorphism data Gene genealogies within mutant allelic classes Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations Mapping of positive selection sites in the HIV-1 genome in the context of RNA and protein structural constraints Understanding the origin of species with genome-scale data: modelling gene flow On the nonidentifiability of migration time estimates in isolation with migration models Interpreting the estimated timing of migration events between hybridizing species Exploring the demographic history of DNA sequences using the generalized skyline plot Characterization of a virtually full-length human immunodeficiency virus type 1 genome of a prevalent intersubtype (c/b) recombinant strain in China Statistical method for testing the neutral mutation hypothesis by DNA polymorphism A genetic perspective on the origin and history of humans Reconstructing the epidemic history of HIV-1 circulating recombinant forms crf07_bc and crf08_bc in east asia: the relevance of genetic diversity and phylodynamics for vaccine strategies Tracking a century of global expansion and evolution of HIV to drive understanding and to combat disease Coalescence 2.0: a multiple branching of recent theoretical developments and their applications Contribution of recombination to the evolutionary history of HIV Pairwise differences under a general model of population subdivision Nonequilibrium migration in human history Near fulllength genome characterization of a new crf01_ae/crf08_bc recombinant transmitted between a heterosexual couple in Guangxi, China Identification of a novel HIV-1 circulating recombinant form (crf62_bc) in western Yunnan of China The distribution of the coalescence time and the number of pairwise nucleotide differences in a model of population divergence or speciation with an initial period of gene flow Evolution in mendelian populations On-going generation of multiple forms of HIV-1 intersubtype recombinants in the Yunnan province of China Statistical tests for detecting positive selection by utilizing high-frequency variants HIV prevalence in China: integration of surveillance data and a systematic review Derivation of the distribution of pairwise coalescence times: We here derive the distribution of within-and between-population pairwise coalescence times, T w and T b , assuming a long isolation period (T iso 2 T reco . 3) and small sequences, i.e., without recombination. These coalescence times T w and T b correspond to the time to the most recent common ancestor of two lineages sampled in the same and in different populations, respectively, scaled by the number of genes per population N.The coalescence of two genes under the finite island model can be described by a three states continuous time Markov chain (Notohara 1990 ): the two genes can be present (1) in the same population, (2) in different populations or (3) be coalesced. The transition probabilities from one state to another depend on the scaled migration rate M between populations and on the number of populations d. Time T is counted in units of N generations. The corresponding Markov chain can be summarized by the following transition rate matrix (Notohara 1990) :The probability that two genes were in state (1), (2) or (3) T generations ago (row vector P T ), given that they are in state (1), (2) or (3) at the current generation is given by P T ¼ P 0 e QT (P 0 at generation 0), where e QT is a 3 · 3 matrix. The elements of matrix e QT are denoted q i;j ðTÞ and represent the probability that a gene in state i at the current generation was in state j T generations ago (cumulative distribution function). Thus, the probability distribution (or probability density function) of within-population and between-population pairwise coalescence times are given by the derivatives of corresponding q i;j ðTÞ functions: f w;coal ðTÞ ¼ d dT q 1;3 ðTÞ and f b;coal ðTÞ ¼ d dT q 2;3 ðTÞ. Populations are considered isolated when T reco , T , T iso and no migration occurs, with M ¼ 0 in the transition matrix Q. Populations exchange migrants when T , T reco and T . T iso , thus M . 0 in the transition matrix Q. Consequently, the distributions of within-and between-population pairwise coalescence times, f Tw ðTÞ and f Tb ðTÞ, following a past isolation event, are:T , T reco q 2;1 ðT reco Þe 2ðT2TrecoÞ ; T reco , T , T iso q 2;1 ðT reco Þe 2ðTiso2TrecoÞ f w;coal ðT 2 T iso Þ þq 2;2 ðT reco Þf b;coal ðT 2 T iso Þ; T iso , T:These distributions have a simple interpretation. In the most recent period, the reconnection period, when T , T reco , f Tw ðTÞ and f T b ðTÞ correspond to the probability of coalescence of 2 lineages under the finite island model, f w;coal ðTÞ and f b;coal ðTÞ. During the past isolation period, T reco , T , T iso , two lineages can only coalesce if they are in the same population after T reco generations of connection (probabilities q 1;1 ðT reco Þ and q 2;1 ðT reco Þ within-and between-populations, respectively). Their probability of coalescence at generation T in an isolated population is e 2ðT2TrecoÞ . During the older connection period, T . T iso , two genes can coalesce if they were in the same population during isolation (probabilities q 1;1 ðT reco Þ and q 2;1 ðT reco Þ, within-and between-populations, respectively) and if they did not coalesce during isolation (probability e 2ðTiso2TrecoÞ ). Their probability of coalescence at generation T is f w;coal ðT 2 T iso Þ. Alternatively, they could have coalesced if they were in different populations during isolation (probabilities q 1;2 ðT reco Þ and q 2;2 ðT reco Þ). Their probability of coalescence at generation T isFunctions f w;coal ðTÞ, f b;coal ðTÞ, q 1;1 ðTÞ, q 1;2 ðTÞ, q 2;1 ðTÞ and q 2;2 ðTÞ all depend on the eigenvalues of matrix Q,We obtain:(A.3b) l 2 2 l 1 , A w;2 ¼ 2 ð1 þ l 1 Þ l 2 2 l 1 , A b;1 ¼ l 2 l 2 2 l 1 , A b;2 ¼ 2 l 1 l 2 2 l 1 and B ¼ M l 2 2 l 1 .Derivation of the distribution of pairwise nucleotide differences: From equations A.2a and A.2b, we can derive the distributions of within-(p w ) and between-population pairwise nucleotide differences (p b ). Given that mutations follow a Poisson process of mean u, we havewhere k is a positive integer and G is the length of the sequence (under the infinite size model, we assume that G is much larger than the number of segregating site). We can decompose Pðp w ¼ kÞ into a sum of smaller integrals (respectively for Pðp b ¼ kÞ): , we obtain:The resulting distributions of pairwise nucleotide differences (from equation A.5) are presented in Figure 1A . Further assuming that T iso 2 T reco . 3 (thus e 2TisoþTreco ≃0) leads to: (resp. p b ) under the finite island model at equilibrium, while terms 1 2 e ðli2uÞTreco P k j¼0 T l reco ð2liþuÞ l l! are due to the truncation of the distribution between T reco and infinity. The second term corresponds to sequences that coalesce during the isolation period. Indeed, with probability q 1;1 ðT reco Þ (resp. q 2;1 ðT reco Þ), sequences did not coalesce during the reconnection period and were in the same population during isolation; u k ð1þuÞ kþ1 correspond to the distribution of p w in an isolated population, and the term e 2uTreco P k l¼0 T l reco ð1þuÞ l l! is due to the truncation between 0 and T reco . Finally, the third term corresponds to sequences that coalesce during the prior connection period. With probability q 1;2 ðT reco Þ (resp. q 2;2 ðT reco Þ), sequences coalesce during the prior connection period (as they did not coalesce during the reconnection period and were in different populations during isolation). Then,Ákþ1 corresponds to the distribution of p b under the finite island model at equilibrium, while terms e 2uTiso P k j¼0 T l iso ð2liþuÞ l l! are due to the truncation of the distribution between 0 and T iso .Derivation of the power to detect bimodality: We can derive the power to detect bimodality from the distributions of p w and p b by computing the probability that, in a sample of n independent pairwise differences sequences, at least one is from the first mode and at least one is from the second. Given that the probability for a difference p w to be drawn from the second mode of the distribution is q 1;2 ðT reco Þ, the probability that among n differences, at least one is from the first mode and one from the second mode is 1 2 q 1;2 ðT reco Þ n 2 ð12q 1;2 ðT reco ÞÞ n . Similarly, the probability to observe a bimodal distribution of p b from a sample of size n is 1 2 q 2;2 ðT reco Þ n 2 ð12q 2;2 ðT reco ÞÞ n . These results are used to draw Figure A in File S1. Note that with n sequences, we can only compute n 2 1 independent pairwise differences, although the total number of pairwise differences we can compute is nðn 2 1Þ