key: cord-0278539-cs8sqrx1 authors: Menardo, Fabrizio; Rutaihwa, Liliana K.; Zwyer, Michaela; Borrell, Sonia; Comas, Iñaki; Conceição, Emilyn Costa; Coscolla, Mireia; Cox, Helen; Joloba, Moses; Dou, Horng-Yunn; Feldmann, Julia; Fenner, Lukas; Fyfe, Janet; Gao, Qian; de Viedma, Darío García; Garcia-Basteiro, Alberto L.; Gygli, Sebastian M.; Hella, Jerry; Hiza, Hellen; Jugheli, Levan; Kamwela, Lujeko; Kato-Maeda, Midori; Liu, Qingyun; Ley, Serej D.; Loiseau, Chloe; Mahasirimongkol, Surakameth; Malla, Bijaya; Palittapongarnpim, Prasit; Rakotosamimanana, Niaina; Rasolofo, Voahangy; Reinhard, Miriam; Reither, Klaus; Sasamalo, Mohamed; Duarte, Rafael Silva; Sola, Christophe; Suffys, Philip; Lima, Karla Valeria Batista; Yeboah-Manu, Dorothy; Beisel, Christian; Brites, Daniela; Gagneux, Sebastien title: Local adaptation in populations of Mycobacterium tuberculosis endemic to the Indian Ocean Rim date: 2020-10-20 journal: bioRxiv DOI: 10.1101/2020.10.20.346866 sha: d9107d263063ddd6020bb244e198d0b47b80456d doc_id: 278539 cord_uid: cs8sqrx1 Lineage 1 (L1) and 3 (L3) are two lineages of the Mycobacterium tuberculosis complex (MTBC), causing tuberculosis (TB) in humans. L1 and L3 are endemic to the Rim of the Indian Ocean, the region that accounts for most of the world’s new TB cases. Despite their relevance for this region, L1 and L3 remain understudied. Here we analyzed 2,938 L1 and 2,030 L3 whole genome sequences originating from 69 countries. We show that South Asia played a central role in the dispersion of these two lineages to neighboring regions. Moreover, we found that L1 exhibits signatures of local adaptation at the esxH locus, a gene coding for a secreted effector that targets the human endosomal sorting complex, and is included in several vaccine candidates. Our study highlights the importance of genetic diversity in the MTBC, and sheds new light on two of the most important MTBC lineages affecting humans. slave trade from East Africa (Allen 2013 , Conceição et al. 2019 . Interestingly, this is in contrast with L5 and L6, which are endemic to West Africa, but did not establish themselves firmly in South America (De Jong et al. 2010 , Rabahi et al. 2020 . Third, similar to the West African clade of L1.1.1, we found an East African clade embedded within sublineage L1.2.1, which otherwise is found almost exclusively in Southeast Asia. This East African clade is composed of 11 strains from five countries, and its sister clade is found in East Timor and Papua New Guinea. We inferred a direct migration from the islands of Southeast Asia to East Africa that occurred between the 13 th and the 16 th century AD (assuming a clock rate of 1.4x10 -7 ). This would be compatible with early Portuguese expeditions, which reached East Timor and Papua New Guinea in the early 16 th century. An alternative explanation could be the Austronesian expansion. Austronesians are thought to have reached the Comoros islands and Madagascar between the 9 th and the 13 th century AD, possibly through direct navigation from Southeast Asian islands. Malagasy speak an Austronesian language, and Austronesian genetic signatures are found in in human populations in the Comoros, Madagascar, and to a small extent also in the Horn of Africa (Blench 2010 , Boivin et al. 2013 , Pierron et al. 2014 , Brucato et al. 2016 , Brucato et al. 2018 , Brucato et al. 2019 ). L1 and L3 coexist in many regions around the Indian Ocean. Yet, in their evolutionary history these lineages colonized areas occupied by different human populations. Human genetic variation has been shown to influence the susceptibility to TB (Qu et al. 2011 ). Most notably, HLA genes play a crucial role in the activation of the immune responses to the MTBC by exposing bacterial peptides (epitopes) to the surface of an infected cell, where they can be recognized by T cells. HLA genes are extremely polymorphic in human populations, and several alleles of different HLA genes are associated with TB susceptibility in different populations (Brahmajothi et al. 1991 , Vejbaesya et al. 2002 , Yuliwulandari et al. 2010 , Salie et al. 2014 , Sveinbjornsson et al. 2016 . Previous studies have shown that T cell epitopes are hyper-conserved in the MTBC, suggesting that immune escape does not provide an advantage, and that contrary to other pathogens, the MTBC needs to be recognized by the immune system and to cause disease in order to transmit (Comas et al. 2010 , Coscolla et al. 2015 . Our large dataset of L1 and L3 genome sequences from different geographic regions provided an opportunity to scan for lineage-and/or region-specific signatures of selection at T cell epitopes in L1 and L3. We reconstructed the mutational history of T cell epitopes in L1 and L3, and found that 51% of all epitopes were variable (at the amino acid level) in at least one L1 strain, while only 20% were variable in at least one L3 strain (Sup . Table 7) . However, this difference can be explained by the different size and diversity of the two datasets (2,061 genome sequences with 136,023 variable sites for L1; 1,021 genome sequences with 36,316 variable sites for L3). The epitope that accumulated most mutations was located at the N-terminus of esxH (Rv0288). This peptide is a T cell epitope for both classes, MHCI and MHCII; it is also a B cell epitope, and was previously identified as one of the few T cell epitopes that were not hyper-conserved (Comas et al. 2010) . We found 15 derived haplotypes (at the amino acid level), generated by 28 independent replacements at five positions in a peptide of seven amino acids (Figure 3 ). By contrast, the second most variable epitope accumulated only seven amino acid changes (Sup . Table 7) . Interestingly, we did not find this signature in L3, as all strains carried the ancestral haplotype at the N-terminal esxH epitope. Moreover, when we extended the analysis to two large genomic datasets of L2 and L4 strains, we found a much weaker signal: while 21% of L1 strains carried a derived haplotype for this epitope, only 1% of L2 and L4 strains, and no L3 strain had a derived haplotype. Despite analyzing datasets with more strains (6,752 and 10,466 for L2 and L4, respectively), and more polymorphic positions (140,309 and 277,648) compared to L1, we found only three amino acid replacements at the N-terminal epitope of esxH in L2, and seven in L4 (Figure 3 ). The most frequently mutated position was the tenth amino acid, where we found 12 independent replacements in L1, and two in L2 and L4 (Fig. 3) . The amino acid replacement A10V occurred eight times in parallel in L1, once in L2 and twice in L4. The most abundant derived haplotype was caused by a different amino acid replacement at position ten (A10T), which occurred once in L1 and once in L2 (Fig. 3) . Overall, the replacements in all lineages occurred in eight residues in a peptide of 13 amino acids (Fig. 3) . We evaluated the robustness of these results by formally testing for positive selection on the complete sequence of esxH using PAML (Yang 2007) . We found that esxH was indeed under positive selection in L1 (p-value = 0.00004) but not in the other lineages (p-values = 0.39, 1.00 and 0.65 for L2, L3 and L4, respectively). PAML identified four codons that have been under positive selection (posterior probability > 0.95), all of them within the N-terminal epitope (codons 7, 9, 10 and 13; Fig. 3 ). Codon 76, which is part of a different T cell epitope, had a posterior probability > 90%, mutated three times in parallel and was possibly also under positive selection. Our results further revealed that the derived haplotypes of the N-terminal esxH epitope were not distributed randomly across the geographic range of L1. Twenty-two of the 28 (79%) amino acid replacements occurred in sublineages L1.1.1 and L1.2.1, which are almost exclusively present in Southeast Asia (Sup. Fig. 11 ). We constructed a statistical test of association similar to phyC (Farhat et al. 2013) to determine whether replacements in the hypervariable esxH epitope were significantly associated with a particular geographic region (Methods). We found that South African strains were less likely to harbor a derived haplotype in the N-terminal esxH epitope than expected by chance (empirical p-value = 0.0130; Table 1 ). While East African L1 strains were not associated with the derived haplotypes (empirical p-value = 0.276), we noticed important differences between countries: of the 29 East African strains harboring a derived haplotype, 28 were sampled in Madagascar. When we excluded Madagascar, we found that East Africa had a strong negative association with the derived haplotypes (i.e. East African strains harbored less derived haplotypes than expected by chance; empirical p-value = 0.0004). We then tested the most frequently replaced position (position ten) alone, and again found that East African strains were negatively associated with the derived haplotypes, with and without excluding Malagasy strains (empirical pvalues = 0.0176 and 0.034, respectively). Finally, we tested the derived haplotype caused by the most frequent amino acid replacement (A10V; 8 parallel occurrences). Again, we found a negative association with East African strains (empirical p-values = 0.046 and 0.079, respectively including and excluding Malagasy strains; Table 1 ). We hypothesized that the accumulation of missense mutations at the N-terminal epitope of esxH was due to immune escape. Therefore, we performed in silico prediction of the binding affinity of the ancestral haplotype and of the two most frequent derived haplotypes (caused by the amino acid replacement A10V and A10T) with different HLA-A, HLA-B and HLA-DRB1 alleles (Methods). We performed this analysis for: 1) Alleles found at high frequency (> 10%) in South-and Southeast Asia, but not in East Africa. 2) Alleles found at high frequency (> 10%) in East Africa, but not in South-and Southeast Asia. 3) Alleles found at high frequency (> 10%) in both regions. However, we found no differences in the predicted binding affinities between alleles with different geographic distributions (Sup. Table 8 ). While esxH was the most striking example of a selective pressure specific for one lineage, our analysis suggest that it was not an isolated case. We performed a genome-wide scan for selection with PAML, and identified 17 genes under positive selection, of which five in common between L1 and L3 (Sup. Table 9 ). We found two genes coding for transmembrane proteins, members of the Esx-1 secretion system, which were under positive selection in L1 (eccB1 and eccCa1, Bonferroni corrected p-values = 0.02 and 0.03), and several genes involved in antibiotic resistance that were under positive selection in both lineages (Sup. Information). We further characterized the profile of drug resistance mutations of L1 and L3, and found that L1 strains harbored a greater proportion of inhA promoter mutations (conferring resistance to isoniazid) compared to L3 strains, confirming previous findings (Fenner et al. 2012; Sup. Information; Sup. Fig. 12; Sup. the migration rates decreased after the 7 th century AD. Our results indicate that the MRCA of L1 did not exist before the 12 th century AD. This discrepancy is due to different assumptions about the clock rates of L1, but none of the two hypotheses can be excluded with the available data. Nonetheless, our discovery of a West African clade that originated through a direct introduction from Southeast Asia supports a larger clock rate of L1, and therefore a more recent MRCA (Sup. Information). As we already mentioned, tip-dating analyses of MTBC trees with roots that are several hundreds of years old is extremely challenging, and the results of such analyses should be taken with caution ). Moreover, all MTBC molecular clock studies so far assumed that evolutionary rate estimates do not depend on the age of the calibration points (Ho et al. 2005) . There are some indications that the effect of time dependency in MTBC datasets is negligible (Pepperell et al. 2013 ). However, this assumption has not yet been thoroughly tested due to the lack of appropriate samples. We found evidence for a strong selective pressure acting on the N-terminus of esxH in L1. In contrast, this selective pressure seems to be much reduced, if not absent, in the "modern" lineages (L2, L3 and L4). It is known that L1 strains interact differently with the infected hosts compared to other lineages. For example, L1 strains show a lower virulence in animal models (Bottai et al. 2020) , transmit less efficiently in some clinical settings, and infect older patients (Holt et al. 2018) . It has been shown recently that the increased virulence of the so-called "modern" MTBC strains is due to the loss of the genomic region TbD1, which remains present in L1 (Bottai et al. 2020). However, it was also reported that in some populations, L1 was associated with higher patient mortality (Smittipat et al. 2019) . Given these differences with the "modern" lineages, it is likely that L1 is subject to different selective pressures, and it is possible that the greater selective pressure on esxH was caused by some epistatic effect specific to L1. The signature of selection at the N-terminal epitope of esxH were associated with strains sampled in South-and Southeast Asia, and were almost completely absent in East Africa (excluding Madagascar). Region-specific signatures of positive selection are a hallmark of local adaptation; in this case, Although all the codons under positive selection in esxH were contained in one single T cell epitope, the selective pressure acting on esxH could be due to some other factor, and not to the recognition of the epitope by T cells, for which, indeed, we found no evidence in the binding prediction analysis. EsxH is a small effector secreted by the Esx-3 secretion system as dimer with EsxG (Ilghari et al. 2011 Reads with alignment score lower than (0.93*read_length)-(read_length*4*0.07)) were excluded: this corresponds to more than 7 miss-matches per 100 bp. SNPs were called with Samtools v1.2 mpileup (Li et al. 2009 ) and VarScan v2 We applied the following filters: genomes were excluded if they had 1) an average coverage < 15x, 2) more than 50% of their SNPs excluded due to the strand bias filter, 3) more than 50% (or more than 1,000 in absolute number) of their SNPs having a percentage of reads supporting the call between 10% and 90%, 4) contained single nucleotide polymorphisms diagnostic for different MTBC lineages (Steiner et al. 2014) , as this indicated that a mix of genomes was sequenced, 5) had more than 5,000 SNPs of difference compared to the reconstructed ancestral genome of the MTBC (Comas et al. 2013). Additionally, when multiple strains were sampled from the same patient, we retained only one. We further excluded all strains that had less SNPs than (average -(3 * standard deviation)) of the respective lineage (calculated after all previous filtering steps). We built SNP alignments for L1 and L3 separately, including only variable positions with less than 10% of missing data, and finally, we excluded all genomes with more than 10% of missing data in the alignment of the respective lineage. After all filtering steps, we were able to retrieve 4,968 strains with high quality genome sequences for further analyses (2,938 L1, 2,030 L3, Sup. Table 1 ). We We selected all strains for which the year of sampling was known (2,499 strains, 1,672 L1, 827 L3). For both lineages, we built SNP alignments including only variable positions with less than 10% of missing data. We inferred phylogenetic trees with RAxML 8.2.11 (Stamatakis 2014) , using the GTR model (-m GTRCAT -V options). Since the alignments contain only variable positions, we rescaled the branch lengths of the trees: rescaled_branch_length = ((branch_length * alignment_lengths) / (alignment_length + invariant_sites)). We rooted the trees using the genome sequence of a L2 strain as outgroup (accession number SAMEA4441446). We used the least square method implemented in LSD v0.3-beta (To et al. 2015) to estimate the molecular clock rate with the QPD algorithm and calculating the confidence interval (options -f 100 and -s). We also performed a date randomization test by randomly reassigning the sampling dates among taxa 100 times and estimating the clock rate from the randomized and observed datasets. All LSD analyses were performed on the two lineages individually (L1 and L3), and on the five sublineages of L1. Bayesian molecular clock analyses are computationally demanding and they would be impossible to apply onto the complete datasets. Therefore, we sub-sampled the L1 and L3 datasets used for the LSD analysis to 400 genomes with two different strategies: 1) random subsampling 2) random subsampling For all datasets, we ran at least two runs, we used Tracer 1.7.1 (Rambaut et al. 2018) to identify and exclude the burn-in, to evaluate convergence among runs and to calculate the estimated sample size. We stopped the runs when at least two chains reached convergence, and the ESS of the posterior and of all parameters were larger than 200. Since we detected a strong temporal signal only for L3, we performed a set of additional analyses of the subsets of L3. We repeated the Beast analysis with an extended Bayesian Skyline Plot (BSP) prior instead of the exponential growth prior, and performed a nested sampling analysis (Russel et al. 2019) to identify which of these two models (exponential growth and extended BSP) fitted the data best. The nested sampling was run with chainLength = 20000, particleCount= 4, and subChainLength = 10000. All xml input files are available as Supplementary Files. For the biogeography analysis, we considered only genome sequences obtained from strains for which the locality of sampling was known. When the country of sampling did not correspond to the country of origin of the patient (or was unknown), we considered as sampling locality the country of origin of the For both lineages, we built SNP alignments including only variable positions with less than 10% of missing data, and all strains with known sampling locality (excluding strains from North America, Europe and Australia; see above, 2,061 L1 genomes and 1,021 L3 genomes). We inferred phylogenetic trees with raxml 8.2.11 (Stamatakis 2014) , using the GTR model (-m GTRCAT -V options). Since the alignments contain only variable positions, we re-scaled the branch lengths of the trees: rescaled_branch_length = ((branch_length * alignment_lengths) / (alignment_length + invariant_sites)). Since PASTML needs a time tree as input, we calibrated the phylogenies with LSD, assuming a clock rate of 1.4x10 -7 for L1, and 9x10 -8 for L3. In this analysis, genomes for which the sampling date was not known were assumed to have been sampled between 1995 and 2018, which is the period in which all strains with known date of isolation were sampled. Importantly, using different clock rates for this analysis would only change the time scale of the trees, but not the reconstruction of the ancestral characters. We assigned to each strain the subcontinental geographic region of origin as character, and used PASTML (Ishikawa et al. 2019) to reconstruct the ancestral geographical ranges and their changes along the trees of L1 and L3. We used the MPPA as prediction method (standard settings) and added the character predicted by the joint reconstruction even if it was not selected by the Brier score (option -forced_joint). Additionally, we repeated the PASTML analysis for the sublineages of L1 individually. As a complementary method to reconstruct the ancestral range and the migration pattern of different populations, we used the Beast package Mascot (Müller et al. 2018) . We assumed that strains sampled in the different subcontinental regions represent distinct subpopulations, and we considered only populations for which we had at least 75 genome sequences: four populations for L1 (East Africa, South Asia, Southeast Asia (mainland) and Southeast Asia (islands)), and two populations for L3 (East Africa and South Asia). For computational reasons, we subsampled the two datasets (L1 and L3) to ~300 strains. We sampled an equal number of strains from each geographic region (where possible), and within regions, we randomly sampled an equal number of strains from each country (where possible). This sub-sampling scheme resulted in two subsets of 303 and 300 strains for L1 and L3, respectively. We assembled SNP alignments including only variable positions with less than 10% of missing data, and used jModelTest 2.1.10 v20160303 (Darriba et al. 2012) to identify the best fitting nucleotide substitution model as described above. We performed Bayesian inference with Beast2.5 (Bouckaert et al. 2019). We corrected the xml files to specify the number of invariant sites as indicated here: https://groups.google.com/forum/#!topic/beastusers/QfBHMOqImFE. We assumed a lognormal uncorrelated clock and we fixed the mean of the lognormal distribution of the clock rate to 1.4x10 -7 (L1) and 9x10 -8 (L3). We assigned the tip sampling years to the different strains, and when the sampling time was unknown, we assumed a uniform prior from 1995 to 2018 (similarly to what done in the PASTML analysis). We further assumed the best fitting nucleotide substitution model as identified by jModelTest, a gamma prior for the standard deviation of the lognormal distribution of the clock rate [0 -infinity], and a lognormal prior for the population size with standard deviation = 0.2, and mean estimated in real space. Finally, we used an exponential distribution with mean = 10 -4 as prior for the migration rates. For each analysis, we ran at least two runs. We used Tracer 1.7.1 (Rambaut et al. 2018) to identify and exclude the burn-in, to evaluate convergence among runs, and to calculate the estimated sample size. We stopped the runs when at least two chains reached convergence, and the ESS of the posterior, prior and of the parameters of interest (population sizes and migration rates) were larger than 200. The xml input files are available as Supplementary Files. For the analysis of esxH, we wanted to compare the results obtained for L1 and L3 with the other two major human-adapted MTBC lineages L2 and L4. Therefore, we compiled two datasets of publicly available genome sequences for these two lineages. We applied the same bioinformatic pipeline described above: genomes were excluded if they had 1) an average coverage < 15x, 2) more than 50% (or more than 1,000 in absolute number) of their SNPs having a percentage of reads supporting the call between 10% and 90%, or 3) contained single nucleotide polymorphisms diagnostic for different MTBC lineages (Steiner et al. 2014) , Table 1 ). We reconstructed the phylogenetic tree of L2 with raxml 8.2.11 (Stamatakis 2014) as described above. Due to the large size of the dataset, we used FastTree (Price et al. 2010 ) with options -nt -nocat -nosupport to reconstruct the phylogenetic tree of L4. We downloaded the amino acid sequence of all MTBC epitopes described for Homo sapiens from the immune epitope database (https://www.iedb.org/; downloaded on the 03.08.2020). We considered separately MHCI epitopes and MHCII epitopes. We mapped the epitope sequences onto the H37Rv genome (GCF_000195955.2) using tblastn and excluded sequences mapping equally well to multiple loci in the H37Rv genome. Additionally, we retained only epitopes that mapped with two mismatches or less over the whole epitope length. This resulted in a final list of 539 MHCI epitopes, and 1,144 MHCII epitopes. (Sup. Table 7) . We used the datasets obtained for the biogeography analysis (2,061 genome sequences for L1, and 1,021 genome sequences for L3). For each lineage, we independently assembled a multiple sequence alignment for each epitope. We then translated the sequence to amino acids and used PAUP 4.a (Wilgenbusch and Swofford. 2003) to reconstruct the replacement history of all polymorphic positions on the rooted phylogenetic trees. We used two maximum parsimony algorithms (ACCTRAN and DELTRAN) and considered only the events reconstructed by both algorithms. We considered the first 18 amino acids of esxH (MSQIMYNYPAMLGHAGDM), which was by far the epitope with most amino acid replacements in L1. We expanded the analysis of this epitope to the L2 and L4 datasets, so that we could compare the results with L1 and L3. For the PAML analysis, we randomly selected 500 strains from each MTBC lineage. We used the phylogenetic tree reconstructed by RAxML (same settings as above), and the gene alignment to estimate the branch lengths of the tree using the M0 codon model implemented in PAML 4.9e (Yang 2007) . This step was necessary to obtain a tree with the branch length in expected substitutions per codon. We then fitted two alternative codon models (M1a and M2a) to the trees and alignments. M1a allows ω to be variable across sites, and it assumes two different ω (0 < ω0 < 1, and ω1 = 1), modeling nearly neutral evolution; M2a assumes one additional ω (ω2 > 1) compared to M1a, thus modeling positive selection. We performed a likelihood ratio test between the two models as described in Zhang et al. (2005) . Templates for the control files of the codeml analyses (M0, M1a, M2a) are available as Supplementary Files. The codon under positive selection were identified with the Bayes empirical Bayes method ). To test whether the derived haplotypes of this epitope were associated with a specific geographic region, we constructed a statistical test analogous to PhyC (Farhat et al. 2013) , which is normally used to test for association between a variant and phenotypic drug resistance. We simulated mutations on the phylogenetic tree of L1 and counted how many strains from each region resulted to have a derived state according to the simulation. Under this procedure, mutations occur randomly on the tree, and therefore independently from the geographic region where the tips were sampled. For each test, we performed 10,000 simulations to obtain a null distribution, and we compared this distribution to the observed data to obtain empirical p-values. We used the same geographic region used for the biogeography analysis. Additionally, we considered East Africa excluding Madagascar. R code and input files to perform this test are available as Supplementary Files. We considered three HLA loci: HLA-A, HLA-B and HLA-DRB1. We used the allele frequency database (Gonzalez-Galarza et al. 2020) to identify alleles that are prevalent in East Africa and not in South Asia and Southeast Asia, or the other way around. Because the coverage of the allele frequency database is patchy, we focused on the following countries: Kenya and Zimbabwe as representatives of East Africa; India, Thailand and Taiwan as representatives of South-and Southeast Asia. We identified: 1) Alleles that had a frequency of 10% or more in at least one population in South-and Southeast Asia, but had frequencies lower than 10% in all East African populations. 2) Alleles that had a frequency of 10% or more in at least one population in East Africa, but had frequencies lower than 10% in all populations in South-and Southeast Asia. 3) Alleles that had a frequency of 10% or more in at least one population both in South-and Southeast Asia and in East Africa. For all these alleles, we performed in silico binding prediction with three epitopes: the ancestral epitope at the esxH N terminus (MSQIMYNYPAMLGHAGDM), and the two most frequently observed derived epitopes (MSQIMYNYPTMLGHAGDM, and MSQIMYNYPVMLGHAGDM). For HLA-A and HLA-B alleles, we used the NetMHCPan4.1 server (Reynisson et al. 2020) with standard settings. For HLA-DRB1 alleles, we used the prediction tool of the immune epitope database (https://www.iedb.org/). For this analysis, we used the subsets generated for the Mascot analysis. These datasets are representative of the populations of L1 and L3 in their core geographic ranges, and are computationally treatable. We generated sequence alignments for all genes individually, excluding genes in repetitive regions of the genome (see above). Because some genes are deleted in L1 but not in L3, or the other way around, we obtained a slightly different number of gene alignments for the two lineages (L1: 3,623, L3: 3,622). For each gene, we performed a test for positive selection with PAML as described above for esxH. We considered as under positive selection all genes, for which the likelihood ratio test resulted in a Bonferroni-corrected p-value < 0.05. We considered 196 mutations conferring resistance to different antibiotics (Payne et al. 2019) . We extracted the respective genomic positions from the vcf file of the 4,968 genomes of the complete curated data set (2,938 L1, 2,030 L3) and assembled them in phylip format. To determine the number of independent mutations, we reconstructed the nucleotide changes on the phylogenetic tree rooted with the L2 strain (SAMEA4441446). To do this, we used the maximum parsimony ACCTRAN and DELTRAN algorithms implemented in PAUP 4.0a (Wilgenbusch and Swofford. 2003) , and considered only the events reconstructed by both algorithms. (309540-EVODRTB and 883582-ECOEVODRTB). Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at the University of Basel. Supplementary results and code are available here: https://github.com/fmenardo/MTBC_L1_L3. Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis Effect of mutation and genetic background on drug resistance in Mycobacterium tuberculosis Strain variation in the Mycobacterium tuberculosis complex: its role in biology Ecology and evolution of Mycobacterium tuberculosis Predicted impact of the COVID-19 pandemic on global tuberculosis deaths in 2020. medRxiv Allele frequency net database (AFND) 2020 update: gold-standard data classification, open access genotype data and new query tools Time dependency of molecular rate estimates and systematic overestimation of recent divergence times EsxA) and TB10. 4 (EsxH) based vaccines for pre-and post-exposure tuberculosis vaccination Potential impact of the COVID-19 pandemic on HIV, tuberculosis, and malaria in low-income and middle-income countries: a modelling study Frequent transmission of the Mycobacterium tuberculosis Beijing lineage and positive selection for the EsxW Beijing variant in Vietnam Solution Structure of the Mycobacterium tuberculosis EsxG· EsxH Complex functional implications and comparisons with other m. tuberculosis esx family complexes A fast likelihood method to reconstruct and visualize ancestral scenarios VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing The sequence alignment/map format and SAMtools Fast and accurate long-read alignment with Burrows-Wheeler transform Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data The potential impact of COVID-19-related disruption on tuberculosis burden Mycobacterium tuberculosis type VII secreted effector EsxH targets host ESCRT to impair trafficking Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity The molecular clock of Mycobacterium tuberculosis Mycobacterium tuberculosis type VII secretion system effectors differentially impact the ESCRT endomembrane damage response MASCOT: parameter and state inference under the marginal structured coalescent approximation Prevention of M. tuberculosis infection with H4: IC31 vaccine or BCG revaccination A sister lineage of the Mycobacterium tuberculosis complex discovered in the African Great Lakes region Lineage specific histories of Mycobacterium tuberculosis dispersal in Africa and Eurasia Transition bias influences the evolution of antibiotic resistance in Mycobacterium tuberculosis The role of selection in shaping diversity of natural M. tuberculosis populations Genetic Diversity in Mycobacterium tuberculosis Clinical Isolates and Resulting Outcomes of Tuberculosis Infection and Disease Genome-wide evidence of Austronesian-Bantu admixture and cultural reversion in a hunter-gatherer group of Madagascar Mycobacterium tuberculosis EsxH inhibits ESCRT-dependent CD4+ T-cell activation FastTree 2-approximately maximum-likelihood trees for large alignments Knowledge gaining by human genetic studies on tuberculosis susceptibility Characterization of Mycobacterium tuberculosis var & van der Poll Protective immune responses to a recombinant adenovirus type 35 tuberculosis vaccine in two mouse strains: CD4 and CD8 T-cell epitope mapping and role of gamma interferon Posterior summarization in Bayesian phylogenetics using Tracer 1.7. Systematic biology NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data Model selection and parameter inference in phylogenetics using Nested Sampling Associations between human leukocyte antigen class I variants and the Mycobacterium tuberculosis subtypes causing disease COVID-19, tuberculosis, and poverty: preventing a perfect storm Indo-Oceanic Mycobacterium tuberculosis strains from Thailand associated with higher mortality RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies KvarQ: targeted and direct variant calling from fastq reads of bacterial genomes HLA class II sequence variants influence tuberculosis risk in populations of European ancestry Fast dating using least-squares criteria and algorithms Associations of HLA class II alleles with pulmonary tuberculosis in Thais Immigrant arrival and tuberculosis among large immigrant-and refugee-receiving countries Global variation in bacterial strains that cause tuberculosis disease: a systematic review and meta-analysis Inferring evolutionary trees with PAUP Global Tuberculosis Report Bayes empirical Bayes inference of amino acid sites under positive selection PAML 4: phylogenetic analysis by maximum likelihood Association of HLA-A,-B, and-DRB1 with pulmonary tuberculosis in western Javanese Indonesia Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level This work was supported by the Swiss National Science Foundation (grants 310030_188888, CRSII5_177163, IZRJZ3_164171 and IZLSZ3_170834) and the European Research Council