key: cord-0326573-b1rz5i9j authors: LaVerriere, E.; Schwabl, P.; Carrasquilla, M.; Taylor, A. R.; Johnson, Z. M.; Shieh, M.; Panchal, R.; Straub, T. J.; Kuzma, R.; Watson, S.; Buckee, C. O.; Andrade, C. M.; Portugal, S.; Crompton, P. D.; Traore, B.; Rayner, J. C.; Corredor, V.; James, K.; Cox, H.; Early, A. M.; MacInnis, B. L.; Neafsey, D. E. title: Design and implementation of multiplexed amplicon sequencing panels to serve genomic epidemiology of infectious disease: a malaria case study date: 2021-09-22 journal: nan DOI: 10.1101/2021.09.15.21263521 sha: 10c7c07a678187f25e852ce072f1b8081f93b100 doc_id: 326573 cord_uid: b1rz5i9j Multiplexed PCR amplicon sequencing (AmpSeq) is an increasingly popular application for cost-effective monitoring of threatened species and managed wildlife populations, and shows strong potential for genomic epidemiology of infectious disease. AmpSeq data for infectious microbes can inform disease control in multiple ways, including measuring drug resistance marker prevalence, distinguishing imported from local cases, and determining the effectiveness of therapeutics. We describe the design and comparative evaluation of two new AmpSeq assays for Plasmodium falciparum malaria parasites: a four-locus panel ('4CAST') composed of highly diverse antigens, and a 129-locus panel ('AMPLseq') composed of drug resistance markers, highly diverse loci for measuring relatedness, and a locus to detect Plasmodium vivax co-infections. We explore the performance of each panel in various public health use cases with in silico simulations as well as empirical experiments. We find that the smaller 4CAST panel performs reliably across a wide range of parasitemia levels without DNA pre-amplification, and could be highly informative for evaluating the number of distinct parasite strains within samples (complexity of infection) and distinguishing recrudescent infections from new infections in therapeutic efficacy studies. The AMPLseq panel performs similarly to two existing panels of comparable size for relatedness measurement, despite differences in the data and approach used for designing each panel. Finally, we describe an R package (paneljudge) that facilitates design and comparative evaluation of AmpSeq panels for relatedness estimation, and we provide general guidance on the design and implementation of AmpSeq panels for genomic epidemiology of infectious disease. 115 oligos can flexibly accommodate most new targets not included in the original design, allowing for panel 116 customization towards detecting locally relevant resistance markers, polymorphic loci, and co-infecting 117 parasite species. We use whole genome sequencing data to explore the degree to which our newly 118 described and previously published genotyping panels can serve studies in diverse geographies, versus 119 the alternative of customizing panels with targets that are locally informative but not globally useful. We 120 suggest there is value in genotyping panels that can be flexibly adapted to incorporate informative 121 targets from pathogen populations of interest. The analyses and resources described in this manuscript 122 clarify the rapidly diversifying options for targeted microbial sequencing ( Fig. 1 ), by providing tools and 123 guidance for the comparative evaluation and refinement of AmpSeq panels. 124 Figure 1 . Amplicon sequencing and other genotyping approaches for genomic epidemiology of 125 infectious diseases. 126 Schematic of three common approaches for molecular surveillance data generation. Genomic DNA can 127 be extracted from clinical samples and then processed using any of the three methods shown: SNP 128 barcoding, amplicon sequencing, or whole genome sequencing (WGS). Our two amplicon panels, 129 AMPLseq and 4CAST, are shown, with representations of their loci and amplification. Pre-amplification 130 (selective whole genome amplification), which increases the ratio of parasite to human DNA in samples, 131 is generally recommended for WGS and some amplicon sequencing panels (AMPLseq, but not 4CAST). 132 SNP barcoding provides data in the form of variant calls at each SNP; amplicon sequencing provides 133 extremely deep coverage at select, small regions of the genome; and WGS generally provides shallower 134 coverage of the entire genome. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint 136 Panel designs 137 We developed a small multiplex of four highly polymorphic antigenic loci, dubbed '4CAST': CSP, 138 AMA1, SERA2, and TRAP ( Fig. 2 ). All four amplicons use previously published primer sequences (Miller 139 et al., 2017; , as no modification was required for successful multiplexing. 140 In designing the larger 'AMPLseq' multiplexed amplicon panel, we first built a large pool of 141 candidate loci, anticipating significant attrition of candidates due to primer incompatibility. We prioritized 142 four classes of loci: loci within antigens of interest (Helb et al., 2015) , loci with high population diversity 143 for relatedness inference , loci included in the SpotMalaria v1 panel (Chang et al., 144 2019; Jacob et al., 2021) , and known drug resistance markers. We contracted the services of GTseek 145 LLC (https://gtseek.com) to design multiplexed oligo panels according to the criteria previously described 146 for the Genotyping-in-Thousands by sequencing (GT-seq) protocol (Campbell, Harmon, & Narum, 2015) 147 ( S1 Supporting information ). We optimized the final primer set and and reaction conditions through 148 several sequencing runs and determined that the primers for the four 4CAST loci ( CSP, AMA1, SERA2, 149 TRAP ) could be added to the panel without impacting amplification of the other loci. We also successfully 150 added primers amplifying known markers of antimalarial drug resistance within the genes dhfr, dhps, 151 mdr1, and kelch13 ( S2 Table ) . Furthermore, we successfully added previously described primers for 152 PvDHFR (Lefterova, Budvytiene, Sandlund, Färnert, & Banaei, 2015) in order to identify P. falciparum / P. 153 vivax co-infections that have gone undetected in preliminary screening by microscopy or rapid diagnostic 154 test (RDT). The final panel, dubbed 'AMPLseq' (short for Assorted Mix of Plasmodium Loci) contains this 155 single P. vivax locus and 128 P. falciparum loci ( Fig. 2 ) , with a median length across all amplicons of 276 156 bp ( S1 Fig. ) . 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; https://doi.org/10.1101/2021.09.15.21263521 doi: medRxiv preprint 171 step at 95 °C (1 min); 10 amplification cycles at 95 °C (15 s), 55 °C (15 s) and 72 °C (30 s); and a final 172 extension step at 72 °C (1 min). We combined PCR2 products in equal volumes and performed 173 double-sided size selection using Agencourt AMPure XP beads (Beckman Coulter): we incubated 100 µl 174 library with 55 µl beads, immobilized beads via magnet rack, and transferred the supernatant to a new 175 tube. We incubated the transferred supernatant with 20 µl beads and washed immobilized beads twice 176 with 80% ethanol. We eluted the library in 25 µl EB buffer (10 mM Tris-Cl, pH 8.5), subsequently adding 177 2.5 µl EB buffer with 1% Tween-20. We verified size selection via Agilent BioAnalyzer 2100 and 178 sequenced the selected library at 6 pM with >10% PhiX in paired-end, 500-cycle format using MiSeq 179 Reagent Kit v2 ( S1 Protocol ). 180 We followed a similar nested PCR and pooled clean-up procedure for AMPLseq library 181 construction. Primer sequences, input volumes and concentrations are listed in S3 Table and PCR 182 conditions and size selection steps are shown in S2 Protocol . As detailed therein, library construction for 183 AMPLseq library construction differs in a few minor aspects. For example, primer input quantities vary 184 slightly (800 pmol +/-33%) to account for amplification rate differences among loci. PCR1 products are 185 diluted 1:12 in NF dH 2 O prior to PCR2 and only single-sided (left-tailed) bead-based size selection is 186 used to enhance yield. Sequencing also occurs via paired-end, 500-cycle MiSeq but with a higher final 187 library loading concentration (12 pM) and a lower fraction of PhiX (8%). 188 Mock samples 189 We generated mock samples from parasite lines 3D7 and Dd2, cultured at 3% hematocrit in 190 commercially obtained red blood cells as previously d escribed (Trager & Jensen, 1976) . We extracted 191 genomic DNA using the Qiagen Blood and Tissue Kit on cells previously lysed with 0.15% saponin. We 192 generated positive control template representing DNA extractions from whole human blood infected with 193 10,000 monoclonal 3D7 parasites/µl by diluting genomic DNA from 3D7 to 0.25 ng/µl in 10 ng/µl human 194 genomic DNA, and storing in 10 mM Tris-HCl (pH 8.0), 1 mM EDTA (Promega, Madison, WI). We 195 generated further control templates representing 1000, 100, and 10 3D7 parasites/µl by serial 1:10 196 dilution of the 10,000 3D7 parasites/µl control, likewise using 10 ng/µl human genomic DNA as diluent. 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint 238 Amplicon data analysis 239 We developed an application named AmpSeQC ( S2 Supporting information ) to assess 240 sequence quality and amplicon/sequence run success ( S2 Fig. ) . We also used AmpSeQC for P. vivax 241 detection by applying a concatenation of the P. falciparum 3D7 and P. vivax PvP01 reference genomes 242 during the BWA-MEM alignment step. For in-depth assessment of P. falciparum sequence variation, we 243 processed paired-end Illumina sequencing data in the form of FASTQ files using a custom analysis Fig. ) . We mapped microhaplotypes obtained from DADA2 against a custom-built database of 3D7 and 247 Dd2 reference sequences for each amplicon locus and filtered microhaplotypes based on edit distance, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; https://doi.org/10.1101/2021.09.15.21263521 doi: medRxiv preprint 248 length, and chimeric identification, using a custom R script. ( S2 Supporting information ). We 249 summarized observed sequence polymorphism into a concise format by converting individual 250 microhaplotypes into pseudo-CIGAR strings using a custom python script. Microhaplotypes were 251 discarded if supported by fewer than 10 read-pairs or by less than 1% total read-pairs within the locus, or 252 if they exhibited other error features ( S3 Supporting information ). 253 We analyzed native and pre-amplified mock samples to determine precision and sensitivity of the 254 DADA2 pipeline and filters. We defined a true positive (TP) as a microhaplotype with a pseudo-CIGAR 255 string identical to the reference strain (either 3D7 or Dd2). We defined a false positive (FP) as a 256 microhaplotype with a pseudo-CIGAR string not matching 3D7 (in the case of samples containing only 257 3D7) or not matching 3D7 or Dd2 (in the case of the mixes), and we defined a false negative (FN) as a 258 locus without any correct microhaplotype representation. We defined precision as TP/(TP+FP), and 259 sensitivity (recall) as TP/(TP+FN). Forty-two of 128 P. falciparum loci in AMPLseq exhibit identical 3D7 260 and Dd2 reference sequences; we only included these in precision and sensitivity calculations for pure 261 3D7 controls ( i.e. , TP+FN = 128); precision and sensitivity calculations for strain mixtures considered 262 only the 86 loci that differ between 3D7 and Dd2 reference sequences ( i.e ., TP+FN = 86). All amplicon sequencing data were submitted to the NCBI Sequence Read Archive 264 ( http://www.ncbi.nlm.nih.gov/sra ) under accession PRJNA758191. 265 Comparator Panels 266 We compared AMPLseq and 4CAST to two previously published AmpSeq panels for malaria CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; 273 for bead-based clean-up and CleanPlex digestion of each sample between PCR1 and PCR2. The 274 protocol therefore does not require sWGA prior to PCR1. SpotMalaria v2, designed via Agena BioScience and MPrimer design software, contains 136 276 primer pairs divided into three different pools. These target loci were considered best able to recapitulate 277 pairwise genetic distance, population differentiation, and sample heterozygosity inferences from global 278 WGS data (MalariaGEN Plasmodium falciparum Community Project, 2016) . Primers also target various 279 drug resistance-associated loci (some known to exhibit copy number variation) and mitochondrial loci 280 with conserved primer binding sites among Plasmodium spp. Library construction requires sWGA prior to 281 PCR1 but no special processing between PCR1 and PCR2. 282 We also compared our amplicon panels to a 24-SNP molecular barcode assay (Daniels et al., Table ) . We also used 292 previously published monoclonal genomic data from Guyana (SRA BioProject PRJNA543530) (Mathieu 293 et al., 2020) and French Guiana (SRA BioProject PRJNA242182) (Pelleau et al., 2015) . We used the 294 scikit-allel library (Miles et al., 2020) to process the data and then estimate microhaplotype frequency 295 and diversity. Specifically, we used the read_vcf, is_het, and haploidify_samples functions as described 296 ( S1 Supporting information ), and we estimated haplotype frequencies with the distinct_frequencies 297 function. 298 We assessed the performance of different panels for relatedness inference using simulated data. 299 We generated data on pairs of haploid genotypes (equivalent to pairs of monoclonal malaria samples) 12 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; https://doi.org/10.1101/2021.09.15.21263521 doi: medRxiv preprint 300 using paneljudge (Taylor & Jacob, 2020) , an R package that we built to simulate data under a hidden 301 Markov model (HMM) (Taylor et al., 2019) ( S2 Supporting information ). For each panel, we calculated 302 inter-locus distances from the median nucleotide position of each locus and set distances as infinite 303 between chromosomes. For each panel and population of interest, we calculated haplotype frequency 304 estimates using scikit-allel , as described above. Given these distances and frequency estimates, we 305 simulated data using relatedness parameter values of 0.01 (unrelated), 0.50 (siblings), and 0.99 (clonal), 306 and switch rate parameter values of 1, 5, 10, and 50. For each combination of panel, population, 307 relatedness parameter, and switch rate parameter, we simulated data on 100 haploid genotype pairs. For 308 each haploid genotype pair, we then generated estimates of the relatedness parameter and the switch 309 rate parameter using paneljudge, with 95% confidence intervals (CIs), under the same model used to 310 simulate the data. We next performed relationship classification from these estimates and CIs. For 311 estimates of unrelated pairs (relatedness parameter of 0.01), we generously classified estimates as 312 correct if the lower limit of the 95% confidence interval (LCI) was below or equal to 0.01 and the upper 313 limit of the 95% confidence interval (UCI) was below 0.99. We classified estimates of sibling-level 314 relatedness (0.50) as correct if the LCI was above 0.01 and the UCI was below 0.99. We classified 315 estimates of clonal pairs (0.99) as correct if the LCI was above 0.01 and the UCI was above or equal to 316 0.99. In all relatedness levels, if the 95% confidence interval spanned both 0.01 and 0.99 ( i.e. , LCI < 0.01 317 and UCI > 0.99), then we denoted the estimate as unclassified. To evaluate panel performance in COI estimation, we combined monoclonal WGS data to 319 engineer in silico polyclonal samples using vcftools (Danecek et al., 2011) . We then estimated 320 microhaplotype frequencies for each locus of a given panel, using scikit-allel as described above, and 321 counted the number of distinct microhaplotypes observed at each locus per sample. We estimated COI 322 as the maximum number of distinct microhaplotypes observed at any locus within a sample. To evaluate panel performance for geographic attribution, we identified microhaplotypes at loci as 324 described above. We used the microhaplotype sequences themselves and visualized these data using 325 the Rtsne package (Krijthe, 2015) , with 5000 iterations, of 0.0, and perplexity parameters of 10 (for 326 4CAST and the 24 SNP barcode) or 30 (for the remaining panels). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; 329 We validated assay precision (defined as TP/(TP+FP)), sensitivity (defined as TP/(TP+FN)), and 330 depth of coverage using 3D7 mock clinical samples representing parasitemia levels between 10 and 331 10000 parasites/µl in 10 ng/µl human DNA. Both 4CAST and AMPLseq generated 3D7 microhaplotype 332 calls with 100% precision for all parasitemia levels assessed, both with and without pre-amplification by 333 sWGA. 4CAST achieved high sensitivity and depth without preliminary sWGA, generating a median of 43 334 read-pairs per locus from native templates representing 10 parasites/µl ( Fig. 3A ). Median depth 335 increased to 443 and 1312.5 read-pairs per locus for native templates representing 100 and 1000 336 parasites/µl, respectively. Read-pair counts were also evenly distributed among 4CAST loci using native 337 DNA ( Fig. 3A ). Unlike 4CAST, AMPLseq required sWGA for 3D7 mock samples representing ≤ 100 parasites/µl 339 ( Fig. 3B ). Following sWGA on mock samples representing 10 parasites/µl, the assay generated ≥ 10 340 read-pairs at a median of 126 loci, with a median of 465 read-pairs after excluding loci with fewer than 10 341 reads. Values were statistically similar for pre-amplified samples representing 100 parasites/µl and 342 increased to 692 read-pairs for pre-amplified samples representing 1000 parasites/µl ( S3 Fig. ) . 343 We also validated the sensitivity of 4CAST and AMPLseq for genotyping polyclonal infections by 344 using mock samples containing both 3D7 and Dd2 templates (likewise in 10 ng/µl human DNA). These 345 mixtures featured Dd2 at 50% ( i.e. , 1:1 3D7:Dd2 ratio), 25% (3:1), and 9% (10:1) relative abundance. 346 Total parasitemia levels ranged between 10 and 10000 parasites/µl. Both 4CAST and AMPLseq 347 generated microhaplotype calls with 100% precision at the 86 loci that are dimorphic between the 3D7 348 and Dd2 references (including all four 4CAST loci and an additional 82 loci in AMPLseq). This perfect 349 precision was observed at all parasitemia levels in both native and pre-amplified mock mixtures of the 350 two strains. 4CAST showed high sensitivity for Dd2 without the need for sWGA. At 1000 parasites/µl, the 352 assay detected Dd2-specific microhaplotypes at each of its four loci in all 1:1, 3:1, and 10:1 mixture 353 replicates ( Fig. 3C ). At 100 parasites/µl, median Dd2 sensitivity remained 100% at 1:1 and 3:1 ratios but 14 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; 354 lowered slightly to 94% at 10:1. Sensitivity for each strain decreased at 10 parasites/µl, with 1:1 ratios 355 yielding a median of 3 target loci for 3D7 and a median of 2 targets for Dd2; median sensitivity in these 356 samples rose to 3.5 and 3 loci (respectively) following pre-amplification with sWGA, but this led to 357 unbalanced read-pair support between the two strains ( S4A Fig. ) , possibly due to differential sWGA 358 success on low-quality Dd2 vs. high-quality 3D7 templates. 4CAST read-pair ratios generated from 359 native templates, by contrast, showed a very high correlation with input ratios at 100 p/µl ( S4A Fig. ) and 360 1000 p/µl ( Fig. 3C ). Ratios became less informative at 10 parasites/µl ( S4A Fig. ) . AMPLseq (with sWGA) was also successful in detecting Dd2-specific microhaplotypes, but only 362 at a maximum of 77 of 86 dimorphic loci (in the 1:1 ratio at 10000 parasites/µl). Dd2-specific sequences 363 were detected at a minimum of two dimorphic loci for all three input ratios (1:1, 3:1, 10:1) and 364 parasitemia levels (≥ 10 p/µl) assessed. Like with 4CAST, however, the use of sWGA decorrelated 365 read-pair ratios from input ratios ( S4B Fig. ) . Dd2 sensitivity was also reduced relative to 3D7 sensitivity 366 (median Dd2 sensitivity / 3D7 sensitivity = 58%) with sWGA. These discrepancies were not observed 367 with native templates ( Fig. 3D ) at ≥1000 parasites/µl for which AMPLseq achieves high read-pair support 368 without the use of sWGA. 369 We also tested both panels on genomic DNA extracted from dried blood spots collected by the 370 Guyana Ministry of Health in 2020 from individuals diagnosed as P. falciparum -positive via a rapid 371 diagnostic test (RDT). Ten Guyanese samples were tested with both panels, and an additional six were 372 tested with AMPLseq. Using 4CAST, we observed coverage across all loci in all samples, with a median 373 read-pair depth per locus of 1162 read-pairs without sWGA ( Fig. 3A ). Using AMPLseq (with sWGA), we 374 observed a median of 122 loci with ≥10 read-pairs and a median read-pair depth of 298 read-pairs per 375 covered locus ( Fig. 3B ) . 376 Additionally, we tested both panels on gDNA extracted from 16 dried blood spot samples 377 collected in Mali in 2011 (Tran et al., 2013) and subsequently stored at room temperature for ten years. 378 Using 4CAST (without sWGA), we observed a median read-depth of 407 read-pairs per locus ( Fig. 3A ) . 379 Using AMPLseq (with sWGA), we observed a median of 75 loci with ≥10 read-pairs and a median 380 read-depth of 112 read-pairs per covered locus ( Fig. 3B ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 401 When data were simulated using microhaplotype frequency estimates of Senegalese parasites, we found 402 that almost all estimates of unrelated or clonal pairs were correctly classified, regardless of the panel 403 ( Fig. 4A) . All three large panels also performed similarly well in accurately identifying partially-related 404 parasite pairs, despite being the product of three distinct design processes. Neither 4CAST nor the 24 405 SNP barcode estimated relatedness for partially-related samples as well as the larger panels. We also 406 evaluated panel performance in less diverse parasite populations (Colombia and Thailand), including a 407 population not used in the panel designs (Colombia). We repeated the simulations using microhaplotype 408 frequencies estimated with these data. Again, we found that all panels performed well for estimating 409 relatedness of clonal pairs, and that the 24 SNP barcode and 4CAST were less likely to have correctly 410 classified estimates of non-clonal pairs. With the data simulated using Colombian microhaplotype 411 frequencies from the Pacific Coast region, all three large panels performed well for all three relatedness 412 values, despite the Colombian data not having informed the design of any of the panels. ( Fig. 4B ). Despite patient travel history metadata suggesting infections to have occurred in 416 various geographic regions of Guyana ( S1 Table ) , AMPLseq relatedness estimates for the Guyanese 417 sample set are significantly higher than those for the Malian sample set (Mann Whitney U, p < 0.001), 418 consistent with anticipated lower transmission levels in Guyana. Nevertheless, the wide range of 419 relatedness estimates (0.007 -1) observed among Guyanese sample comparisons suggests AMPLseq 420 capacity to indicate epidemiologically relevant microstructure even in relatively unstructured parasite 421 populations. For example, the first (lowest) quartile of pairwise relatedness estimates from Guyana was 422 enriched for comparisons involving A2-GUY and C5-GUY, two highly related samples that in whole 423 genome analysis show 50% relatedness with a sample from Venezuela (SPT26229, see S4 Table ) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 442 Geographic attribution 443 We again engineered amplicon data in silico to evaluate the relative signal in genotyping panels 444 for geographic attribution of samples ( Fig. 5 ) . By sub-sampling WGS variant calls, calling 445 microhaplotypes, and visualizing these data using t-SNE plots, we found that results from both the 24 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint 465 We evaluated COI estimation based on 4CAST as opposed to the single locus AMA1 , which is 466 commonly used for this purpose, alone or with another locus (Lerch et al., 2017; Miller et al., 2017; 467 Nelson et al., 2019) . We engineered in silico samples with COI ranging from 2 -10 (100 simulations per 468 COI level) and used the maximum number of unique microhaplotypes present at any locus as a simple 469 objective method to estimate COI. 4CAST provided more accurate estimates of COI than AMA1 alone in 470 these simulated data, especially at simulated COI levels between 5 and 7. (Fig. 6A) . S6 Fig. indicates 471 that estimation improves at simulated COI=8 using AMPLseq, but to reap the full benefit of the larger 472 panel in practice will require a more elaborate probabilistic algorithm that accounts for variable 473 coverage/sensitivity among loci and incorporates a multinomial approach for polyallelic loci. In the absence of such an algorithm, we proceeded with the naive estimation method above to ( Fig. 6B ) , AMPLseq detected at most 4 (median = 2). These results reaffirm 4CAST as a tool of choice 489 for resolving higher COI levels and when parasitemia levels are low. AMPLseq may reach simulated 20 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint ( S6 Fig. ) by reducing sample multiplexing or sequencing on higher output platforms 491 ( e.g. , NovaSeq) for sample sets with low parasitemia. 492 Figure 6 . In silico and empirical complexity of infection (COI) inference. 493 A) Scatter plots of estimated COI for samples simulated from combinations of monoclonal WGS data, 494 subsetted to the loci of interest ( AMA1 locus or 4CAST loci). The x-axis represents the number of 495 monoclonal genomes combined into each simulation, and the y-axis represents the COI estimated using 496 the simulated data. COI was naively estimated as the maximum number of unique microhaplotypes 497 present at any locus per sample (n=100 samples per condition). Each dot represents a sample, jittered 498 for visibility. The black bars represent the median and light grey boxes represent the 25th -75th 499 quantiles. B) Estimated COI for clinical samples sequenced using 4CAST. COI was again estimated as 500 the maximum number of unique microhaplotypes present at any locus in the sample. 501 Longitudinal sampling: distinguishing recrudescence vs. reinfection 502 We used 4CAST to examine longitudinal samples that were likely to be diverse and polyclonal. 503 We sequenced samples from the same asymptomatic individual in the longitudinal Mali cohort over three 504 consecutive visits ( Fig. 7 ) (Tran et al., 2013) . We detected a single microhaplotype at each locus that 505 was present in the first two time points, suggesting a continued infection during the two weeks between 506 samples. At the third time point, we detected a single, distinct microhaplotype at each locus, suggesting 507 that a new infection had occurred and the original infection had disappeared or decreased below our limit 508 of detection (<10 p/µl). In this particular case, the individual was asymptomatic and did not receive 509 anti-malarial treatment between any samples; however, this simple example demonstrates the clarity that 510 4CAST can bring to tracking infection turnover in longitudinal studies, and suggests its utility in 511 distinguishing recrudescence vs. reinfection in therapeutic efficacy studies. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 539 To test the ability of AMPLseq to detect P. vivax co-infections via co-amplification of PvDHFR , two 540 additional Guyanese blood spot samples that had been diagnosed as P. vivax -only (G4G430) and P. 541 vivax + P. falciparum co-infection (G4G180) via RDT were included in the sample set. These samples did 542 not undergo sWGA. PvDHFR was detected at high depth in both samples (1068 -1822 read-pairs for G4G430 and 544 234 -560 read-pairs for G4G180) ( Fig. 9 ). Only G4G180 also showed read-pair support at P. falciparum 545 panel loci (>10 read-pairs at 100 -115 loci). PvDHFR was not detected in any native or pre-amplified 546 3D7 or mixed-strain (3D7 + Dd2) templates. This demonstrates high specificity of both PvDHFR and P. 23 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; https://doi.org/10.1101/2021.09.15.21263521 doi: medRxiv preprint 547 falciparum AMPLseq primers to their intended target species without any apparent amplification inhibition 548 by the presence of congeneric DNA. PvDHFR was also detected at low levels (16 -30 read-pairs) in both native template replicates of 550 C7-GUY, one of the sixteen Guyanese samples previously diagnosed as P. falciparum -only via RDT. 551 Surprisingly, two PvDHFR read-pairs were also detected in one of the two sWGA replicates from the 552 sample, despite the expectation that sWGA would primarily amplify P. falciparum sequencines. 568 the performance of two new panels for P. falciparum , designed to serve different use cases and 569 exhibiting different per-sample costs and levels of complexity. Our comparative analyses of these two 570 new panels, AMPLseq and 4CAST, relative to previously published genotyping panels demonstrates that 571 they perform comparably to existing panels of similar composition across use cases, in a diversity of 572 geographic settings, despite different geographic representation in the population genomic data used to 573 inform their designs. This suggests that de novo custom panel design may not be required for accurate 574 COI and relatedness estimation in parasite populations from previously unstudied geographic regions. 575 We therefore suggest that future implementation of these panels should be guided by three criteria: 1) 576 the intended use cases for the data, 2) protocol complexity and compatibility with available instruments 577 and expertise, and 3) protocol customizability for locally relevant genetic loci. Considering the first of these criteria, intended use case, our investigations above suggest a 579 straightforward mapping of panels by size and feature to use case. The small 4CAST panel is well suited 580 to COI estimation ( Fig. 6 ) , and profiles four highly diverse antigens for the same effort and cost 581 traditionally used to profile a single locus. Because of the very high diversity of the loci in the 4CAST 582 panel in most parasite populations, this panel is also well suited to any application requiring genetic 583 delineation of distinct parasite lineages ( Fig. 7 ) . In therapeutic efficacy studies, for example, it is 584 essential to determine whether subjects who become parasitemic following drug treatment are exhibiting 585 a recrudescence of an incompletely-cleared strain from the initial infection (which could indicate 586 treatment failure), or if they have become reinfected with a distinct parasite strain subsequent to 587 treatment. We suggest that the 4CAST panel would be significantly more informative than traditional 588 genotyping approaches used in therapeutic efficacy studies, such as profiling length polymorphisms or ( Fig. 4 ) , despite different design criteria and 25 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . The AMPLseq panel and its peers are also much better suited to detecting imported infections 599 given their improved capacity to distinguish parasites from distinct geographic locations ( Fig. 5 .) Finally, 600 the larger panels offer the capacity to monitor genetic markers associated with drug resistance ( Fig. 8 ) 601 or, in some panels, detect co-infection with other Plasmodium species ( Fig. 9 ). The second panel selection criterion, protocol complexity and compatibility with available 603 instruments, should be prefaced with a reminder that all of these protocols employ nested PCR reactions 604 as the fundamental mechanism to produce sequencing libraries targeting small genomic regions of 605 interest. Any molecular laboratory with a capacity for PCR and a small sequencing instrument such as an 606 Illumina iSeq100 will be capable of carrying out any of these protocols. The exquisite sensitivity of 607 amplicon sequencing implemented via an Illumina platform means that all of these protocols are also 608 susceptible to contamination. PCR reactions should be conducted in dedicated hoods, ideally in rooms or 609 locations physically removed from settings in which PCR products are manipulated. Finally, all of these 610 protocols share: 1) a requirement for careful cleanup of inappropriately large or small DNA molecules 611 from pooled libraries prior to sequencing, 2) precise quantification of said libraries for optimal loading on 612 the sequencing instrument, and 3) large batch sizes (samples in increments of 96, 192, or larger to 613 accommodate the capacity of the intended sequencing instrument). Though these AmpSeq protocols share many common features, they differ in other aspects that 615 may impact implementation. Whereas the 4CAST and Paragon panels and single nested PCR reactions 616 perform well on native DNA from clinical samples, the AMPLseq and SpotMalaria panels require sWGA 617 pre-amplification prior to the first PCR reaction to ensure adequate performance for samples with 618 parasitemia at or below 100 parasites/µl, which may comprise a significant proportion of samples in 619 some settings. sWGA is an isothermal amplification protocol that is relatively simple to perform but 620 requires an expensive phi29 DNA polymerase and a magnetic bead-based cleanup of individual samples 621 afterward, for an approximate additional cost of $8 USD per sample at the time of writing. Though not 26 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; 622 large in absolute terms, this cost is comparable to the cost of the AMPLseq or 4CAST protocols 623 themselves, which range from $5 -10 USD per sample, depending on details of implementation such as 624 sequencing instrument and sample indexing per run. The larger panels additionally employ differing 625 numbers of first-round PCR reactions and require a varying number of magnetic bead-based cleanups to 626 tailor the length profile of intermediate products (summarized in S5 Table ) , which means that the local 627 capacity for automating the bead-based cleanups is a relevant implementation consideration. The third criterion for panel selection, customizability, may be most relevant for the drug 629 resistance surveillance use case, given differences in the geographic distribution of important drug 630 markers, and varying coverage of known markers by the existing panels. All of the protocols are 631 amenable to customization through the addition of independent target amplifications in the first round of 632 PCR, which could be combined with other first-round PCR multiplex products prior to the second PCR. A 633 more elegant customization approach would be to add (or subtract) targets from the first-round PCR 634 reaction. While complicated bioinformatic pipelines are useful or essential in the design of large 635 multiplexes, in our experience, small multiplexes like 4CAST, which was made from pre-existing primer 636 pairs designed independently, may simply function without optimization, and could presumably be 637 augmented with a small number of additional loci. Though the AMPLseq multiplex of 129 PCR loci 638 benefited from careful design of the original panel, we added the 4CAST targets to the designed 639 AMPLseq target set with no primer modifications and found it to be functional, suggesting it is likely 640 receptive to further augmentation. As the AMPLseq and 4CAST protocols utilize unmodified, 641 commercially available oligos as primers, further customization should be feasible in any setting. 642 However, we must note that not all targets are amenable to incorporation into the multiplex, as we failed 643 despite multiple attempts to include amplicons targeting the pfcrt gene associated with chloroquine 644 resistance (Fidock et al., 2000) , or the hrp2/3 genes, which can contain deletions that lead to 645 false-negative diagnosis via rapid diagnostic test (Gamboa et al., 2010) . 646 The proliferation of new AmpSeq protocols for molecular surveillance of infectious diseases 647 raises the important question of whether it is valuable for each disease field to converge on a single 648 approach or common panel. Factors precluding a completely homogeneous approach include varying 27 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; 28 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; 702 been shared with the provider communities and the broader scientific community (see above), and the 703 research addresses a priority concern, in this case the public health concern of malaria. More broadly, 704 our group is committed to international scientific partnerships, as well as institutional capacity building. 30 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 22, 2021. ; https://doi.org/10.1101/2021.09.15.21263521 doi: medRxiv preprint 000 parasites/µl positive control as described above but using Dd2 instead of 198 3D7 strain genomic DNA. We generated mixed-strain control templates by combining the 10,000 3D7 199 parasites/µl control with this 10 1 ratios to 1000 and 100 parasites/µl concentrations using 10 ng/µl human genomic DNA diluent as 202 before. We also applied selective whole genome amplification (sWGA) to all above control templates 203 representing ≤ 1000 parasites/µl. The 50 µl sWGA reaction followed Oyola et al Agencourt AMPure XP beads (Beckman Coulter) on the KingFisher Flex ( S3 Protocol ) and verified 206 amplification success via NanoDrop We tested the panels on clinical dried blood spot (DBS) samples from Mali and Guyana. Tran et 209 al. collected samples in Kalifabougou The Kalifabougou cohort study was approved by the Ethics Committee of the Faculty of Medicine, 211 Pharmacy and Dentistry at the University of Sciences, Technique and Technology of Bamako, and the 212 Institutional Review Board of the National Institute of Allergy and Infectious Diseases, National Institutes 213 of Health Written informed consent was obtained from participants or parents or guardians of participating children 215 before inclusion in the study. The Guyana Ministry of Health collected samples from Port Kaituma and 216 Georgetown, Guyana between We then extracted gDNA 222 following the DNA purification from buccal swab section of the KingFisher Ready DNA Ultra 2.0 Prefilled 649 instrumentation, expertise, and use cases for the data across settings, in addition to an anticipated 650 onward evolution of genotyping technology and elucidation of new markers of interest for drug resistance 651 or other phenotypes. Factors favoring convergence include opportunities for improved procurement of 652 instruments and reagents at a regional level in malaria-endemic countries, and opportunities to directly 653 compare observations between studies and surveillance efforts led by different groups. This latter factor, 654 which we term portability of analyses, has the potential to provide regional or global insight through 655 syntheses across studies. However, the portability of certain analyses is hampered by ascertainment 656 bias, an inherent limitation of any targeted sequencing approach for analyses based on the genotypic 657 state of select loci in different countries. That is, a panel designed based on observations of genetic 658 diversity through WGS in countries A and B may not provide a fair means of comparing diversity WGS is the ultimate tool for avoiding this bias. However, the problems of comparing parasite populations rather than actual genotypic diversity measures. Overlap of loci among panels would further 663 facilitate direct assessment of relatedness between samples included in different studies The AMPLseq panel we describe here contains a 665 significant number (n=47) of targets from the SpotMalaria panel, and we expect that future P. falciparum 666 panel designs will also tend to exhibit some degree of overlap with other panels As molecular surveillance efforts for malaria and other diseases are more widely adopted and 670 become increasingly diverse, it will be essential for the community to develop standardized approaches 671 for the design, validation, interpretation, and sharing of targeted amplicon sequencing data. The 672 paneljudge R package described here provides an excellent means to comparatively evaluate existing 673 and hypothetical panel performance via data collected from previous population genomic surveys, and 674 the bioinformatic analysis pipelines we have developed are suitable for interpreting Illumina data from This project has been funded in whole or in part with Federal funds from the National Institute of 680 Allergy and Infectious Diseases This project was also supported by 682 an NIH R01 award to DN (R01AI141544), an award from the Bill and Melinda Gates Foundation to DN 683 and COB (OPP1213366), and a Broad Institute NextGen Award to BM. The Mali cohort study was 684 funded by the Division of Intramural Research, National Institute of Allergy and Infectious Diseases, 685 National Institutes of Health. The Colombian cohort study was supported by British Council Fund Institutional Links Award G1854. We thank MalariaGen for use of the Colombian 687 WGS data. We thank Annie Laws for project management. We thank Dr sra) under accession 692 PRJNA758191. This publication uses data from the MalariaGEN Plasmodium falciparum Community 693 Project as described online pending publication and public release of dataset Pf7 Previously published data from Guyana and French Guiana can be found at SRA BioProjects 696 PRJNA543530 and PRJNA242182, respectively. Software and documentation can be A molecular marker of artemisinin-resistant 724 Plasmodium falciparum malaria Drug-Resistance and Population Structure of 728 Plasmodium falciparum Across the Democratic Republic of Congo Using High Molecular Inversion Probes Microhaplotypes 732 provide increased power from short-read DNA sequences for relationship inference. Molecular 733 Ecology Resources Genomic insights into the emergence and 735 spread of antimicrobial-resistant bacterial pathogens Development of a single nucleotide polymorphism barcode to genotype Plasmodium 741 vivax infections High-resolution sample inference from Illumina amplicon data Genotyping-in-Thousands by sequencing 747 (GT-seq): A cost effective SNP genotyping method based on custom amplicon sequencing Longitudinal genomic surveillance of Plasmodium falciparum malaria parasites reveals complex 752 genomic architecture of emerging artemisinin resistance Mapping imported malaria in Bangladesh using parasite genetic and human mobility data. ELife , 758 8 , e43481 THE REAL McCOIL: A method for the concurrent 762 estimation of the complexity of infection and SNP allele frequency for malaria parasites Propeller Domain Mutant Allele C580Y in Guyana Molecular Profile of Malaria Drug 771 Resistance Markers of Plasmodium falciparum in Suriname. Antimicrobial Agents and Longitudinal genomic 775 surveillance of MRSA in the UK reveals transmission patterns in hospitals and the community Use cases for genetic 778 epidemiology in malaria elimination & 1000 Genomes Project Analysis Group Modeling malaria genomics reveals transmission 787 decline and rebound in Senegal A general SNP-based molecular barcode 791 for Plasmodium falciparum identification and tracking A framework for variation 796 discovery and genotyping using next-generation DNA sequencing data Detection of low-density Plasmodium falciparum infections using amplicon 800 deep sequencing Mutations in the P. falciparum digestive vacuole transmembrane protein PfCRT and 804 evidence for their role in chloroquine resistance SNP barcodes provide higher 808 resolution than microsatellite markers to measure Plasmodium vivax population genetics COIL: a methodology for evaluating malarial complexity of infection using likelihood from 813 single nucleotide polymorphism data A Large Proportion of P. falciparum Isolates in the 817 Amazon Region of Peru Lack pfhrp2 and pfhrp3: Implications for Malaria Rapid Diagnostic Tests Amplicon deep sequencing improves 820 Plasmodium falciparum genotyping in clinical trials of antimalarial drugs Mating systems and 823 predictors of relative reproductive success in a Cutthroat Trout subspecies of conservation 824 concern 828 biomarkers provide accurate estimates of recent Plasmodium falciparum exposure for individuals 829 and communities Identity-by-descent analyses for measuring 832 population dynamics and selection in recombining pathogens Genetic surveillance in the Greater Mekong subregion and South Asia to support malaria control 838 and elimination. ELife , 10 , e62997 Should 840 deep-sequenced amplicons become the new gold-standard for analysing malaria drug clinical 841 trials? Spatial and molecular mapping of Pfkelch13 gene 846 polymorphism in Africa in the era of emerging Plasmodium falciparum resistance to artemisinin: a 847 systematic review Spatio-temporal dynamics of Plasmodium falciparum 853 transmission within a spatial unit on the Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut 856 Surveillance of molecular markers of Plasmodium falciparum artemisinin resistance 860 (kelch13 mutations) in Papua New Guinea between Simple real-time PCR and 864 amplicon sequencing method for identification of plasmodium species in human whole blood Development of amplicon deep sequencing markers and data 868 analysis pipeline for genotyping multi-clonal malaria infections Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM Confirmation of the absence 874 of local transmission and geographic assignment of imported falciparum malaria cases to China 875 using microsatellite panel Genomic epidemiology of artemisinin 878 resistant malaria. ELife , 5 , e08714 The Genome Analysis Toolkit: a MapReduce 886 framework for analyzing next-generation DNA sequencing data Indels, structural variation, 891 and recombination drive genomic diversity in Plasmodium falciparum Summer Rae, & Tim Millar. 894 (2020). cggh/scikit-allel: v1.3.2 A deep sequencing approach to estimate 897 Plasmodium falciparum complexity of infection (COI) and explore apical membrane antigen 1 898 diversity Genetic architecture of artemisinin-resistant Plasmodium falciparum Emergence of artemisinin-resistant Plasmodium falciparum with kelch13 C580Y mutations on the island of New Guinea Independent 911 Evolution of Pyrimethamine Resistance in Plasmodium falciparum Isolates in Melanesia Development of a new barcode-based, multiplex-PCR, next-generation-sequencing 917 assay and data processing and analytical pipeline for multiplicity of infection detection of 918 Plasmodium falciparum Describing the current status of Plasmodium falciparum population structure and drug resistance 923 within mainland Tanzania using molecular inversion probes Empowering conservation practice with efficient and economical genotyping from poor 927 quality samples Genetic Diversity and Protective Efficacy of the RTS,S/AS01 Malaria Vaccine Advances and opportunities in malaria population High-resolution micro-epidemiology of parasite spatial and temporal 938 dynamics in a high malaria transmission setting in Kenya The next phase of SARS-CoV-2 surveillance: real-time molecular 942 epidemiology Whole genome sequencing of Plasmodium falciparum 946 from dried blood spots using selective whole genome amplification Adaptive evolution of 950 malaria parasites in French Guiana: Reversal of chloroquine resistance by acquisition of a 951 mutation in pfcrt Real-time, portable genome sequencing for Ebola surveillance A simple method for typing Plasmodium falciparum merozoite 959 surface antigens 1 and 2 (MSA-1 and MSA-2) using a dimorphic-form specific polymerase chain 960 reaction Clinical malaria incidence following an outbreak in Ecuador was 964 predominantly associated with Plasmodium falciparum with recombinant variant antigen gene 965 repertoires hmmIBD: software to infer 968 pairwise identity by descent between haploid genotypes Genotyping-in-Thousands by sequencing (GT-seq) panel development and application to 972 minimally invasive DNA samples to support studies in molecular ecology Culture-free genome-wide locus sequence typing (GLST) provides new perspectives on 977 Genotyping of Plasmodium spp. Nested PCR Independent 984 emergence of artemisinin resistance mutations among Plasmodium falciparum in Southeast Asia paneljudge: Judge the performance of a panel of genetic markers 987 using simulated data Estimating Relatedness Between Quantifying connectivity between 992 local Plasmodium falciparum malaria parasite populations using identity by descent Absence of Putative Artemisinin Resistance Mutations Among Plasmodium falciparum in Africa: A Molecular Epidemiologic Study. The Journal of Infectious Diseases Sensitive, highly multiplexed sequencing of microhaplotypes from the Plasmodium falciparum 1003 heterozygome. The Journal of Infectious Diseases Using parasite genetic and human mobility data to infer local and cross-border malaria 1008 connectivity in Southern Africa Human malaria parasites in continuous culture An Intensive Longitudinal Cohort Study of Malian Children and Adults 1014 Clinical 1015 Infectious Diseases: An Official Publication of the Infectious Diseases Society of America From FastQ data to high confidence variant calls: the Genome Analysis 1020 Toolkit best practices pipeline Globally prevalent PfMDR1 mutations modulate 1024 Plasmodium falciparum susceptibility to artemisinin-based combination therapies World Malaria Report