key: cord-0873043-wfctm3dy
authors: Steinig, Eike; Duchêne, Sebastián; Aglua, Izzard; Greenhill, Andrew; Ford, Rebecca; Yoannes, Mition; Jaworski, Jan; Drekore, Jimmy; Urakoko, Bohu; Poka, Harry; Wurr, Clive; Ebos, Eri; Nangen, David; Manning, Laurens; Laman, Moses; Firth, Cadhla; Smith, Simon; Pomat, William; Tong, Steven Y C; Coin, Lachlan; McBryde, Emma; Horwood, Paul
title: Phylodynamic Inference of Bacterial Outbreak Parameters Using Nanopore Sequencing
date: 2022-02-16
journal: Mol Biol Evol
DOI: 10.1093/molbev/msac040
sha: 9bc424e70a15ca1f2df45962a35257c7d9ada795
doc_id: 873043
cord_uid: wfctm3dy

Nanopore sequencing and phylodynamic modeling have been used to reconstruct the transmission dynamics of viral epidemics, but their application to bacterial pathogens has remained challenging. Cost-effective bacterial genome sequencing and variant calling on nanopore platforms would greatly enhance surveillance and outbreak response in communities without access to sequencing infrastructure. Here, we adapt random forest models for single nucleotide polymorphism (SNP) polishing developed by Sanderson and colleagues (2020. High precision Neisseria gonorrhoeae variant and antimicrobial resistance calling from metagenomic nanopore sequencing. Genome Res. 30(9):1354–1363) to estimate divergence and effective reproduction numbers (R(e)) of two methicillin-resistant Staphylococcus aureus (MRSA) outbreaks from remote communities in Far North Queensland and Papua New Guinea (PNG; n = 159). Successive barcoded panels of S. aureus isolates (2 × 12 per MinION) sequenced at low coverage (>5× to 10×) provided sufficient data to accurately infer genotypes with high recall when compared with Illumina references. Random forest models achieved high resolution on ST93 outbreak sequence types (>90% accuracy and precision) and enabled phylodynamic inference of epidemiological parameters using birth–death skyline models. Our method reproduced phylogenetic topology, origin of the outbreaks, and indications of epidemic growth (R(e) > 1). Nextflow pipelines implement SNP polisher training, evaluation, and outbreak alignments, enabling reconstruction of within-lineage transmission dynamics for infection control of bacterial disease outbreaks on portable nanopore platforms. Our study shows that nanopore technology can be used for bacterial outbreak reconstruction at competitive costs, providing opportunities for infection control in hospitals and communities without access to sequencing infrastructure, such as in remote northern Australia and PNG.

Sequence data from infectious disease outbreaks have provided critical information for infection control and inference of pathogen transmission dynamics, including during the West African Ebola virus epidemic (Quick et al. 2016 ) and the current severe acute respiratory syndrome coronavirus 2 Article ß The Author(s) 2022. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https:// creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com (SARS-CoV-2) pandemic (Bull et al. 2020) . Maximumlikelihood (ML) and Bayesian phylodynamic methods are commonly used to date the emergence of lineages, variants, and outbreaks, and to estimate key epidemiological parameters, such as changes in the effective reproduction number over time (R e ; Hodcroft et al. 2021; Volz et al. 2021) . Oxford Nanopore Technology (ONT) sequencing has emerged as viable technology for real-time genomic epidemiology and has been applied across national-scale SARS-CoV-2 surveillance efforts in the United Kingdom, Denmark, and Australia, amongst other countries (da Silva Filipe et al. 2021; du Plessis et al. 2021; Hammer et al. 2021; Nicholls et al. 2021) . Moreover, nanopore sequencing devices can, in theory, be operated in low-and middle-income countries (LMICs) where local genomics infrastructure may be lacking or is difficult to access (Faria et al. 2017; Giovanetti et al. 2020) , so that a timely outbreak response is not feasible (Gardy and Loman 2018) . Accessible genomics infrastructure is particularly relevant for continuous surveillance at bacterial evolutionary time scales, where outbreak strains may circulate for years, and can persist in human and animal reservoirs or the environment outside their host. Furthermore, viral pathogen genomes, such as Ebola virus or SARS-CoV-2, are often sequenced directly from patient samples using targeted PCR-based enrichment approaches, achieving high-genome coverage and resolution capable of informing phylodynamic models Bull et al. 2020 ). However, nanopore sequencing for bacterial pathogens, coupled to Bayesian phylodynamic models, has so far not been considered for routine epidemiological applications, mainly due the need for sufficiently accurate single nucleotide polymorphism (SNP) calling at bacterial whole-genome scales . SNP calls from high-coverage (>30Â) Illumina data are the current standard for accurate SNP calls used in phylogenetic applications, but current generation nanopore SNP calling has suffered from low raw sequence read accuracy (R9.4.1, < Guppy v5.0) and a focus on variant calling in human genomes, with much of the available callers developed specifically for human variants (Luo et al. 2020 ). This problem is further aggravated when attempting to sequence costeffectively, for example, using low-coverage multiplexed runs (<5-10Â) and simple library preparation with ONT sequencing kits (at least R9.4.1 pore architecture, SQK-RBK004 libraries) that can be used in LMICs with large burdens of bacterial disease.

Phylodynamic inference on nanopore platforms is further complicated because (ideally) an outbreak reference genome is used, that is closely related to the outbreak sequence type, thus providing sufficiently high variant calling resolution for transmission inference, particularly in recent transmission chains or outbreaks (Gorrie et al. 2021 ). In addition, on bacterial time scales (years) little sequence variation will have occurred in newly emergent outbreaks, which places a disproportionate emphasis on correctly inferring the few available outbreak-specific SNPs. As a consequence, there is little room for systematic errors introduced by base and variant callers when using (low coverage) nanopore sequencing data to effectively survey bacterial outbreaks. Neural network-based, nanopore-native variant callers in particular can introduce excessive false-positive (FP) SNP calls, complicating transmission inference from ONT sequence data, where accuracy and precision are required . Within-lineage phylodynamic inference for bacterial outbreaks additionally depends on sufficient temporal signal to ascertain a phylodynamic threshold, at which sufficient molecular evolutionary change has accumulated in the sample to obtain robust phylodynamic estimates (Duchêne, Geoghegan et al. 2016; Duchene, Featherstone, et al. 2020; . Due to slower substitution rates in bacteria compared with viruses (Duchêne, Holt et al. 2016) , longitudinal sample collections are optimal for genomic epidemiology and often require multiple years of data to infer transmission dynamics of the sampled population. Requirements for accurate whole-genome SNP calls across a large number of isolates, sequenced cost-effectively at low-genome coverage and over a sufficient interval of time, represent a significant barrier to the implementation of phylodynamic modeling for bacterial pathogens.

Illumina hybrid-corrected and ONT-native phylogenetic analyses methods have been demonstrated for a small number of distantly related bacterial nanopore genomes and genome assemblies from the same species for example, Neisseria gonorrhoeae (Golparian et al. 2018; Sanderson et al. 2020) or between species from environmental sources (Urban et al. 2021) . Recently, a six-strain multiplex protocol for the MinION with genome assembly and determination of phylogenetic relationships to identify outbreaks has been tested for Staphylococcus aureus lineages sampled in Norway. However, it remains unclear whether full within-lineage phylodynamic modeling is possible at population-level scale, whether estimates from nanopore data match results obtained using SNP calling with Illumina reads and whether sequencing runs can be conducted cost-effectively (at least 24 isolates per run). In this study, we adapt a variant polishing approach first implemented by Sanderson et al. (2020) on metagenomic sequencing of N. gonorrhoeae using random forest classifiers to filter SNP calls from the nanopore-native variant callers Medaka v1.2.3 and Clair v2.1.1 (Luo et al. 2020) . We use Snippy Illumina variant profiles as reference data and investigate caller performance across reference genomes and outbreak data sets. We show that random forest classifiers sufficiently remove incorrect calls from Clair in outbreak isolates with >5Â coverage to allow for sequencing of 24 community-associated S. aureus isolates per MinION flow cell (n ¼ 181) which successfully resolved phylodynamic parameters estimates of two outbreaks of ST93-MRSA-IV in remote Far North Queensland (FNQ) and Papua New Guinea (PNG).

In this study, we adapt the use of random forest classifiers for SNP polishing of nanopore sequence data to reduce excessive FP SNP calls, which have so far prevented accurate phylogenetic reconstruction of bacterial outbreaks. Our approach enables the inference of outbreak source, timing, and effective reproduction numbers using Bayesian phylodynamic models. Steinig et al. . https://doi.org/10.1093/molbev/msac040 MBE We validated the method on new sequence data from two outbreaks of community-associated S. aureus in remote communities of northern Australia and PNG.

We sequenced a total of 181 unique isolates from a pediatric osteomyelitis outbreak (collected between 2012 and 2018) in the Papua New Guinean highland towns Kundiawa (Simbu Province, n ¼ 42) and Goroka (Eastern Highlands Province, n ¼ 45). We additionally sequenced haphazardly collected blood cultures from a hospital in Madang (Madang Province, n ¼ 8) and strains from routine community surveillance across FNQ collected in 2019 (Cairns and Hinterlands, Cape York Peninsula, Torres Strait Islands, processed at Cairns Hospital, n ¼ 86; fig. 1 , supplementary tables, Supplementary Material online). ONT sequencing was conducted using a minimal, dual-panel barcoding scheme, multiplexing 2 Â 12 isolates interspersed with a nuclease flush on a single MinION flow cell (R9.4.1, EXP-WSH-003) for a total of 96 barcodes per outbreak (including isolate re-runs that were merged, n ¼ 12, and external isolates excluded here, n ¼ 3). Rapid barcode sequencing libraries (RBK-004) were prepared omitting magnetic bead clean-ups after enzymatic digestion of cultured strains and simple spin column extraction. Panels produced between 0.506 and 6.47 Gb of sequence data per fig. 2A ). We excluded one infection with S. argenteus (FNQ) and one coinfection with Mammaliococcus sciuri (PNG). Isolates with matching Illumina data were retained to create a high-quality reference data set for further evaluation of genome assembly and variant calling (n ¼ 159, fig. 1 ).

Short-read reference genomes (fragmented but highly accurate), long-read polished nanopore genomes (contiguous assemblies but less accurate), and long-read hybrid genomes corrected with Pilon (Walker et al. 2014) , which first assembles short reads and then scaffolds the assemblies with long reads to generate contiguous wholegenome assemblies. Compared with Illumina reference assemblies, SNP and indels were frequently occurring in low-coverage uncorrected nanopore assemblies ( fig. 2A, right) . Errors were considerably reduced in high-coverage isolates leading to assembly identities ranging between 0.9993 and 0.9999 in the dnadiff metric (Marçais et al. 2018 ; supplementary tables, Supplementary Material online). Recovery of complete chromosomes and S. aureus specific genotypes from uncorrected long-read assemblies was sufficient for high-coverage isolates in our collection ( fig. 2B , > 80-90%). Assembly genotyping for clinically relevant features such as the presence of mecA or the Panton-Valentine leukocidin (PVL), major subtypes of SCCmec elements, resistance genes, and other markers of interest showed high concordance with reference assemblies ( fig. 2B ). In contrast, low-coverage assemblies often failed to call genotypes-recovery was low for mecA and SCCmec types, as well as for PVL and other markers of interest ( fig. 2B , <60%, supplementary tables, Supplementary Material online). Hybrid long-read correction with Pilon did MBE not markedly improve genotype recovery in low-coverage isolate; however, recovery improved in the Unicycler hybrid assemblies ( fig. 2A and B). Lower SCCmec subtyping performance was likely due to remaining insertions or deletions from nanopore data impacting on the large cassette chromosomes (>20 kb). Unicycler produced more accurate hybrid assemblies than correction of long-read assemblies with Pilon alone and performed slightly better in hybrid assemblies of low-coverage nanopore data ( fig. 2B ). For genome assembly and genotyping, our dual-panel sequencing approach recovers nanopore genotypes in high-coverage isolates (>10Â) although some errors remain, particularly in sequence type calling and SCCmec subtyping.

Next, we aimed to accurately reconstruct the PNG and FNQ outbreaks within the ML background phylogeny of ST93. Subsequent phylodynamic analysis is challenging because accurate reconstruction of branch lengths within the nanopore clades is required for reproduction of the Bayesian epidemiological parameters. We first tried a candidate-driven approach, using Illumina core SNP panels from the ST93 background population (Snippy, n ¼ 444, 6,616 SNPs) and

Megalodon which accurately reconstructed the divergence of the PNG clusters from the Australian East Coast (supplementary fig. S1, Supplementary Material online). However, within-outbreak branch lengths were not reconstructed, because novel variation had accumulated since the divergence from the Australian East Coast population in the 1990s. We therefore decided to use a de novo variant calling approach comparing two native nanopore variant callers based on neural network architectures, by default trained on Homo sapiens variant calls (Clair v2.1.1) or a mix of human and microbial data from Escherichia coli, Saccharomyces cerevisiae, and H. sapiens (v1.2.3). Although recall was high, raw basecaller performance was exceedingly low in both Clair and Medaka accuracy and precision, particularly in outbreak isolate calls against the outbreak reference genome (<20%, supplementary fig. S2 , Supplementary Material online). We next adopted SNP polishers using random forest classifiers originally developed by Sanderson and colleagues (2020) 3B-D) ; because there were no considerations of specific training sets used in the original N. gonorrhoeae classifier, we trained S. aureus classifiers on three combinations of isolates including a mixed set of three sequence types (ST93, ST88, ST15; saureus_mixed) and two sets of outbreak sequence type isolates (ST93) from either FNQ (saureus_fnq) or PNG (saureus_png). In combination with the original N. gonorrhoeae classifiers, the different training sets allowed us to evaluate whether SNP polishing was effective using models from a different species entirely (Sanderson) , from the same species but without outbreak-related data (saureus_mixed) or from the same species, but with isolates from the same sequence type or outbreak (saureus_fnq, saureus_png 1B ). Evaluations indicated that all trained SNP polishers increased accuracy and precision with slight reductions in recall ( fig. 4) . However, suboptimal performance was observed in all metrics for the N. gonorrhoeae classifier across outbreak sequence types (<40%) as well as other sequence types (<50%). Performance improved considerably in the mixed S. aureus polisher (saureus_mixed) both among outbreak isolates (69.52% 6 12.48 accuracy, 75.94% 6 14.56 precision) and other sequence types (81.94% 6 14.56 accuracy, 90.11% 6 6.83 precision). However, despite significant baseline improvement, the interspecies and mixed-sequence type models the number of FP SNP calls remained in the range of 100s to 1,000s (right column, fig. 4A and B). Training the models with isolates from the same sequence type (ST93, FNQ) slightly improved performance (ST93: 71.69% 6 13.99 accuracy, 83.33% 6 10.42 precision) but reductions of accuracy and recall in other sequence types were observed ( fig. 4C ). PNG outbreak-derived model (saureus_png) performed best for polishing isolates from the same outbreak across all metrics in the high-coverage isolates (ST93: 69.28% 6 16.78 accuracy, 87.57% 6 9.83 precision) but incurred a steeper cost to accuracy and recall in nonoutbreak isolates ( fig. 4D ). Reductions indicate that the model trained on features specific to the outbreak genotype and became significantly less generalizable to other sequence type applications. We note that the levels of precision and accuracy of the ST93 polishers in absolute numbers translate to 1-10 s of false SNP calls compared with the N. gonorrhoeae and mixed-sequence type model ( fig. 4 ).

We next implemented Snippy's core alignment functionality, calling sites present in all isolates of the sampled population, with a minimum SNP site coverage of 1Â (JKD6159). Hybrid alignments integrated Illumina background SNPs from the ST93 (outbreak) lineage (n ¼ 444) in combination with polished ONT nanopore calls from Clair ( fig. 1 ). The lineage background alignment, as one would use for short-read reference data, therefore served as a backbone for ONT data in the core-site alignment ( fig. 4B ). We retained isolates with at least 5Â coverage (n ¼ 531/562) due to low accuracy and precision of these isolates in the SNP polishing step ( fig. 4, supplementary fig.  S4 , Supplementary Material online). We then used the between-species, within-species, within-lineage (FNQ and PNG) models to apply for variant polishing in our de novo core alignment and phylodynamics pipeline ( fig. 3A) .

NanoPath's core alignment construction reproduced Snippy's core alignment from Illumina data closely (6,319 SNPs with NanoPath vs. 6,580 SNPs with Snippy-core, fig. 5A and B). When we called Clair SNPs on isolates with >5Â coverage from PNG (n ¼ 56) and FNQ (n ¼ 32), we observed a vast excess of SNP calls, particularly in the raw Clair calls, where the hybrid core alignment contained 491,210 SNP sites and was considered unusable (supplementary table 7, Supplementary Material online). All polished SNPs produced reasonable alignments, where FNQ and PNG polishers produced alignments closest to the Illumina reference ( fig. 5 , supplementary table 2, Supplementary Material online). We reconstructed the ML phylogenies from these alignments in RAxML-NG using the GTR þ G model with Lewis' ascertainment bias correction and rooted the trees on SRR115752 for comparison of topological consistency (van Hal et al. 2018) . We also wanted to investigate whether the main introductions into FNQ and PNG could be reconstructed with accurate interpretations of their source divergence on the Australian East Coast. For reference, we used Illumina alignments constructed with NanoPath (Materials and Methods) and Snippy-core with matching isolates (n ¼ 531, fig. 5 ).

All major clades and subpopulations of the background population (North West, East Coast, Northern Territory, and New Zealand) including the outbreaks in FNQ and PNG were accurately reconstructed as referenced by the Illumina trees ( fig. 5 ). Minor topological variations were observed in the position of the PNG-1 and PNG-2 introductions (greens), and the southern East Coast and New Zealand subclade (seagreen) of the East Coast population (turquoise, fig. 5 ). However, there were no major topological inconsistencies that would affect interpretation of the source population. In all topologies, the outbreaks from PNG derived from the East Coast ST93-MRSA-IV clade, and the FNQ outbreak derived from the Northern Territory reintroduction (fig. 5). Regional transmissions into the United Kingdom and Australia within the outbreak clusters remained identifiable (black and red branches in PNG-1 and PNG-2). Introductions into FNQ from other parts of the population are evident from both the reference and the polished alignments (red branches in East Coast, PNG and Northern Territory clades). Branch lengths of the nanopore-sequenced clades were similar to the reference ML tree, but were excessive in the between-species N. gonorrhoeae polished alignments as well as in the mixed The alignment based on SNPs polished using outbreak sequence type (ST93) isolates was most consistent with the Illumina reference phylogeny of ST93. We note that within-lineage polishing did not require within-outbreak polishers, for example, FNQ-trained polishers reproduced PNG outbreak divergence and vice versa.

We next investigated the performance of Bayesian phylodynamic methods to estimate the divergence date and effective reproduction number using birth-death skyline models with serial (PNG) or contemporaneous (FNQ) sampling and lineage-specific prior configurations. We ran BEAST2 Markov chain Monte Carlo (MCMC) chains on the outbreak subsets of the full SNP alignment with sufficient isolates (n PNG-1 ¼ 53; n FNQ ¼ 32) using a fixed substitution rate of the wholelineage median posterior estimate (3:199 Â 10 À4 ). This was necessary as nonrandom sampling (subsetting the alignment to the outbreak clade) removes the temporal signal in the comparatively recent outbreaks, and thus leads to an overestimation of the outbreak MRCA. We note that the models were efficiently run on a standard NVIDIA GTX1080-Ti graphical processing unit (GPU) using BEAST2 with the BEAGLE library at speeds of <3-4 min/million steps in the MCMC (5-7 h per run and GPU) making timely parameter estimation for outbreak responses feasible on low-cost hardware. On an NVIDIA P100 GPU, walltime decreased to <50 s to 1 min/ million steps in the MCMC, around 1-2 h walltime per run and GPU on a distributed system. Estimates were consistent with full lineage-wide analysis (R e > 1.5-2.0) and we observed robust estimates in an exploration of the R e prior (supplementary table S2 and figs. S6 and S7, Supplementary Material online). We therefore demonstrate that SNP polishing enables the use of birth-death skyline models for outbreak parameter estimation, even with low-coverage nanopore sequencing data (5Â to 10Â). Finally, we implemented training, evaluation, and deployment of SNP polishers for within-lineage transmission modeling in Nextflow.

In this study, we adopted a method for variant polishing for phylodynamic modeling of bacterial whole-genome data Phylodynamic Inference of Bacterial Outbreak Parameters . https://doi.org/10.1093/molbev/msac040 MBE using low-coverage nanopore sequencing, and applied it to outbreaks of community-associated S. aureus in remote FNQ and PNG. Previous studies using (high coverage) nanopore data have evaluated phylogenetic reconstructions on few and distantly related isolates of N. gonorrhoeae as well as other bacterial genomes from assembly (Golparian et al. 2018; Sanderson et al. 2020; Urban et al. 2021) . A recent pipeline for cluster identification using six strains per MinION flow cell (42 on 7 flow cells) successfully identified clusters in four distinct lineages, using a whole-genome assembly-based phylogeny (Ferreira et al. 2021 ). However, full outbreak reconstruction within the outbreak lineage-allowing for Bayesian model applications to estimate epidemiological parameters within the phylogeny-has so far not been conducted. Here, we show that the application of random forest SNP polishers developed by Sanderson and colleagues (2020) can sufficiently reduce the number of FP SNP calls from neural network variant caller Clair v.2.1.1 (Luo et al. 2020) . Hybrid lineage alignments of ONT sequence and Illumina background data of the outbreak lineage (ST93) can be constructed, and effective reproduction numbers accurately modeled using birth-death skyline models in BEAST2.

We evaluated genotype reconstruction against previously sequenced Illumina data demonstrating the superior quality of hybrid assembly with Unicycler. Our genotyping analysis showed that for high-coverage isolates (>10Â) genotyping directly from polished nanopore assembly was comparable to hybrid Illumina-ONT approaches ( fig.  2) . We used the most recent models at the time of writing for base calling (Bonito v0.3.6) followed by polished long-read assembly or hybrid assembly. With the imminent release of R10.3 pores and expected increases in raw-read accuracy the remaining misclassifications in genotypes from assemblies (mostly in MLST and SCCmec subtyping) will be eliminated and produce nanopore assemblies comparable to reference assemblies at >10Â coverage. We chose here to implement a rapid and minimal protocol to evaluate its application in reference laboratories without local access to sequencing infrastructure, such as at the Australian Institute of Tropical Health and Medicine or the Papua New Guinea Institute of Medical Research. Our method requires some context from genomic surveillance at the level of full lineages (e.g., ST93 or ST772) in order to situate nanopore-sequenced outbreaks within the wider lineage context and fix the clade birth-death model substitution rate. Given that substitution rates vary between S. aureus lineages , an estimate from the background data is required to fix substitution rates within the outbreak clusters. For optimal polishing results, it appears to be effective to train the random forest polishers on lineage-specific data, noting that effective polishing was still achieved when training isolates derived from a different part of the tree within the lineage (e.g., FNQ-trained polishers were effective on PNG isolates). In higher-coverage isolates effective polishing was also achieved with the mixed S. aureus and N. gonorrhoeae models; we note that only three isolates with matching Illumina and ONT data are required for training the polishers.

Interestingly, the random forest classifiers failed to polish Medaka v1.2.3 reference-specific SNP calls (supplementary fig.  S3 , Supplementary Material online) even though the Medaka-Bonito model is trained explicitly on microbial signal data from E. coli and an experimental version (v0.1.0) was successfully used for polishing by Sanderson and colleagues (2020) . Polishing success of Clair calls suggests that the features selected here-in particular, the proximity and quality features ( fig. 3B-D) -were effective at removing systematic FP SNP calls. SNP calling did not improve considerably using Bonito v0.3.6 R9.4.1 DNA models compared with Guppy high accuracy (supplementary fig. S5 , Supplementary Material online) and methylation-aware models (data not shown). It remains to be seen whether retraining Clair or Medaka neural networks on S. aureus specific signal and sequence data would improve species-specific SNP calls without polishing.

We demonstrate the utility of our method by sequencing novel isolates of community-associated MRSA from a pediatric osteomyelitis outbreak in the highland towns of Kundiawa and Goroka (PNG) and routine surveillance in remote northern Australia (FNQ; fig. 1, n ¼ 181) . A protocol that minimized cost (without optimization) allowed us to sequence 2 consecutive panels of 12 isolates with rapid barcoded libraries on a MinION flow cell (SQK-RBK004), by using an interspersing nuclease flush (ONT, EXP-WSH-003). We note that spin column extractions resulted in several fragmented barcodes that failed assembly (12/96). Overall, phylodynamic models were mostly affected by very low-coverage isolates (<5Â) whereas even low-medium coverage isolate (!5Â) produced consistent estimates of the effective reproduction number for the PNG and FNQ clades, when compared with the Illumina reference ( fig. 6) . Accurate modeling was possible even with interspecies polishers trained on N. gonorrhoeae in higher-coverage isolates in PNG. Estimates were more variable in the low-coverage FNQ outbreak clade and for optimal performance, some protocol optimization will be required, and may include extraction protocols for long reads, inclusion of a magnetic bead cleanup step (obligatory in the latest iteration of the ONT rapid kit protocols, 2021) or short-read elimination kits. Although we were ultimately unable to use a total of 32 isolates (<5Â coverage) in the phylodynamic estimation, the cost per S. aureus genome using the 24Â multiplex protocol ranges between USD $40 (no failures over 181 unique samples) and around USD $50 per genome with two repeat flow cells from already extracted cultures (supplementary material 2, Supplementary Material online). Further optimization would incur small additional cost and can be conducted for bacterial pathogens of interest in sufficiently resourced laboratories. Further improvements in cost for genome selection by first scanning genomes with approximate genomic neighbor typing approaches may also be feasible ). Although we chose S. aureus as a model organism for this work mainly due to our interest in sequencing the outbreaks in PNG and FNQ, as well as because of its relatively small genome (2.8 Mbp), core principles and methods used in this study are immediately applicable to other bacterial pathogens Steinig et al. . https://doi.org/10.1093/molbev/msac040 MBE and all steps of the pipelines are implemented in replicable Nextflow workflows (Materials and Methods).

We did not expect significant rate variation in the outbreak clades, which made computation of clade parameters with a lineage-wide fixed substitution tractable. We note that within-outbreak patterns of divergence vary between phylogenies ( fig. 5) , and considering the number of remaining FP and false-negative (FN) SNPs after polishing ( fig. 4) , we did not expect within-outbreak transmission chains to be reproducible. Optimization of SNP polishing or variant calling, for example, with species-specific neural networks, remains to be investigated. For this study, we accelerated computation using the BEAGLE library (Ayres et al. 2019) in combination with BEAST2. Moderate acceleration on standard hardware (<5-7 h) and increased acceleration on NVIDIA P100 GPUs (<2 h) were achieved. Nanopore-driven outbreak sequencing and GPU acceleration in BEAST2 thus enable the rapid deployment of phylodynamic models and responsive surveillance of bacterial diseases.

Ultimately, our cost-effective protocol for multiplex nanopore sequencing and phylodynamic inference of outbreak parameters lowers the barrier of conducting these analyses in scenarios where access to sequencing infrastructure is difficult or infeasible, including in low-or middle-income countries where the burden of bacterial disease outbreaks remains high. In particular, the effective cost of monitoring disease transmission is considerably lower on nanopore sequencing platforms than with gold-standard Illumina sequencing, which may facilitate the sustainable integration of genomic surveillance in reference laboratories located in these regions, including in remote northern Australia and the highlands of PNG. Improvements to the sequencing protocol, for example, by further reducing cost of nucleic acid extraction, increasing the number of isolates per flow cell or balancing throughput per barcode using in silico adaptive sequencing (Payne et al. 2021 ) will further enable the adoption of phylodynamic surveillance for bacterial outbreaks on nanopore devices.

We collected isolates from outbreaks in two remote populations in northern Australia and PNG ( fig. 1) . Isolates associated with pediatric osteomyelitis cases (mean age of 8 years) were collected from 2012 to 2017 (n ¼ 42) from Kundiawa, Simbu Province (27) , and from 2012 to 2018 (n ¼ 35) from patients in the neighboring Eastern Highlands province town of Goroka. We supplemented the data with methicillin sensitive S. aureus isolates associated with severe hospitalassociated infections and blood cultures in Madang (Madang Province; n ¼ 8) and Goroka (n ¼ 12). Isolates from communities in FNQ, including urban Cairns, the Cape York Peninsula and the Torres Strait Islands (n ¼ 91), were a contemporary sample from routine surveillance at Cairns Hospital in 2019. Isolates were recovered on LB agar from clinical specimens using routine microbiological techniques at Queensland Health and the Papua New Guinea Institute of Medical Research (PNGIMR). Isolates were transported on swabs from monocultures to the Australian Institute of Tropical Health and Medicine (AITHM Townsville) where they were cultured in 10 -ml Luria-Bertani (LB) broth at 37 C overnight and stored at À80 C in glycosol and LB. Illumina short-read data from the ST93 lineage (van Hal et al. 2018) included in this study were collected from the European Nucleotide Archive (supplementary tables, Supplementary Material online).

Two milliliters of LB broth was spun down at 5,000 Â g for 10 min and after removing the supernatant, 50 ml of 0.5 mg/ ml lysostaphin were added to the tube and vortexed. Cell lysis was conducted at 37 C for 2 h with gentle shaking followed by a proteinase K digestion for 30 min at 56 C. DNA was extracted using a simple column protocol from the DNeasy Blood & Tissue kit (QIAGEN) following the manufacturer's instructions. DNA was eluted in 70 ml of nuclease-free water, quantified on Qubit, and DNA was stored at 4 C until library preparation. Library preparation was done using approximately 420 ng of DNA and the rapid barcoding kit with 12 barcodes (ONT, SQK-RBK004) as per manufacturer's instructions. Basecalling was done using the R9.4.1 high accuracy (HAC, supplementary fig. S5 , Supplementary Material online), the HAC methylation model (not shown), and the all context methylation Rerio model (not shown) in Guppy v4.2.3, as well as the final Bonito v0.3.6 R9.4.1 DNA model (used for all analyses), run on a local NVIDIA GTX1080-Ti or a remote cluster of NVIDIA P100 GPUs. Sequence runs were conducted with 2 Â 12 barcoded (SQK-RBK004) isolates per flow cell in two consecutive 18-24 h runs. Libraries were nuclease flushed using the wash kit between consecutive runs (EXP-WSH-003). This is sufficiently effective to remove read carry-over, as demonstrated previously with hybrid assemblies of sequentially sequenced Enterobacteriaceae (Lipworth et al. 2020 ) and our analysis of a single library panel (FNQ-2) sequenced on a previously used flow cell with a human library. After washing with EXP-WSH-003, a total of 2,910/294,461 reads were classified as human in the S. aureus library, about twice as much as human contamination in other runs. Sequencing runs were managed on two MinIONs and monitored in MinKNOW > v20.3.1.

Genome assemblies for genotyping were constructed using our Nextflow assembly pipeline (https://github.com/np-core/ np-assembly), which first randomly subsamples reads to a maximum of 200Â coverage with rasusa v0.3.0 (Hall 2022) and filtered Q > 7 with minimum read length of 100 bp using nanoq v0.8.0 (Steinig and Coin 2022) . Fastp v0.20.1 (Chen et al. 2018 ) was used to trim adapter and low-quality Illumina sequences. We then constructed three types of assemblies: a polished long-read assembly using ONT data only (flye), one with short-read correction of the ONT long-read assembly (pilon) and one that first assembles short reads and scaffolds the assembly with long reads. For the polished longread assembly, Flye v2.8.3 (Kolmogorov et al. 2019 ) was used Phylodynamic Inference of Bacterial Outbreak Parameters . https://doi.org/10.1093/molbev/msac040 MBE in conjunction with four iterations of minimap2 v2.17-r941 (Li 2018) þ Racon 1.4.20 (Vaser et al. 2017 ) and subsequent polishing with Medaka 1.2.3. For the long-read hybrid assembly, corrections were conducted with Illumina paired-end reads for each genome using two iterations of Pilon v1.2.3. For the short-read hybrid assembly, we used Unicycler v0.4.8. Reference Illumina assemblies were generated with the pipeline Shovill v1.1.0 (https://github.com/tseemann/shovill) using Skesa v2.4.0 and genotyped with Mykrobe v0.9.0 (Hunt et al. 2019 ) (from reads) and SCCion v0.4.0 (https://github. com/esteinig/sccion), a wrapper around common assemblybased genotyping tools and databases (Zankari et al. 2012; Chen et al. 2016; Kaya et al. 2018 ) for S. aureus. We called species, resistance genes, virulence factors, PVL, multilocus sequence type, mecA, and major SCCmec cassette subtypes. We assessed differences between the Illumina references and hybrid-or nanopore assemblies using the dnadiff v1.3 to determine assembly-based differences in SNPs and Indels, as well as assess overall identity between genomes ( fig. 2) . Coverage against the reference genome (ST93: JKD6159; Chua et al. 2010) was assessed using CoverM v0.6.0 (https://github. com/wwood/CoverM).

We called SNPs de novo using the neural network callers Medaka v1.2.3 (https://github.com/nanoporetech/medaka) and Clair v2.1.1 (shown in example pipeline executions). Snippy v4.6.0 (https://github.com/tseemann/snippy) was used to generate a core-site alignment of the ST93 background population (n ¼ 444, 6,161 SNPs) and reference Illumina core alignments including the outbreaks in FNQ and PNG isolates (>5Â, n ¼ 531, 6,580 SNPs). Snippy variant calls (SNP type) were used as reference truth for matching ONT and Illumina sequenced isolates. We implemented the feature extraction and random forest design from Sanderson and colleagues (2020) who use the RandomForest classifier from scikit-learn (Pedregosa et al. 2011) with default hyperparameter settings and feature extraction with pysamstats. Like the original implementation, we subsampled isolates to 2, 5, 10, 20, 50, and 100Â coverage with rasusa to account for read coverage in training and evaluating the classifiers. For training, we created three sets of matching Illumina and ONT sequence data, each with three isolates for training: three mixed sequence types (ST88, ST15, and ST93; saureus_mixed), one of FNQ within-lineage isolates (ST93; saureus_fnq), and one of Papua New Guinean within-lineage isolates (ST93; saureus_png). Training and validation sets for the classifiers were split into 60% training and 40% validation data.

Next, we evaluated the classifiers, including the N. gonorrhoeae classifier trained by Sanderson and colleagues, using the remaining isolates from FNQ and PNG as an independent test data set ( fig. 1) . We defined true positive (TP) SNPs as those that were called by both Illumina Snippy and ONT Clair, FP as ONT SNPs that were not called with Snippy, and FN Snippy calls that were missed by ONT calls or later excluded in the random forest filtering step. Since we used the de novo Snippy calls as reference, true negative (TN) calls (sites called as wild type by ONT and Snippy) were not able to be considered. We combined data from both outbreaks (n ST93 ¼ 118, n other ¼ 44) and computed accuracy, precision, recall, and F1 scores for each evaluation against Illumina reference data (supplementary tables, Supplementary Material online, fig. 4 ).

To contextualize polished ONT isolates called with Clair within the wider background of the ST93 lineage, we adopted the core functionality from Snippy's core alignment caller (Snippy-core) into an ONT and Illumina core SNP alignment caller in the NanoPath package (https://github.com/np-core/ nanopath). Core SNP sites were defined by polymorphic SNP sites present in genomes of all isolates included in the alignment, excluding any site that in any one isolate falls into a gap, or any site with less than a predefined minimum coverage (default: 1Â). We first polished ONT SNPs from Clair with the trained random forest models, including the N. gonorrhoeae data set from Sanderson et al. (2020) . We then created reference alignments of the Illumina data (ST93 background and outbreaks, n ¼ 531, >5Â) with Snippy-core, as well as a reference Illumina and polished hybrid alignments with ONT outbreak SNPs in NanoPath ( fig. 5 ).

ML phylogeny of the ST93 lineage was reconstructed from the Illumina and ONT polished alignments, including the outbreaks. We used RAXML-NG with the GTR þ G and Lewi's ascertainment bias correction for SNP alignments. Trees were rooted on SRR115236 (early isolate from 1992), near the root of the phylogeny (van Hal et al. 2018 ) and decorated with metadata of sample origin at state level in ITOL (Letunic and Bork 2019) . Sampling dates in years were provided for each isolate. We next subset the full lineage alignments to the isolates in the large clades of the FNQ (n ¼ 36) and PNG (n ¼ 62) outbreaks. We then configured birth-death skyline models in BEAST2 using a custom Python interface (NanoPath Beastling) that stores model configurations of the serially (PNG) and contemporaneously sampled models (FNQ) in YAML files. Birth-death models consider dynamics of a population forward in time using the (transmission) rate k, the death (become uninfectious) rate d, the sampling probability q, and the time of the start of the population (outbreak; also called origin time) T. The effective reproduction number (R e ), can be directly extracted from these parameters by dividing the birth rate by the death rate (k Ä d). We configured the model priors as outlined in table 1. Importantly, we set a lineage-wide fixed substitution rate prior at 3:199 Â 10 À4 to account for the loss of temporal signal in the outbreak subset alignments. NanoPath constructs the BEAST2 XML model files which can be run with the BEAGLE library on GPU. Results were summarized using the bdskytools package in R, where median higher posterior density intervals were computed in custom plotting scripts that can be found along with all other results from the pipelines and model runs at the data repository. Steinig et al. . https://doi.org/10.1093/molbev/msac040 MBE

BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics

Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis

VFDB 2016: hierarchical and refined dataset for big data analysis-10 years on

fastp: an ultra-fast all-in-one FASTQ preprocessor

Complete genome sequence of Staphylococcus aureus strain JKD6159, a unique Australian clone of ST93-IV community methicillin-resistant Staphylococcus aureus

COVID-19 Genomics UK (COG-UK) Consortium. 2021. Genomic epidemiology reveals multiple introductions of SARS-CoV-2 from mainland

Nextflow enables reproducible computational workflows

COVID-19 Genomics UK (COG-UK) Consortium. 2021. Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK

Temporal signal and the phylodynamic threshold of SARS-CoV-2

Estimating evolutionary rates using time-structured data: a general comparison of phylogenetic methods

Genome-scale rates of evolutionary change in bacteria

Bayesian evaluation of temporal signal in measurably evolving populations

Establishment and cryptic transmission of Zika virus in Brazil and the Americas

Rapid nanopore-based DNA sequencing protocol of antibioticresistant bacteria for use in surveillance and outbreak investigation

Towards a genomics-informed, real-time, global pathogen surveillance system

Genomic and epidemiological surveillance of zika virus in the amazon region

Antimicrobial resistance prediction and phylogenetic analysis of Neisseria gonorrhoeae isolates using the Oxford Nanopore MinION sequencer

Key parameters for genomics-based real-time detection and tracking of multidrug-resistant bacteria: a systematic analysis

Rasusa: randomly subsample sequencing reads to a specified coverage

SARS-CoV-2 transmission between mink (Neovison vison) and humans

Spread of a SARS-CoV-2 variant through Europe in the summer of 2020

Antibiotic resistance prediction for Mycobacterium tuberculosis from genome sequence data with Mykrobe

Development of phylodynamic methods for bacterial pathogens

SCCmecFinder, a web-based tool for typing of Staphylococcal Cassette chromosome mec in Staphylococcus aureus using wholegenome sequence data

Assembly of long, errorprone reads using repeat graphs

Spartan performance and flexibility: an HPC-cloud chimera

Interactive Tree Of Life (iTOL) v4: recent updates and new developments

Minimap2: pairwise alignment for nucleotide sequences

Optimized use of Oxford Nanopore flowcells for hybrid assemblies

Exploring the limit of using a deep neural network on pileup data for germline variant calling

MUMmer4: a fast and versatile genome alignment system

COVID-19 Genomics UK (COG-UK) Consortium. 2021. CLIMB-COVID: continuous integration supporting decentralised sequencing for SARS-CoV-2 genomic surveillance

Readfish enables targeted nanopore sequencing of gigabase-sized genomes

Scikitlearn: machine learning in Python

Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples

Real-time, portable genome sequencing for Ebola surveillance

High precision Neisseria gonorrhoeae variant and antimicrobial resistance calling from metagenomic nanopore sequencing

SKESA: strategic k-mer extension for scrupulous assemblies

Phylodynamic signatures in the emergence of community-associated MRSA. bioRxiv

Nanoq: ultra-fast quality control for nanopore reads

Genomic neighbor typing for bacterial outbreak surveillance

Freshwater monitoring by nanopore sequencing

Fast and accurate de novo genome assembly from long uncorrected reads

Queensland Genomics: an adaptive approach for integrating genomics into a public healthcare system

COVID-19 Genomics UK (COG-UK) consortium. 2021. Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England

Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement

Unicycler: resolving bacterial genome assemblies from short and long sequencing reads

A cross-jurisdictional research collaboration aiming to improve health outcomes in the tropical north of Australia

Identification of acquired antimicrobial resistance genes

Supplementary data are available at Molecular Biology and Evolution online. 

Sequence data (Illumina, ONT) generated in this study have been submitted to the NCBI BioProject database (https:// www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA657380. Additional model results and configuration files may be found in our repository (https://github.com/ esteinig/ca-mrsa).