key: cord-0303635-hbkko0pe authors: Kupperman, Michael D.; Leitner, Thomas; Ke, Ruian title: A deep learning approach to real-time HIV outbreak detection using genetic data date: 2021-12-20 journal: bioRxiv DOI: 10.1101/2021.12.17.473204 sha: 5f211565595074e3012c7dcaec9a7df31bcff42d doc_id: 303635 cord_uid: hbkko0pe Pathogen genomic sequence data are increasingly made available for epidemiological monitoring. A main interest is to identify and assess the potential of infectious disease outbreaks. While popular methods to analyze sequence data often involve phylogenetic tree inference, they are vulnerable to errors from recombination and impose a high computational cost, making it difficult to obtain real-time results when the number of sequences is in or above the thousands. Here, we propose an alternative strategy to outbreak detection using genomic data based on deep learning methods developed for image classification. The key idea is to use a pairwise genetic distance matrix calculated from viral sequences as an image, and develop convolutional neutral network (CNN) models to classify areas of the images that show signatures of active outbreak, leading to identification of subsets of sequences taken from an active outbreak. We showed that our method is efficient in finding HIV-1 outbreaks with R0 ≥ 3, and overall a specificity exceeding 85% and sensitivity better than 70%. We validated our approach using data from HIV-1 CRF01 in Europe, containing both endemic sequences and a well-known dual outbreak in intravenous drug users. Our model accurately identified known outbreak sequences in the background of slower spreading HIV. Importantly, we detected both outbreaks early on, before they were over, implying that had this method been applied in real-time as data became available, one would have been able to intervene and possibly prevent the extent of these outbreaks. This approach is scalable to processing hundreds of thousands of sequences, making it useful for current and future real-time epidemiological investigations, including public health monitoring using large databases and especially for rapid outbreak identification. Author summary The analysis of pathogen genomic data to analyze epidemics at scale is constrained by the computational cost associated with phylogenetic tree reconstruction. As a fast and efficient alternative, we employed convolutional neural networks to analyze evolutionary pairwise distance matrices as images to perform classifications of the current epidemiological situation of a growing public health sequence database. We used simulated data to train and test our model, and as validation we accurately mapped the start and end of two linked well-documented HIV-1 outbreaks in the backdrop of ongoing slower HIV spread. Thus, our new approach is efficient, accurate, scalable, and can analyze data in real time. The human population is increasingly exposed to threats of infectious disease outbreaks 2 due to population growth, increased frequency of traveling, changing patterns of land 3 use etc. This is exemplified by the ever-growing HIV epidemic [1] as well as recent 4 emerging outbreaks such as the SARS-CoV-2 pandemic. A key to outbreak control is 5 early detection when the number of infected individuals is small and the disease spread 6 is local. One growing resource for disease control is to utilize pathogen genomic 7 sequence data to assess epidemiological conditions and threats. 8 Given sequence data, state-of-the-art, flexible phylogenetic methods have been 9 developed for analysis of general evolutionary questions [2] [3] [4] , applicable to pathogen 10 evolution, as well as faster but less precise algorithms for large data [5] . More focused 11 phylodynamic methods have also been developed for specific tasks, e.g., taking incidence 12 data into account [6] , including multiple evolutionary scales [7] , inferring underlying 13 transmission networks on several levels [8, 9] , and using large next generation sequencing 14 data [10] . Motivated by pathogen evolution, advanced methods for inference of past 15 demographic history with population size dynamics and migration that can reconstruct 16 outbreaks have also been developed, e.g., the modular framework of BEAST [11] , which 17 takes a Bayesian approach to account for uncertainties in the tree reconstruction. 18 However, phylogenetic tree reconstruction, interpretation, and subsequent outbreak 19 identification requires extensive expert knowledge, and thus typically can be reliably 20 done only by highly trained scientists. Furthermore, despite the advances and promises 21 of genomic epidemiology, current phylogenetic methods to analyze pathogen genetic 22 data suffer from high computational cost that prevents real-time or near real-time 23 analysis, especially when thousands of sequences need to be analyzed, as is common in 24 modern sequencing projects. 25 Here, we propose an alternative to this paradigm. Instead of constructing 26 phylogenetic trees, we based our approach on the pairwise distance matrix of a sample 27 of genetic sequences, and used a deep learning approach to analyze the matrix. Deep 28 learning approaches, such as the convolutional neural network (CNN) [12] , has been well 29 developed and widely used for image identification over the past decade [13] [14] [15] . By 30 using many parameters in a highly nonlinear model, a deep learning model can 31 efficiently learn complex relationships within the data to form highly accurate 32 predictions [16] . Our rationale here is that outbreaks would lead to distinctive 33 signatures in the pairwise distance matrix (as well as impact the topology of the 34 phylogenetic tree [17] ). We leveraged the advance in deep learning models by treating 35 the matrix as an image, and developed deep learning models to identify these signatures 36 from the pairwise distance matrix and thus detect outbreaks from a sequence database. 37 We show that our CNN models made accurate predictions against historical HIV 38 sequence data with known epidemiological history, and that they can handle many 39 thousands of sequences within a very short time frame using a laptop computer. 40 Overview of the framework 42 Here we describe the main workflow of our approach. We first developed a forward 43 stochastic simulator of HIV transmission to generate synthetic datasets for training and 44 testing of our CNN models. In this simulator, the number of infected individuals initially 45 substitutions were then simulated on the transmission tree/genealogy of the n sampled 50 individuals, and a n × n pairwise distance matrix was derived from the simulated 51 substitutions on the transmission tree (Fig. 1B) . We repeated the stochastic simulation 52 many times to derive a rich synthetic dataset (i.e. a collection of matrices with labels). 53 We then developed CNN models, trained and tested these CNN models using the 54 synthetic dataset to identify matrices that are labeled as 'epidemic'. These CNN models 55 handle a small number of sequences (n = 15, 20, 30, 40 or 50) at a time ; however, a 56 pathogen database may contain tens of thousands or even hundreds of thousands of 57 sequences. We thus developed a 'sliding-window' approach to utilize the CNN model to 58 identify subsets of sequences that belong to an active outbreak, i.e. the 'epidemic' label, 59 from a collection of a large number of sequences. More specifically, we first constructed 60 a pairwise distance matrix for all the sequences in the database (m sequences in total), 61 and reordered the matrix using a fast clustering algorithm, such as hierarchical 62 clustering [18] . This ensures that the sequences belonging to an active outbreak are 63 grouped together. We then used a window of size n × n, and slide this window from one 64 diagonal of the m × m matrix towards the other diagonal. At each position of the (ranging between 1.5 and 5 in our simulations). It has been shown that the transmission 76 potential is much higher during the acute infection phase in an infected 77 individual [21, 22] . Thus, we assumed that during the first three months the rate of new 78 infections is 20-fold higher than the remaining infectious period. We assumed that an 79 individual will be non-infectious when on successful antiviral treatment, administered 80 after diagnosis between 13 to 36 months after infection [23, 24] . Once the population size reaches the maximum number of infected individuals, 82 assumed to be log-uniformly distributed between 10 3 and 10 4 across different simulation 83 runs, we set the population size to be constant over time. In the constant population 84 phase, the simulation switched from sampling the number of new infections to only 85 replacing infections that have reached the end of their active phase, maintaining the 86 total number of active infections (Fig 1 A) . The transmission history of all individuals during the simulation was recorded, which 88 allows for the reconstruction of the transmission history/tree for the entire population To construct an evolutionary pairwise distance matrix, D, we first calculated the 100 temporal distance between each pair of samples based on the transmission history/tree. 101 We assumed that the genetic sequence data is approximately 300 nucleotides (nt) to be 102 consistent with the real HIV-1 data we used below. We then calculated the pairwise We developed a deep learning model using a Convolutional Neural Network (CNN) to 110 solve a classification task predicting the label (0, 2, or 10 years) for a given pairwise 111 distance matrix. The pairwise distance matrix is similar to a grayscale image (Fig 2) . 112 Thus, in the context of machine learning models, we also refer to a pairwise distance 113 matrix as an image. 114 We constructed a CNN using Tensorflow [25] with a sequential architecture 115 comprised of 2D convolutions, batch normalization, ReLu activations, and spatial 116 maximum pooling. The layer structure is described in table 1. The inputs to the first 117 and third dense layers were regularized using dropout with probability p = 0.25. We A collection of example images used for training from each training year. Each pictured training example was generated with R 0 > 4. Three unsorted example images are shown for each sampling time ( Fig 1A) . Table 1 . The layer connectivity proceeding from the Input layer (top) down to the output layer. Our representation of image data within the convolutional stack is channels last. The batch size is implicitly 1 for each image and is dropped. If k > 20, an additional max-pool operation is performed before layer 10. No bias weight is used in the final dense layer. Layer index Layer type Output shape (per image) 0 Input Flatten batches of synthetic training data of 60,000 pairwise distance matrices were generated 125 for each number of sampled individuals (for a total of 25 training sets), each with Table) . To handle a large number of sequences, we employed a sliding-window approach. We To validate the performance of the sliding window approach, we generated pairwise 159 distance matrices of dimension 600 × 600 using our stochastic simulator. For each of 160 the pairwise distance matrix, we performed 10 simulations using the same parameters as 161 with the training dataset for the CNN model. In each simulation 60 individuals are 162 sampled at either 0, 2, or 10 years. We then joined these 10 clusters by connecting their 163 MRCAs by assuming that the distance between MRCAs of the clusters were uniformly 164 distributed between 1 and 3 months. The order of individuals in the pairwise distance 165 matrix was then randomized. A total of 100 independent pairwise distance matrices 166 were generated with this process to form a validation set. Accuracy was computed on a 167 per-individual basis using the sliding-window method. For sequences sampled within 2 years of the current year, they were labeled as part 178 of an active epidemic. For sequences sampled more than 2 years before the current year, 179 we categorized them according to whether they were in the same sliding window as (i.e. 180 close to) any sequences sampled in the current year that are labeled as 'epidemic'. If so, 181 we labeled these sequences reactivated epidemic; otherwise, we labeled them inactive 182 epidemic. Reactivated epidemic indicates a situation where the newly identified 183 outbreak may be linked to previously sampled sequences (i.e. individuals) in the 184 database. Inactive epidemic indicates that we predict that the sequence used to belong 185 to an outbreak at the time when the sequence is added to the database; however, it may 186 not be relevant to current outbreaks. The overall procedures of model prediction on an expanding database is represented 188 in Fig. 3 . year. This data contains a well-known dual outbreak of HIV among first Finnish, then 194 Swedish, intravenous drug users [27] . The sequences from that dual outbreak thus 195 become mixed with other sequences from Europe. We arranged this database to mimic 196 the inflow of data from an outbreak that emerges in only part of the whole data as it 197 enters the analysis stream, one new sequence at a time. In total, this data consisted of 198 277 sequences covering approximately 300 nt surrounding the HIV-1 env V3 region. To determine the sensitivity of our method to the choice of evolutionary distance 200 model used to calculate pairwise evolutionary distances, we analyzed HIV-1 subtype A6 201 sequence data from Former Soviet Union, Russia, and Ukraine (417 sequences covering 202 approximately 400 nt surrounding the env V3 region. This dataset was sampled over a 203 20+ year range and features long distances where a correction factor may be significant. 204 We computed pairwise distance matrices using a Hamming model ("Ham" which counts 205 only mutations per site), JC69, K80, F81, and TN93 using Ape [28, 29] . Before 206 evaluating the model, we subset the data by year to obtain pairwise distance matrices 207 containing sequences sampled around the same time. Years with less than the minimum 208 number of sequences needed to make a prediction were joined forward to the next year. 209 The resulting pairwise distance matrices were evaluated using the sliding window 210 method and accuracy values were calculated on a per-person basis. To test scalability, we extracted HIV-1 RT (p51) gene sequences of at least 500 nt of 212 any subtype or recombinant in the LANL HIV database 213 (https://www.hiv.lanl.gov/content/sequence/HIV/mainpage.html). We then removed 214 all sequences labelled as "problematic" in the LANL HIV database and further all that 215 required gaps, as RT typically does not have many gaps, resulting in a 1,320 nt 216 alignment of 271,868 sequences. To estimate the runtime of our pipeline, we 217 subsequently resampled this alignment for 10 2 − 10 5 sequences and constructed the 218 pairwise distance matrix using the TN93 evolutionary model. We also measured the 219 time to sort each matrix using hierarchical clustering and the time to stride the larger 220 matrix into smaller views and pass the data through the model. The convolutional neural network (CNN) model classifies 224 synthetic data with high accuracy 225 We developed a framework employing CNN models to identify active outbreaks from 226 HIV-1 sequence data (see Methods for the workflow and details of the framework). We 227 first generated synthetic datasets for training the CNN models using our forward 228 stochastic simulator of HIV transmission for training and testing of the CNN model. 229 We varied the reproductive number, R 0 , between 1.5 and 5 in the simulator. Samples generate 300,000 pair-wise distance matrices. The memory requirements to store the full 237 training set for the larger sample sizes contiguously in GPU memory exceeds the 238 capacity of many consumer-grade laptops and desktops. Therefore, we split them into 239 five subsets of matrices with equal size, i.e., 60,000 labeled pairwise distance matrices. 240 We then trained a CNN model on each of the five window size subsets of data to 241 correctly predict the labels (i.e. label 0, 1 and 2) of the pair-wise matrices. Similar 242 model performance were achieved across the five subsets (Fig. 4A) . We then joined all 243 the trained networks in a simple bagging classifier with a voting decision method. This 244 led to a classifier with a higher accuracy than the mean accuracy of the five networks 245 used to assemble the larger classifier model (Fig. 4B) . We thus used the voting classifier 246 for predictions below. Overall, the accuracy of the CNN model with the voting decision 247 method ranged between 79% and 87%, and the accuracy improved with increased 248 sample size (i.e., larger windows perform better; Fig. 4 ). From a public health perspective, we are interested in identifying sequences in a correctly predicting label 0). We classified label 0 as epidemic, and label 1 and 2 as 254 'endemic', and found that at all window sizes, the specificity (of identifying 'epidemic' 255 sequences) exceeds 85% and the sensitivity exceeds 70%. Specificity increases with the 256 addition of data, while sensitivity for our method increases initially, but appears to 257 saturate at window sizes of 40 and 50. An important feature of our framework is that it does not rely on phylogenetic tree 259 construction; instead, it used a clustering algorithm to group closely related sequences 260 together. While phylogenetic tree construction is more reliable to specify evolutionary 261 relationships (and thus the genealogical relationships) of the sequences, it takes a longer 262 time to perform than clustering algorithms. Here we show that precise evolutionary 263 relationships is not necessary for our CNN model to make correct predictions as long as 264 closely related sequences are grouped together (through clustering algorithms). To this 265 end, we permuted two randomly chosen sequences in a matrix and calculate the 266 probability that the model gives the correct prediction (S2 Table) . Note that reordering 267 of the sequences could lead to a different image (pairwise-distance matrix) on which the 268 CNN model makes its prediction. Overall, a random reordering of two sequences 269 preserves a correct prediction with a probability between 80% and 89%. This suggests 270 that our method is robust to the specific ordering of the sequences presented to the 271 model. To identify the subsets of sequences that were sampled from an outbreak in a database 289 of many sequences, we used a sliding window approach. In this approach, we first 290 constructed a pairwise distance matrix for all the sequences in the database, then sort 291 the matrix so that closely related sequences are grouped together. We then employed 292 the sliding window to assign predictions to groups of individuals in the matrix (see 293 Methods for details). 294 We considered three different sorting methods: 1) randomly ordered, 2) hierarchical 295 clustering (Ward criterion), and 3) hierarchical clustering (Ward criterion) with optimal 296 leaf ordering refinement. To test the performance of the three methods, we generated 297 100 pairwise matrices (images) of size 600 by 600. Each of the 600 by 600 matrix was 298 generated by joining 10 simulated matrices using the stochastic simulator with outbreak 299 parameters (R 0 and simulation length) randomly sampled (see Methods for details). In 300 Accuracy is calculated over each subinterval, a trend line is fitted using a Savitzky-Golay filter of degree 3 with a window size of 51. A black dashed line indicates the probability of predicting an outbreak using only the occurrence rate in the training set, independent of R 0 . Outbreaks with small R 0 are more difficult to detect than with larger R 0 . total, we evaluated 60,000 predictions to compute the accuracy of these model 301 predictions. Of the three tested sorting methods, applying hierarchical clustering gave 302 the best accuracy (see S3 Fig) . Thus, hereafter, we applied this sorting method to all 303 large matrices. This approach also achieved high sensitivity and specificity in detecting 304 outbreaks ( figure 6 ). 305 We also tested the robustness of model predictions using different evolutionary 306 models to compute pairwise evolutionary distances from real HIV-1 sequence data. We 307 used a meta-dataset of HIV-1 sequences from the former Soviet Union [31] (see 308 Methods). Overall, for models using all window sizes (15, 20, 30, 40, 50) , there is To validate our model against real datasets and illustrate the application of a real 313 outbreak identification in real time, we collected a dataset containing a well-documented 314 dual outbreak among first Finnish and then Swedish IDUs [27] . To add complexity, we 315 added an assortment of other European HIV-1 infections from the time period around 316 the Finnish-Swedish outbreak to test our model's ability to detect an outbreak in an 317 endemic backdrop (Fig 7) . An initial subset of 15 sequences is required to make a first 318 prediction ('circled 1' in Fig 7) . These initial 15 sequences were classified as endemic, 319 which agreed with how they were sampled; coming from different European countries 320 (including Sweden), not as part of any known outbreak. Note that most of these initially 321 classified endemic sequences stayed endemic throughout the growth of the database. After the initial 15 sequences, the database grows one sequence at a time, and we 323 calculated a new distance matrix for all sequences up to that point, applied sorting, and 324 performed predictions using the sliding window as outlined in Figure 1 . The sequences 325 were re-classified in each matrix step, and the result added as a slice along the x-axis. Sensitivity and specificity on the multiple-cluster test set. Model performance is computed using the same 60, 000 infections. Sensitivity is high but decreases as the window size increases, while the specificity of the method increases as the window size increases. sequences were added at 'circled 3', their classification changed to 'active epidemic' as 328 new sequences inform about an ongoing outbreak. This correctly identified the start of 329 the Finnish IDU outbreak in 1999. Sequences that joined the database at 'circled 3' and 330 were classified as active epidemic, later became classified as 'reactivated epidemic' at 331 'circled 4' as they were about to become 'inactive epidemic' but new active epidemic 332 sequences were added, instantly reactivating them. At 'circled 5', these (and many detected both outbreaks early on, before they were over, implying that had this method 350 been applied in real-time as data became available, we would have been able to 351 intervene and possibly prevent the extent of the outbreaks. (see Methods). We measured the real time it took to construct a large pairwise distance 357 matrix from the sample using Ape [28] , the time required to sort the matrix using 358 hierarchical clustering [32] , and the time required slide the window down the sorted In this work, we developed a deep learning approach to effectively and efficiently handle 373 large quantities of viral sequences to identify subsets of sequences that were taken from 374 an HIV-1 active outbreak. In contrast to phylogenetic approaches, our approach 375 constructs and directly analyzes pairwise distance matrices from viral sequences. The 376 key idea is to treat these matrices as images and then train and use convolutional neural 377 network (CNN) models to identify subsets of sequences which form matrices that bear 378 the signature of an outbreak. Using historical HIV-1 genetic sequence data with known 379 epidemiological history, we showed that our approach correctly identified subsets of 380 sequences sampled from outbreaks. Furthermore, we showed that our approach is 381 scalable and can process hundreds of thousands of sequences within hours. The key to the low-computational cost of our approach is analyzing a pairwise 383 distance matrix derived from sequences in a database, instead of constructing 384 phylogenetic trees. While computing large trees from the full sequences is possible using 385 fast, but approximate, methods, subsequent interpretation is ad hoc, often subjective, and requires careful analysis by experts. Other, more advanced methods that partially 387 incorporate interpretation and integrating across many plausible trees, such as elegant 388 Bayesian methods [11] , are instead even slower and unsuitable for very large data 389 analyses. On the other hand, the dimension of the data encoded in a pairwise distance 390 matrix is significantly less than that encoded in the full sequences. Importantly, it has 391 been shown that this dimension reduction does not substantially impact on the 392 information content with respect to evolutionary relationships between sequences [33]. 393 By building the pairwise distance matrix, we concentrated the epidemiological 394 signatures (outbreak or endemic) into a more compact structure, i.e. an image format, 395 that can be analyzed very efficiently using deep learning. Given the trained system, the 396 computational speed and complexity of our approach is currently limited by the 397 quadratic complexity (in the number of sequences considered) in building the pairwise 398 distance matrix and secondarily by the data sorting and model evaluation. The accuracy of the model prediction is ensured by applying a deep learning method 400 based on CNN models for image classification. The model was trained using a large set 401 of synthetic data generated from a previously validated HIV-1 transmission 402 simulator [20] . Our deep learning model performed well when R 0 is high (R 0 > 3). Large R 0 characterizes fast-growing outbreaks that we would like to identify early and 404 with high confidence. When R 0 is small, the outbreak grows significantly slower and is 405 difficult for our model to detect. More sophisticated CNN models than proposed here 406 may be used in future improvements to the ability of the model to detect outbreaks. However, another factor that may lead to poor model performance is lack of signal in 408 the data. For example, it is not clear if a reduction of R 0 from 1.5 − 3 down to 1 for 409 only 2 years, when the mean time to detection is about 2 years, induces strong signals of 410 an outbreak in viral genetic data. Note that, increasing the sample size from 15 411 sequences to 30 or 50 sequences does not give a substantial increase in performance for 412 low R 0 S1 Fig. This further suggests lack of signal for detection. Our approach is limited by the use of synthetic data obtained from a simulator to 414 train the model. Synthetic data may lack certain subtleties present in real data. Thus, 415 we validated our final model using real HIV-1 sequence data. We showed that detecting 416 a known dual outbreak among injecting drug users first in Finland and then in Sweden, 417 intermingled by random data from the same subtype in Europe, accurately identified 418 the outbreak in a data stream where HIV sequences were provided over the time the 419 outbreaks occurred. Encouragingly, we detected the outbreaks early, before they were 420 over, as well as the end of the outbreaks. Accurately identifying a beginning outbreak in 421 the background of other data and knowing when it's over are both crucial pieces of 422 information for a public health authority. Our approach has potential to effectively deal with recombination when analyzing 424 sequences. Standard phylogenetic methods assume no recombination, which HIV clearly 425 violates. Pairwise distances, on the other hand, are accurate whether recombination 426 took place or not. Here, our deep learning model is trained to classify sets of 427 evolutionary distances in very large data collections with no requirement to perform 428 manual or automated cluster separation and examine only regions of high density or low 429 genetic divergence [34] . Our technique can identify sequences likely to be part of an 430 epidemic without the need for a strict evolutionary distance cutoff and avoids the need 431 to a priori give a strict definition of a cluster. Overall, our deep learning approach is capable of processing tens of thousands of 433 sequences, and is not dependent on a specific choice of evolutionary distance metric. While we developed this computational framework for HIV-1, with adjustments in the 435 evolutionary rate and training simulations, it could be applied to analyze other 436 pathogens for online, rapid response to developing outbreaks and rising of novel variants 437 of concerns, e.g. for SARS-CoV-2. The source code used to produce the results and analyses presented in this manuscript 440 are available from: clustering methods applied to the full data set. Method HC denotes hierarchical 461 clustering, OLO denotes HC with optimal leaf ordering refinement, None denotes no 462 sorting is applied. Accuracy is computed using the three labels used during training. 465 Global and regional molecular epidemiology of HIV-1 review, global survey, and trend analysis. The Lancet Infectious Diseases Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies Assessing the Performance of PhyML 3.0. Systematic Biology RAxML version 8: a tool for phylogenetic analysis and 473 IQ-TREE 2: New Models and Efficient Methods for 477 Phylogenetic Inference in the Genomic Era Maximum-Likelihood Trees for Large Alignments Phylodynamic Inference for Structured 483 Epidemiological Models Phylogenetic patterns recover known HIV 486 Genomic infectious disease Molecular Biology and 494 Evolution 496 PHYLOSCANNER: Inferring transmission from within-and between-host Bayesian phylogenetic and phylodynamic data integration using BEAST 1 Backpropagation Applied to Handwritten Zip Code Recognition Convolutional Neural Networks Advances in Neural Information Processing Systems Deep learning A Survey of Convolutional Neural 514 Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural 515 Networks and Learning Systems Learning representations by 518 back-propagating errors Identification of 520 Hidden Population Structure in Time-Scaled Phylogenies Modern hierarchical, agglomerative clustering algorithms Rich feature hierarchies for accurate 524 object detection and semantic segmentation Computer Society Conference on Computer Vision and Pattern Recognition Agent-based and phylogenetic analyses reveal 528 how HIV-1 moves between risk groups: Injecting drug users sustain the 529 heterosexual epidemic in Latvia HIV-1 Transmission, by Stage of 532 JS. HIV-1 Transmission during Early Infection in Men Who Have Sex with Men: 536 A Phylodynamic Analysis Getting more from heterogeneous HIV-1 surveillance data in a high 540 immigration country: estimation of incidence and undiagnosed population size 541 using multiple biomarkers A multistate approach for estimating the 544 incidence of human immunodeficiency virus by using HIV and AIDS French 545 surveillance data TensorFlow: 548 Large-Scale Machine Learning on Heterogeneous Systems A method for stochastic optimization Dynamics of Two Separate but Linked HIV-1 CRF01 AE Outbreaks among 554 ape 5.0: an environment for modern phylogenetics and 557 evolutionary analyses in R R: A Language and Environment for Statistical Computing MAFFT Multiple Sequence Alignment Software Version 561 7: Improvements in Performance and Usability Toward Extracting All Phylogenetic Information from Matrices of 569 Automated analysis of phylogenetic clusters