key: cord-0678972-6t1hxqwi authors: Blenkinsop, Alexandra; Monod, M'elodie; Sighem, Ard van; Pantazis, Nikos; Bezemer, Daniela; Coul, Eline Op de; Laar, Thijs van de; Fraser, Christophe; Prins, Maria; Reiss, Peter; Bree, Godelieve de; Ratmann, Oliver title: Estimating the potential to prevent locally acquired HIV infections in a UNAIDS Fast-Track City, Amsterdam date: 2022-03-15 journal: nan DOI: nan sha: 5b6b9c0307e064c0d15c9e9f9c8613d82634453b doc_id: 678972 cord_uid: 6t1hxqwi Amsterdam and other UNAIDS Fast-Track cities aim for zero new HIV infections. Utilising molecular and clinical data of the ATHENA observational HIV cohort, our primary aims are to estimate the proportion of undiagnosed HIV infections and the proportion of locally acquired infections in Amsterdam in 2014-2018, both in MSM and heterosexuals and Dutch-born and foreign-born individuals. We located diagnosed HIV infections in Amsterdam using postcode data at time of registration to the cohort, and estimated their date of infection using clinical HIV data. We then inferred the proportion undiagnosed from the estimated times to diagnosis. To determine sources of Amsterdam infections, we used HIV sequences of people living with HIV (PLHIV) within a background of other Dutch and international sequences to phylogenetically reconstruct transmission chains. Frequent late diagnoses indicate that more recent phylogenetically observed chains are increasingly incomplete, and we use a Bayesian model to estimate the actual growth of Amsterdam transmission chains, and the proportion of locally acquired infections. We estimate that 20% [95% CrI 18-22%] of infections acquired among MSM between 2014-2018 were undiagnosed by the start of 2019, and 44% [37-50%] among heterosexuals, with variation by place of birth. The estimated proportion of MSM infections in 2014-2018 that were locally acquired was 68% [61-74%], with no substantial differences by region of birth. In heterosexuals, this was 57% [41-71%] overall, with heterogeneity by place of birth. The data indicate substantial potential to further curb local transmission, in both MSM and heterosexual Amsterdam residents. In 2014-2018 the largest proportion of local transmissions in Amsterdam are estimated to have occurred in foreign-born MSM, who would likely benefit most from intensified interventions. We located diagnosed HIV infections in Amsterdam using postcode data (PC4) at time of registration in the ATHENA observational HIV cohort, and estimated their date of infection using clinical HIV data. We then inferred the proportion of undiagnosed from the estimated times to diagnosis. To determine the sources of Amsterdam infections, we used HIV sequences of Amsterdam people living with HIV (PLHIV) within a background of other Dutch and international sequences to phylogenetically reconstruct transmission chains, and tabulate their growth between 2014 and 2018. Frequent late diagnoses indicate that more recent phylogenetically observed chains are increasingly incomplete, and we use a Bayesian model to estimate the actual growth of Amsterdam transmission chains, and the proportion of locally acquired infections. Human immunodeficiency virus (HIV) is concentrated in metropolitan areas, with the 200 cities with the highest burden of HIV representing 26% of global HIV burden (Joint United Nations Programme on HIV/AIDS (UNAIDS) 2014). In response, as of March 2021 over 300 cities have joined the Fast-Track Cities initiative by signing the Paris Declaration, committing to end the AIDS epidemic by 2030, by addressing disparities in access to basic health and social services, social justice and economic opportunities (UNAIDS 2019) . Several of these fast-track cities have successfully developed strategies which best address the needs of the local epidemic, including London's HIV Prevention Programme and early ART initiation, and New York's Status Neutral Prevention and Treatment Cycle (Public Health England 2018; Myers et al. 2018) . A central milestone in this agenda is to characterise the number of HIV infections that are acquired from sources within cities and are thus preventable through local interventions, as well as to identify the primary risk groups with infections from local sources. In the Netherlands, Amsterdam is the city with the greatest HIV burden nationally, reflecting in part a large MSM community, as well as large communities of at-risk, foreign-born individuals. Amsterdam has a long history of a collaborative HIV approach in combating the epidemic and joined the UNAIDS fast-track Cities initiative on 1 December 2014. City-level HIV responses were galvanised in the HIV Transmission Elimination Amsterdam project (H-Team) that same year (de Bree et al. 2019) . The H-Team fast-track response, amongst others, focussed on outreach activities, encouraging repeat testing every 3-6 months to identify acute and early HIV infection, followed by immediate initiation of combination antiretroviral therapy (c-ART) in newly diagnosed patients, and roll-out of pre-exposure prophylaxis (PreP) in high risk populations at increased risk of HIV infection (den Daas et al. 2018 ; Bartelsman et al. 2017; Hoornenborg et al. 2019; M. Dijkstra et al. 2019) . Prior to the COVID-19 pandemic, the number of annual HIV diagnoses in Amsterdam residents has consistently declined from 300 new city-level HIV diagnoses in 2010 to 120 in 2018, primarily in Dutch-born and foreign-born MSM. Given these impressive achievements, it is particularly unclear how many new infections are locally acquired and could thus still be locally averted. Late diagnoses remain common and are a particular concern in this effort, both for individual health and the risk that unnoticed transmission chains pose to public health. Here, we build on Amsterdam's combined case and genomic surveillance data to reconstruct transmission chains at city level, and the growth and origins of these chains between 2014-2018. We estimate the extent of undiagnosed infections at the forefront of the cities' transmission chains, among infections that are estimated to have occured since 2014. We characterise variation in the extent to which all individuals in city level transmission chains are virally suppressed, and we study the relative impact of newly introduced transmission chains from outside Amsterdam on the city-level epidemic. Finally, we combine our insights to estimate the proportion of locally acquired infections, i.e. those infections that could have been locally averted. Data were obtained from the prospective ATHENA cohort of all people living with HIV (PLHIV) in care in the Netherlands, including patient demographics and longitudinal CD4, HIV viral load, viral sequence, and treatment data (see Supplementary Material, Section 2) ). Sequencing methods are described in (Bezemer et al. 2004) . We geolocated diagnosed infections in Amsterdam based on patients' postcode of residence at time of first registration in ATHENA, or the most recent registration update. MSM were stratified by region of birth (The Netherlands; Western Europe, North America, Oceania; Eastern and Central Europe; South America and the Caribbean; Other), and similarly for heterosexual individuals (The Netherlands; South America and the Caribbean; Sub-Saharan Africa; Other), and so we considered 9 risk groups in total. We here focus on city-level transmission chains growing in the period from January 1, 2014 to December 31, 2018, which for brevity we refer to as 2014-2018. Available demographic, clinical, and viral sequence data were obtained for HIV diagnoses in Amsterdam until December 31, 2018 , from the ATHENA database version closed on May 1, 2019. Using longitudinal viral load and CD4 count data along with other demographic/clinical information in combination with parameters of a bivariate linear mixed model, estimated in a large dataset of individuals with known infection dates from the CASCADE Collaboration, we estimated time from infection to diagnosis for all HIV diagnosed patients using a Bayesian method. . We next reconstructed time-to-diagnosis distributions from the individual-level estimates. To avoid censoring of infection-to-diagnosis times, we focused analyses on the subset of infections in 2010-2012 which were diagnosed by May 2019, since most infections in this window would have been diagnosed by the close of study. We assume time to diagnosis did not change substantially in the years 2010-2019. We then fitted a Bayesian hierarchical model with a Weibull likelihood, borrowing information across individuals stratified by region of birth. The model was implemented with Stan version 2.21 (Carpenter et al. 2017) . Full details are provided in Supplementary Material, Section 3. For each year since 2014, we calculated the probability that infections were not diagnosed by database closure, and used the average of these probabilities to estimate the proportion of infections in 2014-2018 that remained undiagnosed by database closure, which we denote by . The total number of Amsterdam infections in 2014-2018 including the θ undiagnosed, which we denote by ,was calculated by dividing the number of diagnosed Amsterdam infections in 2014 Amsterdam infections in -2018 with the estimated proportion of diagnosed individuals through, . (1) = 1−θ We calculate the proportion of infections in 2014-2018 that remain undiagnosed and (1) for Amsterdam MSM and heterosexuals, and region of birth. To reconstruct distinct transmission chains among city-level infections, we used the first available partial HIV-1 polymerase (pol) sequence from Amsterdam PLHIV, Dutch PLHIV from outside Amsterdam, and >82,000 pol sequences from non-Dutch PLHIV that were at least 1300 nucleotides long from other countries classified into 10 regions: Africa, Western Europe, Eastern Europe and Central Asia, North America, Latin America and the Caribbean, Dutch Caribbean and Suriname, Middle East and North Africa, Asia and Oceania. The non-Dutch viral sequences were retrieved from the Los Alamos HIV-1 sequence database on March 2, 2020. All sequences were subtyped using Comet (Struck et al. 2014) and Rega . Subtype-specific alignments were generated with Virulign (Libin et al. 2019 ) (Supplementary Text Section 4.1). Subtype-specific HIV phylogenetic trees were reconstructed with FastTree v2.1.8 . Then, we attributed to all viral lineages in the phylogenies a 'state' label that included information on the transmission risk group (MSM, heterosexual, other) and place of birth (defined above) with phyloscanner version 1.8.0 (Wymant et al. 2018 ) and as in (Bezemer et al. 2022) . In this analysis, lineages are grouped into phylogenetic subgraphs that have the same, uninterrupted state label based on maximum parsimony. Diagnosed Amsterdam patients in the same subgraph are interpreted as belonging to the same transmission chain, and the estimated state of the root of the subgraph is interpreted as the origin of the transmission chain. We here refer to the subgraphs also as the phylogenetically observed (parts of) transmission chains. To capture phylogenetic uncertainty, phylogenetic analyses were repeated on 100 bootstrap replicates drawn from each subtype alignment, and transmission chains were enumerated across these replicate analyses. See Supplementary Text 4.2 for full details. We classified phylogenetically reconstructed transmission chains by the infection dates that we estimated from each patient's diagnosis date, risk group, age, CD4 trajectory and viral load trajectory. Chains were classified as 'pre-2014' if at least one of its members had a posterior median infection date before 2014, and as 'emerging' if all members had a posterior median infection date after January 1, 2014. For all pre-2014 chains, we determine the number of infectious individuals at the start of 2014 from viral load data. Specifically, we defined patients as suppressed by 2014 if their last viral load measurement before 2014 was below 100 copies/ml, and count for each pre-2014 chain its suppressed and unsuppressed members by 2014. Estimating the growth of city-level transmission chains Because of the large number of late presenters and incomplete sequence coverage in diagnosed patients, the phylogenetically observed transmission chains are incomplete and statistical models were required to estimate the growth and origins of Amsterdam transmission chains. We extended the Bayesian branching process model of (Bezemer et al. 2022) to describe the growth of pre-existing transmission chains and transmission chains introduced since 2014. In the extension, the model likelihood is described by the observed number of new cases in each chain conditional on the number of index cases, and integrates out all possible scenarios of unobserved members of the actual transmission chains (Supplementary Text Section 5). In the model, the index cases are assumed to be infectious and defined by the number of unsuppressed members by 2014, adjusted for the sampling probability of such members. In chains which emerged since 2014, we assume that there is one index case. The likelihood then comprises the growth distributions of emerging chains (since 2014 as defined above), pre-2014 chains that continued to grow, and pre-2014 chains with unsuppressed members that did not grow. Pre-2014 chains for which all members were suppressed by 2014 and which did not grow were not included, because these chains had no unsuppressed index case. We fitted the branching process model with Stan version 2.21 to MSM chains borrowing information across subtypes, and similarly for heterosexual individuals. The primary output of the model are posterior predictive distributions on the growth of the actual transmission chains while accounting for as of yet undiagnosed and unsequenced individuals. This includes emerging chains that were entirely unsampled. Full details are provided in the Supplementary Text, Section 6. Estimates of the proportion of Amsterdam infections between 2014-2018 that originated from an individual living in Amsterdam, can be calculated from the predicted actual transmission chains. We denote this proportion by , and interpret it as the proportion of infections which γ have the potential to be locally preventable through local intervention. This is because all infections originating from an individual living in Amsterdam had a local source, except the index cases in the emerging chains. We have, Through Equation 2, we obtain estimates of for both MSM and heterosexuals, and for each γ subtype. To obtain estimates of transmission chains and of that are stratified by place of γ birth, we used weighted averages across chains and subtypes, with the weight determined as the proportion of observed individuals from a particular area of birth. Full details are provided in the Supplementary Text, Section 6. Figure 1 illustrates how each of the data sources are used to build the model and estimate the proportion, and number, of locally acquired infections. As from 2002 ATHENA is managed by SHM, the institution appointed by the Dutch Ministry of Public health, Welfare and Sport for the monitoring of people living with HIV in the Netherlands. People entering HIV care receive written material about participation in the ATHENA cohort and are informed by their treating physician on the purpose of data collection, thereafter they can consent verbally or elect to opt-out. Data are pseudonymised before being provided to investigators and may be used for scientific purposes. A designated data protection officer safeguards compliance with the European General Data Protection Regulation. ). Figure 1 : Graphic describing approach to analysis. Input data includes patient baseline data at registration, clinical biomarker data and viral sequence data. Biomarker data is used to estimate the proportion of undiagnosed infections, and thus the total population size of PLHIV. Sequence data is used to reconstruct phylogenetic trees. The phylogenetically likely subgraphs are extracted and used to model the growth of transmission chains in Amsterdam, adjusting for the sequence sampling fraction. Between January 1 2014 and May 1 2019, there were 846 HIV diagnoses in Amsterdam residents who self-identified as MSM (79%) or heterosexual (21%). 275 (33%) of these diagnoses presented with a CD4 count below 350, with late presentation being higher among heterosexuals. All diagnosed patients had biomarker data available to estimate time to diagnosis, and 516 of 846 (61%) were estimated to have been infected between 2014-2018 based on the posterior median infection time estimate (Table 1 , and 5.2% of individuals were lost to care van Sighem et al. 2020 ). Overall, we find the individual-level time-to-diagnosis estimates varied substantially within each of the 9 risk groups by transmission group and region of birth (Supplementary Figures 1 and 2) . Regardless, the posterior median time-to-diagnosis estimates among individuals were 14 months longer in heterosexuals than in MSM, and 9 months longer among heterosexuals born in the Netherlands than Dutch-born MSM. Among heterosexuals, median times to diagnosis were 21 months longer among heterosexuals born in Sub-Saharan Africa compared to Dutch-born heterosexuals (Table 1) . While the estimated times to diagnosis are shortening compared to earlier calendar periods, the substantial diagnosis delays continue to undermine the long-term prognosis of infected individuals and transmission prevention efforts. We next adopted viral phylogenetic methods to understand how the diagnosed Amsterdam infections since 2014 are distributed across Amsterdam's HIV transmission networks. 387 of the 516 (75%) individuals had a partial polymerase HIV sequence available, of whom 344 were of the major subtypes or circulating recombinant forms that are circulating in Amsterdam (B, 01AE, 02AG, C, D, G, A1 or 06cpx). 43 individuals were excluded from further analysis as they were associated with other subtypes or recombinant forms, or their subtype identification was inconclusive. Supplementary Table S1 summarises the characteristics of the study population, and those with a sequence available. We reconstructed viral phylogenies using the HIV sequence data from these individuals combined with viral sequences from 3,647 Amsterdam diagnoses with estimated infection prior to 2014, 6,875 diagnosed individuals from the Netherlands outside Amsterdam, and 14,222 viral sequences from outside the Netherlands that were genetically closest to those circulating in the Netherlands. We identified across the major HIV-1 subtypes and circulating recombinant forms 1,829 distinct viral phylogenetic subgraphs that comprised at least one diagnosed Amsterdam infection prior to 2014, which we refer to as the phylogenetically observed pre-2014 chains ( Table 2 ). Considering growth, 89 (7%) of the 1,253 phylogenetically observed pre-2014 chains in Amsterdam MSM had at least one new member diagnosed in 2014-2018, and 114 chains emerged ( Figure 2 and Table 2 ). In Amsterdam heterosexuals, 15 (3%) (Table 3 ). Emerging transmission chains outnumber pre-existing, growing transmission chains We next used a Bayesian branching process growth model to predict the size and growth of the actual transmission chains, and account for the fact that larger proportions of recent infections remain undiagnosed and that approximately half of diagnosed individuals had a sequence sampled (see Materials and Methods, and Supplementary Material, Section 6) . Model fit to the observed growth distributions was very good (Supplementary Figure S16 ). We estimate that there are substantially more emerging chains in Amsterdam since 2014 than phylogenetically observed, 182 in in heterosexuals, reflecting that emergent chains have a high probability to be entirely unobserved when growth is below the epidemic reproduction threshold of one (Table 2) . Thus, the estimated actual, emerging chains outnumber the growing pre-2014 chains in both Amsterdam MSM and heterosexuals more strongly than the phylogenetic data suggest. We Figure 4A ). In Amsterdam heterosexuals, an estimated 57% [41-71%] of infections in 2014-2018 were locally preventable, with more variation by region of birth, though we caution that the underlying sample sizes are small. More than 300 cities have by the end of 2021 signed the Fast-Track Cities Paris Declaration and committed to end the AIDS epidemic by 2030, addressing disparities in access to basic health and social services, social justice and economic opportunities. The city of Amsterdam reached the UNAIDS Fast-track Cities 95-95-95 targets before the onset of the COVID-19 pandemic, and has seen a decade of declines in city-level HIV diagnoses. Here, we quantify the further potential of preventing HIV infection and recent annual spread at city level based on viral phylogenetic analysis of the cities' growing HIV transmission chains. We can structure our insights in four themes. First, when focusing on the denominator of recent infections that are estimated to have occurred in 2014-2018, the proportion of undiagnosed individuals infected with HIV are at or above 20% in (self-identified) Amsterdam MSM risk groups, and at or above 30% in Amsterdam heterosexual risk groups. These results underscore that strategies aimed at raising awareness of HIV infection, providing easy access to checking symptoms of early HIV infection, encouraging frequent testing, PrEP provision, addressing fears of a positive test and reducing sitgma are vital to break the forefront of ongoing HIV transmission chains (H-TEAM Amsterdam n.d.; Maartje Dijkstra et al. 2017; Heijman et al. 2009; Burns, Rodger, and Johnson 2017; Myers et al. 2018) . The estimated times to diagnosis document substantial disparities across risk groups in entering HIV care in Amsterdam, and separate efforts have characterised individuals with late diagnoses (Op de Coul et al. 2016; Bil et al. 2019; Slurink et al. 2021) . One limitation of our approach is that time-to-diagnosis is estimated from clinical biomarkers and so, for individuals who recently arrived in Amsterdam, includes the potential time periods from infection to arrival. This prompted us to investigate potential upwards bias in our estimates of the proportion of undiagnosed Amsterdam infections, by comparing our estimates against those generated by the ECDC model for all the Netherlands. We found that the ECDC model resulted in larger undiagnosed proportions (Supplementary Material, Section 3.5), suggesting that our results are unlikely to be substantially upwards biased. We also validated time to infection estimates by comparing the estimated proportion of recent HIV infections ( 6 months) with those estimated in an independent study in Amsterdam using ≤ avidity assays, and found them to be similar . See Supplementary Figure S4 for details. Second, we documented the growth of Amsterdam HIV transmission chains in which all phylogenetically observed members were virally suppressed by 2014. We find that regardless of risk group, almost all such virally suppressed chains did not grow in the sense that no new infections were phylogenetically observed. These results are unsurprising and mirror the established relationship that treatment for HIV infection, which results in undetectable viral load equals untransmittable virus. To track progress in tackling inequalities between those suppressed, and going the last mile to end the AIDS epidemic, we suggest that sequences are obtained from all newly diagnosed patients and the suppression status of transmission chains is routinely monitored. Third, we initially speculated that with a decade of declining HIV diagnoses in Amsterdam, those infections that still occur might be concentrated in newly seeded, emerging transmission chains. It is challenging to interpret the directly observed data because high proportions of individuals remain undiagnosed and/or are not sequenced, and emerging chains are more likely to be completely undetected. We thus used statistical growth models accounting for unsampled cases, and we estimate in contrast to our initial speculations that 53% of new Amsterdam MSM infections in 2014-2018 grew from chains that existed prior to 2014, and 39% of new Amsterdam heterosexual infections. Following up and tracing back from known transmission chains is easier than discovering emerging chains, and so the many new infections that originate in existing chains have particularly high prevention potential (Oster, France, and Mermin 2018; Little et al. 2021; Dennis et al. 2021 ). Fourth, we characterised the locally preventable Amsterdam infections in 2014-2018 by key population, i.e. MSM and heterosexuals stratified by region of birth. We defined locally preventable infections as the infections in Amsterdam residents in 2014-2018 who are estimated to have a source in another Amsterdam resident. The locally preventable infections thus comprise all new infections in pre-2014 chains and all new infections in emerging chains except their index case. We estimate that 68% of infections in Amsterdam MSM were locally preventable, and 57% in Amsterdam heterosexuals. One limitation of our analyses is that in addition to undiagnosed individuals, viral sequences were missing for 33% of diagnosed MSM and for 37% of diagnosed heterosexuals. We thus found data on fewer of the actual transmission chains than planned, and our sampling frame is not powered to identify statistically meaningful differences in the proportion of locally preventable infections by risk group. More detailed estimates into the locally preventable infections among migrants have recently been obtained through clinic surveys across Europe (Arco et al. 2017) . These data suggest similar estimates of in-country HIV acquisition post migration of 51% in heterosexual women and 58% in heterosexual men, which are consistent with ours, and further highlight important variations by place of birth and other demographic characteristics that we are unable to uncover. In summary, we find considerable potential to prevent Amsterdam HIV infections, which could be targeted through city-level interventions, even in the context of substantial improvements in curbing the number of diagnoses and infections in Amsterdam over the past 10 years. Given the similarity of the demographics, HIV burden, access to care, and prevention approaches across many cities in Western Europe, our conclusions are relevant to many UNAIDS Fast-Track cities, and provide evidence-based support for locally-targeted combination HIV prevention interventions in metropolitan areas. In the meantime COVID-19 is severely disrupting prevention messaging, testing and PrEP services, and early pathways to care, making innovative and targeted prevention approaches all the more important. performed all analyses. AB, OR wrote the draft manuscript. All authors contributed to and approved the final manuscript. HIV physicians can review the data of their own treatment centre and compare these data with the full cohort through an online report builder. Statistical information or data for own research purposes can be requested by submitting a research proposal (https://www.hiv-monitoring.nl/english/research/research-projects/). For correspondence: hiv.monitoring@amc.uva.nl. Code is available on Github, https://github.com/alexblenkinsop/bpm. Supplementary Text Supplementary Tables S1-S5 Supplementary Figures S1-S28 PC4 is the most granular administrative city level in Amsterdam, with 12,000 residents on average per PC4 area and a number of residents ranging from 10 to 26,263. Figure S18 shows a map of the 81 Amsterdam PC4 areas. Amsterdam patients were identified as patients with a first or more recent registration in the Amsterdam GGD region. The ATHENA database version was closed on March 31st 2019 [2] . We obtained data for 19,204 patients from the Netherlands, with 7,773 of these having an Amsterdam postcode at first or last registration. We leverage baseline data recorded at registration on year of birth, country of birth, We also obtained datasets from the ATHENA cohort of partial HIV-1 polymerase (pol) sequences of Amsterdam patients, including date of sample, and of clinical data collected longitudinally of viral load measurements and CD4 counts. In the study, we focus on infections estimated to have been acquired between 2014-2018 (see Section S3.1). We also consider MSM and heterosexual transmission groups only, since less than 2% of infections were in other transmission groups. Since we focus on infections acquired between 2014-2018, we define the study start and end time by, In this section, we first describe how we fit a model to clinical biomarker data to estimate the time from infection to diagnosis, and consequently the date of infection. Next, we describe how we fit a model to the posterior median estimates of the time to diagnosis, to estimate the proportion of Amsterdam infections which remained undiagnosed by the close of the study. We define the complete cohort of patients registered in Amsterdam by N . We first follow methods in [3] to estimate time from infection to diagnosis for individual i ∈ N by w i . We 27 use an indicator R i to denote transmission risk group of each individual, where, We utilise clinical biomarker data for each patient on CD4 counts and viral loads, measured after diagnosis but before onset of AIDS or start of ART. As a caveat, we keep viral load measurements within one week of ART start, and CD4 counts within one month of ART start. This choice is supported by the fact that ART takes time to act. We denote CD4 counts by y c , and viral loads by y r , and encapsulate measurements for all i individuals in a vector, Each measurement is collected at an (unknown) time since infection, We have clinical data prior to AIDS diagnosis or start of ART for 6,879 (88%) of patients. For the remaining 12% we are unable to estimate the time of infection. We then denote the time between diagnosis and each biomarker measurement by, From this, we can then express the time from infection to measurement date in (S5) in terms of the estimated date of infection, w i , and the time between diagnosis and each biomarker measurement as follows, We then use a bivariate linear mixed model for the joint distribution of the two biomarkers over time and denote their distribution by, for the joint distribution of the two biomarkers over time. We place a uniform prior on w i over (0,u i ), where u i is the interval between time at risk for each individual and HIV diagnosis. We take the risk onset date to be the maximum of the time between the last negative and test and diagnosis, and the time between the individual turning 15 years of age and diagnosis. The posterior distribution of w i is as follows: Figure S1 shows the distribution of individual median estimates for time to diagnosis by the risk groups given by (S1a) and (S1b) for MSM and heterosexuals, respectively. Figure S2 plots given typical disease progression [4] . We first define an indicator U i (τ ), which is a function of a given year τ , in which, We then define the synthetic cohort of infections in 2010-2012 by S12. We then consider individuals k ∈ C M SM and l ∈ C HSX . For each transmission group, we defined each strata by place of birth given in equations S1a and S1b. which takes as value the place of birth of individual k, as defined in equation (S1a). We estimate ethnicity-specific shape and scale parameters κ j(k)∈M and λ j(k)∈M which can borrow information from each other through a hierarchical prior distribution. For convenience when choosing priors, we re-parameterised the Weibull distribution in terms of its median and 80% quantile (log χ 50 j(k) , log χ 80 j(k) − log χ 50 j(k) ). The quantile function for the Weibull distribution is given by Equation (S13). We then express the parameters of the Weibull distribution as follows: and then specify the Weibull model and its prior distribution as follows, , exp log(χ 50 j(k) ) − log(log (2)) log(χ 50 j(k) ) ∼ N (µ log χ 50 , σ 2 log χ 50 ) (S15b) where Q(0.5) and Q(0.8) are the empirical quantiles from the estimated times to diagnosis, for each transmission group. The joint posterior distribution of model (S15a) was estimated in rstan with Stan ver- We then estimate, for a given month x ∈ X = {1, . . . , 12} (where x = 1 corresponds to January) and year y ∈ Y = {2014, . . . , 2018} of infection, the probability of an MSM individual not being diagnosed by 1st May 2019, given their place of birth, as follows: and obtain numerical estimates of N The fits were good, with the empirical CDFs generally lying within the 95% posterior intervals of the fitted CDFs for all risk groups. Figure to that obtained when using only midpoint estimates from seroconverters ( Figure S22b ). Estimates are compared by year of infection for each risk group. When using data only from the seroconverters, the estimated proportions of undiagnosed individuals are much smaller. This is likely driven by the fact we excluded patients without a last negative test, who may have typically had longer estimated times to diagnosis. This was also observed by Ratmann et al. [5] . There are also considerably fewer data points, particularly among heterosexuals, resulting in elevated uncertainty in these estimates. Figure S23 shows our estimates for the total number of infected individuals in Amsterdam. Clearly, whilst the estimates are more conservative where we use midpoint estimates than we find using the estimated times to diagnosis (see Figure S3 ), we still find a substantial proportion of individuals to be undiagnosed by May 2019. Second, we considered alternative classification rules for grouping individuals into infected before or after 2014. Specifically, we considered as cutoff point in equation (S10) the 30% (and 40%) quantile of the posterior distributions of individual time to diagnosis times, in place of the posterior median estimate. The model for estimating times to diagnosis from biomarker data has been previously validated with good accuracy on classifying individuals into two groups of those infected before or after a certain cut-off point [3] . Here, we nonetheless performed these sensitivity analyses to quantify the potential impact of hypo- Third, we considered utilising estimates for Amsterdam from the ECDC HIV modelling tool [6] . We used partial HIV pol sequences from Amsterdam and the rest of the Netherlands from the ATHENA cohort and aligned these to the reference genome HXB2 [7] using Virulign [8] . Sequences which failed to align were aligned globally using Mafft version 7 [9] . Nucleotide positions which were missing for most sequences, or not in the reference sequence HXB2 were removed. Known antiretroviral resistant mutations were masked using the R package big.phylo [10] . The final alignment was 1302nt in length. We carried out some manual curation of the alignment, removing all gaps and resolving sequences which did not align correctly. We then classified sequences by subtype using COMET [11] and verified any which were uncertain with REGA v3.0 [12] . We downloaded 82,708 background sequences from the Los Alamos HIV-1 sequence database on 27th February 2020, specifying fragments in the POL region longer than 1300nt. We then used the Basic Local Alignment Search tool (BLAST, https://blast.ncbi.nlm. nih.gov/Blast.cgi) to identify the top 20 closest background sequences to each of the Dutch sequences, which we kept and aligned to the Dutch sequences using the HXB2 reference sequence. We created alignments by subtype, excluding the least common subtypes with fewer than 50 Dutch and background sequences. We used FastTree v2.1.8 to reconstruct phylogenetic trees for each subtype [13] . We then assigned labels to each sequence. Sequences from Amsterdam were labelled according to their risk group, sequences from the rest of the Netherlands (excluding Amsterdam) were labelled as such, and background sequences were labelled according to the country they originated from. The geographic regions for the MSM trees were, We then used phyloscanner v1.8.0 [14] to assign one of the state labels to each viral lineage in the reconstructed phylogenies. Figures S5-S15 show the annotated phylogenetic trees for all major subtypes and circulating recombinant forms that circulate in Amsterdam. From the annotated trees, we extracted the viral phylogenetic subgraphs that were assigned to Amsterdam individuals. We assume that viral phylogenetics correctly assigns individuals into subgraphs, which we interpret as the observed parts of distinct city-level transmission chains. In this section, we describe how we build on existing methods to model the growth of the existing and newly introduced transmission chains. Utilising the phylogenetic subgraph data described in Section S4.2, we show how we can model their growth from a point in time, rather than from the first introduction, by utilising the number of infectious cases in the subgraph at a given point in time, and how many new cases were generated from those infectious cases. We also describe the model likelihood of new transmission chains which emerged. We model the growth of transmission chains using putative infection dates, estimated in S3.1. For individuals with no estimate for date of infection, due to missing clinical biomarker data after diagnosis, we subtracted the posterior median time to diagnosis for an individual estimated using the model described in equation (S15a) in the corresponding migrant group, defined by equations (S1a)-(S1b). We model the spread of HIV transmission chains that are characterised by reproduction numbers below one, through branching processes characterised by Negative Binomial offspring distributions [15] . A central component of branching process theory is the probability generating function Q(s) = ∞ i=0 q i s i , where q i is the probability that one individual generates i new infections in one generation, and q 0 is the probability that one individual generates no further infections. For our purposes, we will use two fundamental formulae. First, the kth derivative of Q is and so the probability q k is recovered through which is the probability that two individuals collectively generate k new infections. Thus, the probability that m index cases collectively generate i new infections is given by the ith coefficient of Q m (s). We denote this probability by We consider a Negative Binomial offspring distribution, parameterised in terms of the mean µ and dispersion parameter φ, so that its variance is given by µ(1 + µ/φ). Thus, as φ tends to zero, µ/φ increases, and so does the variance to mean ratio (1 + µ/φ). This means that the Negative Binomial can simultaneously model average reproduction numbers as well as additional heterogeneity in the number of new infections per generation, that goes beyond the variation described by a Poisson offspring distribution. The probability generating function of the Negative Binomial offspring distribution is Thus, we have that the probability that m index cases generate i new infections is where m = 1, 2, . . . are fixed, and the number of new infections takes on values i = 0, 1, . . . . Negative Binomial with mean µm and dispersion parameter φm, which we denote by where m = 1, 2, . . . are fixed, and the number of new infections takes on values i = 0, 1, . . . . Equivalently, we can express the probability that m index cases result in a total number of n cases through or more simplỹ where m = 1, 2, . . . are fixed, and the number of total cases are n = m, m + 1, . . . . Transmission chains require that infections occur in a particular order, while in contrast Equations ((S28a)-(S28d)) do not impose in what generation how many infections occur. For example, with one index case m = 1 and a total size n, Equation (S30) quantifies the probability that n − 1 new infections occur, but there is no constraint that the index case generates at least one new infection in the next generation. Dwass [16] derived the correction factor, and the probability that a transmission chain with m index cases has i new infections, or equivalently n cases, is where m = 1, 2, . . . , i = 0, 1, . . . , and n = m, m + 1, . . . . In practice, only a subset of new infections are captured in viral phylogenies because only a subset of infections are diagnosed, and of those only a subset have virus sequenced. We make two assumptions. First, infections are missing independently of each other with the same probability 1 − ρ, so ρ is the sampling probability of infections. Second, uncertainty in ρ can be quantified within several percentage points through surveillance data and/or modelling; we use this assumption later to ensure that the remaining parameters are statistically identifiable. Then, the probability of observing i individuals in a subgraph that has m known index cases is where m = 1, 2, . . . , i = 0, 1, . . . , and c(k|m, µ, φ) is from Equation (S32a). It is possible that an observed subgraph has m index cases by a particular study start time ψ start and no new infections between ψ start and ψ end , as defined in equations (S2a)-(S2b)), and the probability of observing one such subgraph is c obs (0|m, µ, φ, ρ). Some observed subgraphs are emergent in the sense that they consist of individuals that were all diagnosed after the study start time T . In this case, Equation (S33) cannot be used because it assumes that subgraphs contain at least one index case prior to the study start time T . We assume that emergent subgraphs are seeded by one index case, which for example ignores the possibility that sexual partners infected each other and then moved to Amsterdam, and seeded a new transmission chain in Amsterdam. The probability of observing an emergent transmission chain of size n is given bỹ where unlike Equation (S33), n = 1, 2, . . . may include in the count the index case (if it is sampled), andc(z|m = 1, µ, φ) is from Equation (S32b). The denominator corrects for the event that the index case and all new infections in an emergent chain are unsampled, which is possible with non-zero probability, but always unobserved. We now describe the likelihood of the growth distribution of viral phylogenetic subgraphs, which throughout we identify as the observed parts of distinct city-level transmission chains. In what follows, for brevity, we only consider one transmission group and omit reference to this transmission group. All equations are analogous for the other transmission group. We start with the viral phylogenetic subgraphs in the viral phylogeny of one subtype, and omit for brevity also any indication of that subtype. The data consist of a two-dimensional array x, where x mi denotes the number of subgraphs that had m index cases at the study start time ψ start and i sampled, new infections by the study end time ψ end . Here, m = 1, . . . , M and i = 0, . . . , I where M denotes the largest number of index cases observed, and I denotes the largest number of new infections observed. In addition, the data consist of a 49 one-dimensional arrayx, wherex n denotes the number of emergent subgraphs that have n sampled cases during the study period. Here, n = 1, ..., N , because at least one case needs to be sampled in order to observe the corresponding subgraph. Then, we associate the following log-likelihood to the growth distributions of pre-existing and emergent subgraphs, x mi log c obs (i|m, µ, φ, ρ) + N n=1x n logc obs (n|m = 1, µ, φ, ρ) . The log-likelihood thus involves infinite sums through equations (S33) where the µ s , φ s , ρ s are specific to the corresponding transmission group and subtype. We estimate city-level transmission dynamics, the growth distribution of transmission chains, and the proportion of locally acquired infections through the log-likelihood (S36) of phylogenetically observed subgraphs. For each individual i in the cohort N , if r i is their last viral load measurement taken before 2014, we define them to be not virally suppressed by 2014 through, Then, for each observed subgraph j where (j = 1, ..., A), m j are the observed index cases, we count the number of individuals infected by, but who were not virally suppressed, by the start of 2014. For example for MSM, if C M SM is the subset of MSM in Amsterdam, the number of observed index cases in subgraph j is, and m j > 0. We count analogously for heterosexuals. The actual number of index cases m * j ∼ NegBinom(m j , ν), where ν is the sampling fraction of individuals who were not virally suppressed by 2014. We estimate the true number of index cases under complete sampling m * j by, When m j = 0, estimate m * j from the mode of the pmf for the distribution Binomial(0; m * j , ν). For the subgraphs in which no individuals were not virally suppressed by 2014 (i.e. no observed index case), and no observed new case between 2014-2018, were not included in the subgraph sizes and assumed to have died out. The parameters of the model (S36) are the subtype-specific mean parameters of the offspring distributions, µ 1 , . . . , µ S , the dispersion parameters φ s and the sampling parameter ρ. To estimate the µ 1 , . . . , µ S , we borrow information across subtypes through a hierarchical prior distribution. We interpret the mean parameters of the offspring distributions as the effective reproduction numbers during the study period for the corresponding subtype. The varianceto-mean ratio of the Negative Binomial offspring distribution is 1 + µ s /φ s and measures the degree of dispersion of the size distribution of the transmission chains. For ease of inference, we re-parameterize the dispersion parameter into the variance-to-mean ratio minus one and also specify a hierarchical prior distribution, The log posterior density is given by log p µ s , υ, ρ|x s ,x s , s = 1, . . . , S where the prior densities are specified as follows. For the effective reproduction numbers, we specified the normal-normal two-level prior The hyperprior of the grand mean was centred on the maximum likelihood estimate logμ MLE = log(1 − 1/x), wherex is the average subgraph size [17] . The hyperprior of the grand standard deviation σ was specified by considering the differences in the log maximum likelihood estimates logμ MLE for each subtype. For the variance-to-mean ratio, we specified where 1 is the rate parameter for the exponential distribution. For the sampling parameter, we specified where N S are the number of sequenced individuals, N D are the number diagnosed and θ are the proportion of undiagnosed individuals. The joint posterior distribution was estimated using Stan version 2.19.3 across three chains, each with 2,000 samples. The models mixed well; Figures S26-S27 shows the pairs plot of parameters for the MSM and heterosexual models, respectively. We note that we did not observe multiplicative nonidentifiabilities (banana shape) between the reproduction rate R0 and the variance-to-mean ratio, as found by Blumberg and Lloyd-Smith [17] . where i * = 0, 1, . . . , and for ease of notation we have dropped the suffixes for different subtypes, transmission groups, or time intervals. We approximate equation (S46) numerically from k = 1, . . . , K Monte Carlo samples µ (k) , φ (k) of the joint posterior distribution by generating samples from i * (k) ∼ c(i * |m, µ (k) , φ (k) ), k = 1, . . . , K. This is easily done since the inference algorithm already tabulates the probabilities c(i * |m, µ (k) , φ (k) ) for i * = 0, 1, . . . . Equation (S47) allows us to generate one Monte Carlo sample of the actual growth of all transmission chains. We denote the number of all pre-existing phylogenetically observed transmission chains with at least one index case by and index each of them through j x = 1, . . . , |x|. Correspondingly we denote the number of emergent subgraphs by A certain proportion of emergent transmission chains remains phylogenetically unobserved owing to incomplete sampling. In our model, the probability that an emergent transmission chain is entirely unobserved is given by as in Equation (S34). Thus, the expected number of emergent transmission chains is |x|/(1− ρ e not-obs ). We obtain a Monte Carlo estimate of (S50) by plugging in our estimates of the joint posterior density, Using Equation (S51), we can predict the number of completely unobserved, emergent subgraphs through where the Negative Binomial is specified in terms of the number of failures and success probabilities, with mean |x|(1 − ρ e(k) not-obs , so that the mean of |x| + N * (k) not-obs equals as desired |x|/(1 − ρ e(k) not-obs ). We index all emergent transmission chains through j e = 1, . . . , |x| + N * (k) not-obs , and note that the number of emergent transmission chains is uncertain. Then, the actual number of new cases since 2014 in the chain corresponding to the observed, pre-existing subgraph j x is predicted by sampling i * (k) m j x is the number of index cases in the corresponding subgraph. Similarly, the actual number of new cases since 2014 of the chain corresponding to the emerging transmission chain j e is predicted by sampling i * (k) j e ∼c(·|1, µ (k) , φ (k) ), and then calculating n * (k) j e = i * (k) j e + 1. For the emergent subgraphs, we add 1 since we assume as before that the index case occurred after 2014. Aggregating these sizes, we predict the size distribution of the number of chains with i new cases since 2014 by where i = 0, 1, . . . and 1 is the indicator function that evaluates to 1 if i * (k) j x is equal to i, and otherwise to zero. The median estimate and 95% credible intervals of x * i are obtained by drawing posterior samples (k), repeating the calculation of (S53) for each k, and then summarising the set of samples. Figure S24 shows the observed growth of subgraphs in red next to the predicted actual growth of subgraphs in blue (with 95% credible intervals) for Amsterdam MSM and heterosexuals. Figure S24 : Growth of pre-existing (left) and emergent (right) phylogenetically observed subgraph sizes using estimated date of infection. * pre-existing prior to 2014. Ifπ r = (π 1 , . . . ,π R ) are the proportion of phylogenetically observed subgraphs since 2014 with geographic origin r, we can predict the origins of the pre-existing and emergent transmission chains, y j , for each Monte Carlo sample through, We then denote the proportion of predicted emergent transmission chains since 2014 with Amsterdam origin (A) by From Section S7.1, we can use the posterior predictive distribution of the number of new cases for a chain of a given index size (S46) to obtain a Monte Carlo prediction of the number of city-level cases since 2014. This is given by (1 + i * (k) A proportion of these index cases also acquired infection locally, from other risk groups in Amsterdam. We denote this proportion by λ ((S55)). This allows us to estimate the proportion of locally acquired infections since 2014 through The median estimate and 95% credible intervals of γ local are obtained by drawing posterior samples (k), repeating the calculation of (S59) for each k, and then summarising the set of samples. We then seek to estimate the proportion of locally acquired infections by ethnicity. Until now, all generated quantities are calculated for each subtype, without specific indexing. To obtain estimates of locally acquired infections by ethnicity, we apply weights to the subtypespecific estimates of locally acquired infections, using the proportion of predicted individuals from each georegion with subtype B or non-B infections. If γ (k) local,s is the proportion of locally acquired infections for subtype s ∈ {B,non-B}, and we introduce indexing for the birthplace by subtype, we then calculate the proportion of locally acquired infections by place of birth r as follows: As before, the median estimate and 95% credible intervals of γ local are obtained by drawing posterior samples (k), repeating the calculation of (S60) for each k, and then summarising the set of samples. To assess model fit, we perform posterior predictive checks against the phylogenetically observed growth distribution of subgraphs for each transmission group and subtype. To keep notations simple, we drop in what follows the suffixes that indicate dependence on transmission group and subtype. We previously described the phylogenetically observed growth distribution through the number of pre-existing subgraphs with m index cases by 2014 and i new cases since 2014, x mi , and the number of emergent subgraphs since 2014 with n new cases,x n . To generate posterior predictions x * mi andx * n , we index the pre-existing, phylogenetically observed subgraphs by j x = 1, . . . , |x|. With regard to emergent transmission chains, for the purpose of assessing model fit, we index only the corresponding phylogenetically observed subgraphs, j e = 1, . . . , |x|. In analogy to Equation (S46), we use the sampling-adjusted size equa- tions (S33) and (S34), which lead to the posterior predictive densities p obs (i * |x,x, m) = c obs (i * |m, µ, φ, ρ)p(µ, φ, ρ|x,x)d(µ, φ, ρ) (S61) p obs (n * |x,x) = c obs (n * |m = 1, µ, φ, ρ)p(µ, φ, ρ|x,x)d(µ, φ, ρ). We use these posterior predictive densities to predict the (observed) growth of the preexisting subgraphs, and the (observed) growth of the emergent subgraphs, and compare these predictions to the observed values. Specifically, given a Monte Carlo sample µ (k) , φ (k) , ρ (k) from the posterior distribution, we predict the growth of the pre-existing, phylogenetically observed subgraph j x through Similarly, we predict the growth of the emergent, phylogenetically observed subgraph j e through n * (k) j e ∼c obs (·|1, µ (k) , φ (k) , ρ (k) ). Finally, we aggregate ((S63)-(S64)) to obtain a posterior prediction of the observed growth distributions, High Levels of Postmigration HIV Acquisition within Nine European Countries HIV Testing Week 2015: Lowering Barriers for HIV Testing among High-Risk Groups in Amsterdam Many but Small HIV-1 Non-B Transmission Chains in the Netherlands Declining Trend in Transmission of Drug-Resistant HIV-1 in Amsterdam Disparities in Access to and Use of HIV-Related Health Services in the Netherlands by Migrant Status and Sexual Orientation: A Cross-Sectional Study among People Recently Diagnosed with HIV Infection AIDS Therapy Evaluation in the Netherlands (ATHENA) National Observational HIV Cohort: Cohort Profile Peter Reiss, and HIV Transmission Elimination AMsterdam (H-TEAM) Initiative. 2019 Evaluation of the London-Wide HIV Prevention Programme (LHPP) Stan: A Probabilistic Programming Language Increasing Awareness and Prompting HIV Testing: Contributions of Amsterdam HIV Testing Week HIV-1 Transmission Linkages among Persons with Incident Infection to Inform Public Health Surveillance Development and Validation of a Risk Score to Assist Screening for Acute HIV-1 Infection among Men Who Have Sex with Men Targeted Screening and Immediate Start of Treatment for Acute HIV Infection Decreases Time between HIV Diagnosis and Viral Suppression among MSM at a Sexual Health Clinic in Amsterdam Opting out Increases HIV Testing in a Large Sexually Transmitted Infections Outpatient Clinic Sexual Behaviour and Incidence of HIV and Sexually Transmitted Infections among Men Who Have Sex with Men Using Daily and Event-Driven Pre-Exposure Prophylaxis in AMPrEP: 2 Year Results from a Demonstration Study Symptom Checker UNAIDS Outlook: The Cities Report VIRULIGN: Fast Codon-Correct Alignment and Annotation of Viral Genomes Effective Human Immunodeficiency Virus Molecular Surveillance Requires Identification of Incident Cases of Infection Redefining Prevention and Care: A Status-Neutral Approach to HIV Factors Associated with Presenting Late or with Advanced HIV Disease in the Netherlands Molecular Epidemiology and the Transformation of HIV Prevention Determining the Likely Place of HIV Acquisition for Migrants in Europe Combining Subject-Specific Information and Biomarkers Data Automated Subtyping of HIV-1 Genetic Sequences for Clinical and Surveillance Purposes: Performance Evaluation of the New REGA Version 3 and Seven Other Tools FastTree 2--Approximately Maximum-Likelihood Trees for Large Alignments Annual Epidemiological Spotlight on HIV in London Monitoring Report 2020. Human Immunodeficiency Virus (HIV) Infection in the Netherlands Using midpoint estimates from seroconverters Monitoring recently acquired hiv infections in amsterdam, the netherlands: The attribution of test locations Aids therapy evaluation in the netherlands (athena) national observational hiv cohort: cohort profile Determining the likely place of HIV acquisition for migrants in Europe combining subject-specific information and biomarkers data The Immunopathogenesis of Human Immunodeficiency Virus Infection Sources of HIV infection among men having sex with men and implications for prevention Ecdc hiv modelling tool Complete nucleotide sequence of the AIDS virus, HTLV-III VIRULIGN: fast codon-correct alignment and annotation of viral genomes MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability COMET: adaptive context-based modeling for ultrafast HIV-1 subtype identification Automated subtyping of hiv-1 genetic sequences for clinical and surveillance purposes: Performance evaluation of the new rega version 3 and seven other tools Fasttree 2 -approximately maximumlikelihood trees for large alignments PHYLOSCANNER: Inferring transmission from within-and between-host pathogen genetic diversity Detecting Differential Transmissibilities That Affect the Size of Self-Limited Outbreaks The total progeny in a branching process and a related random walk Inference of R0 and Transmission Heterogeneity from the Size Distribution of Stuttering Chains We thank the steering committee of the Amsterdam HIV transmission initiative for earlier comments on this work; and Imperial College Research Computing Service, DOI: 10.14469/h-pc/2232, for providing the computational resources to perform this study.