key: cord-1009055-ck27nmhm
authors: Myall, A. C.; Peach, R. L.; Wan, Y.; Mookerjee, S.; Jauneikaite, E.; Bolt, F.; Price, J. R.; Davies, F.; Wiesse, A. Y.; Holmes, A.; Barahona, M.
title: Characterising contact in disease outbreaks via a network model of spatial-temporal proximity
date: 2021-04-09
journal: nan
DOI: 10.1101/2021.04.07.21254497
sha: 4cfb6b169cc2de4b0a3280645c91216c1fb68e2f
doc_id: 1009055
cord_uid: ck27nmhm

Contact tracing is a key tool in epidemiology to identify and control outbreaks of infectious diseases. Existing contact tracing methodologies produce contact maps of individuals based on a binary definition of contact which can be hampered by missing data and indirect contacts. Here, we present our Spatial-temporal Epidemiological Proximity (StEP) model to recover contact maps in disease outbreaks based on movement data. The StEP model accounts for imperfect data by considering probabilistic contacts between individuals based on spatial-temporal proximity of their movement trajectories, creating a robust movement network despite possible missing data and unseen transmission routes. We showcase the potential of StEP for contact tracing with outbreaks of multidrug-resistant bacteria and COVID-19 in a large hospital group in London, UK. In addition to the core structure of contacts that can be recovered using traditional methods of contact tracing, the StEP model reveals missing contacts that connect seemingly separate outbreaks. Comparison with genomic data further confirmed that these additional contacts indeed improve characterisation of disease transmission and so highlights how the StEP framework can inform effective strategies of infection control and prevention.

The spread of infectious diseases, direct or indirect, is predominantly mediated by person-to-person contacts 1 . Establishment of contacts, otherwise known as contact tracing, aims to uncover such transmission events and provide a basis to direct disease control policies 2 . Contact tracing plays a pivotal role in the public health response to many diseases. For example, it guided targeted vaccination that led to the eradication of smallpox in many areas 3 . For HIV, contact tracing provided the first evidence that its spread was primarily transmitted through sexual contact, which subsequently led to targeted safe-sex campaigns [4] [5] [6] [7] . It also proved an effective public health tool when it helped limit the 2003 outbreak of SARS in China 8 . Recently, contact tracing has been a key component of the response to COVID-19 9 , where countries and areas that aggressively traced contacts of confirmed cases, such as Singapore, South Korea, New Zealand, mainland China, and Taiwan, have all maintained low case rates which can be attributed largely to their prompt tracking and isolation of spreaders 10-12 .

Contact tracing establishes routes of disease transmission based on a network of contacts between affected individuals 13 . Its effectiveness highly depends on both the quality of data and how a 'contact' is defined, for example, a contact may be established when individuals share a location over a certain period of time. Such discrete definition, however, can miss certain transmission events [14] [15] [16] [17] [18] , and unidentified individuals as well as contaminated environments can moreover contribute to indirect contacts that lead to transmission 19 but are overlooked by contact tracing. Indirect or unseen pathways of transmission through intermediate contacts result in missing contacts between known infected individuals ( Figure 1 ), which inevitably hamper investigations and can misguide conclusions 20, 21 .

Mobility patterns of individuals within a system are a primary vector for disease spread [22] [23] [24] [25] . From the Black Death sweeping across medieval Europe via inter-town population movements 26 to the 1918 Flu Pandemic which followed troop movements in its early stage 27 , mobility data has proven effective in explaining spatial disease dynamics. More recently, flight data has been used for accurate prediction of global disease spread [28] [29] [30] , and inter-hospital spread of disease has been shown to closely follow patient referrals 31 . Taking movement flows into account, moreover, allows recovery of missing transmission events, as it quantifies how connected two individuals have been over time and space, and thus, how likely transmission occurred.

Direct contact Unidentified carriers Figure 1 . Contact network between individuals in a disease outbreak investigation. Establishing networks of contact is frequently used in disease outbreak investigations. However, it can be hampered by missing data, for example, when contacts between carriers are unobserved or non-observable, or when contact is mediated by unknown carriers and environmental surfaces. Ignoring these routes when investigating outbreaks and transmission routes can lead to a mis-characterisation of disease outbreaks.

Here, we present the Spatial-temporal Epidemiological Proximity (StEP) model, which assumes that the likelihood of transmission between individuals given their spatial-temporal (space and time) proximity. StEP assumes that the likelihood of transmission scales with the spatial-temporal proximity of the individuals' time-stamped trajectories, defined via a parametric kernel function. Two types of data are taken as input: (1) the movement trajectories of confirmed cases; (2) information on the total movement flows of all individuals in the system, irrespective of their infection status. To quantify the likelihood of transmission among positive cases, StEP relies on two parameters that can be trained to reflect epidemiological characteristics of a given pathogen. The output is a contact network where nodes represent identified cases and edges plausible transmission routes between them. We have implemented our methods in an R package StEP (github.com/ashm97/StEP).

We demonstrate the effectiveness of StEP in three case studies covering in total five hospital-acquired disease outbreaks. Four of the data sets comprise patients colonised with carbapenemase-producing Enterobacteriaceae (CPE), a multidrug-resistant pathogen with a high propensity to spread in healthcare settings 32, 33 . In our first case study, we trained and assessed StEP on an outbreak of a specific type of CPE, CPE IMP 34 , identified in 116 patients over three years. We used semi-supervised learning to train parameters that align the resulting contact network with the distribution of transmission markers (approximating overall transmission structure) available for a subset of confirmed cases. To assess transmission alignment, we then compared the contact network to a whole-genome sequencing (WGS) analysis of a plasmid identified in most CPE IMP isolates carrying several acquired resistance genes 35 . In our second case study, we transfer the optimised CPE IMP model to analyse further three CPE outbreaks, demonstrating that once trained, StEP can be deployed for a similar class of pathogen without the need for additional labelled data. The efficacy of this approach was validated using less accurate but routinely collected transmission markers. Finally, in our third case study, we analyse data from a hospital outbreak of COVID-19. At the time, COVID-19 was a new pathogen. Since we lacked labelled data to track transmission, we explored StEP's efficacy for unsupervised learning, by aligning the contact networks to topological signals identified in the unsupervised investigation across previous case studies.

The base assumption of the StEP model, summarised in Figure 2A, is that disease transmission is likely to occur between the most connected physical locations 26 . Connectivity between locations is quantified by the total movement of all individuals in a system and represented by an effective distance network among locations 36 . A pairwise proximity between infected individuals can be computed based on their movement histories across the effective distance network. The resulting contact map quantifies the likelihood of disease transmission between known cases via direct and indirect contacts. 

where k n is the number of location-timings for individual n, and each location-timing l i = (v i ,t i ) is a tuple of the ward location v i ∈ V visited by the individual at time t i . Rather than considering only direct contacts, where trajectories overlap T n ∩ T m = / 0, we consider the likelihood of transmission given the proximity of trajectories in space and time relative to the background disease propagation.

We define the spatial-temporal proximity between location-timings l i and l j via a kernel

where τ i j = t i − t j is the absolute time difference and δ i j the effective distance between wards v i and v j ( Figure 2B ) defined by the shortest-path distance (i.e. the most likely path of disease transmission) between wards v i and v j on the distance network D (see Methods: Effective distance). The parameter β represents the propagation speed of the disease and can be adpated to the pathogen under investigation. Based on the proximity between location-timings (2), we then compute the total proximity between two trajectories T m and T n by summing over the pairwise proximities of locationtimings

The total proximity s scales quadratically with the lengths of trajectories, reflecting the increased likelihood of a direct or indirect transmission when two patients coincide for longer periods 37 .Finally, the trajectory-to-trajectory (N × N) matrix S with elements S nm = s(T n , T m ) summarises the Spatialtemporal proximity among individual movement histories ( Figure 2C ). Using an elbow detection algorithm 38 we identified points of interest (the elbows) across the metrics exhibiting platues ((iii), (iv), (v), (vi), and (vii)), highlighted by red vertical lines. In (viii) we identifyied spiking behaviour and instead highlighted the maximum in median betweenness centrality.

To reveal the most likely transmission routes, we filter out weak connections from the similarity matrix S. We use an extension to k-nearest neighbours called continuous k-nearest neighbours (CkNN) 39 for edge filtering which performs well topological features span several different scales 40 , as is the case for disease outbreaks spanning across different scales in time and space (see Methods: CkNN). The filtered similarity matrix S then includes up to k of the strongest contacts for each individual. We refer to the input parameter k to CkNN as the edge density. Edge density reflects pathogen infectiousness, e.g. pathogens that transmit via minimal spatial-temporal contact require high values of k. The resulting matrix S can be interpreted as the adjacency matrix of a contact network, where each node corresponds to an infected individual (Figure 2D ) and edges to contacts and thus possible transmission events 13 .

The two parameters of StEP, the propagation speed β and the edge density k, reflect pathogen-specific transmission dynamics. In our case studies, we present three different approaches to parameter learning. Firstly, we use graph (semi-)supervised learning to predict node labels on genomic markers and identify the contact network that maximises the classification accuracy (see Methods: Semi-supervised learning of parameters). The semi-supervised approach enables learning from partially labelled data via label diffusion to model transmission across the contact network. Secondly, we transfer the previously trained model to new pathogens that are epidemiologically similar and studied in the same environment, thus avoiding the use of additional labelled data. Thirdly, we use StEP entirely unsupervised to investigate an outbreak of an unrelated pathogen without ground-truth labels available, exploiting structural statistics of the contact networks to tune parameters (see Methods: Unsupervised learning of parameters) and validating outcomes on data where optimal parameters are known. Figure 4 . Comparison of contact networks. Contact networks for (1) G m derived via StEP (which includes indirect contacts through background patient movement), (2) G p via physical contact tracing, capturing contacts between patients shared a ward on the same day, and (3) G w constructed via ward-centric tracing (which defines contacts from cases identified on the same ward where within a +/-7 day period).

(A) The contact networks identified by the three methods. (B) Overlap among network structures. G m with edges and nodes coloured by their presence in G p and/or G w . (C.1) -(C.6) highlight regions in the main component of G m that contained 2 or more contacts from G p . and edge density k = 3, suggesting that the resulting contact network best aligns with transmission of CPE IMP . In this optimal region, we observe substantial heterogeneity across the performance of individual biomarkers, many of which seem unrelated to outbreaks (Table s1) . Among the 181 biomarkers, 51 reached mean F 1 -scores above 0.6, and in particular, nine biomarkers performed particularly well, reaching a mean F 1 -score above 0.9. The latter nine biomarkers were also strongly associated with plasmid clusters independently identified through genomic analysis 35 ( Figure S2 ). Hence, it is likely the diffusion of biomarkers over StEP's recovered contact network is well characterising parts of the CPE IMP outbreak.

We further investigated the effect of edge density, which had a more profound effect on accuracy than propagation speed (cf. Figure 3A) , on the contact structure as an indicator for underlying disease dynamics 43 ( Figure 3B ). We quantified heterogeneity in the number of contacts of each individual by the squared coefficient of variation of the degree distribution (CV 2 ) 44 , which can for example indicate the presence of 'super-spreaders' (a minority of individuals who infect many others 45 ) and approximates how fast a disease will spread directly across a contact network 46 . The CV 2 displayed an elbow at the optimal edge density k = 3, indicating that further increases in edge density do not largely change the pathogens spreading behaviour. Similarly, elbows of other topological network measures coincided with optimal edge density, namely, transitivity (a measure of how tightly connected nodes are, and a well-known proponent of network disease dynamics 47 ), the number of connected nodes (patients with a connection to at least one other) and size of the largest component (both reflective of outbreak size), as well as the number of components (indicating the total number of distinct outbreak sub-clusters). The median betweenness centrality 48 , moreover, peaked close to the optimal density, suggesting that critical bridging nodes which connect communities 49 are most identifiable across the contact network in this parameter region. Altogether, the sensitivity of network structure around optimal parameter values suggests that topological signals may indicate pathogen-specific parameters in an unsupervised setting when genomic ground-truth data is not available.

To assess the trained model we compared its contact network G m to two contact networks obtained via traditional methods: (i) the physical contact network (G p ) with contacts between individuals who shared the same ward on the same day (see Methods: Standard contact model), and (ii) a ward-centric contact network G w frequently used in the clinical setting (see Methods: Locationcentric contact model), where cases are linked if identified on the same ward within a +/-7 day period 50 ( Figure 4A ). As StEP accounts for both direct and indirect contacts, unsurprisingly, it identified most contacts, followed by the physical contact network, and the ward-centric network identified only few contacts (G m : 123, G p : 88, G w : 7). The number of connected cases, consequently, varied substantially between networks (G m : 98/116, G p : 75/116, G w : 14/116). In particular, G m revealed a large sub-network of highly connected cases. In contrast, G w only identified few pairs of connected cases, which were predominately contained within larger connected components in G m and G p ( Figure 4C ). Overall, G m contains only four disconnected components (compared to 13 in G p , and seven in G w ), as well as the largest component comprised of 87 patients (compared to 42 in G p , and two in G w ).

Although StEP can both introduce indirect and omit direct contacts, G m largely contained structures of G p and G w , including identified contacts (G p : 64/88, G w : 5/7) and connected patients (G p : 72/75, G w : 18/18, Figure s5 ), was also supported by an alignment of hierarchical clusterings of the weighted networks ( Figure s4 ). Altogether, this means that StEP recovered the outbreak clusters identified via the traditional contact tracing methods and suggests that the extent of outbreaks may be largely underestimated when ignoring transmission via indirect contacts.

To test for alignment of the suggested outbreak structures with true transmission chains, we compared the contact networks to lineages obtained from sequencing of the IncHI2 plasmid, identified in 72 of the 85 sequenced isolates and found to be a driving force of the outbreak 35 ( Figure S2 ). We used four community detection algorithms [51] [52] [53] [54] to partition patients into contact clusters, which we then compared to isolate clusters identified by the plasmid analysis (Table 1 ).

Due to low connectivity, G w did not significantly align with plasmid lineages, regardless of the algorithm, while both G m and G p displayed better and significant alignment across all community detection algorithms (Table 1) . Overall, G m with its additional 23 connected patients over G p , achieved highest alignment to plasmid lineages (when using Walktrap 51 ), supporting the extent of outbreak clusters via indirect contacts estimated by StEP. We therefore conclude that using a wardcentric contact tracing would wrongly suggest that outbreaks are a series of isolated transmission events. By considering patient trajectories, physical contact tracing improves on the ward-centric approach, however, best alignment with additional outbreak data is achieved without constraints on strictly observable physical contact. . As these outbreaks concern the same family of pathogen, Enterobacteriaceae, we assume that the epidemiological parameters optimised for CPE IMP are applicable and transferable. We validate this hypothesis by (i) evaluating the alignment between the contact networks and bacterial species, and (ii) by analysing topological signals as previously in Case Study 1.

Validation of StEP against bacterial species Bacterial species shared by patients often used in the study's hospital trust to quickly identify outbreaks of CPE between epidemiologically connected patients in the absence of genomic data. As species are a high-level summary and due to cross-

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; Table 2 . Summary statistics of CPE OXA-48 , CPE NDM , and CPE VIM contact network. For each pathogen, topological statistics are displayed for the StEP-derived contact network G m and the physical contact network G p . In the first column, the average alignment with bacterial species is shown as computed using the label diffusion via GDR (determined using the F 1 -score from GDR, see Methods: GDR and Methods: Classification metric), where node labels are bacterial species (for breakdown by individual bacterial species see SI 3.2). A larger value indicates the contact network structure is more aligned to the distribution of bacterial species. The second column displays the 'optimal' edge density k found by considering each network measure (unsupervised parameter optimisation using elbow and peak detection averaged over topological measures, only applicable to the StEP-derived contact graph G m ). The network statistics of the resulting contact graphs including number of components (not including isolated patients), transitivity, average degree, and number of nodes and edges are shown as well as the intersection of nodes and edges between both contact models.

Graph Alignment to bacterial species~T species outbreaks driven by both species and plasmids, bacterial species, however, do not align perfectly with transmission events, rendering bacterial markers fuzzy indicators of transmission 56, 57 . Nevertheless, for each CPE-type the StEPderived contact networks G m displayed better alignment with the distribution of bacterial species (and across all individual species labels Figure S9 ) than conventional the physicalcontact graphs G p , again suggesting a substantial role of indirect contacts in transmission (Table 2 ).

Topologically indicated edge density As in Case study 1, we investigated the topological sensitivity of the contact networks derived by StEP to edge density k. All CPE-types indeed displayed similar elbows and maxima in the network summary statistics to those observed for CPE IMP , i.e. around k ≈ 3 (k ∈ [3, 4] for all CPE-types, see Table 2 and Table S2) , and for CPE VIM all metrics suggested an optimal edge density of k = 3, same as for CPE IMP . This implies that network topologies are comparable to CPE IMP , in line with the outbreaks concerning the same family of pathogens with largely the same epidemiological characteristics.

Comparison to traditional contact networks Finally, we compared the contact networks derived from transferring the CPE IMP StEP model (with propagation speed β = 0.6, and edge density k = 3) to the other CPE-types with contact networks derived from physical contact tracing ( Figure S8 ). 

Our final Case Study examines 90 hospital patients who acquired COVID-19 during their hospital stay (patients identified in a previous study by Price et al. 59 ). In the absence of additional genomic data, we previously transferred parameters from a pre-trained StEP model to unseen outbreak data of comparable pathogens. COVID-19, however, this presents a wholly different type of pathogen, that spreads with high dispersion via air droplets, compared to CPE, which typically spreads via shared touch surfaces 60, 61 . We therefore learned pathogen-specific parameters by analysing the topological signals previously identified with CPE (see also Methods: Unsupervised learning of parameters). As in Case study 1, edge density had a more pronounced effect than propagation speed on the topology of the contact network (Figure S14 & S14). Qualitatively, all metrics displayed comparable trends in edge density ( Figure 5A ), however, the locations of elbows and peaks across network metrics suggest a higher pathogen-specific edge density for COVID-19 (k ∈ [4, 6] , average k = 5). Lower edge densities resulted in disjoint clusters that only included highly weighted edges, i.e. the most likely transmission routes, which were embedded in further plausible transmission clusters in the contact networks resulting from higher edge densities ( Figure 5B ). Examining the contact network at the topologically indicated edge density, k = 5, we find a large component connecting 69/90 COVID-19 patients. As apposed to a location-centric method originally used in ref 59 ), suggesting a far more linked and connected outbreak (SI XX). We also found strong alignment between the core contact structures of the StEP contact

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint network G m at k = 5 and the physical contact network G p ( Figure 5C ). Even at granular scales, 70 out of 81 patients aligned within leaves of the hierarchical clustering, suggesting that physical contact structure was indeed preserved. The elevated contact density identified by our analysis is consistent with COVID-19 presenting a highly transmissible pathogen 60 . With more than 75% of cases indicated to be connected, our analysis indicates a much larger extent of hospital transmission than suggested by traditional contact tracing, and thus crucially, highlights the importance of considering missing data and indirect contacts for infection control and prevention.

We introduced a graph-based model to recover pathogentailored contact networks amenable to tracing disease trans-mission. StEP integrates total movement into proximity measures to determine the likelihood of transmission between individuals who may or may not have had direct physical contact. It offers improved characterisation of transmission via more comprehensive contact networks. We used StEP to recover contacts among hospital patients in five disease outbreaks. Initially, we optimised StEP for an outbreak of one type of CPE, using genomic data available for a subset of cases 35 . We then deployed the optimised model on three unseen datasets of other CPE-types, and in the absence of genomic markers, validated its fitness with weaker transmission markers often used in clinical practice. Finally, we demonstrated how the model can be applied in an unsupervised setting to provide insights for outbreaks of novel pathogens, as presented by COVID-19 during the early stages of the pandemic. In all cases, StEP retrieved more extensive contact networks that preserved core

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ;

physical contact structures and, where available, better aligned with independent outbreak data, altogether implying a larger extent of hospital transmission than identified by traditional methods.

StEP presents promising advances for contact network epidemiology, where proximity sensors have become a goldstandard to collect detailed contact data and understand infectious disease transmission. [62] [63] [64] [65] [66] . The underlying assumption is that proximity increases the likelihood of disease transmission, and we find support for this assumption through the alignment of genomic variation with spatial-temporal proximity (SI 2.3). Proximity-based methods can, however, miss certain transmission events facilitated by unobserved individuals, for example, healthcare staff or visitors, or indirectly, for example, via contaminated instruments and surfaces. StEP addresses missed contacts by considering total movement as the key vector for disease transmission, an assumption that has proven effective for various epidemiological scenarios 22, 23, 25, 26, 28-31 .

The model can be trained to pathogen-specific transmission features, and the learned parameters can give insights into epidemiological differences between pathogens such as the level of contact (space time proximity) leading to transmission 64 . Once tuned for a general class of pathogens, StEP can moreover be deployed in real-time and unsupervised, without the need for additional genomic sequencing, which offers a major advantage over alternative reconstruction models that integrate both genomic and epidemiological data 67, 68 . In addition, the graph-based framework is robust to missing data, as often the case in real-world scenarios with only partially complete datasets.

Our analysis suggests that pathogen-tailored contact networks feature topological signatures associated with plausible parameter regions. Across all networks and epidemiological parameters, we observed some highly connected individuals, likely the result of heterogeneity in infectiousness 45 . For specific parameter values, moreover, we observed bridging individuals, highlighted by their pronounced betweeness centralities 49 . Importantly, optimal parameters coincided with topological changes across a number of network metrics related to disease dynamics. For COVID-19, the topological signals correctly identified an elevated contact density and thus higher transmissibility. Such information may guide investigations of other newly emergent pathogens, when data is typically scarce.

In this work, we considered static background movement, however, in practice, background movement is likely nonstatic, for example, due to time-varying fluctuations in human mobility 69 , or more disruptive changes, such as the ongoing COVID-19 pandemic 23, 70 . One way to incorporate such temporal information is by replacing the static background movement in StEP for a temporal network. In this setting, the shortest path in the proximity kernel becomes the shortest time-respective pathway 71 between two locations. The undirected networks recovered by StEP contain contacts that likely led to transmission, but not its direction 1 . Possible future extensions of StEP may account for directed edges to recover transmission trees and provide further granularity to inform outbreak analyses. Furthermore, application of StEP to broader classes of pathogens and validation with WGS would further demonstrate the model's generalisability.

Overall, StEP tackles several challenges associated with contact tracing, and our analysis highlights the importance of indirect (and/or unobserved) contacts in disease transmission. Our approach takes both direct and indirect contacts into account and thus provides an extensive characterisation of disease outbreaks. Its flexible use of heterogeneous data can significantly enhance insights to inform interventions for infection control and prevention.

Our analysis is based on anonymised electronic health records of patients from a large 1000-bed Trust of teaching hospitals in London. Total hospital movement patterns were constructed from all hospital patients' (1862) inter-ward movements (not just those with an infection) over a month of regular hospital activity (see SI 1 -Background hospital movement network). For Case study 1, which looked at CPE IMP , the known CPE IMP -carrying patients pathways (116) were obtained during the original study duration (March 2016 to December 2019) in Boonyasiri et al. 35 . Case study 2 looking at CPE OXA-48 , CPE NDM , CPE VIM , looked at patient carriage (for each type) between August 2018 and August 2020. For greater overview on background CPE during the study duration, we direct the reader to a local study in the same hospital trust by Otter et al. 58 , or to a national overview by Public Health England 41 . Case study 3, which looked at COVID-19 in the same hospital teaching trust, studied the first wave of COVID-19 in the UK between March and April in 2020 (detailed in Price et al. 59 ).

WGS reads of the 85 CPE IMP isolates were processed and analysed for plasmid detection, phylogenetic reconstruction, and plasmid-lineage identification, as we have described in another study 35 . Briefly, using IQ-Tree 72 , a maximumlikelihood phylogenetic tree of IncHI2 plasmids were reconstructed from whole-genome alignment of the reads against a reference plasmid genome pKA_P10 (RefSeq accession: NZ_CP044215.1) 73 . Then the tree was corrected for recombination using ClonalFrameML and eventually was rooted using BactDating 74, 75 . Plasmid lineages were identified from the rooted tree based on its topology and bootstrap values (≥ 65) in the output maximum-likelihood tree of IQ-Tree ( Figure S2 ). Database ResFinder (updated on 28 Oct 2020) and software ARIBA v2.14.6 were used for detecting acquired antimicrobial resistance genes from the reads of 72 isolates (out of the 85 isolates) carrying IncHI2 plasmids 76, 77 . Alleles of identified acquired antimicrobial resistance genes were determined 9/15 All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. and labelled using GeneMates helper script PAMmaker v0.0.5, which also created a presence-absence matrix of these alleles across the 72 isolates 78 .

To track the spread of CPE IMP , 85 available CPE isolates from 116 confirmed CPE IMP -carrying patients were sampled for WGS and the detection of acquired resistance genes. In total, 106 alleles were identified, forming 181 three-allele combinations that we considered as biomarkers (Table S1 ). For an isolate to have a particular biomarker, all three alleles must have been found present together in their genomic analysis.

Routes of disease spread are often dominated by a set of most probable trajectories in terms of effective distance 36 . To derive such effective distance trajectories, we firstly construct a transition matrix P which captures background system movement. In P, elements P i j confer to the fraction of departing individuals leaving node v i and arriving at node v j , such that 1 < P i j ≤ 1. Given P effective distance d i j is,

In d i j a low proportion of movement from v i → v j corresponds to a large effective distance. Explained in greater detail in ref 36 , the logarithm captures additive effective distances along multi-step pathways. The most probable trajectories of disease spread from v i → v j is given by the directed path δ i j = {v i →, ..., → v j } with the smallest total effective distance λ (Γ):

Typically,δ i j =δ ji , since movement patterns vary depending localities of the underling system. Also note, for paths a shortest pathδ i j may not exist (when no path is observed).

CkNN 39 is a density aware neighbourhood joining methods. Whilst CkNN is predominantly developed for graph construction, we have adapted it here for filtering graph edges since it provides a geometrically consistent manifold representation. In this paper we implement CkNN to construct a contact network in two steps: Firstly, we define pairwise distance between any two nodes m and n as d(n, m). Secondly, we use d(n, m) in CKNN to define the CKNN graph as follows:

Here n k and m k represent the k-th nearest neighbours for samples n and m respectively. Importantly, the two parameters k (ranging from 1 to N-1), and γ regulate the CKNN graph structure. Parameter k regulates the k nearest nearest neighbours to consider and hence network sparsity, whereas γ is a positive parameter regulating local point density. In this work we focused primarily on k by fixing γ = 1.

Graph Diffusion Reclassification (GDR) leverages explicit diffusive dynamics through of the graph through its Laplacian 42 . GDR uses continuous-time graph diffusion of each class label for the training nodes as a means to classify the set of evaluation nodes. Through searching the time evolution of node dynamics we can identify the maximum probability of class assignment which forms the basis to classify the test nodes. Here, only the node classes are defined and node features are not present (node features can be used to define a prior on the class probability, however, without node features the probability is equal for each class). Starting with a prior assignment matrix H , which here is taken to be a flat distribution of equal probability of being in each class, a node is classified according to the maximum probability (here described as node overshoots) of any of the c classes. It is easy to see that the values of all the node overshoots are captured compactly in matrix form as

and the node reclassification is given from the maximum overshoot across classes

where the argmax is the standard row-wise operator that finds the maximum across classes, and we define argmax(0) = 0 so that the indicator vector 1 0 (κ Ω ) marks the set of non-overshooting nodes. The GDR code is available at github.com/barahona-research-group/GDR).

The classification accuracy from GDR is then reported as the F 1 -score across all biomarkers and averaged over predictions for five test sets of Monte-Carlo Cross Validation 79 :

where TP i , FP i , and FN i represent true-positive, false-positive, and false-negative rates, respectively, of the i-th test set, when compared presence-absence of biomarkers predicted from network topology to their actual distributions amongst CPE IMP isolates.

The StEP model features two key parameters, namely the propagation speed β and edge density k. These parameters reflect the transmission dynamics of the pathogen under consideration. We present two approaches to identify StEP parameters: firstly, a supervised approach based on partially labelled data that aligns the contact network to transmission markers using a predictive framework, representing the more sensitive means to obtain optimal parameters; and secondly, an unsupervised approach based on the structural properties of the StEP contact network, which is less sensitive, but aligns the network to edge density .

Semi-supervised learning of parameters To learn parameters, we use a semi-supervised node classification framework, Graph Diffusion Reclassification (GDR) 42 to obtain a network alignment to transmission by predicting node labels (see Methods: GDR). Whilst in this study we use genetic biomarkers as node labels, and as a means to trace pathogen transmission, the StEP methodology is general and amenable to alternative markers that follow transmission. Because transmission between two individuals is likely to have occurred if their pathogens contained similar sets of biomarkers markers, this predictive approach can identify a contact network best aligned to transmission. Importantly, this semi-supervised framework is robust to missing data which is standard for real-world epidemiological investigations under limited resources. For the final classification accuracy of GDR we reported the F 1 -score across all biomarkers, averaged over predictions across five test sets of Monte-Carlo Cross Validation 79 (see Methods: Classification metric).

Unsupervised learning of parameters In many instances, ground truth labels are not directly accessible, precluding the use of supervised methods. Unsupervised techniques exploit the intrinsic structure present in the input data to identify model parameters 38 . In this study, as well as directly learning parameters from labelled data, we demonstrate how network topological metrics, namely plateaus and spikes in network metrics, provide signal aligning to tuned parameters and validated parameters. These signals reflect pathogen-specific contact structure, namely reflecting direct infectiousness, and offer an unsupervised means to explore contact in the early stages of new disease outbreaks. 

Location-centric contact model In operational settings, standard contact definition can be challenging to fully implement in hospital due to the large number of linked patients generated by contact with identified carries. Often it is more feasible to restrict the definition of contact, and link cases in space and time bases on where they have been identified 50 . For some pathogens, these points my fall in locations and time where they were screened, or where they became symptomatic time as was used for COVID-19 previously in 59 . Namely, a location-centric approach is taken, which establishes links between individual if they had been identified close across time t in the same location v i . For each individual a single location-timing tuple, p i = (v i ,t i ), records where (v i ) and when (t i ) they were identified as a positive carrier. Formally, location-centric contact is established between two individuals if they have been in the same location within ∆t amount of time from when either was identified as positive. Thus the adjacency matrix S between N individuals based on location-centric contact is defined as:

An importance different between COVID-19 and CPE is the shorter incubation time of COVID-19. Therefore, when constructing the patient trajectories for COVID-19, we only considered movement 14-days before diagnosis 80 (this is a choice that would also be necessary for the traditional contact model, not a limitation of StEP).

Study ethics were granted and approved through Imperial College London NHS Trust service evaluations (Ref:386,379,473). All patient pathway data was collected from the central business intelligence system and fully pseudanonymised before analysis, in accordance with ethics 15_LO_0746.

The patient pathway datasets generated and analysed specifically for the study are not publicly available to protect anonymity of included hospital patients. However, for datasets incorporated from existing studies, we refer readers to Boonyasiri et al. 35 for Case study 1 data on CPE IMP , and Price et al. 59 for Case study 3 data on hospital acquired COVID-19.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint

The repository for StEP can be found at github.com/ashm97/StEP and GDR at github.com/barahonaresearch-group/GDR.

The authors declare that they have no competing interests. 

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint 1 StEP model

The total mobility network incorporated by StEP in this study (for all case studies) is constructed from 4902 inter-ward transfers of 1862 hospital patients. This data set constitute a large sample of movement patterns from 1862 from 2018 to 2020 and encompassed 129 hospital wards across 5 geographically separate sites. Naturally, our mobility data is represented by of directed graph G = {V, E}, in which nodes V = v i represent hospital wards and edges E = v i j the directed movement of patient between wards v i → v j . Introducing weighting on edges, as w i j then allows us to capture total volume of patients moving from v i → v j in the sampled data. The resultant G contained 1461 total directed edges ( Figure S1 ). On average nodes in G had an in-degree and out-degree of 11. Previous analysis of the same hospital by Myall et al. 1 found communities conferring to both different hospital departments and physical structures (geographical sites and different buildings). This correspondence to known structures suggests an ability graph captures hospital movement patterns, as well as some amount of staff movement given the alignment to different departments.

Hospital 2 Hospital 3 Hospital 4 Hospital 5 Figure S1 . Directed hospital mobility network. The hospital network is constructed from patient movements from wards v i → v j . Nodes are scaled by network betweenness centrality, and edge widths are sized according to total number of patient transfers. Node colouring refers to which of the five hospital sites the ward belonged too. .

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint 2 Case study 1 2.1 Biomarkers CPE IMP biomarkers were recovered from Whole Genome Sequencing of pathogen isolates. For each sample the presenceabsence of 106 alleles of acquired antimicrobial resistance genes were recorded in a feature matrix 2 . From this allele feature matrix, we engineered a set of 181 biomarkers, based on the co-occurrence of three different alleles in a CPE IMP isolate (Table  S1 ). Whilst a possible 5831820 combinations of three biomarkers existed, we restricted our set of interest such that the class distribution was largely equal and was no worse than 80-20 (a total present class count between 15 and 57). Table S1 . Biomarker table. Each biomarker represents co-occurrence of three different alleles of acquired antimicrobial resistance (AMR) genes in a CPE IMP isolate. Each suffix 'var' of allele labels was assigned by PAMmaker to represent a specific variant of a reference allele in the ResFinder database 3, 4 . In GDR's label diffusion framework, each biomarker is diffused over the network structure, giving a classification accruacy (F 1 -score) which is averaged over predictions across five test sets of Monte-Carlo Cross Validation 5 . All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. 

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. Figure S2 . Alignment between lineages, AMR alleles, and core-genome single-nucleotide polymorphisms (SNPs) of IncHI2 plasmids.

plasmids are labelled (red, with bootstrap values shown aside in gold) and coloured (only for major lineages) in a Bayesian dated phylogenetic tree of the plasmids 2

. The red heat map shows allelic presence-absence of acquired AMR genes that are likely to be carried by these plasmids according to a previous study 6 . The blue heat map shows core-genome SNPs of the plasmids when compared them to a reference plasmid genome pKA_P10 (RefSeq accession: NZ_CP044215.1). Coordinates of the SNPs in the reference genome are used as column names (with a prefix 's') for the heat map. Zoom in to show labels.

We also investigated whether our spatial-temporal proximity aligned with transmission by comparing core-genome SNP distances between IncHI2 plasmids identified in CPE IMP isolates from corresponding patients ( Figure S3 ). We would expect higher patient similarities (indicating a higher probability of transmission) with lower genomic diversity (lower number of SNPs). Conversely, higher genomic diversity (lower number of SNPs) between patients would not support evidence of transmission. We found, the optimised contact network G m recovered edges were skewed towards low genomic diversity, reflecting that our model tends to pick edges representing a higher likelihood of transmission ( Figure S3 )A. Moreover, we found a negative correlation between G m recovered edges and genomic diversity, meaning that as our G m indicates of a lower probability of transmission, the genomic diversity increases between the plasmid of a patients CPE IMP , also less indicative of transmission. Looking specifically at edges selected in the optimised contact network G m , there is a visible heterogeneity in the chosen weights. Most edges from G m have high weightings and are visibly distinct from the total edge space, several (few) lower weighted edges are also selected. Our models' heterogeneous selection is a desirable effect missed by standard network construction approaches given the topological plain, non-uniform, and occurring over several different periods of time. Overall, the total edge space in terms of patient movement similarity is skewed, indicating most are unlikely to result in transmissionlining up with only a single edge needed to capture transmission. Compared to edges resulting from physical contacts in G p ( Figure S3B ) there is considerable overlap with 64 physical edges recovered in G m . However, there is no decreasing relationship between patient similarity weighting and genomic variation, as observed in (Figure S3 )A. This difference suggests that, first, not all physical contacts are conclusive of transmission; and second, in order to optimally characterise the transmission, some physical proximity is redundant (24 not included G m ).

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Figure S4 . Tangle-gram comparing model derived contact network G m (left) to physical contact network G p (right). Both networks were hierarchically clustered using walktrap 7 and aligned with dendextend 8 . Distinct branches (not shared between clustering) are marked with a dashed line, whereas shared branches are solid and colour coded according by corresponding branches. Remove node names, add graph labels, make distinct lines dotted.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint Figure S5 . Venn diagram for overlap in nodes and edges between CPE IMP contact networks: (1) Model recovered contact graph G m which captures indirect contact through background patient movement, (2) Physical contact graph G p , capturing contact between patients which were on the same ward on the same day, and (3) contacts from the result of a standard outbreak investigation G w which had only looked at contact between patients on the ward where they were identified as positive (+/-7 days). Figure S6 . Measures of network topology as a function of propagation speed β for StEP's CPE IMP contact network. Metrics: (i) average degree, (ii) degree variance, (iii) CV 2 , (iv) transitivity, (v) connected nodes, (vi) largest component size, (vii) number of components, (viii) and median betweenness centrality. As opposed to variation by edge density k, propagation speed β exhibits largely less influence over network metrics, except initially when β = 0.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. For lower values in propagation speed, varying the parameter has a large impact on changes in edges. However, for propagation speed β > 0.5 varying the parameter has little effect on the edges recovered, with edges remaining greater than 90% consistent.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. Table S2 ). Connected cases in the physical contact network largely overlapped with those identified by StEP (88.8%, 94.9%, and 100% for CPE OXA-48 , CPE NDM , and CPE VIM respectively).

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Figure S9 . Alignment of CPE type contact network to bacterial species. We computed the alignment between each CPE contact network and bacterial species GDR (determined using the F 1 -score from GDR (see Methods: GDR and Methods: Classification metric)) where node labels are bacterial species. In this framework a higher F 1 -score would suggest that movement of bacterial species between individuals is well captured via the contact structure. Whilst the Enterbactor sp. and the Pseudomonas sp. seem to be little captured by the contact networks, across all CPE types, G m produces a better alignment to the distrubtion of bacterial species (exhibited by the higher F 1 -score). Additionaly, both Escherichia sp. and Klebsiella sp. aligned largely to both contact networks, but most so towards G m .

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint Both networks were hierarchically clustered using walktrap 7 and aligned with dendextend 8 . Distinct branches (not shared between clustering) are marked with a dashed line, whereas shared branches are solid and colour coded according by corresponding branches.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. Both networks were hierarchically clustered using walktrap 7 and aligned with dendextend 8 . Distinct branches (not shared between clustering) are marked with a dashed line, whereas shared branches are solid and colour coded according by corresponding branches.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint 50 40 30 20 10 0 0 5 10 15 20 25 Figure S12 . Tangle-gram comparing CPE VIM model derived contact network (left) to physical contact network. Both networks were hierarchically clustered using walktrap 7 and aligned with dendextend 8 . Distinct branches (not shared between clustering) are marked with a dashed line, whereas shared branches are solid and colour coded according by corresponding branches.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. ; https://doi.org/10.1101/2021.04.07.21254497 doi: medRxiv preprint 

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. For lower values in propagation speed, varying the parameter has a large impact on changes in edges. However, for propagation speed β > 0.5 varying the parameter has little effect on the edges recovered, with edges remaining greater than 90% consistent.

All rights reserved. No reuse allowed without permission.

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted April 9, 2021. 

Spread of epidemic disease on networks

Contact tracing and disease control

Selective epidemiologic control in smallpox eradication

Cluster of cases of the acquired immune deficiency syndrome: Patients linked by sexual contact

Social networks and the spread of infectious diseases: The AIDS example

Concurrent partnerships and the spread of HIV

The web of human sexual contacts

Quantifying the alignment of graph and features in deep learning

English surveillance programme for antimicrobial utilisation and resistance (espaur) report: 2019 -2020

Semisupervised classification on graphs using explicit diffusion dynamics

The architecture of complex weighted networks

Infectious diseases of humans: dynamics and control

Superspreading and the effect of individual variation on disease emergence

Network structure and the biology of populations

Properties of highly clustered networks

Centrality in social networks conceptual clarification

Dynamics and Control of Diseases in Networks with Community Structure

Communicable disease outbreak management: operational guidance

Computing Communities in Large Networks Using Random Walks

Finding and evaluating community structure in networks

Near linear time algorithm to detect community structures in large-scale networks

Finding community structure in very large networks

Objective criteria for the evaluation of clustering methods

OXA-48-like carbapenemases in the UK: an analysis of isolates and cases from 2007 to 2014

Pervasive transmission of a carbapenem resistance plasmid in the gut microbiota of hospitalized patients

Detecting carbapenemase-producing Enterobacterales (CPE): an evaluation of an enhanced CPE infection control and screening programme in acute care

Development and Delivery of a Realtime Hospital-onset COVID-19 Surveillance System Using Network Analysis

How far droplets can move in indoor environments ? revisiting the Wells evaporation?falling curve

Acute trust toolkit for the early detection, management and control of carbapenemase-producing enterobacteriaceae

Inferring friendship network structure by using mobile phone data

Close encounters in a pediatric ward: measuring face-to-face proximity and mixing patterns with wearable sensors

A high-resolution human contact network for infectious disease transmission

Detailed contact data and the dissemination of staphylococcus aureus in hospitals

Dynamics of Person-to-Person Interactions from Distributed RFID Sensor Networks

Unravelling transmission trees of infectious diseases by combining genetic and epidemiological data

Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data

Understanding individual human mobility patterns

Modeling the Worldwide Spread of Pandemic Influenza: Baseline Case and Containment Interventions

Connectivity and Inference Problems for Temporal Networks

IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies

IncN3 and IncHI2 plasmids with an In1763 integron carrying blaIMP-1 in carbapenemresistant Enterobacterales clinical isolates from the UK

Efficient Inference of Recombination in Whole Bacterial Genomes

Bayesian inference of ancestral dates on bacterial phylogenetic trees

Identification of acquired antimicrobial resistance genes

ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads

GeneMates: an R package for detecting horizontal gene co-transfer between bacteria using gene-gene associations controlled for population structure

Monte carlo cross validation

Network memory in the movement of hospital patients carrying drug-resistant bacteria

Network and plasmid analysis of a regional outbreak of cabarpenemase-producing enterobacteraceae carrying both blaimp and mcr-9 genes

GeneMates: an R package for detecting horizontal gene co-transfer between bacteria using gene-gene associations controlled for population structure

Identification of acquired antimicrobial resistance genes

Monte carlo cross validation

IncN3 and IncHI2 plasmids with an In1763 integron carrying blaIMP-1 in carbapenem-resistant Enterobacterales clinical isolates from the UK

Computing Communities in Large Networks Using Random Walks

We wish to thank Eleonora Dyakova for help with accessing data.

4 1 1 6 0 4 1 6 ! "#2 6$ %& 6 0 7 1 9 6 ' 6 4 1 7 6 1 23 8 6 () 6 %699  %6999  %69466  %694  %694  %698  %6989  %69885  %687  %65  %6  %777  %756  %947  %4  %8756  %644  %77  %8  %7847  %98  %7454  %768  %65  %677  %44  %448  %4955  %49  %4974  %6  %685  %66  %9  %4  %78  %4  %8  %5  %7  %78  %744  %68  %  %77  %8  %49  %86  %57  %5  %9  %978  %95  %45  %494  %444  %8  %849