key: cord-0103428-egjwytxq
authors: Haldar, Aparajita; Wang, Shuang; Demirci, Gunduz; Oakley, Joe; Ferhatosmanoglu, Hakan
title: Temporal Cascade Model for Analyzing Spread in Evolving Networks with Disease Monitoring Applications
date: 2022-03-28
journal: nan
DOI: nan
sha: 062ee3537ade2be81dc78976e24adbcdf34014ff
doc_id: 103428
cord_uid: egjwytxq

Current approaches for modeling propagation in networks (e.g., spread of disease) are unable to adequately capture temporal properties of the data such as order and duration of evolving connections or dynamic likelihoods of propagation along these connections. Temporal models in evolving networks are crucial in many applications that need to analyze dynamic spread. For example, a disease-spreading virus has varying transmissibility based on interactions between individuals occurring over time with different frequency, proximity, and venue population density. To capture such behaviors, we first develop the Temporal Independent Cascade (T-IC) model and propose a novel spread function, that we prove to be submodular, with a hypergraph-based sampling strategy that efficiently utilizes dynamic propagation probabilities. We then introduce the notion of 'reverse spread' using the proposed T-IC processes, and develop solutions to identify both sentinel/detector nodes and highly susceptible nodes. The proven guarantees of approximation quality enable scalable analysis of highly granular temporal networks. Extensive experimental results on a variety of real-world datasets show that the proposed approach significantly outperforms the alternatives in modeling both if and how spread occurs, by considering evolving network topology as well as granular contact/interaction information. Our approach has numerous applications, including its utility for the vital challenge of monitoring disease spread. Utilizing the proposed methods and T-IC, we analyze the impact of various intervention strategies over real spatio-temporal contact networks. Our approach is shown also to be highly effective in quantifying the importance of superspreaders, designing targeted restrictions for controlling spread, and backward contact tracing.

respectively, with provable approximation guarantees for the optimality of these solution sets. We employ RSM to find sentinel nodes, i.e., a set of nodes where at least one is likely to be activated regardless of where the spread begins.

RSM is useful to detect if there is spread within a population. An application of this approach is "sentinel surveillance" and the early detection of a disease outbreak by using sensors at the set of sentinels [3, 12, 20] . To understand how the spread is likely to take place, we also study ESM to identify susceptible nodes, i.e., a set of nodes that accumulate the most spread from arbitrary seeds. In a disease contact network, these individuals would be the first choice for forward and backward contact tracing (to identify others who either may have been infected by these individuals or may have passed along infection to these individuals), being at high risk of catching as well as transmitting infection. It is important to note that the traditional approaches of merely ranking the nodes by importance or ability to spread activation (as with influence maximization) are insufficient for our proposed setting.

The main technical contributions of the paper are as follows:

• A temporal independent cascade model (T-IC) for networks with dynamic propagation rates is introduced, along with a 'reverse spread' function which is submodular and provides a formal approximation guarantee on solutions.

• A sampling strategy based on hypergraph construction is proposed to handle evolving connections in the network, and dynamic spread patterns are reflected in 'random reachable sets' that form hyperedges.

• We propose RSM for selecting sentinel nodes (e.g., individuals to test with priority) to detect if there is any spread, and ESM for identifying susceptible nodes (e.g., individuals to contact trace) to capture how the spread can occur.

• Experiments using highly granular real-world location, contact, and social network datasets confirm that the proposed solutions are effective in identifying i) a sentinel set with highest 'reverse spread' coverage and success of detecting spread, and ii) a susceptible set with highest likelihood of activated nodes included based on the 'expected spread' pattern. Further results are presented for disease monitoring applications, including mitigation and intervention strategies such as targeted restrictions and backward contact tracing.

While there is no prior work on capturing sentinel and susceptible sets in temporal networks as studied in this paper, there is extensive work in the areas of information diffusion and influence maximization, especially for social networks [9, 10, 13, 17, 30, 40, 43] . Under the widely used propagation models, Independent Cascade (IC) and Linear Threshold, finding a subset of users that maximizes the expected spread is shown to be NP-Hard [23] . There are studies using time constraints within IC, which use a constant probability of spread or ignore repeated activations from an active node [8, 25, 31] . In contrast, our approach exposes temporal factors in the continuous activation process and allows customizable formulations of propagation rates. Prior work on evolving networks typically focuses on maximizing influence [15, 18, 34, 39, 46] as opposed to the objectives that we study, for which we also provide approximation algorithms with quality guarantees.

Several recent studies on data-driven disease monitoring [5, 7, 42, 44] use the same types of location-based social network datasets, e.g., Foursquare [45] and SafeGraph [38] , that we also cover in our experiments. It is worth noting that studies such as these do not attempt to perform individual-level trajectory analysis using contact tracing data. Such location-based data is publicly available only in an aggregated form (e.g., total population changes in sub-districts), or is used in a decentralized fashion to determine infection risk for individuals [22] . Our approach enables a more fine-grained analysis in any specific propagation scenarios (e.g., with varying population density and varying proximity between interacting individuals), as illustrated in our experiments using data that have been collected for analyzing pandemic spread, e.g., the data from the BBC Pandemic app [27] and Italy mobility data captured during the COVID-19 outbreak [35] . We also utilize Foursquare trajectories and generate granular trajectories from the SafeGraph data, in lieu of their aggregated versions, for a more granular approach to spread modeling.

There is extensive epidemiological research on disease modeling starting from the SIR framework [24] and its extensions, where differential equations govern the transition rates between Susceptible, Infectious, and Recovered stages. This framework has led to numerous spread models being designed that handle datasets at a low-granularity aggregated level (e.g., district-level infection case counts). One such approach deals with greedy immunization strategies restricted to local behavior such as node degrees [36] . SIR is also used to study the impacts of static and temporal network structures on outbreak size, sentinel surveillance, and vaccination objectives [20, 21] . An extension of the traditional SIR approach, SEIR, has been studied in the context of COVID-19 [7] , with SafeGraph sub-district-level spread modeling performed to aid policy making around the partial easing of lockdowns and other such considerations. Results are calibrated against real New York Times COVID-19 aggregated case counts. However, individual-level trajectory analysis is not addressed.

All of the aforementioned equation-based compartmental models in the domain use aggregate data and assume homogeneous mixing within the population without considering temporally ordered meeting events. Instead, our solution offers highly granular and efficient predictive models, with novel goals of identifying critical subsets (i.e. sentinel and susceptible nodes) in the contact network. A recent Hawkes-process based variation of SIR has been proposed by performing agent-based infection simulations to assign COVID-19 risk scores to individuals [37] . Our approach has orthogonal perspectives where we use our assigned transmission risk scores to determine optimal solution sets that are either sentinels or susceptible individuals. T-IC is also able to utilize individual-level location sequences to trace the disease spread spatially, resulting in support for more use cases such as "backward contact tracing" (using knowledge of active nodes to trace initial seeds), which has attracted attention in the context of COVID-19 [14] . The need for dynamic network analysis for forward and backward influence tracking has been highlighted in the literature [1] .

We now present the Temporal Independent Cascade (T-IC) model, and define our optimization objectives, for tracing the spread over an evolving network with dynamic propagation probabilities.

In a standard IC model, information flows/diffuses/propagates through the network via a series of cascades. Nodes may either be active (already influenced by the information that is propagating through the network) or inactive (either unaware of the information that is propagating but has not reached it by this point, or not influenced during the propagation that did reach it). The standard IC model assumes a static probability distribution over a static graph structure. This IC process is simulated over a graph = ( , , ) where each edge ( , ) ∈ is associated with a constant probability function : ↦ → [0, 1], reflecting the likelihood of activation when nodes ∈ and ∈ have a common edge (e.g., a common meeting point in the location histories of two individuals). Propagation starts from an initial seed set in (the only nodes active at step 0). Propagation takes place in discrete steps with each active node during step being given a single chance to activate its currently inactive neighbor with some probability ( , ). That is, at every step ≥ 1, any node made active in step −1 has a single chance to activate any one of its inactive neighbors. The process continues with nodes remaining active once activated, until no further propagation is possible. Therefore, this is a stochastic process that requires a large number of simulations to accurately determine the spread of information.

Since the standard IC model uses a static probability distribution over a static network, it is insufficient to handle evolving graphs with changing propagation rates. In our setting, both the structure of and the propagation rates can change dynamically. For example, a disease spread model needs to consider the order and duration of interactions within a population, and the varying infectivity of a virus over its lifetime. Therefore, we define a new temporal IC model for a temporal network.

Definition 1 (Temporal Network). Given a discrete time domain = {1, 2, ..., }, a temporal network is a graph = ( , , ) where for each time interval ∈ , edge set is associated with a different propagation probability distribution

That is, each edge ( , ) ∈ has probabilities ( , ), one for each ∈ .

Since and may be linked multiple times in , the corresponding ( , ) for every interval needs to be separately maintained. An evolving graph is represented by adding all edges and assigning ( , ) = 0 when there is no ( , ) connection in interval . Moreover, a rigorous formulation for ( , ) is needed to describe more complicated cases of spread, which we discuss in Section 5.1.2.

To model the spread of activation in the temporal graph , we introduce the Temporal Independent Cascade (T-IC) model as an enhancement of the popularly used standard IC model for this novel setting. Given time intervals , ∈ such that < , the T-IC model proceeds from to as follows: Let denote the set of initially active nodes at the beginning of time interval . Within each interval ∈ [ , ], the standard IC model is executed once under probability distribution on edges and proceeds until no further spread is possible. The set of all activated nodes at the end of interval is +1 , which thus also represents the active nodes at the beginning of the next time interval +1 (active nodes remain active in subsequent intervals). This process is executed for each interval , +1, . . . , and the final set of active nodes +1 is obtained after interval .

In other words, we do not modify the standard IC process, but instead run it to completion independently within each discrete time interval ∈ [ , ] under the corresponding probability distribution defined over edges. For the entire chosen time window ([ , ] ∈ ), stacking several of these IC processes, and treating activated nodes as the seeds for the next run, allows us to mimic the spread over an evolving graph without sacrificing the approximation guarantees.

Furthermore, if a node is activated in a specific time interval, it can continue to activate its neighbors in subsequent time intervals. For example, a person infected with a virus may continue to spread infection during interactions with people for as long as they are contagious (e.g., 14 days for COVID-19 [4] ). This continuous activation during the T-IC process, along with the propagation probability formulations, makes it possible to reflect real-world spreading phenomena (e.g., for disease monitoring) with our model. Figure 1 illustrates the impact of temporal order and dynamic connections in the spread model. Suppose node is initially active. In the first case (Figure 1(a) ), nodes and have higher likelihood of activation than nodes that arrive later, such as or , or nodes that leave, such as . The greater chance of activation due to the prolonged duration of contact with the active node , as well as the increased risk of activation due to the high density of nodes, is reflected in the thicker edge connections to and . In the second case ( Figure 1 (b)), and are again more susceptible due to their proximity to , because leaves before approaches closer. The risk of activation for then increases as the node approaches closer to the other nodes that may have been activated by . We develop a propagation probability assignment that can capture all these factors for different application settings, which we describe in Section 5.1.2. 

Our first objective is to identify the sentinel nodes, i.e., the set of nodes that maximizes the probability of detecting any activations in a network where the set of initially active nodes is not known. A practical application is to detect an outbreak using minimal resources (e.g., medical tests). Testing a targeted set of individuals can be an efficient way to detect the presence of disease within a population before it is widespread, akin to the concept of "sentinel surveillance" in network epidemiology [20] . Therefore, our optimization objectives focus on finding -element subsets of sentinel nodes. The proposed solution can also be run until termination over an indefinite time window, but we note that when there is a temporal constraint within which to detect outbreaks, a truncated solution set (choosing only nodes) can sufficiently cover the entire relevant spread.

The objective is to maximize the probability that at least one node in a -element solution set becomes active after a random T-IC process (i.e., one that starts with randomly selected seeds) within a given time window [ , ] . We note that the 'temporal reverse spread maximization' objective, as defined below, corresponds to this goal of identifying a -element set of sentinel nodes. Optimizing the success rate of detection of spread anywhere in the network by using sentinels can be achieved by maximizing the expected amount of 'reverse spread' (·). Expected reverse spread can be defined as the expected number of nodes that can spread activation to the nodes in . Therefore the expected reverse spread of set on within [ , ], denoted by ( , ), is the expected number of nodes that can activate at least one node in set during a random T-IC process in [ , ] . We discuss how to compute (·) in Section 4. The problem of maximizing the expected reverse spread ( , ) can be formally defined as follows:

Definition 2 (Temporal Reverse Spread Maximization). Find the -element subset of nodes * ⊂ such that * = arg max

Definition 2 addresses the problem of detecting if there is any spread (e.g., a disease outbreak). To understand how the spread takes place within the network (e.g., in order to contact trace), we need to identify high priority nodes that collect activation, i.e., nodes that are the most susceptible to activation. For example, in disease monitoring applications, the objective is to identify the subset for testing who are all highly likely to be infected. This is because backward contact tracing [1, 14] from these nodes may offer important candidates for immunization to prevent spread of infection to nodes of .

Hence, we next aim to identify the -element subset containing the maximum expected number of active nodes

after T-IC process in [ , ] . Similarly to the temporal reverse spread maximization objective, the 'temporal expected spread maximization' objective defined below corresponds to the goal of identifying a -element set of susceptible nodes. This second problem is formally defined as follows:

Definition 3 (Temporal Expected Spread Maximization). Let ( ) be an indicator random variable for node ∈ such that

after executing random T-IC process. Find a -element subset of nodes * ⊂ that maximizes the expected number of active

In the next section, we describe our proposed solutions to address the above objectives, namely Reverse Spread Maximization (RSM) and Expected Spread Maximization (ESM).

While other models lose the rigorous approximation guarantees with changes to support temporal spread, we introduce a novel significant contribution to the well-defined sampling approach that allows us to preserve optimality guarantees in a new dynamic setting. We first show that the defined reverse spread function (·) is submodular in Theorem 4.1. It follows that the standard hill-climbing greedy algorithm achieves 1− 1 -approximation guarantee, i.e., a solution that uses this function can be approximated to within 1− 1 of optimal [23] . Proof. Let = ( , , ) be a directed graph where each edge ( , ) ∈ is associated with a weight ( , ) denoting the probability that spread occurs from to . Kempe et al. [23] showed that the IC model induces a distribution over graph , such that a directed graph = ( , ′ ) can be generated from by independently realizing each edge ( , ) ∈ with probability ( , ) in ′ . In a realized graph ∼ , nodes reachable by a directed path from a node are its reachable set ( , ), and correspond to the nodes activated in one instance of the IC process with as the initially active seed node. They proved that for ⊂ , the spread function ( , ) = | ∪ ∈ ( , )| is submodular.

Similarly, the T-IC model induces a distribution over = ( , , ), where the IC model is executed independently in each discrete time interval ∈ [ , ] under the corresponding probability distribution defined over edges. Additionally, an activated node remains active in subsequent intervals, getting multiple chances to activate its neighbors. So, directed graph = ( , ′ ) can be generated as follows: For intervals = , + 1, . . . , , each edge ( , ) ∈ is realized in ′ with probability ( , ), only if node is active at the beginning of interval . Hence, the reachable set ( , )

corresponding to a node on the generated graph consists of all the nodes that are reachable and activated by time interval by the seed that was initially active in time interval .

Let denote the transpose of , obtained by reversing all its directed edges. Reachable set ( , ) corresponds to all seed nodes that, if active in interval , would have the ability to activate the receiving node by time interval . Given a set of nodes , let the reverse spread ( , ) denote the number of nodes that can reach some node in . That is,

, the submodularity of ( , ) follows. Therefore, the expected reverse spread ( , ) = ( ( , )) is submodular, being a linear combination of submodular functions. □ Borgs et al. [6] use a state-of-the-art sampling strategy to build a hypergraph representation and estimate the spread of activation. We enhance this technique to handle dynamic propagation rates and identify solution sets for both our defined tasks (i.e., identifying sentinel nodes and susceptible nodes). Our algorithm and sampling strategy use a novel process of generating the hypergraph to encode the reverse spread of any given subset of nodes via its nets. A hypergraph is a generalization of a graph in which two/more nodes (pins) may be connected by a hyperedge (net). The two-step sampling strategy is as follows: i) we execute random T-IC processes (that start with random active seeds) on the temporal network, and ii) for each execution of a T-IC process, we construct a net whose pins are the nodes that are activated during the process.

As shown in Theorem 4.1, can be drawn from the distribution induced by a T-IC model on . The edges in the graph are tried to be realized by traversing only live edges (i.e., edges where the starting node is already active).

Due to this constraint of only considering live edges, this enforces a dynamic nature to the spread as it takes place by respecting the temporal ordering of connections. If a node is reachable from many different nodes in , then it is more likely that this node will be activated by time interval . Since any random seed in time interval is equally likely to start the spread, the existence of more paths that lead to the node results in a higher likelihood of its activation.

This means that the reachable set of nodes ( , ) (i.e., all the nodes from the realized graph that are reachable by a directed path of edges from the node ), which depends on the random seed node , is one among many possible sets of activated nodes at the end of a random T-IC process on the randomly sampled .

Hence, the solution depends on two levels of randomness that are encountered during the hypergraph construction:

i) the sampling strategy for ∼ , and ii) the computation of ( , ) given a random seed . The former depends on the probability distribution induced by the T-IC model over , while the latter depends on the seed node . We refer to such a reachable set that is generated by two levels of randomness as a 'random reachable set' ( , ). In disease modeling applications, most outbreaks are thought to start with one infected person [19, 26] . Therefore, we consider a single seed node in our experiments on all datasets.

The main sampling step is repeatedly performed to build a hypergraph = ( , ) where each net ∈ is independently generated by executing a random T-IC process from seed . The hypergraph corresponds to a random reachable set ( , ), i.e., pins( ) = ( , ). The solution quality and concentration bounds thus depend on the number of nets generated to build the hypergraph [6] .

Draw ∈ [0, 1] uniformly at random 10 if ≤ ( , ) and ∉ then

Remove and all of its incident nets from 16 return S Note that and are composed of the same set of nodes . For a solution set , the number of nodes sharing a net with at least one node in set (which we refer to as deg( ) henceforth) corresponds to the number of times a node in gets activated during the random T-IC processes executed to compute the random reachable sets. To select as a collection of sentinel nodes, higher deg( ) will be more likely to detect spread in the network, which can be understood as follows: The degree of a node in the hypergraph is the sum of | | Bernoulli random variables [6] . This is because the inclusion of a node in a random reachable set ( , ) and in pins( ) can be considered as a Bernoulli trial with success probability , where denotes the probability that gets activated in a random T-IC process.

That is, the hypergraph node degrees are binomially distributed with an expected value of ×| |. This implies that = E[ ( )/| |]. Therefore, this node degree corresponds to the estimation of reverse spread of node , since the reverse spread can be written as ( ) = | | × . Similar to the node degrees in hypergraph , the expected value E[ ( )/| |] corresponds to the probability that at least one node in gets activated during a random T-IC process.

Therefore, the degree of a set of nodes in hypergraph , corresponds to the reverse spread

which can be estimated well if a sufficient number of nets are built.

We next describe two algorithms, RSM and ESM, to efficiently compute solution sets for the two tasks corresponding to Definitions 2 and 3 respectively.

In the hypergraph = ( , ), if a node connects many nets (i.e., its degree is high), then that node has a high probability of being activated during a random T-IC process. Similarly, if a set of nodes covers many of the nets (random reachable sets), then its expected reverse spread ( , ) is likely to be higher. In other words, there is a larger set of nodes that all have a chance to activate at least one node of within the time window [ , ] .

As in the maximum coverage problem, we want to cover the maximum number of nets (elements) in the hypergraph by choosing a solution set of nodes (subsets). This step is therefore equivalent to the well-known NP-Hard maximum coverage problem [41] . Borgs et al. [6] show that the maximum set cover computed by the greedy algorithm on the RSM node selection on a transposed graph helps to identify the most influential nodes for influence maximization solutions, since reversing the edges makes these nodes the best spreaders in the original graph ( Figure 2 ). However, as illustrated in Figure 3 , we apply our sampling strategy using directly, allowing solutions for our novel objectives of finding sentinel/susceptible nodes. Clearly, the sampling step is dependent on temporal dynamics which change upon transposing , therefore solutions for our novel objectives are distinct from influence maximization approaches in literature. Specifically, when applied to disease modeling, the RSM approach can detect an outbreak efficiently, e.g., sentinel nodes and (or ) collectively correspond to greater coverage of the network than that offered by the ESM solution (nodes and ). The ESM solution appears in RR-sets more frequently, and are nodes that are all highly likely to be infected and therefore important for contact tracing efforts. Identifying susceptible nodes in a social network also has interesting use cases such as reducing/capturing rumor spread more effectively. We used seven real datasets in our experiments. For the use case of disease monitoring, we use five location-based networks. The first two are spatio-temporal network datasets that we construct using the locations (check-ins) of Foursquare users, in line with other research studies on disease monitoring applications [5, 44] . We refer to them as NYC Our third location dataset is based on SafeGraph [38] , which has been used to analyze mobility patterns for COVID-19 mitigation [7] . SafeGraph contains POIs, category information, opening times, as well as aggregate mobility patterns such as the number of visits by day/hour and duration of visits. Using these mobility patterns, we generate synthetic trajectories for 2 individuals visiting 100 unique POIs in the NYC area over 25 days. To build an individual's trajectory, for each day of the week for three consecutive weeks, we select and assign sequential visit locations to appropriate timestamps (based on travel time and visit duration) as follows: i) each individual receives a random start timestamp for travel, a random start POI location, and a random trajectory length that determines the number of POIs to visit, ii) SafeGraph dwell time estimates and a random distance-based travel time are used to determine the timestamp for reaching the next location, iii) depending on this timestamp, POIs are filtered out from the candidate list (based on opening time, category information, and distance from current location) to ensure that the trajectory sequences generated are feasible and realistic, and iv) the next location POI is selected from among the remaining candidates, and the process (steps ii-iv) is repeated until the full length trajectory is complete (where no candidates exist, the trajectory is truncated). We then construct the corresponding contact network by connecting (bidirectionally) nodes that appear in the same location at the same time, considering 5 minute intervals to determine this overlap. We call this semi-synthetic network SafeGraph-traj.

For examining intervention strategies in more detail, we use two location datasets developed for studying pandemics: Haslemere and Italy. The first records meetings between users of the BBC Pandemic Haslemere app over time, including pairwise distances with 5 minute intervals [27] . The second reports temporal aggregated mobility metrics for each day's movement of population between Italian provinces based on smartphone user locations before and during the COVID-19 outbreak over 90 days [35] . The Italy dataset also provides transition probabilities between provinces, which we directly use as the propagation probabilities for our T-IC process.

Finally, we experiment on two social network datasets: wiki-Vote [28] and cit-HepPh [29] . wiki-Vote consists of user discussions on Wikipedia, with edges between users representing votes. cit-HepPh encodes citation connections between research papers.

The statistics of the constructed networks are in Table 1 . 

Each edge is assigned its corresponding probability of propagation (e.g., transmission of disease) from one node to the other, based on the needs of the specific application dataset.

• For social network datasets, the propagation probability is assigned following the common practice in influence modeling studies [11] of using a uniform distribution. We randomly assign the edges of the network to a discrete time interval in [0, ], and sample ∈ [0, 0.3] for each of the edges.

• For location-based contact networks and Haslemere, we utilize a domain-informed probability assignment. The Italy dataset directly uses the provided transmission probability between connected nodes. Recent epidemiological studies quantify how transmission rates are related with the distance between the individuals as well as the overall population density at the location [2, 16, 27, 32] . Based on these, we calculate the propagation probability of a connection from node to node at time interval using Equation 4 , to incorporate knowledge of virus spreading characteristics:

where , is a factor denoting the "force of infection" (the larger the value of , , the greater is the transmission probability between and ) [27] , and ′ ∈ ( − 0 , ] indicates the relevant duration of time up to the current time interval

. The latter is governed by 0 , the duration for which historic infection force is considered, since the transmission probability is decided by the accumulated infection force over ′ .

Therefore, a minimal expression for , ( ) must consider the distance from to and the population density at the venue to determine risk, and is formulated based on the literature as:

where , , is the distance between and at time interval (based on their location data), is the number of people located at the same venue, and , , 1 and 2 are hyper-parameters. In line with [27] , to realistically simulate spread, we use default values of 1 = 2 = 0.1, = 0.05, and choose = 0.05 (for NYC and SafeGraph-traj datasets) or = 0.01 (for Tokyo dataset due to its dense connectivity). When , > (distance threshold) or when the dataset has no such proximity information, the contribution to infection force is assumed to be zero (i.e., = 0). Figures 4b-4d ). Here, we choose a distance threshold of = 5 meters as shown in Figure 4a , so there is no infection force once the distance between a node-pair is greater than 5 meters. Figure 4b shows the varying distance between contact node-pairs every 5 minutes over three days. The Haslemere data covers 16 hours of each day, and for simplicity of illustration we ignore the remaining hours of each day on the x-axis of Figures 4b-4d . We select 0 = 1 day, i.e., the transmission probability at time is decided by the past 1 day's interactions. Figure 4c is the corresponding accumulated infection force of Figure 4b , which is computed as ′ , ( ′ ) using Equation 5 to calculate . The trend of the propagation probability in Figure 4d is the same as in Figure 4c because the propagation probability is proportional to the accumulated infection force, as shown in Equation 4 . A more detailed exploration of the influence of the various hyper-parameter settings for Equation 5 can be found in Section 5.2.3. While we experiment with various hyper-parameter settings for disease modeling applications, these may be easily customized to incorporate the domain findings on transmission risks, e.g., [4, 33] (based on contact duration, venue size and occupancy rates, activity type, ventilation, and other factors) as an orthogonal scope of work.

As there is no work examining sentinel and susceptible nodes using IC models, we look for comparable alternatives to our RSM and ESM solution sets. We select baselines in three groupings. The first consists of IC model-based methods (Greedy-IM [23] , DIA [34] ), to demonstrate the superiority of T-IC for analyzing spread on evolving networks. The second is the traditional virus propagation method for finding the critical nodes to immunize to prevent an epidemic

(T-Immu [36] ). The final grouping covers simple heuristic-based methods (Max-Deg, Random).

Greedy-IM obtains the top-influential nodes using a greedy hill-climbing algorithm over | | time windows. Since it is infeasible for larger datasets, we run it only on the smallest Haslemere and Italy datasets, and apply Greedy-IM for Table 2 . The value 0 for DIA in this section at = 10 means that the reverse spread of DIA with = 10 is minimum among all methods from = 10 to = 50, while the value of 10 for RSM at = 50 denotes that its reverse spread in this configuration is maximum across among all methods and solution set sizes.

The set returned by RSM collectively achieves the highest reverse spread coverage in all cases, which increases with increasing (solution set size). Without prior information about the initial seeds from where activation begins to spread, distributing limited resources (e.g., scarce/expensive medical tests) to these sentinel nodes (i.e., the nodes selected in ) increases the probability of detecting the spread at an early stage.

By contrast, ESM selects all nodes having the highest probabilities of being activated during a random T-IC process, and thus best captures the largest expected spread out of all methods. ESM outperforms Max-Deg, which is often enforced in reality, by up to 82% on NYC. ESM is thus an effective targeted strategy for identifying the most susceptible nodes (e.g., for contact tracing or treatment).

The binary success rate using RSM is the best for all datasets and . Comparisons with T-Immu and DIA show that considering temporal properties while also preserving the overall graph structure is vital to select the ideal solution sets.

RSM consistently outperforms T-Immu (nearly 2x better on SafeGraph-traj) despite having related objectives, since T-Immu cannot capture time-varying transmission probabilities. RSM also drastically outperforms DIA, which is worse 

We evaluate the effect of varying the time window while keeping a constant solution set size = 50 on NYC, Tokyo, and SafeGraph-traj over one-day intervals up to | | = 25. We do the same on the datasets with pandemic background,

i.e., Haslemere and Italy, where the propagation probability for Italy is the real transmission probability of people moving between any two provinces over | | = 90 days, while for Haslemere it captures infections over | | = 3 days. Figures 5 and 6 show that reverse spread, expected spread, and binary success all increase with | |, as it allows more activations to take place. As expected, RSM has the best performance with respect to reverse spread and binary success rate in Figure 5 , and ESM outperforms other methods in terms of having the best expected spread in Figure 6 .

Only considering the node degrees is ineffective, particularly as propagation becomes more complex, e.g., on a large 

The propagation probability in Equation 4 is proportional to hyper-parameters , , and , while it is inversely proportional to 1 and 2 . Furthermore, the propagation rates are sensitive to small changes in 1 and 2 since they directly influence the threshold of importance of proximity and population density respectively. The default values of hyper-parameters for all datasets are summarized in Table 4 . We evaluate different hyper-parameter settings on SafeGraph-traj, Tokyo, NYC, and Haslemere datasets. These hyper-parameters are not necessary for Italy and the social network datasets (wiki-Vote and cit-HepPh), since the computation of does not involve them.

We experiment with 1 , 2 ∈ [0.1, 0.5], , ∈ [0.01, 0.1], and ∈ [5, 20] . We fix one of the parameters as the default value, then experiment with different values of the others, e.g., setting = 0.05 for SafeGraph-traj, and changing We present the spread resulting from RSM and ESM solutions with different , 1 , and on Haslemere in Figure 7 .

The reverse spread and expected spread increase overall with increase in and while decrease with increase in 1 .

This is intuitive, since the likelihood of propagation increases with larger values of , and the longer distance threshold implies a high probability that the infection will successfully spread when individuals are in proximity to each other even though they are separated by some distance. Meanwhile, a large choice of 1 will decrease the factor of the infection force that is brought about by the proximate contact. Figure 8 shows the spread resulting from RSM and ESM with different and 2 on the contact-based datasets SafeGraph-traj, Tokyo, and NYC. The performance fluctuates with the increase of 2 , while the performance gets large strictly with the increase of . As shown in Equation 5, 2 is the factor that dictates the importance of the number of people in determining transmission risk. Hence, in the real-life application, it is important to select an appropriate value of 2 to reflect how the density of population at a POI contributes to the spread. It is important to note that we do not select our hyper-parameter values in a way that maximizes the spread and provides any undue advantage for our method. Rather, we choose values that are reasonable reflections of real-world scenarios.

While the solution quality improves with higher number of hypergraph nets generated (up to a certain point), there is an efficiency trade-off. We measure the running time of generating hypergraph nets by varying the desired number of nets | | and the number of time windows | | under consideration, as shown in Table 5 and Table 6 , respectively. Specifically, with | | increasing from 20 to 100 (for a fixed | | = 5) in Table 5 , there is a slight increase in running time (from 0.43 to 2.18 seconds for Tokyo) demonstrating the efficient and scalable nature of our reachable set sampling based algorithm. The hypergraph construction time also increases with | | (for a fixed | | = 20 ) in Table 6 due to a prolonged propagation process, which is especially evident in dense networks (e.g., Tokyo). Despite the increase in computation time and memory requirements when increasing | |, we find that stable solution sets of sufficiently high quality are produced without the need for more than 20 hypergraphs.

T-IC can model large-scale individual-level contacts efficiently, whereas other solutions [20, 34, 36] are only feasible on small graphs. For example, T-Immu is time-consuming due to repeated computations of the eigenvalue of the dynamic contact network structure which makes it not feasible for large-scale dataset (e.g., cit-HepPh), and DIA can only consider the latest snapshot but not the global structure efficiently.

We compare RSM with the commonly used IC model based method, Greedy-IM, in terms of the running time for selecting different size of solution set from = 10 to = 50 in Table 7 . The running time of Greedy-IM grows quickly and gets infeasible with large dataset and long time windows, whereas our RSM and ESM running times grow much □ RSM, △ ESM 2 · 10 −2 4 · 10 −2 6 · 10 −2 8 · 10 −2 0. 

Applying our solution towards disease monitoring, we analyze the effect of several intervention strategies for reducing the spread of infectious diseases. We study targeted lockdown strategies and occupancy restrictions, identify superspreader venues, and examine the need for backward tracing.

To simulate (partial) lockdown strategies, we reduce the edge connections in our network construction and analyze how the spread changes as a result of these dropped edges. We randomly select a seed set of 10 nodes from which to simulate the T-IC process. We use two intervention strategies to select the edge connections to drop (we drop 30% of the total edges). The first is to randomly drop edges. The second is based on the priority of the venues, i.e., delete connections for the venues visited by more people. That is, the number of edges dropped is proportional to the number of connections to the venue. We perform 20 simulations to get the average decrease in spread resulting from each of the two strategies. For NYC, random deletion reduces 78% spread while venue prioritization (on the top-50 busiest venues) achieves 83% spread reduction. Similarly, for SafeGraph-traj, random deletion reduces 26% spread while an additional 4% spread reduction is achieved by venue prioritizing the top-50 busiest venues. For the less granular Italy dataset, which does not have venue information, we prioritize the deletion of the top-50 densely connected provinces. Spread reduction is 49% when using prioritization, while random deletion reduces 38% spread. Therefore, a targeted approach to lockdowns at specific POIs shows superior performance over random occupancy restrictions across all POIs.

We calculate the contribution of backward traced nodes to the activations in the selected ESM solution set, in Table 8 .

First, we select different sizes of solution set from = 10 to = 50. Considering the reverse reachable set of nodes from a given solution set for backward tracing, we identify the top spread contributors as the nodes that participate most frequently in activations. We find that this backward traced set of superspreader nodes account for 67.9% to 95.0% of the activations in on the Haslemere dataset. For Tokyo, they contribute 77.8% to 96.1%. This skewed over-dispersion further points to the importance of backward contact tracing and need for suppressing superspreader events. 

The Tokyo and NYC datasets also include venue categories which provide further insights for designing effective intervention strategies. For | | = 25, we observe that only 26 to 83 venues in NYC are visited by persons in solution sets selected by RSM when increasing from 10 to 50, while ESM, Max-Deg, and Random cover up to 3x as many venues. For Tokyo and NYC, an analysis of the categories of venues visited reveals transportation hubs (including airport, subway, and train station), restaurants, bars, and coffee shops as superspreaders in the solutions sets, with transportation hubs in particular having an out-sized impact when increasing the set size of infected individuals.

We introduced the Temporal Independent Cascade (T-IC) model for Reverse Spread Maximization and Expected Spread Maximization tasks and illustrated their application for disease monitoring. We showed that reverse spread under the T-IC model is submodular, and proposed efficient algorithms for RSM and ESM with approximation guarantees that handle large-scale, highly granular data. Our novel objectives are to identify i) a minimal set of sentinel nodes (e.g., to minimally cover the network for detection of outbreaks), and ii) a set of highly susceptible nodes (e.g., for prioritizing tracing and treatment). Through extensive experiments performed on seven real-world datasets, we showed that RSM significantly outperforms the alternatives for the former task while the ESM solution sets capture significantly more susceptible individuals for the latter. We observed that the dynamic topology captured by our model plays a much more significant role than local connectivity, and that temporal characteristics alongside global graph structure are needed for optimal solutions. In particular, ESM is found to dramatically outperform Max-Deg (by up to 82%) as a superior targeted strategy for contact tracing, while the sentinel nodes identified by RSM have significantly higher success rates of detecting outbreaks compared to T-Immu and DIA. We also presented how the proposed approach can enable more applications by handling individual-level contact information and accounting for temporally ordered events, while being scalable to large networks. We further applied T-IC to quantify the significant impacts of superspreader venues and events, targeted interventions, and backward contact tracing on contact networks.

This research is supported in part by The Alan Turing Institute under the EPSRC grant EP/N510129/1. Aparajita and Joe are supported by the Feuer International Scholarship in Artificial Intelligence.

On influential node discovery in dynamic social networks

ASTRO: Reducing COVID-19 Exposure through Contact Prediction and Avoidance

Optimizing surveillance for livestock disease spreading through animal movements

A Guideline to Limit Indoor Airborne Transmission of COVID-19

Rationing social contact during the COVID-19 pandemic: Transmission risk and social benefits of US locations

Maximizing social influence in nearly optimal time

Mobility network models of COVID-19 explain inequities and inform reopening

Time-critical Influence Maximization in Social Networks with Time-delayed Diffusion Process

Scalable Influence Maximization for Prevalent Viral Marketing in Large-scale Social Networks

Efficient Influence Maximization in Social Networks

Scalable influence maximization in social networks under the linear threshold model

Social network sensors for early detection of contagious outbreaks

Mining the Network Value of Customers

Implication of backward contact tracing in the presence of overdispersed transmission in COVID-19 outbreak

Diffusion maximization in evolving social networks

Spatio-temporal propagation of COVID-19 pandemics

Information diffusion in online social networks: A survey

Efficient algorithms for adaptive influence maximization

The mathematics of infectious diseases

Three faces of node importance in network epidemiology: Exact results for small graphs

Objective measures for sentinel surveillance in network epidemiology

2022. A survey on contact tracing: the latest advancements and challenges

Maximizing the Spread of Influence Through a Social Network

A contribution to the mathematical theory of epidemics

CT-IC: Continuously activated and time-restricted independent cascade model for viral marketing

Mathematics of epidemics on networks

Sparking" The BBC Four Pandemic": Leveraging citizen science and mobile phones to model the spread of disease

Predicting positive and negative links in online social networks

Graphs over time: densification laws, shrinking diameters and possible explanations

Cost-effective Outbreak Detection in Networks

Time Constrained Influence Maximization in Social Networks

A mathematical framework for estimating risk of airborne transmission of COVID-19 with application to face mask use and social distancing

Risk assessment for airborne disease transmission by poly-pathogen aerosols

Dynamic influence analysis in evolving networks

COVID-19 outbreak response, a dataset to assess mobility changes in Italy following national lockdown

Virus propagation on time-varying networks: Theory and immunization algorithms

Toward Accurate Spatiotemporal COVID-19 Risk Scores Using High-Resolution Real-World Mobility Data

Safegraph Places Schema

Online processing algorithms for influence maximization

Influence maximization in near-linear time: A martingale approach

Approximation algorithms

Heterogeneous interventions reduce the spread of COVID-19 in simulations on real mobility data

Community-based Greedy Algorithm for Mining top-K Influential Nodes in Mobile Social Networks

Predicting the Spatio-Temporal Evolution of Chronic Diseases in Population with Human Mobility Data

Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs

Tracking influential individuals in dynamic networks