key: cord-0445357-10m88bja
authors: Agrawal, Sudesh K.; Hasenbein, John J.
title: Detecting Viruses in Contact Networks with Unreliable Detectors
date: 2021-06-14
journal: nan
DOI: nan
sha: c0d5900e469d3e196d832f5952daa1ac1f42d470
doc_id: 445357
cord_uid: 10m88bja

This paper develops and analyzes optimization models for rapid detection of viruses in large contact networks. In the model, a virus spreads in a stochastic manner over an undirected connected graph, under various assumptions on the spread dynamics. A decision maker must place a limited number of detectors on a subset of the nodes in the graph in order to rapidly detect infection of the nodes by the virus. The objective is to determine the placement of these detectors so as to either maximize the probability of detection within a given time period or minimize the expected time to detection. Previous work in this area assumed that the detectors are perfectly reliable. In this work, it is assumed that the detectors may produce false-negative results. In computational studies, the sample average approximation method is applied to solving the problem using a mixed-integer program and a greedy heuristic. The heuristic is shown to be highly efficient and to produce high-quality solutions. In addition, it is shown that the false-negative effect can sometimes be ignored, without significant loss of solution quality, in the original optimization formulation.

In globally connected networks, computer viruses can cause significant losses in terms of both time and money. Therefore, it is essential to quickly detect them in order to implement effective mitigation measures. In this paper, we develop optimization models for virus detection in contact networks. A contact network captures information about the "closeness" between individuals [Chen and Lanzas, 2016] . A virus is introduced at a node in the network through some external means and spreads because of this contact, possibly in a stochastic manner. There is some capability of detection, and we would like to know where to place a limited number of detectors to maximize the probability of detection before a given time. In this paper we consider the static detection problem where detectors are placed in fixed locations. Related research in Ding et al. [2021] and Agrawal [2021] , motivated by COVID-19, considers the dynamic detection problem in which nodes are tested sequentially. While we consider specific settings in this research, because of some common mathematical features, the models we develop for the static detection problem can be applied to a more general setting of detecting an anomaly spreading through a contact network: for example, detecting

In social network research several centrality measures, such as node degree, average distance and closeness, have been used when selecting nodes for spread of influence. Weighted centrality measures which linearly integrate two or more centrality measures are also used for selection of nodes [Fan et al., 2019] .

Researchers have considered problems similar to the model in Lee et al. (2012 Lee et al. ( , 2015 , with different applications. Berman et al. [1995] formulate a facility location problem so as to maximize the proportion of flow through the facilities, in a network with probabilistic flows. Gutfraind et al. [2009] develop a network interdiction model to maximize the probability of catching evaders. Both use a greedy heuristic to provide good solutions.

This research extends the basic problem in Lee et al. (2012 Lee et al. ( , 2015 by considering the possibility of unreliable detectors, specifically detectors which produce false-negative results. In particular, we examine how unreliable detectors affect detection probabilities and the performance of mixed-integer programming solutions, along with greedy methods. We are not aware of any research which incorporates the reliability of detectors, other than Krause et al. [2008] who consider sensor placements in water distribution networks from a robustness perspective. We modify the greedy heuristic in the literature and perform computational studies to demonstrate the efficiency of the modified greedy heuristic, in terms of both the computational time and the solution quality.

In this section we extend the model of Lee et al. (2012 Lee et al. ( , 2015 to model detectors which can produce false-negatives. We first introduce some notation and review the model nomenclature.

Let G(V, E) be a graph G on a set of vertices (or nodes) V with a set E of undirected edges. A virus is introduced at a node in the network by an external agent. We assume that w i , i ∈ V is the known probability mass function (pmf) of the initial location of the virus. The virus then spreads from this initial location, possibly in a stochastic manner, as described shortly. The network manager is interested in knowing where to place detectors on a subset of nodes, before the virus enters the network, in order to rapidly detect the virus. We describe the decision makers' metrics in detail in Section 3.2.

For the virus spread dynamics, we follow the classification and terminlogy outlined in Lee et al. (2012 Lee et al. ( , 2015 . They classify the spread dynamics using five characteristics: replication, persistence, propagation, transmissability and latency. (Transmissability refers to the probability that a node receiving the virus will be infected. This is referred to as "Susceptibility" in Lee et al. 2015.) Table 1 reproduces the nomenclature they use to represent spread dynamics. A sequence of five characters from the table defines a model. The models that they develop and the extension that we develop in this section are applicable to any of the spread models listed in Table 1 . However, we do our computational experiments and analysis on only three of them: TN11C, RA1PC and RAEPC. TN11C is the simplest model, in which there is a virus that moves from one node to another in constant discrete time steps, but does not persist at a node that it had visited in past time steps. In contrast, in RA1PC and RAEPC, a virus replicates itself and sends out copies to other nodes, thereby infecting them, which is a more realistic scenario. While in RA1PC, a virus attempts to infect only one randomly chosen neighbor, in RAEPC it attempts to infect all its neighbors.

We now introduce two definitions that are needed in the sequel.

Roughly speaking, a set function is submodular if it has a diminishing marginal utility in the size of the set.

Definition 2 (Supermodular function). Let 2 V denote the power set of a set V . A function f : 2 V → R is a supermodular function if −f (·) is submodular.

In this section we discuss the MPT formulation and the MET formulation developed in Lee [2012] , and then extend the MPT formulation for the case where detectors with false-negative results are allowed.

In the MPT formulation the objective is to choose a set of detectors S ⊆ V to maximize the probability of detecting the virus by a given time threshold t 0 , with |S| = k. We assume |V | ≥ k. Let T S be the first-passage time for the 

Virus propagates to one randomly selected neighbor (1), every neighbor (E).

Virus is transmissable with probability 1 (1), Virus is transmissable with probability p < 1 (P).

Transmission occurs in constant unit time steps (C), according to an exponential distribution (M), according to a general distribution (G).

virus to hit the set S. The MPT optimization formulation is:

For any of the spread models we consider, the virus spread dynamics form a stochastic process. A realization of this process is the set of infected nodes up to some time t 0 . Notice that different sample paths of virus spreads may correspond to the same realization, since we do not need to record the order in which nodes are infected. If we enumerate all the realizations, then we can reformulate (1) as an integer program (IP):

where Ω : set of indices ω of all realizations of the virus spread P ω ⊆ V : set of infected vertices in realization ω φ ω : probability of realization ω (φ is the p.m.f. over ω.) y ω , ω ∈ Ω : binary variables x j , j ∈ V : 1 if node j is selected as a detector, 0 otherwise. Constraint (2b) limits the number of available detectors to k, and constraint (2c) ensures that y ω = 1 only if the realization indexed by ω contains at least one honyepot. Given (2d), we can relax (2e) to y ω ∈ [0, 1], yielding a mixed integer program (MIP). In our implementation we solve the problem as an IP since we found the solver took slightly less time when solved as an IP.

In the MET formulation the objective is to choose a set of detectors S ⊆ V , with |S| = k, to minimize the expected time until we detect the virus. The MET optimization formulation is:

In Lee [2012] , it is observed that (2) and (3) apply to all of the spread models introduced earlier since, for purposes of the MPT and the MET, each model is defined only by the set of realizations P ω and the pmf on this set. We reproduce their two main results in Theorem 1 and Theorem 2.

Theorem 1. (Lee et al. [2015] ) Let G = (V, E) be connected with |V | < ∞. Let S ⊂ V be a set of detectors, let T S be the hitting time of the stochastic process governing virus spread to set S, and let f (S) = P(T S ≤ t 0 ) denote the probability of detecting the virus by time t 0 > 0. Then, f is submodular.

Theorem 2. (Lee et al. [2015] ) Under the hypotheses of Theorem 1, let g(S) = ET S denote the expected time to detect the virus, and assume g(S) < ∞. Then, g is supermodular.

While the MPT and the MET formulations can be used for any of the spread dynamics discussed, enumerating all realizations is computationally intractable for big networks because |Ω| is exponentially large in the number of nodes in the graph. The MIP reformulation lends itself to a sample average approximation (SAA) formulation, which is more tractable. We can simulate n sample paths of the virus spread instead of enumerating all realizations and solve (2) on these samples, leading to the following sample average approximation formulation (4):

(SAA formulation for completely reliable detectors)

where l indexes the n sample paths. Under some technical conditions, it is well-known that SAA solutions converge to the solutions value, i.e., z n −→ z p . Furthermore, it can be shown that Ez * n − z * P ≥ 0, ∀n (positive bias); and Ez * n ≥ Ez * n+1 , ∀n i.e., the bias is smaller for larger samples (see Mak et al. [1999] , Norkin et al. [1998] for details). A greedy heuristic is described in Lee et al. [2015] . Submodularity of the objective function ensures that the greedy heuristic produces a solution whose value is within a constant factor e−1 e of the optimal objective value [Nemhauser et al., 1978] .

We now develop an MPT formulation for the false-negative case, where the detector at a given node fails to detect the virus present on that node with probability r j with r j ∈ [0, 1), ∀j ∈ V . To describe the formulation and simulation methods for models with detectors that can yield false negatives, we now introduce some definitions.

Definition 3 (Virtual detection). LetR j be a Bernoulli random variable associated with a node j, with probability of success 1 − r j . LetR j = 1 when the success event occurs, and 0 otherwise. A virtual detection is said to occur at node j at a given time whenR j = 1 at that time.

Definition 4 (Successful detection). A successful detection is said to occur at a node with a detector at a given time, when there is a virus at that node at that time, and the detector signals the presence of that virus.

Note that virtual detection is not concerned with the presence of a detector; it just indicates that if a detector is present at a node, then a successful detection occurs at that node. These definitions will be helpful in writing the MPT formulation as an integer program and describing our modified greedy heuristic. Remark 1. In our implementation of the MPT model, we allow multiple independent coin flips at a node (for virtual detection) in the TN11C spread model if the virus revisits the node, but to limit book-keeping we do a coin flip only once at the infected nodes in the RA1PC and the RAEPC spread models. For these latter two models, this implies that a detector only has one "chance" to detect a virus present at a node.

When detectors are not completely reliable, the virus may not trigger an alert at a node with a detector installed. So, we need to redefine our optimization problem. Let T d S be the first-detection time after the virus hits the set S. (Note that T d S = T S when detectors are completely reliable.) The false-negative version of MPT can be formulated as:

where the subscript 'FN' is used to indicate the false-negative case.

Theorem 3. Let G = (V, E) be connected with |V | < ∞. Let S ⊂ V be a set of detector nodes, and T d S be the first-detection time using set S. Furthermore, let f (S) = P(T d S ≤ t 0 ) denote the probability of successfully detecting the virus by time t 0 > 0. Then, f (·) is submodular.

Proof. With the revised definition of f (S), note that f (S ∪ {j}) − f (S) sums values of φ ω for the realizations in which the detector at j successfully detected the virus, but in which none of the detectors in S successfully detected the virus. Next, given S 1 ⊆ S 2 and j / ∈ S 2 we then have

The inequality clearly holds if j ∈ S 1 or j ∈ S 2 . Suppose then that j ∈ S C 2 . Then, the sum defining right-hand side consists of realizations in which none of the detectors in S 2 (or S 1 ) successfully detected the virus, but the detector at j did. Notice then that all such realizations are also contained in the sum defining the left-hand side, since if no detector in S 2 detected the virus, neither did any detector in S 1 . Thus, the left-hand side contains at least the same summands that the right-hand side does.

Remark 2. The proof holds for any of the spread dynamics outlined in Table 1 .

Next, we present a similar result pertaining to the MET formulation, with false negatives.

Theorem 4. Under the hypothesis of Theorem 3, let g(S) = ET d S denote the expected time to detect the virus, and assume g(S) < ∞. Then, g(·) is supermodular.

denote the probability of detecting the virus by installing the detectors at nodes in the set S, by the general threshold time t. Then,

From Theorem 3 we know that F (·, t) is submodular for any t > 0, i.e.,

So,

, where the inequality follows from (7).

Remark 3. We can use a backward greedy algorithm for a constant-factor guarantee in the MET problem. However, we restrict our focus in this paper on the MPT version and the corresponding standard greedy heuristic. Now, returning to the MPT problem, note that we can can reformulate (5) as a mixed-intger program:

where Ω : set of indices ω of all realizations of the virus spread P FN, ω ⊆ V : set of vertices that virtually detect the virus up to t 0 in realization ω φ ω : probability of realization ω occurring (φ is the p.m.f. over ω.)

u ω , ω ∈ Ω : binary variables

x j , j ∈ V : binary variables to determine the set of detector nodes.

As before, we can formulate a SAA version of the problem, which is used in the next section:

(SAA formulation for detectors with false negatives)

In our computational studies in Section 5 we focus our analyses on TN11C, RA1PC and RAEPC spread dynamics. In this section we describe how to generate Monte Carlo samples by simulating the stochastic spread given these virus dynamics. We also describe the greedy heuristic which is used in conjunction with the SAA formulations (4) and (9). The following inputs are required for any of the stochastic spread models:

• the time threshold t 0 ,

• the number of sample paths n to be generated,

• and for the false negative case, the false negative probability r. (For simplicity, we assume r j = r, ∀j ∈ V .)

Recall that in the TN11C model, the virus hops from one node to another in constant time steps. Let L be the set of sampled virus paths. For each sample path l ∈ L, at t = 0, we first choose an initial node for the virus to infect, uniformly at random among all nodes in V . At t = 1, 2, . . . t 0 , we uniformly randomly pick a neighbor of the currently infected node for the virus to infect. We keep track of the infected nodes and generate a virus spread matrix (VSM) in which the rows represent sample paths, and the columns correspond to time. Hence, the VSM has |L| rows and t 0 + 1 columns. A sample VSM for |L| = 3 and t 0 = 2 for a TN11C model is as follows:

When detectors can produce false-negative results, we additionally generate a virtual detection matrix (VDM) whose elements take values in {0, 1}. An element of this matrix indicates that a detector placed at the node corresponding to that element would detect the virus (i.e., a true positive would occur). The VDM also has |L| rows and t 0 + 1 columns.

To generate this matrix, we simulate independent Bernoulli random variablesR j for each node and for each point of time in the VSM. Note that this means we check for virtual detection every time the virus visits a node. For example, if the virus visits node i twice in a sample path, then we generate two independent samples ofR i . An example VDM for the VSM example in (10) appears below: 1 1 1 1 0 1 0 0 1 . In the RA1PC model, the virus at an infected node tries to replicate and send a copy of itself to a randomly chosen neighbor, at every time step. For each sample path l ∈ L, at t = 0, we again choose an initial node for the virus to infect uniformly at random. At t = 1, 2, . . . t 0 , we uniformly randomly pick a neighbor of each of the currently infected nodes for the virus to infect, with successful infection happening independently with probability p. We keep track of the infected nodes and generate a list of |L| rows, where each row is a set of infected nodes. The rows in this case need not contain the same number of elements. Hence, the resulting matrix is, in general, what is sometimes referred to as a "ragged matrix." As such, we still refer to this object as a VSM. An example VSM for |L| = 3 and t 0 = 2 is: 5 192 3 7 13 4 3 5 1 11 13 23 .

(11)

When detectors can produce false-negative results, we additionally generate a VDM by simulating independent Bernoulli random variablesR j for each infected node in each of the |L| rows of the VSM. Note that this means we check for virtual detection only once per node since we store the infected nodes in a sample path as a set. An example VDM for the VSM example in (11) is:

The RAEPC simulation process is analogous to the RA1PC except for the spread dynamics. In RAEPC, the virus at an infected node tries to replicate and send a copy of itself to all of its neighbors, at every time step. As before, we keep track of the infected nodes and generate a list of |L| rows, where each element is a set of infected nodes. When detectors can produce false-negative results, we additionally generate a VDM by simulating a Bernoulli random variableR j for each infected node in each of the |L| rows.

Greedy heuristic: We assume now that there are a finite number of nodes that are labeled 1, 2, . . . , |V |. A Hadamard product (element-wise multiplication) of the VSM and the VDM gives us another matrix: the successful detection matrix (SDM). The SDM corresponding to the VSM in (11) and the VDM in (12) is as follows:

5 0 3 0 13 0 3 5 0 11 13 23

.

When the detectors are completely reliable i.e., r = 0, then the VDM contains only 1's, and the SDM is equal to the VSM. Construction of the SDM is useful in describing the greedy heuristic. Each row in an SDM represents the set of nodes in the corresponding sample path where a successful detection could occur, if a detector is indeed placed at the corresponding node.

The greedy algorithm for selecting detector nodes for the MPT objective, with false negatives works as follows. For any spread model, generate the VSM and VDM with |L| rows, corresponding to the number of samples paths in the simulation. The SDM is then formed as described above. We initialize S 0 to be the empty set. The first detector selected is a node j 0 which is present in the largest number of rows in the SDM, with ties broken arbitrarily. Set S 1 = S 0 ∪ {j 0 }. Now at iteration i, a row is a candidate row if none of the nodes in the row is contained in S i . Then in iteration i we select a node j i which is present in the largest number of candidate rows, with ties broken arbitrarily, and set S i+1 = S i ∪ {j i }. The algorithm terminates when S k is determined. This set S k is then set to be S for the MPT objective.

This greedy heuristic may be preferred over the MIP with SAA if it can give us an "acceptable" solution more quickly, especially for large networks. In Section 5 we undertake computational studies to analyze how the heuristic fares against a solution via the MIP formulation, both in terms of computational time and solution quality. Lee [2012] uses real data from an Asian telecommunication company to build contact networks and uses a downsampling scheme, c-core decomposition [Seidman, 1983] , to build network samples of different scales. He also discusses how the decomposition scheme keeps the relevant properties of the network intact. A c-core of a connected graph is not necessarily connected. Lee considers only the largest connected component of the c-core decomposed network, because if we do not install a detector in an isolated component, then we cannot detect a virus in that component.

We use a network generated using email data from a research institution Krevl, 2014, Leskovec et al., 2007] , and therefore the context of our computational studies is the spread of email viruses. This network could be viewed as representing a collaboration network of researchers, which has been shown to behave like a "small world" network [Newman, 2001] . In a small world network the typical path length M between two randomly chosen nodes is small compared to the number of nodes |V | in the graph: more precisely, M = O(log |V |) [Watts and Strogatz, 1998 ]. We use the largest connected component of the 6-core of the aforementioned network, removing all self-loops, and refer to this decomposed network as the "6-core EU email network," or just the "EU email network" when it is clear that we mean the 6-core version of the network. This network has 5400 nodes and 73315 edges with an average degree of about 27. Details of the network can be found in Appendix A.

We use simulation to determine a suitable value for t 0 to ensure quick detection of a virus. A reasonable value depends on how quickly the virus spreads over the network and on the decision maker's tolerance. In our computational studies, we use t 0 = 4 for TN1PC, t 0 = 3 for RA1PC and t 0 = 1 for RAEPC, corresponding to infection of approximately 0.1% of the nodes when p = 1. Details of the simulation used to determine these values are in Appendix B.1. We perform the following computational studies:

1. comparing the solutions from MIP and greedy heuristic 2. evaluating the quality of the greedy solution using the multiple replication procedure (MRP)

3. studying the effect of the false-negative rate on the probability of detection 4. studying the effect of ignoring detector fallibility.

All the experiments were done on a 64-bit system with Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz 2.30 GHz with 31.8 GB usable RAM and 8 cores (16 logical processors).

For this study we generate n samples for the RA11C spread model and solve (9) using these samples. We use the Gurobi solver [Gurobi Optimization, 2019] via its Java API to solve the MIP and terminate when it achieves a relative gap of 1% between the best integer objective z * n and the best linear-programming-relaxed objective z * n , or when the time expended exceeds 1 hour. The relative gap provides a bound on the accuracy of the MIP optimal solution. The integer feasibility tolerance is set to 1.0 · 10 −9 , and only one processor thread is used. We refer to the solution S * n produced by the solver as "MIP." We also use the greedy heuristic on the same samples and obtain the objective value z h n . We refer to the solution S h n from the greedy heuristic as "Greedy." We note the computational times for both. We also examine the difference µ = f (S h n )−f (S * n ) in the objective value of Greedy and MIP and form the corresponding confidence interval using McNemar's procedure [McNemar, 1947] . A (1-α)-level CI on the difference between marginal proportions is given by:

with sample variance: 

where n 11 , n 12 , n 21 , and n 22 are described in Table 2 , and n ′′ is the total number of sample paths generated in evaluating the greedy and the MIP solution. The approximate (1-α)-level CI is given by

where z α is the α quantile of the standard normal. In (13) we subtract the "MIP positive" proportion from the "Greedy positive" proportion, which means if a CI covers strictly positive values, then we infer that there is strong evidence that the greedy solution is better. If a CI covers strictly negative values, then there is evidence that the MIP solution is better.

The results appear in Table 3 . In the table, if z h n > z * n , then the heuristic provides a better solution on the relatively small set of sampled paths indicated in the second column of the table. Otherwise, the MIP solution is better. Highlighted cells in the table indicate cases in which the solver terminated due to the time limit of 1 hour. We generate CIs on the difference µ = f (S h n ) − f (S * n ) corresponding to 95% significance, using 2 million samples (n ′′ = 2000000). Remark 4. Unless stated otherwise, all further analyses use MIP solutions attainable within 1 hour or approximately 1% relative gap.

From Table 3 we make the following observations: Table 2 : "Greedy positive" indicates that with the greedy solution S h n , the sample realization is detected, and "Greedy negative" indicates the virus is undetected with the greedy solution. Similarly, "MIP positive" and "MIP negative" indicate whether or not the virus is detected in the MIP solution.

Greedy positive n 11 n 12 n 11 + n 12 Greedy negative n 21 n 22 n 21 + n 22 Total n 11 + n 21 n 12 + n 22 n ′′ Table 3 : Comparison of MIP and Greedy, under RA11C spread on 6-core EU email network with t 0 = 3 and r = 0.05: computational time and 95% confidence interval on the difference µ = f (S h n ) − f (S * n ) using McNemar's procedure (n ′′ = 2000000). A positive difference indicates that the greedy solution performed better. Note that columns z * n , z * n , z h n and both wall times are not from the McNemar's procedure.

Greedy 2. For larger values of n or k the solver is not able to solve the MIP to optimality (within the given tolerance) within an hour.

3. There is a diminishing marginal value of adding detectors.

4. For the majority of cases, the confidence interval covers strictly positive values, indicating that the greedy heuristic is likely better most of the time. Even in cases where the CI does not cover strictly positive values, we see that the MIP and greedy solution are the same in the first two significant digits.

Overall, this provides compelling evidence that the greedy heuristic can provide good quality solutions with substantially less computational effort. The study is done to demonstrate the computational efficiency of the greedy heuristic over a solver, and therefore we only tabulate results for the RA11C spread here. Similar results for TN11C and RAEPC can be found in Appendix B.2.

Next, we evaluate the solutions from the greedy heuristic using the multiple replications procedure (MRP) described in Bayraksan and Morton [2006] . The idea is to evaluate these candidate solutions on a larger set of independent sample paths and estimate the optimality gap using the upper bound obtained from solving a linear programming relaxation of SAA on this larger set. This is then repeated n g times to form an absolute gap estimate G n (n g ) and the corresponding one-sided approximate (1-α)-level confidence interval [0, G n (n g ) + ǫ g ]. We refer readers to Bayraksan and Morton [2006] for full details.

In our study we use 50,000 virus sample paths to evaluate the candidate solutions in each of n g = 20 replications, and α = 0.05 to form the one-sided confidence interval (CI). Due to the computational effort required for this procedure, we perform experiments only for a limited set of parameters. The results for the TN11C and the RA1PC models appear in Table 4 and Table 5 , respectively. We plot the gap estimate G n (n g ) for the TN11C and the RA11C spread models in Figure 1 . As expected, the gap generally decreases with an increase in the number of samples used to generate the greedy the solution, i.e., the greedy solution is expected to improve with larger sample sizes. We also note that in these tests, the greedy heuristic performs quite well, generating solutions within about 2% of optimal for when 5000 samples or more are used. Table 4 : Assessing the quality of greedy solutions using MRP for 6-core EU email network under TN11C with k = 100, t 0 = 4 and r = 0.05. For varying values of sample size n, we tabulate the gap estimate G n of the greedy solution and the width ǫ g of the corresponding one-sided CI for n g = 20 and α = 0.05. n G n (n g ) ǫ g 1000 0.0827 0.0008 5000 0.0234 0.0006 10000 0.0183 0.0005 30000 0.0098 0.0004 Table 5 : Assessing the quality of greedy solutions using MRP for 6-core EU email network under RA11C with k = 50, t 0 = 3 and r = 0.05. For varying values of sample size n, we tabulate the gap estimate G n of the greedy solution and the width ǫ g of the corresponding one-sided CI for n g = 20 and α = 0.05. n G n (n g ) ǫ g 1000 0.0690 0.0006 5000 0.0305 0.0006 10000 0.0182 0.0005 30000 0.0198 0.0006

Having established the computational efficacy of the greedy heuristic and the high quality of solutions produced by it, we now use greedy solutions to study the effect of varying the false negative probability r (from 0 to 0.5) on the probability of detection. We use the greedy heuristic to get an "approximately optimal" solution on 50,000 samples for k = 50 (about 1% of the total number of nodes), and then evaluate this solution on an independent set of 2 million samples to get a point estimate of the approximate optimal objective value, for different spread dynamics and varying values of r. We plot the results for the TN11C, the RA1PC, and the RAEPC models in Figure 2 . The probability of detection seems to drop approximately linearly in r for all the three cases. For TN11C the probability drops by 0.17 when using a 50% reliable detector, a 45% decrease from the overall probability of detection with a perfectly reliable detector. For RA1PC the probability drops by 0.12, a 47% decrease, and for RAEPC the probability drops by 0.20, a 33% decrease. For a 95% reliable detector (i.e., r = 0.05) the drop is around 4% for TN11C and RA1PC, and 3% for RAEPC.

In our final computational study, we examine the effect on solution quality when the decision maker ignores the fallibility of detectors. In particular, we solve a model that assumes that the detectors are completely reliable, and compare it to the solution of the "true" model, with fallible detectors.

The following procedure is used to generate the data in Table 6, Table 7, and Table 8: 1. Generate 50,000 sample paths of virus spread. Step 1. The 50,000 sample paths generated in Step 1 along with corresponding VDMs serve as 50,000 samples for the "true model" i.e., the model with false-negative results.

3. Generate 5 million sample paths, independently from the Step 1 paths, for virus spread, along with the corresponding VDMs.

4. Use the greedy heuristic on sample paths generated in Step 1, to select nodes at which to place detectors (ignoring false negatives).

5. Use the greedy heuristic on the 50,000 samples in Step 1 and the VDMs in Step 2, corresponding to the true model, to select nodes at which to place detectors.

Step 5 on the samples generated in Step 3. The corresponding objective function values are referred to as FN1 and FN2, respectively, in Table 6 through Table 8 . Specifically, we calculate (a) the difference in the probability of detection and (b) half of the Hamming distance (the semi-hamming distance), a measure of how different the detector node sets are. The semi-hamming distance between two sets of detector nodes indicates the minimum number of node changes that are required to make the two sets the same. From Table 6 through Table 8 we make the following observations: 1. There seems to be a negligible absolute difference in the probability of detection.

2. For a fixed k, the semi-Hamming distance generally increases but is not always monotonically increasing in r.

3. For r = 0.3, the semi-Hamming distance is as high as 10% of the number of detectors.

The results above induce at least two questions regarding modeling of false negatives. First, why does a model that ignores false negatives select a detector set which is close to the "correct" set from the true model, at least when using a greedy algorithm? Second, why do differences in up to two dozen nodes induce very small changes in objective values? The likely answer to the second question is simply that as long as a "core" set of nodes is selected correctly, changing the remaining 5%-10% of nodes in the detector set does not matter because of small marginal detection gains as the last nodes are added. As for the first question, it is harder to guess the root cause of this effect.

From the results above, one might infer that properly modeling false negatives is not particularly important in our models. However, we should distinguish two different purposes for building such models. First, these models are useful in design decisions, i.e., deciding where detectors should be placed. Second, once locations are chosen, the models are used for performance modeling, e.g., evaluating the probability of successful detection for a given design. Our somewhat limited results indicate that modeling false negatives may not be crucial in the design stage. However, in our performance evaluation in Section 5.3 there was a significant decrease in detection probability with increases in the false-negative rate. Hence, it is important to incorporate false negatives when making a decision, for example, on how many detectors are needed to ensure a certain probability of detection.

In this section we provide some analytical insight, via stylized examples, on optimal placement of detectors. Specifically, we develop analytical expressions for detector placement strategies around high-degree nodes, which are generally chosen for influence purposes in social networks. We consider the wheel network, four versions of which are depicted in Figure 3 . The wheel network has one central "high-degree" node that connects to many degree-3 edge nodes. We derive analytical expressions for the probability of detecting a virus under the following assumptions:

• There is only one time step, i.e., t = 0 → t = 1 (two "days" of detection).

• There are two detectors available.

• The spread model is TN11C.

• The number of nodes, v, is greater than or equal to 5 (one center node and at least 4 edge nodes).

• The false negative probability satisfies 0 ≤ r < 1 for all detectors.

In Table 9 we give the probability of detecting the virus for the four different configurations shown in Figure 3 . Note that one of the configurations places both detectors on the central node. There is a high likelihood of the virus visiting the central node in two detection steps, and one might wish to place two detectors at the same node if the detectors are unreliable. (We assume that they produce false negatives independently.) We also provide the limiting probability of Figure 3 Detector placement Detection probability Detection probability for fixed r, v → ∞ Adjacent-edge nodes

Diametrically-opposite nodes

Center-center nodes

We have the following results from the limiting values (v → ∞) in Table 9 :

1. For any r > 0, placing two detectors at the central node is optimal, for sufficiently large v.

2. When detectors are completely reliable, i.e., r = 0, at least one detector is placed at the central node in an optimal solution.

For smaller values of v, we observe the following:

1. Placing detectors on diametrically-opposite edge nodes is always better than placing them on adjacent-edge nodes.

2. For v ≥ 6, the center-edge placement is better than placing the detectors on diametrically-opposite edge nodes.

For v = 5, if r > 2 7 , then the center-edge placement is better, and if r < 2 7 , then the diametrically-opposite placement is better.

3. For a given r, let ⌈v min ⌉ be the minimum number of nodes needed for the center-center placement to be better than the center-edge placement of two detectors. Table 10 provides ⌈v min ⌉ for a few reasonable values of r. We see that for higher false-negative rates it might be better to place two detectors at the central node in a cluster.

4. For r ≥ 16 21 ≈ 0.76, the center-center placement is at least as good as the center-edge placement, because ⌈v min ⌉ = 5 at r = 16 21 . 

A natural extension to this research is to allow for detectors with false-positive results. It is clear that this extension requires a substantially modified optimization model. In the false-positive case the network manager has to make two decisions: a design decision and an operational decision. This is unlike the false-negative case where any false negative does not trigger an "alarm," and the virus just goes undetected. The design problem is to decide where to place the honeypots, at time 0. The operational problem is what action to take when a honeypot signals a virus detection. For example, the network manager can choose to ignore the alarm, in which case a possible virus detection is missed, but a false positive is avoided. Hence, in a model with false positives, we seek an optimal operational policy that incorporates both the costs of taking action when there is false positive, and a failure to detect the virus. It would be interesting to see if the induced Markov decision problem can be formulated as an integer program. It is not apparent if the submodularity property would still hold, however it might be possible to develop a simulation model with a "similar" greedy heuristic and gain some insights into the problem through computational results. One could also explore how to combine all these models and develop a generalized model which accounts for the possibility of detectors being unreliable and producing both false-positive results and false-negative results.

Another possible extension to our work is to model the choices that malicious players (adversaries) make in order to inflict maximum damage to the system. The adversary attempts to maximize the expected number of infected nodes by the time the virus is detected by introducing the virus on a single node, and the network manager installs honeypots to minimize this value. It would be interesting to explore this min-max model extension.

only if G S is a maximum subgraph for which the degree of every vertex v ∈ V S in G S is greater than or equal to c. A maximum induced subgraph has as many vertices as possible. We generate Monte Carlo samples to determine the average time it takes a fixed percentage of nodes to become infected. This analysis provided guidance in selecting the value of t 0 in our computational studies. We selected infection of 0.1% of the nodes as the threshold, but a decision maker may of course adjust this threshold. In Table 12 we present the results of such simulations for different levels of infections. These results were obtained by simulating the virus spread using TN11C, RA11C and RAE1C dynamics, terminating the simulation when the given percentage of infections was achieved, and recording the time. The average time is computed over 10,000 independent replications. Since the initial infection happens at t = 0, if a virus is detected at, say t = 3, then it means the virus has been detected within 4 time slots. In Table 12 we observe that the average time for TN11C increases approximately linearly with percent infection. In the TN11C model, the infection percentage cannot be superlinear with respect to time, because |P l | ≤ t 0 + 1, ∀l.

In Section 5.1 we compare the MIP and greedy solutions for the RA11C model. In Table 13 and Table 14 , we provide comparisons for the TN11C and the RAEPC spread models, respectively.

We use the term "node X" (or just X) to mean the node where detector X has been placed. We compute the probabilities of detecting a virus when detectors are placed in the four configurations shown in Figure 3 . Let A adj be the event that the virus is detected with adjacent detectors, A do be the event that the virus is detected with diametrically-opposite detectors, A ce be the event that the virus is detected with center-edge detectors, and A cc be the event that the virus is 

Distinction and connection between contact network, social network, and disease transmission network

Surveillance testing for rapid detection of outbreaks in facilities

Virus detection problem in contact networks

Stochastic optimization models for rapid detection of viruses in cellphone networks

Optimization of stochastic virus detection in contact networks

An analysis of approximations for maximizing submodular set functions -I

An influence maximization approach to enhance or degrade networks

Mining the network value of customers

Mining knowledge-sharing sites for viral marketing

Threshold models of collective behavior

Talk of the network: A complex systems look at the underlying process of word-of-mouth

Maximizing the spread of influence through a social network

On the optimal solution of budgeted influence maximization problem in social networks

An efficient linear programming based method for the influence maximization problem in social networks

Large-scale influence maximization via maximal covering location

Efficient sensor placement optimization for securing large water distribution networks

Information spread link prediction through multi-layer of social network based on trusted central nodes. Peer-to-Peer Networking and Applications

Locating flow-intercepting facilities: New approaches and results

Optimal interdiction of unreactive Markovian evaders

Monte Carlo bounding techniques for determining solution quality in stochastic programs

A branch and bound method for stochastic global optimization

Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection

Graph evolution: Densification and shrinking diameters

The structure of scientific collaboration networks

LLC Gurobi Optimization. Gurobi optimizer reference manual

Note on the sampling error of the difference between correlated proportions or percentages

Assessing solution quality in stochastic programs

The authors would like to acknowledge the feedback of Dave Morton and Grani Hanasusanto which helped in refining the exposition of this research.

Definition 5 (c-core). Given a graph G = (V, E), G S = (V S , E S ) is a subgraph of G that has V S ⊆ V and E S = {(i, j) ∈ E : i ∈ V S , j ∈ V S }. A subgraph G S = (V S , E S ) induced by the set V S ⊂ V is a c-core if and

detected with center-center detectors, under the network assumptions outlined in Section 6. Then we have: P(A adj ) = P(virus hops from D1 or D2 to a node in {D1, D2} C ) · (1 − r) + P(virus hops from D1 to D2 or from D2 to D1) · (1 − r 2 ) + P(virus hops from the center to either D1 or D2) · (1 − r) + P(virus hops from a neighboring edge node of D1, other than D2, to D1, or from a neighboring edge node of D2, other than D1, to D2) · (1 − r)P(A do ) = P(virus is at D1 or D3 at t = 0) · (1 − r) + P(virus hops from the center to either D1 or D3) · (1 − r) + P(virus hops from a neighboring edge node of D1 to D1, or from a neighboring edge node of D3 to D3) · (1 − r)P(A ce ) = P(virus hopping from the center (D5) to an edge other than D1) · (1 − r) + P(virus hopping from the center to D1) · (1 − r 2 ) + P(virus hopping from D1 to an edge node) · (1 − r) + P(virus hopping from D1 to the center) · (1 − r 2 ) + P(virus hopping from a neighboring edge node of D1 to either D1 or D5) · (1 − r)and P(A cc ) = P(virus is at the center at t = 0) · (1 − r 2 ) + P(virus hops from an edge node to the center) · (1 − r 2 )

Diametrically-opposite vs. adjacent detector placements: We now derive a necessary and sufficient condition for the diametrically-opposite placement of detectors to be at least as good as the adjacent placement of detectors in the wheel network. Using Table 9 , we compare the corresponding detection probabilities, deriving the following condition:Result 1. From (16) we see that placing detectors on diametrically-opposite edge nodes is always better than placing them on adjacent-edge nodes.Diametrically-opposite vs. center-edge detector placements: We next derive conditions to determine when the diametrically-opposite placement is at least as good as the center-edge placement of detectors. We again use Table 9 to make the comparison:When r = 0, the left-hand side of the final expression in (17) reduces to v 2 − 5v − 2. The roots of v 2 − 5v − 2 are 5± √ 33 2 . The larger root is smaller than 5.5, since 5+ √ 33 2 < 5+ √ 36 2 = 5.5. Because the coefficient of the quadratic term is positive, v 2 − 5v − 2 > 0 for all v ≥ 6. Since the left-hand side of the final expression in (17) is increasing in r, this holds for r ∈ [0, 1). Result 2. For v ≥ 6 the center-edge placement is better than placing the detectors on diametrically-opposite edge nodes. For v = 5, straightforward algebra implies that if r > 2 7 , then the center-edge placement is better, and if r < 2 7 , then the diametrically-opposite placement is better.Center-edge vs. center-center detector placements: We now derive conditions to determine when the center-center placement is at least as good as the center-edge placement. We use expressions from Table 9 to make the comparison:The roots of the expression rv 2 − 4v − 4r + 4, viewed as a function of v, are 2 r ± 2 1 − 1 r + 1 r 2 . This means that when r > 0, the left-hand side of the final expression in (18) is non-negative for v ≥ 2 r + 2 1 − 1 r + 1 r 2 . Therefore, when r = 16 21 , then the left-hand side of the final expression in (18) is non-negative for v ≥ 5. It can be shown that 2 r + 2 1 − 1 r + 1 r 2 is decreasing in r for r ∈ [0, 1), so the claim holds for r ≥ 16 21 . Result 3. For r ≥ 16 21 ≈ 0.76 and v ≥ 5, the center-center placement is as least as good as the center-edge placement.