key: cord-0159553-zntc7qz1 authors: Pare, Philip E.; Vrabac, Damir; Sandberg, Henrik; Johansson, Karl H. title: Analysis, Online Estimation, and Validation of a Competing Virus Model date: 2020-01-28 journal: nan DOI: nan sha: 80aba76fc1f8ea7d4668b5d8b91b7b3b6208d650 doc_id: 159553 cord_uid: zntc7qz1 In this paper we introduce a discrete time competing virus model and the assumptions necessary for the model to be well posed. We analyze the system exploring its different equilibria. We provide necessary and sufficient conditions for the estimation of the model parameters from time series data and introduce an online estimation algorithm. We employ a dataset of two competing subsidy programs from the US Department of Agriculture to validate the model by employing the identification techniques. To the best of our knowledge, this work is the first to study competing virus models in discrete-time, online identification of spread parameters from time series data, and validation of said models using real data. These new contributions are important for applications since real data is naturally sampled. two competing US Department of Agriculture (USDA) farm subsidy programs. We employ that dataset here but use a two-competing virus (bi-virus) model. The results show that the model fits the dataset much better than when using a single virus model. In many ways this paper is an extension of [18] , [19] , generalizing from a single virus discrete model to multiple competing viruses. However, the proofs are different in several of the cases. New insights into the discrete time model are presented via simulations. Finally, the data results from [18] , [19] are improved upon when using a two-competing virus model. The paper is organized as follows: in Section II, the competing virus spread model is introduced with accompanying assumptions that ensure the model is well posed, which is proven. In Section III, we analyze the model from Section II. In Section IV, we present necessary and sufficient conditions for learning, or identifying, the spread process parameters of the same model, from data produced by the models. In so doing, we establish several assumptions that need to be met by the USDA data. In Section V, we validate the results from Sections III and IV via simulation and present some exploratory simulations to support the work in Section VI and propose an online spread parameter estimation algorithm. In Section VI, we learn the spread parameters of the USDA subsidy programs using data from different subsets of the country and verify the learned parameters by simulating the spread model over the complete United States and comparing the simulated data with the actual data. Given a vector function of continuous time x,ẋ indicates the time-derivative. Given a vector function of discrete time x[t], t is the time index. Given a vector x ∈ R n , the 2-norm is denoted by x and the transpose by x . The vector of all equal zeros is denoted by 0. Given two vectors x 1 , x 2 ∈ R n , x 1 > x 2 indicates each element of x 1 is greater than or equal to the corresponding element of x 2 and x 1 = x 2 , and x 1 x 2 indicates each element of x 1 is strictly greater than the corresponding element of x 2 . Given a matrix A ∈ R n×n , the spectral radius is ρ(A). Also, a ij indicates the i, j th entry of the matrix A, and A F indicates the Frobenius norm of A. The notation diag(·) refers to a diagonal matrix with the argument(s) on the diagonal; the argument can be a vector x or its elements x i . For n ∈ Z + , [n] := {1, ..., n}. We introduce a discrete-time multi-virus competing model. The model can be derived from the continuous-time model, where, for each virus k ∈ [m], x k i is the infection level of the ith agent (which can be interpreted as the probability of agent i being infected or the proportion of subpopulation i that is infected) and evolves aṡ By rearranging (2), we have , since the term on line (6) is non-positive. Since is a convex combination of (1 − hδ k i ) and h n j=1 β k ij , which are less than or equal to one by Assumption 3, z k , and t ≥ 0. is positively invariant with respect to the system defined by (3) . Since x k i denotes the probability of infection of individual i by virus k, or the fraction of group i infected by virus k, and 1 − x 1 i − · · · − x m i denotes the probability of individual i being healthy, or the fraction of group i that is healthy, it is natural to assume that their initial values are in the interval [0, 1], since otherwise the values will lack any physical meaning for the epidemic model considered here. Therefore, we focus on the analysis of (3) only on the domain D. We need an assumption to ensure non-trivial virus spread. Assumption 4. We have B = 0, h = 0, and n > 1. Definition 1. Consider an autonomous system where f : X → R n is a locally Lipschitz map from a domain X ⊂ R n into R n . Letx be an equilibrium of (8) and E ⊂ X be a domain containingx. If the equilibriumx is asymptotically stable such that for any (7). Proof. We employ a LaSalle's invariance principle argument. To simplify notation, let M k = I + hB k − hD k , where (9) holds by Proposition 1 in [24] , (10) holds by Assumptions 2 and 3, and (11) holds by Lemma 1. Therefore, by Proposition 1, x k converges asymptotically to the origin. Since k ∈ [m] was chosen arbitrarily, the whole system, that is, every virus k ∈ [m], converges to the healthy state for this case. For the case where ρ(I − hD k + hB k ) = 1, we have, by Lemma 3 in [19] , that, for all k ∈ [m], there exists a Since, by Assumptions 2 and 4 and by Lemma 3 in [19] , Therefore, by Proposition 1, x k converges asymptotically to the origin. Since k ∈ [m] was chosen arbitrarily, the whole system, that is, every virus k ∈ [m], converges to the healthy state for this case. Therefore the healthy state is asymptotically stable with domain of attraction D. Proof. Clearly 0 is always an equilibrium of (3). By the Perron Frobenius Theorem for irreducible nonnegative matrices (Theorem 8.4.4 in [25] ), for all k ∈ [m], ρ(I − hD k + hB k ) = s 1 (I − hD k + hB k ) and there exists v k 0 such that Therefore This condition is the same as the condition of Proposition 3 in [9] , which shows the existence and stability of the endemic state in the single virus case. The proof follows similarly, when assuming that x l = 0 for all l = k, that Therefore, if x l = 0 for all l = k,x k is an equilibrium of (3). Consequently, 0, (x 1 , 0, . . . , 0), . . . , (0, . . . , 0,x m ) are equilibria of (3). We have the following corollary. From the proof of Theorem 1, x k [t] will asymptotically converge to 0 as t → ∞ for all initial values ( Thus, we regard the dynamics of x l [t] as an autonomous system with a vanishing perturbation −h k =l X k (t)B l x l (t), which converges to 0 as t → ∞. Therefore, from Theorem 2 in [22] , the autonomous system (12) will locally asymptotically converge to the unique epidemic statex l . Thus (3) will converge to (0, . . . , 0,x l , 0, . . . , 0). From Theorem 1 and Proposition 2, we have the following result. In this section, we clearly lay out the assumptions and the identification techniques for the multi-virus model. For this section we use a slightly different version of (3), where we factor β k ij into β k i a k ij as where B k = diag(β k i ) and A k is the matrix of a k ij . Remark 1. If the system has homogeneous spread parameters, that is, β k i = β k j and δ k i = δ k j for all i, j ∈ [n], the condition in Theorems 1-2 reduces to ρ(A) ≤ δ k β k . We start by assuming that the underlying graph structures A k are known and that we have full-state measurement with no noise on the measurements, which we admit are strong assumptions. However, for the dataset used in Section VI these assumptions are well-founded because we aggregate the data by county and the adjacency of counties is known, i.e., the graph structure is known, and any farmer that received a subsidy payout is in the dataset, i.e., there are no hidden, unmeasured states. We will relax the no-noise assumption in the Simulations Section (see Section V). We present several results on learning the spread parameters of the model in (2) from data. The following result is an improvement of Theorem 3 in [19] . such that where , and A k are known, using (13) we can construct the matrix Φ k , defined as, Therefore, since we also know x k [T ] and h, we can rewrite (3) as Since n > 1, Φ k has at least two rows. By the assumption that there exist i, j ∈ [n] and t 1 , t 2 ∈ [T − 1] ∪ {0} such that (14) holds, Φ k has column rank equal to two, with two unknowns. Therefore there exists a unique solution to (16) using the inverse or pseudoinverse. If there do not exist i, j ∈ [n] and t 1 , (14) holds, then Φ k has a nontrivial nullspace. Therefore (16) does not have a unique solution. Note that t 1 and t 2 from Theorem 3 could both equal zero and the condition in (14) could still hold, that is, recovery of the spread parameters may be possible with only two time series points. Now we present two corollaries where hβ k and hδ k , denoted by β k h and δ k h , respectively, can be recovered. This corollary illustrates that under certain conditions, while the exact behavior of the system may not be recoverable the limiting behavior of the system may be determined, by employing Theorems 1-2 with Remark 1. If the assumption is made that the underlying spread process is heterogeneous, we have a similar condition, an improvement of Theorem 4 in [19] . , and h are known. Then, the spread parameters of virus k for node i can be identified uniquely if and only if T > 1, and there exist t 1 , , and A k are known, we can construct the matrix Φ k i , defined as,  . . . . . . Then, since we also know x k i [T ] and h, we have  Since T > 1, Φ k i has at least two rows. By the assumption that there exist t 1 , t 2 ∈ [T − 1] ∪ {0} such that (17) holds, Φ k i has column rank equal to two, with two unknowns. Therefore there exists a unique solution to (18) using the inverse or pseudoinverse. If there do not exist t 1 , t 2 ∈ [T ] such that (17) holds, then Φ k i has a nontrivial nullspace. Therefore (18) does not have a unique solution. In this section, we present first, a set of simulations that illustrate the results from the previous sections and second, some illuminating simulations of the model that support the validation work with real data. Since the dataset we consider in Section VI only has two competing spread processes we limit ourselves to m = 2 for this section as well, however, the behavior is similar for m > 2. Virus 1 is depicted by the color red (r), virus 2 is depicted by the color green (g), and susceptible, or healthy, is depicted by the color blue (b). For all i ∈ [n], the color at each time t for node i is given by For the second set of simulations we, at times, inspect the case of m = 1. where z i is the position of agent i and A is, therefore, fully connected. Since the edges are weighted, the ones between nodes that are far away from each other are difficult to see in the figures. A simulation, based on this system, is shown in Figure 1 with plots of the initial condition, the epidemic states at time-step 100 and the final condition. Assuming A is known we recover β 1 h , δ 1 h , β 2 h , and δ 2 h exactly, using (16) with only two time-steps, consistent with Corollary 2. Hence, the proportions δ 1 /β 1 and δ 2 /β 2 are also correctly recovered. And clearly, if h is known, we recover the parameters exactly, consistent with Theorem 3. Moreover, ρ(I − hD 1 + hβ 1 A) = 1.1976 > 1, and ρ(I − hD 2 + hβ 2 A) = 0.997 ≤ 1, and consistent with Corollary 1, the endemic state is (x 1 , 0), wherex 0. We also find that this endemic equilibrium is reached for all initial conditions with x 1 [0] > 0, that is, via simulations it appears to be globally stable. We now consider a similar system with 50 agents 24 of which are initially infected by either one of the two viruses and A given by (20) . But the agents have moved and the system is a heterogeneous virus system with In this section, we present two set of simulations that give important insight into the model to assist our work on the USDA dataset in the next section. The first simulation explores how accurately we can capture the behavior of a heterogeneous virus system with additive i.i.d. Gaussian noise by using a homogeneous approximation, i.e. recovering the spread parameters by applying (16) . The second set of simulations illustrates some interesting behavior regarding the sampling parameter h. For the first simulation we consider a heterogeneous system with three agents and two viruses, m = 2. We set h = 1, We generate 40 time-steps of the epidemic states, x, using (2) with additive i.i.d Gaussian noise with the standard deviation set to 0.03. To understand how accurately a heterogeneous system can be approximated by a homogeneous model we use The learned parameters in (21) are used to recover the generated data-samples,x, by using (2) with homogeneous spread parameters. We compare x andx in Figure 3 to illustrate how well a homogeneous model can approximate Fig. 3 : Simulation of the epidemic states of a heterogeneous system with additive i.i.d. Gaussian noise and recovered states using a homogeneous approximation of the system. a heterogeneous system with additive noise. We see that, even with noise in the system, the approximation is quite good. One can see that the errors between the recovered states,x 2 1 andx 1 3 , and the original system, x 2 1 and x 1 3 , are higher than the rest of the errors. The decreased accuracy ofx 1 3 can be explained by the difference in magnitude of β 1 3 from the infection rates of the other agents. The same applies forx 2 1 but for the healing rate, δ 2 1 . We now propose an online algorithm for learning the spread parameters from data, extending the ideas from Section IV. The question becomes, if there is additive system noise and data is obtained in an online manner, how do estimates of the spread parameters improve as more data is added? The algorithm becomes the following: as more data is added, more rows of (16) are added. Then the spread parameters can be obtained by solving (16) at each time step, using least squares or by employing a recursive least squares method [26] , to predict the next time step. A simulation based on the online algorithm for learning is shown in Figure 4 with a single homogeneous virus. We set h = 1, δ = 0.9, β = 1.5, x[0] = [0 0 1], and We generate the epidemic states, x, using (2) with additive i.i.d Gaussian noise with the standard deviation set to 0.03. By Figure 4 we can see that the estimation is quite accurate using this online algorithm for learning, wherē x represents the estimated state. We can see that the new algorithm performs quite well, capturing the behavior of the system. We now apply these ideas to a USDA farm subsidy dataset. In the Food, Conservation and Energy Act of 2008 (2008 Farm Bill) a new subsidy program, ACRE, was introduced. It was an alternative to the exist CCP program. Similar to [18] , [19] we aggregate farms on the county level. This approach allows us to convert the binary decision to enroll in ACRE or in CCP into a continuous measure of the proportion of eligible farms that enroll in ACRE or CCP, in each county. The proportion of farms enrolled in ACRE (and CCP) corresponds exactly to the density of the first virus (second virus), facilitating our investigation of the spread of the competing programs. The number of eligible farms in a county was set to the max number of farms enrolled in both programs in any year. We removed counties where no farms were ever enrolled in either program. We also removed Alaska and Hawaii since they are not in the contiguous United States of America. The data for the four years considered can be found in Figures 5a-5d . Please see [18] , [19] for more detailed information on the programs. We now use the learning techniques presented in Section IV and tested in Section V for the model in (2) on the USDA dataset. The adjacency matrices are calculated using the adjacency of counties, that is, Figure 5a as the initial condition on the model in (2) with parameters calculated using the data from Kentucky, given in (25) . The colors of the nodes follow (19) . 1, if county i and county j share a border, First we identify two sets of homogeneous spread parameters using the whole dataset by applying (16) We then simulate the model in (2) with the spread parameters in (23), with the data from Figure 5a being used as the initial condition. The resulting scaled error between the dataset, F, and the simulated data,F all , using the Note thatδ 1 h for the first virus (the ACRE program) is negative, violating the assumptions of the model, which is not ideal. Nevertheless for completeness, we simulate the spread over the contiguous United States using the model in (2) with the spread parameters calculated using the data from Idaho, given in (24) , with the data from Figure 5a being used as the initial condition. The scaled error between the dataset, F, and this simulated data,F ID , is The scaled error from the analogous simulation in [18] , [19] was 0.2348. Therefore it would appear that, while not a perfect fit, the competitive-virus model seems to capture the behavior of this USDA Farm Subsidy adoption dataset better than the single virus model. After testing every possible state, we found that the data from Kentucky provided the best estimate of the whole US data set when using the homogeneous version of the model in (13) . Applying (16) on the Kentucky dataset gives the following spread parameters: The simulated data can be found in Figures 5e-5h The results were improved upon when implementing the recursive algorithm proposed in Section V-B, reducing the scaled error to 0.0855. However, it must be noted that the first two data points were included in the simulated data, since the recursive algorithm is only used for one step prediction. Using the first set of learned spread parameters for the second and third data points gave an error of 0.1140, still improving upon the previous results. In this work we have proposed a discrete time competing virus model for an arbitrary number of viruses. We have provided conditions for the model to be well defined. We provided necessary and sufficient conditions for uniqueness of the healthy equilibrium. We presented necessary and sufficient conditions for learning spread parameters for competing viruses from data. We presented an interesting set of simulations that illustrate the analytic results and depict some characteristics of the model that warrant further study, and proposed an online spread parameter estimation algorithm. We employed a previously studied USDA dataset to validate the discrete-time two-competing virus, or bi-virus, case by modeling the spread of two alternative farm subsidy programs among farms aggregated by county, improving on previous work. Social media and fake news in the 2016 election The Brexit botnet and user-generated hyperpartisan news The evolution of viruses. competition between horizontal and vertical transmission of mobile genes Competitive epidemic spreading over arbitrary multilayer networks Competing epidemics on complex networks Winner takes all: competing viruses or ideas on fair-play networks Competing memes propagation on networks: A network science perspective Bi-virus SIS epidemics over networks: Qualitative analysis On the analysis of a continuous-time bi-virus model Optimal resource allocation for competitive spreading processes on bilayer networks Analysis and control of a continuous-time bi-virus model A stochastic model of multivirus dynamics Multi-competitive viruses over static and time-varying networks Multi-competitive viruses over time-varying networks with mutations and human awareness," under review for IFAC Automatica Epidemic spreading in real networks: An eigenvalue viewpoint Epidemic thresholds in real networks Global dynamics of epidemic spread over complex networks Discrete-time spread processes: Analysis, identification, and validation Discrete time virus spread processes: Analysis, identification, and validation Knowledge transfer from universities to regions as a network spreading process Network reconstruction and prediction of epidemic outbreaks for NIMFA processes The viral state dynamics of the discrete-time NIMFA epidemic model An introduction to numerical analysis Distributed control of positive systems Matrix analysis