key: cord-0184064-85su22ss authors: Shariatnasab, M.; Shirani, F.; Anwar, Z. title: Privacy Limits in Power-Law Bipartite Networks under Active Fingerprinting Attacks date: 2022-02-11 journal: nan DOI: nan sha: 87f41e5483d13885d24bf709a968a7cef7af7076 doc_id: 184064 cord_uid: 85su22ss This work considers the fundamental privacy limits under active fingerprinting attacks in power-law bipartite networks. The scenario arises naturally in social network analysis, tracking user mobility in wireless networks, and forensics applications, among others. A stochastic growing network generation model -- called the popularity-based model -- is investigated, where the bipartite network is generated iteratively, and in each iteration vertices attract new edges based on their assigned popularity values. It is shown that using the appropriate choice of initial popularity values, the node degree distribution follows a power-law distribution with arbitrary parameter $alpha>2$, i.e. fraction of nodes with degree $d$ is proportional to $d^{-alpha}$. An active fingerprinting deanonymization attack strategy called the augmented information threshold attack strategy (A-ITS) is proposed which uses the attacker's knowledge of the node degree distribution along with the concept of information values for deanonymization. Sufficient conditions for the success of the A-ITS, based on network parameters, are derived. It is shown through simulations that the proposed attack significantly outperforms the state-of-the-art attack strategies. Bipartite networks model a range of application scenarios in social network analysis [1] , [2] , tracking mobility in wireless networks [3] - [6] , pandemic-related contact tracing [7] , and security and forensics [8] . In this work, we consider bipartite networks whose vertices are partitioned into user vertices and group vertices, where the group vertices may represent social network groups, locations visited by users, users' online activities and browsing habits, etc. For instance, in social networks, the users' group memberships are modeled using a bipartite network [9] - [11] , where an edge between a user vertex and a group vertex indicates that the user is a member of that group. Companies use tracking tools to monitor users' online activities at varying level of intrusiveness. Sophisticated technologies such as third-party cookies, web beacons and click streams track internet addresses, order in which pages are viewed, and even the location of users when browsing websites [12] . This data collection can be used to construct a 'digital fingerprint' for network users. In this work, we wish to find out when can user fingerprinting via data collection lead to deanonymization? In particular, we study the privacy limits in bipartite networks under active fingerprinting attacks, where, an anonymous victim is targeted by an attacker (e.g. the victim visits a malicious website), and the attacker queries her group memberships sequentially (e.g. by querying the browser history). The attacker constructs a fingerprint for the victim based on the received query responses, and by comparing this fingerprint to that of the network users, which is acquired through scanning the publicly available network graph, it identifies the victim. The problem was initially introduced and studied by Wondracek et al. [9] , where an attack strategy was proposed and its effectiveness was illustrated in simulations of the attack scenario in real-world networks. The fundamental privacy limits were studied under various assumptions on the graph network in [11] , [13] - [15] . In [15] , we introduced a stochastic graph generation model, called the popularity-based model, proposed the information threshold strategy (ITS), and derived its fundamental performance limits in terms of expected number of queries necessary for successful denaonymization with vanishing probability of error as the graph size grows asymptotically large. The ITS strategy queries the group memberships of the victim starting with the first group in the network, and at each step, finds the information value of each user which captures the likelihood of that user being the victim given the query responses. It identifies a user as the victim if the information value passes a predetermined threshold. The analytical techniques in [15] leverage ideas from data transmission over channels with feedback [16] . The ITS is agnostic to the network degree distribution. That is, it does not choose the groups to be queried based on their sizes. In this work, we improve the ITS and propose the Augmented Information Threshold Strategy (A-ITS) in which the attacker chooses which group to query based on the group sizes. The performance analysis of such strategy is challenging and requires characterizing the group degree distribution as well as the memory structure of the edges in the network. The analysis of the degree distribution and edge memory (Section II) may be of independent interest in graph analysis applications as well. In summary, the contributions of this work are as follows: • We derive the node degree distribution under the proposed popularity-based graph generation models. • We show that with the appropriate choice of initial popularities, the degree distribution follows a power-law with arbitrary parameter α > 2. • We show that under a sparsity condition on the number of graph edges, which requires the number of edges to grow linearly in the number of users, the user fingerprints are almost memoryless. • We propose the A-ITS strategy which leverages the attacker's knowledge of the node degree distribution, derive sufficient conditions for its success, and provide simulation results to illustrate the performance gains of A-ITS as compared with ITS. Notation: The random variable 1 E is the indicator of the event E. The set {n, n+1, · · · , m}, n, m ∈ N is represented by [n, m] , and for the interval [1, m] , we use the shorthand notation [m]. For a given n ∈ N, the n-length vector (x 1 , x 2 , . . . , x n ) is written as x n . For x ∈ R, we have defined This section introduces the stochastic graph model, characterize the resulting degree distribution, and derives several statistical properties which are used in the sequel. A bipartite graph is formally defined as follows. The popularity-based generation process is as follows: Initiation: Fix the model parameter µ ∈ N, and let ∆ µn. The iterative process is initiated by considering a bipartite graph , v 2,n } are fixed through the iteration process, and there are no edges in the initial graph. Each vertex v 2, j , j ∈ [n] is assigned an initial popularity value τ j (0) > 0. The graph is generated in ∆ iterative steps, where at each step, a single edge is added to E as described in the sequel. Step t: At step t ∈ [∆], an edge (v 1,I t , v 2,J t ), I t ∈ [m], J t ∈ [n] is chosen as described next and added to the bipartite graph, i.e. E(t) = E(t−1)∪{(v 1,I t , v 2,J t )}. First, a right-vertex v 2,J t is chosen from the set V 2 by choosing J t randomly according to the probability distribution P(t) = (P 1 (t), P 2 (t), · · · , P n (t)) defined below: Next, a left-vertex v 1,I t is chosen randomly and uniformly from the set [m]−V 2,J t (t −1). The edge The popularity values are updated as τ investigate the degree distribution of bipartite graph under the following asymptotic regime: i) number of left-vertices m is taken to be asymptotically large, i.e. m → ∞, ii) number of rightvertices n grows linearly in m, i.e. m = βn for a fixed β > 0. iii) average value of right-vertex degrees is constant as the network grows. That is, ∆ = µn = µ β m, µ ≥ 1, so that the average degree µ is constant in n. The latter condition is a sparsity condition, which is analogous to the scale-free property in linear and sublinear preferential attachment models [1] , [17] - [19] . A critical feature of scale-free network generation models such as the well-studied Barbási-Albert models, is that the resulting degree distribution follows a power-law. That is, the expected number of vertices with degree d is proportional to d −α for some α > 0. Such power-law behavior is observed empirically in real-world graphs such as social network group memberships, wireless mobility networks, and online shopping habits [1] , [2] . In the following, we show that the degree distribution of the right-vertices under the generation process described in Section II-A, follows the power-law distribution with parameter α > 0, where α can be controlled by the appropriate choice of the initial popularity values. Definition 4 ((µ, n, m, α)-bigraph). Given µ, α > 0 ,and n, m ∈ N, a (µ, n, m, α)-bigraph is a bipartite graph with n left-vertices, m right-vertices, ∆ = nµ edges, and initial popularity values distributed according to: 1 i α , and the initial popularity values are mutually independent. Note that lim m→∞ ζ(m, α) is the well-studied Riemann-Zeta function (e.g. [20] ), evaluated at α. The following theorem shows that given α > 2, the degree distribution of a (µ, n, m, α)-bigraph converges to the power-law distribution with parameter α as m → ∞. Theorem 1 (Power-law in Popularity-based Models). Fix µ, β > 0 and α > 2, and let G m = (V 1 , V 2 , E) be a sequence of (µ, n, βn, α)-bigraphs. Then, Proposition 1 (Concentration of Initial Popularities). Fix α > 2, and let Y n j=1 τ j (0) be the total initial popularities of the right-vertices. Then, where β = max(−1, 2 − α). Proof. Please refer to Appendix B. It should be noted that the proposed popularity-based generation models can be used to generate degree distributions other than power-law distributions by appropriately choosing the parameter α. For instance, the following proposition shows that the degree distribution converges to a geometric distribution as m → ∞ when α → ∞, i.e. when all initial popularity values are set to be equal to one. Remark 1. It can be observed from the proof of Theorem 1 that the degree distribution depends on the total number of edges ∆ through the model parameter µ which determines the constant c in Equation (1). Proof. Please refer to Appendix C. A major obstacle in analyzing the asymptotic properties of bipartite networks and performance limits of attack algorithms is the memory structure in the edges connecting a given left-vertex to the right-vertices. That is, the generation model induces correlation among the edges, and this prohibits the conventional large deviations techniques which have been used in deriving theoretical performance limits in similar scenarios in group testing [21] and communications [16] problems. In [15] we have shown that if all initial popularity values are equal to one (i.e. α → ∞), the memory in the left-vertices' fingerprints is weak, so that the joint distribution of a given fingerprint approaches a product distribution. In this section, we extend the result to α > 2, i.e. power-law bigraphs. Proposition 3 (Group Size Correlation). Let α > 2, µ > 1 and β > 0 . For a (µ, n, βn, α)-bigraph, the following holds: Proof. Please refer to Appendix D. Following the arguments in [15] , for a given bigraph satisfying the conditions in Proposition 3, the left-vertex fingerprints are 'almost' memoryless as stated below. The following holds: Furthermore, assume that n > m µ and n i=1 1(s i = 1) = o(n) for some constant finite number C > 0. Then, there exists c > 0 whose value only depends on µ and β such that: The following sparsity result holds for the fingerprint vector of the left-vertices. Proposition 5 (Sparsity of the Fingerprint Vector). Let α > 2, µ > 1 and β > 0 . For a (µ, n, βn, α)-bigraph. Consider the partial fingerprint R , there exists a constant c > 0 such that: Particularly, let ψ n > 0, n ∈ N such that ψ n = ω(1). Then, Proof. Please refer to Appendix E. In this section, we apply the derivations in Section II to investigate the fundamental privacy limits in bipartite networks under active fingerprinting attacks. The scenario is captured by the following: Ground-Truth: We consider the ground-truth bipartite graph G 0 = (U, R, E) capturing users' group memberships in a given network, where i) the set of left-vertices U represents the set of users in the network, ii) the set of right-vertices R represent the set of groups in the network, e.g. social network groups, locations visited by users, users' online activities and browsing habits, etc., and iii) An edge (u i , r j ) ∈ E, i ∈ [m], j ∈ [n] between a user u i and a group r j indicates the user's membership in the group. The ground-truth is modeled by a (µ, n, βn, α)-bigraph, where n ∈ N is the number of groups, m = βn is the number of users, ∆ = µn is the number of edges, and α > 2 is a network parameter which depends on the network's power-law distribution [18] . Scanned Graph: Prior to the start of the attack, the attacker scans the ground-truth and acquires an observation captured by the scanned graph G s . In this work, for brevity, we assume that the scanning operation is noiseless, i.e. G s = G 0 . However, in general, the operation may be noisy, and the scanning noise depends on the users' individual privacy preferences, e.g. social network privacy settings. The derivations provided in the sequel may be potentially extended to noisy scanned graphs using techniques similar to the ones in [15] which investigated the case when α → ∞. Victim: A victim u M is the user which is targeted by the attacker. For instance, the victim may visit a malicious website, where the attacker uses browser history sniffing techniques to sequentially query its group memberships [22] . We assume that the victim's index is chosen from the set [m] randomly according to P M . The distribution P M may not be uniform as users are not equally likely to fall victim to an attack, with more risk-averse users less likely to be victims in an attack. Query Responses: The attack initiates with the attacker sequentially querying the victim's group memberships. Generally, the query responses are noisy, e.g. browser history sniffing techniques are imperfect and only provide noisy observations of the victim's browsing history. The noise statistics are determined by the users' software (e.g. browser [23] ) and hardware specifications (e.g. CPU and memory specifications [24] ), and depend on the type of history sniffing attack. Definition 5 (Noisy Query Responses). Let ∈ N and P θ Y|R , θ ∈ Θ be a collection of distributions, where Y and R are binary and Θ is a finite set. For sequence j 1 , j 2 , · · · , j ∈ [n], assume that victim's fingerprint is (R j 1 , R j 2 , · · · , R j ) and received query responses are Y 1 , Y 2 , · · · , Y . Then, x t (G s , Y t−1 ) outputs the group whose edge connection with the victim is queried at time t, and Id t (G s , Y t ) either outputs the victim's identity among the user set U or outputs 'e', indicating that a unique victim has not been identified yet, in which case further queries are made and the attack continues. Let Q = min{t ∈ N : Id t (G s , Y t ) ∈ U}. Then, the probability of error P e and expected number of queries Q are defined as: where the probabilities are with respect to M, G 0 , G s and Y t , t ∈ [Q]. The minimum expected number of queries is defined as: This section provides an attack strategy which improves upon the information threshold strategies (ITS) investigated in [14] , [15] , and uses the derivations in Section II to derive its fundamental performance limits in terms of minimum expected number of queries and probability of error. Attack Strategy: The attacker queries the group memberships of the victim starting from the largest group r j 1 , i.e., j 1 = arg max |D j | j ∈ [n] and continues by querying the next largest group, so that x s = r j s , where r j s is the sth largest group. The queries continue until a particular stopping criterion is met. To explain the stopping criterion, let us define the information value I k (t), k ∈ [m], t ∈ [n] of user u k and time t as follows: where P Y t (1) where the parameter > 0 affects the resulting probability of error. If there exists a user whose information value exceeds log 1 , then that user is identified as the victim. Otherwise, the next query is made. So, We call this attack the Augmented-ITS (A-ITS) since it uses both information thresholds and the group degree distribution. Consider the A-ITS described above with parameter > 0. Let Q A-IT S be the resulting expected number of queries and P e,A-IT S the resulting probability of error. Then, where c is from Proposition 4, the variable N d , d ∈ [m] is the number of groups with degree equal to d in the ground-truth graph, the mutual information I d,θ (Y; E 0 ) is evaluated with respect to P d,θ Y,E 0 = P d E 0 P θ Y|E 0 , the variable E 0 is Bernoulli with parameter d m , P θ Y|E 0 is the query noise with parameter θ given in Definition 5, i max max y,r,θ∈{0,1}×Θ log Proof. Please refer to F. The following is a direct consequence of Theorem 2, and the fact that Consider the A-ITS with parameter > 0. Assume that |Θ| = 1, and P Y|E 0 is a binary symmetric distribution with crossover probability n q ≤ 1 2 . Then, where c and c are from Propositions 1 and 4, respectively, is the binary entropy function, and a * b = a(b−1)+b(a−1), a, b ∈ [0, 1]. In this section, we provide a simulation of active fingerprinting attacks on synthesized bigraphs. In order to provide a baseline for comparison, we also investigate the performance of a natural extension of the ITS considered in [15] . We generate the ground-truth with α ∈ {3, 5, 10}. Furthermore, we consider |Θ| = 1 and a single P Y|E 0 which is a binary symmetric distribution with crossover probability n q = 0.05. We take the victim to be chosen equally likely among the users. We have simulated the attack with parameters µ = 100, = 0.01, and β = 0.1 and m = {1000, 2000, 4000, 6000, 8000, 10000}. For each set of parameters, we have simulated the attack 500 times, by generating the ground-truth five times and choosing a victim randomly and uniformly for each generation 100 times. Figure 1 shows the performance of A-ITS and ITS in terms of expected number of queries. We have chosen the parameter such that the empirical observed probability of error is close to 0.05 for each set of simulation parameters. It can be observed A-ITS significantly outperforms the ITS. This suggests that the attacker can make significant improvements by leveraging its knowledge of the group sizes in its choice of queries. The expected number of queries is increasing in α, and it grows linearly in the number of users m. The latter is in agreement with the observations made in [15] for the α → ∞ scenario. The fundamental privacy limits under active fingerprinting attacks in power-law bipartite networks was considered. The popularity-based model was investigated, and it was shown that using the appropriate choice of initial popularity values, its node degree distribution follows a power-law distribution with arbitrary parameter α > 2. An active fingerprinting deanonymization attack strategy called the augmented information threshold attack strategy (A-ITS) was proposed, and sufficient conditions for its success, based on network parameters, were derived. It was shown through simulations that the proposed attack significantly outperforms the state-of-the-art attack strategies. Appendix A Proof of Theorem 1 Fix > 0. Let Y j = j j τ j (0) be the sum of all initial popularity values except the initial popularity value of the jth right-vertex. where the last equality is due to Proposition 1. We let A be the event that |Θ j − E(Y j )| > E(Y j ), and write Note that from the arguments in the proof of Proposition 1 we have P(A) = o(n β ) and from the theorem statement, we have k = o(n −β α ). Consequently, P(A) = o(k −α ) as shown below: So, Next, we investigate P(D j (∆) = k, A c ). Note that Θ j and τ j (0) are independent variables. We have: , and in the last inequality we have used the fact that the maximum is greater that the average. Similarly, where we have used the fact that P(A c ) = υ∈T P(Y j = υ) = 1 − o(k −α ). Furthermore, where σ ∈ S n and S n is the symmetric group of permutations over [∆] . . As a first step, we only consider transpositions. To elaborate, we show that where σ = (κ, κ+1) so that σ is the transposition which swaps κ with κ+1 for a given κ ∈ [∆−1]. which proves Equation (22) . The proof for the case where b κ = 1, b κ+1 = 0 follows similarly. Next, we extend the argument to arbitrary σ ∈ S n . It is well-known that a decomposition σ = σ 1 • σ 2 • · · · • σ r , r ∈ N always exists, where σ j = ( j , j + 1), j ∈ [∆ − 1] is the transposition which swaps j and j + 1 (e.g. [25] ). The proof of Equation (22) for general σ ∈ S n follows by iterative application of the above arguments for each transposition. Consequently, Let ψ υ * ∆ , where υ * achieves the minimum above, and define We first show that g(i) is monotonically decreasing: It suffices to show that k(ψ∆ + i) ≤ (i + 1)∆, which holds if and only if i ∆ ≤ i−ψk+1 k for n large enough. It can be verified that the latter holds since k = o(n −β α ) and ∆ = µn and by noting that β α ≤ 1 3 . Next, we show that g(k+ is bounded as k → ∞. To see this, note that: where the second term is 1 where in the last equality we have used the well-known result that (1 + a n ) n → e a , a > 0 as n → ∞. Consequently, we have: is bounded as n → ∞. Hence, where we have defined c c ψ ψ+1 . Next, we use the fact that n is the binary entropy function measured in nats (e.g. [26] ) as follows: where exp(x) = e x , x ∈ R, and c ∈ R is a constant number. We note that since ∆ = µn grows linearly in n by the sparsity condition in Section II-A. Also, Using the second order Taylor's approximation, we have: where p 1 lies between ψ∆−1 (1+ψ)∆−k−1 and ψ ψ+1 and is hence bounded away from 0 and 1 as n → ∞. As a result 1 p 1 (1−p 1 ) is bounded as n → ∞. Similarly, where p 2 lies between ψ∆+ψk−1 (1+ψ)∆+ψk−1 and 1 ψ+1 and is hence bounded away from zero as n → ∞. As a result, which is bounded as n → ∞ since k 2 ∆ = O(n 1− 2β α ) as explained in the prequel. We have: for a constant c ∈ R and n large enough. A similar argument can be provided to show that For a constant c . This completes the proof. Proof of Proposition 1 The following proves Equation (2): where in (a) we have used linearity of expectation, and in (b) we have used the definition of the Riemann Zeta function. Next, we prove Equation (3): where in (a) we have used independence of initial popularity values. To prove Equation (4), note that by Chebychev's inequality, we have: So, where in (a) we have used the assumption that α > 2, and in (b) we have used the fact that m = ηn for a constant η > 0. Using the second order Taylor's approximation, we have: where in (a) p 1 lies between n−2 (1+µ)n−k−1 and 1 1+µ and is hence bounded away from 0 and 1 as n → ∞. As a result 1 p 1 (1−p 1 ) is bounded as n → ∞. Furthermore, Note that we must have c = 1 1+µ . This completes the proof. Appendix D Proof of Proposition 3 We provide an outline of the proof. Equation (6), we note that E(D 2 j (∆)) = m i=1 P(τ j (0) = i)E(D 2 j (∆)|τ j (0) = i). Next, for a given i ∈ [m] we construct a new bipartite graph by replacing the right-vertex v 2, j by i right-vertices v 2, j,k , k ∈ [i] each with initial popularity values τ j,k (0) = 1, k ∈ [i]. It is straightforward to verify that the degree distribution of the right-vertices in the original graph, other than v 2, j , is the same as the new graph, and the degree distribution of v 2, j in the original graph is the same as the sum of the degrees of the new vertices v 2, j,k , k ∈ [i]. So, E(D 2 j (∆)|τ j (0) = i) = E((D j,1 (∆) + D j,2 (∆) + · · · + D j,i (∆)) 2 |τ j,k (0) = 1, k ∈ [i]). Consequently, E(D 2 j (∆)|τ j (0) = i) ≤ iE(D 2 j,1 (∆)|τ j,1 (0) = 1) + i(i − 1)E(D j,1 (∆)D j,2 (∆)|τ j,1 = τ j,2 = 1). So, E(D 2 j (∆)) ≤ E(τ j (0))E(D 2 j,1 (∆)|τ j,1 (0) = 1) + E(τ 2 j (0))E(D j,1 (∆)D j,2 (∆)|τ j,1 (0) = τ j,2 = 1). The two terms on the right hand side of the last equation are finite as m → ∞ since E(τ 2 j (0)) = ζ(m, α−2), E(τ j (0)) = ζ(m, α − 1), and due to Proposition 1 in [15] which shows the result conditioned on τ j (0) = 1. To prove Equation (7), we have: Equations (8), (9) , and (10) can be shown similarly. The proof is omitted for brevity. Proof of Proposition 5 We provide an outline of the proof. Note that C i n = 1 So, using an extension of Hoeffding's inequality to weakly correlated variables given in [27] , we have: where = n m λ(m, α)(1 + ψ) = 1 β λ(m, α)(1 + ψ) and ψ ∈ (0, λ(m,α) µ − 1). To derive (12), we note that: P(C i ≥ ) ≤ c2 −n( λ(m,α) m (1+ψ) log (1+ψ)+O( 1 n )) → 0, as n → ∞. Note that Q A-IT S = E(t * ). Fix n ∈ N. Let T n = min{t M , n }. Note that: where n ζ(m,α)d α , and we have used Wald's identity [28] and Proposition 5 to upper bound the expectation over the fingerprint distribution with that over a product distribution. Note that Equations (23) and (24) yield the desired bound on Q A-IT S . The proof for the probability of error follows similar steps as that of Theorem 1 in [15] and is provided for completeness as follows: 1(κ j ≤ η)) where in (a) we have used the fact that P (R j,i ) i∈[n] = P (R M,i ) i∈[n] , j ∈ [m]. Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia Clustering and preferential attachment in growing networks Limits of location privacy under anonymization and obfuscation Asymptotic loss in privacy due to dependency in Gaussian traces Unique in the crowd: The privacy bounds of human mobility Drive2friends: Inferring social relationships from individual vehicle mobility data Applicability of mobile contact tracing in fighting pandemic (COVID-19): issues, challenges and solutions Users' fingerprinting techniques from tcp traffic A practical attack to de-anonymize social network users Online social networks: threats and solutions De-anonymizing web browsing data with social networks Verizon might be collecting your browsing history and here's how to stop it. The Verge An information theoretic framework for active de-anonymization in social networks based on group memberships Optimal active social network de-anonymization using information thresholds Fundamental privacy limits in bipartite networks under active attacks Data transmission over a discrete channel with feedback. random transmission time Emergence of a non-scaling degree distribution in bipartite networks: a numerical and analytical study Preferential attachment in online networks: Measurement and explanations A preferential attachment gathering mobility model The theory of the Riemann zeta-function Active sequential hypothesis testing A practical attack to de-anonymize social network users Browser history re: visited Perfweb: How to violate web privacy with hardware performance events Algebra: a graduate course Good lower and upper bounds on binomial coefficients Constructive proofs of concentration bounds On cumulative sums of random variables Proof of proposition 2 Note that P(τ j (0) = 1) → 1 as α → ∞; therefore, the summation of initial popularities is equal to the total number of groups, i.e. Y = n. So,Similar to the steps in Appendix A, we have:where in (a) we use the fact that nSo,We provide an outline of the proof. Let I k (G s , Y κ ) be the information value of user k ∈ [m]given scannd graph G s and query responses Y κ , κ ∈ [n]. Define the following stopping times