key: cord-0174566-idhtwsf1
authors: Cheng, Dawei; Chen, Chen; Wang, Xiaoyang; Xiang, Sheng
title: Efficient Top-k Vulnerable Nodes Detection in Uncertain Graphs
date: 2019-12-28
journal: nan
DOI: nan
sha: aa8491bf183cc637d767c5e2bf48bb8295d7b380
doc_id: 174566
cord_uid: idhtwsf1

Uncertain graphs have been widely used to model complex linked data in many real-world applications, such as guaranteed-loan networks and power grids, where a node or edge may be associated with a probability. In these networks, a node usually has a certain chance of default or breakdown due to self-factors or the influence from upstream nodes. For regulatory authorities and companies, it is critical to efficiently identify the vulnerable nodes, i.e., nodes with high default risks, such that they could pay more attention to these nodes for the purpose of risk management. In this paper, we propose and investigate the problem of top-$k$ vulnerable nodes detection in uncertain graphs. We formally define the problem and prove its hardness. To identify the $k$ most vulnerable nodes, a sampling-based approach is proposed. Rigorous theoretical analysis is conducted to bound the quality of returned results. Novel optimization techniques and a bottom-$k$ sketch based approach are further developed in order to scale for large networks. In the experiments, we demonstrate the performance of proposed techniques on 3 real financial networks and 5 benchmark networks. The evaluation results show that the proposed methods can achieve up to 2 orders of magnitudes speedup compared with the baseline approach. Moreover, to further verify the advantages of our model in real-life scenarios, we integrate the proposed techniques with our current loan risk control system, which is deployed in the collaborated bank, for more evaluation. Particularly, we show that our proposed new model has superior performance on real-life guaranteed-loan network data, which can better predict the default risks of enterprises compared to the state-of-the-art techniques.

: Guarantee loan process Figure 2 : A real-world guaranteed-loan network to obtain loans from banks, groups of small and medium enterprises (SMEs) back each other to enhance their financial security. Figure 1 shows the flow of guarantee loan procedure. When more and more enterprises are involved, they form complex directed-network structures [7] . Figure 2 illustrates a guaranteed-loan network with around 3, 000 enterprises and 7, 000 guarantee relations, where a node represents a small or medium enterprise and a directed edge from node A to node B indicates that enterprise B guarantees another enterprise A. Thousands of guaranteed-loan networks of different complexities have coexisted for a long period [8] . It requires an efficient strategy to prevent, identify and dismantle systematic crises.

Many kinds of risk factors have emerged throughout the guaranteed-loan network, which may accelerate the transmission and amplification of risk. The guarantee network may be alienated from the "mutual aid group" as a "breach of contract". An appropriate guarantee union may reduce the default risk, but significant contagion damage throughout the networked enterprises could still occur in practice [9] . The guaranteed loan is a debt obligation promise. If one corporation gets trapped in risks, it will spread the contagion to other corporations in the network. When defaults diffuse across the network, a systemic financial crisis may occur. It is essential to consider the contagion damage in the guaranteed-loan networks. We can model a guaranteed-loan network with our uncertain graph model, where each node has a self-risk probability and each edge has a risk diffusion probability. Figure 3 (a) is toy example of guaranteed-loan network, where Figure 3 (e) shows the corresponding relationships between different enterprises and the risk diffusion probabilities. It is desirable to efficiently identify the k most vulnerable nodes, i.e., enterprises with high default risks, such that the banks or the financial regulatory authorities can pay extra attention to them for the purpose of risk management. It is more urgent than ever before with the slowdown of the economics worldwide nowadays.

In the literature, some machine-learning based approaches (e.g., [10] ) have been proposed to predict the node default risk for different applications. For instance, a high-order and graph attention based network representation method has been designed in [10] to infer the possibility of loan default events. These approaches indeed consider the structure of networks. However, they cannot properly capture the uncertain nature of contagious behaviors in the networks. In the experiment, we also compare with these approaches to demonstrate the advantages of our model.

Our approach. In this paper, to identify the top-k vulnerable nodes, we model the problem with an uncertain graph, and infer the default probability of a node following the possible world semantics, which has been widely used to capture the contagious phenomenon in real networks [11, 12, 13, 14] . In particular, we utilize an uncertain graph with two types of probabilities to model the occurrence and prorogation of the default risks in the network. Note that, we focus on identifying vulnerable nodes for a given network, while the self-risk probabilities and diffusion probabilities can be obtained based on the existing works (e.g., [15, 10] ).

illustrate the structure of a toy uncertain graph with 5 nodes and 6 edges, as well as the associated self-risk probabilities and diffusion probabilities. Given the probabilistic graph G, we may derive the default probability of a node following the possible world semantics, where each possible world (i.e., instance graph in this paper) corresponds to a subgraph (i.e., possible occurrence) of G. Taking node E as an example, it may default because of (i) itself, which is represented by a shaded node as shown in Figure 3 (b), or because of (ii) the contagion damage initiated by other nodes as shown in Figures 3(c)-(d). In Section 2, we will formally introduce how to derive the default probabilities of nodes.

In this paper, we show that the problem of calculating the default probability of a node alone is #P-hard, not mentioning the top-k vulnerable nodes identification problem. A straightforward solution for the top-k vulnerable nodes computation is to enumerate all possible worlds and then aggregate the results in each possible world. However, it is computational prohibitive, since the number of possible worlds of an uncertain graph may be up to 2 n+m , where n and m are the number of nodes and edges in the graph, respectively. In this paper, we first show that we can identify the top-k vulnerable nodes by using a limited number of sampled instance graphs with tight theoretical guarantee. To reduce the sample size required and speedup the computation, lower/upper bounds based pruning strategies and advanced sampling method are developed. In addition, to further accelerate the computation, a bottom-k sketch based method is proposed. To verify the performance in real scenarios, we integrate the proposed techniques with our current loan risk control system, which is deployed in the collaborated bank.

Contributions. The principle contributions of this paper are summarized as follows.

• We advocate the problem of top-k vulnerable nodes detection in uncertain graphs, which is essential in real-world applications.

• Due to the hardness of the problem, a sampling based method is developed with tight theoretical guarantee.

• We develop effective lower and upper bound techniques to prune the searching space and reduce the sample size required. Advanced sampling method is designed to speed up the computation with rigorous theoretical analysis.

• A bottom-k sketch based approach is further proposed, which can greatly speedup the computation with competitive results.

• We conduct extensive experiments to evaluate the efficiency and effectiveness of our proposed algorithms on 3 real financial networks and 5 benchmark networks. 

a truly random hash function L(A, k) the k-th smallest hash value of the set A bk the parameter k in the bottom-k sketch N (v) the set of in-neighbors of node v A an approximation algorithm for the problem R the set of k nodes returned by A P k the default probability of rank k-th nodes ( , δ) the parameters in ( , δ)-approximation t the sample size p l (v), p u (v) the lower and upper bound of

the set of candidates G t graph by reversing the direction of edges in G

• To further verify the advantages of our models in real scenarios, the proposed techniques are integrated into our current loan risk control system, which is deployed in the collaborated bank. Through the case study on real-life financial environment, it verifies that our proposed model can significantly improve the accuracy for high-risky enterprises prediction.

Roadmap. The rest of the paper is organized as follows. Section 2 describes the problem studied and the related techniques used in the paper. Section 3 shows the basic sampling-based method and our optimized algorithms. We report the experiment results in Section 4. Section 5 introduces the system implementation details and case study. We present the related work in Section 6 and conclude the paper in Section 7.

In this section, we first we present some key concepts and formally define the problem. Then, we introduce the related techniques used. Table 1 summarizes the notations frequently used throughout this paper.

We consider an uncertain graph G = (V, E) as a directed graph, where V is the set of nodes and E denotes the set of edges. n = |V| (resp. m = |E|) is the number of nodes (resp. edges) in G. Each node v ∈ V is associated with a self-risk probability p s (v), which denotes the default probability of v caused by self-factors. Each edge (u, v) ∈ E is associated with a diffusion probability p(v|u), which denotes the probability of v's default caused by u's default. In this paper, we assume the self-risk probabilities and diffusion probabilities are already available. These probabilities can be derived based the previous studies (e.g., [10] ).

For simplicity, when there is no ambiguity, we use uncertain graph, graph and network interchangeably. In this paper, we derive the default probability of a node by considering both self-risk probability and diffusion probability, which is defined as follows. Definition 1 (Default Probability). Given an uncertain graph G = (V, E), for each node v ∈ V, its default probability, denoted by p(v), is obtained by considering both self-risks probability and diffusion probability. p(v) can be computed recursively as follows.

where N (v) is the set of in-neighbors of v.

It is easy to verify that the equation above is equal to aggregate the probability over all the possible worlds, i.e.,

where W is the set of all possible worlds, p(W ) is the probability of a possible world W , and I W (v) is an indicator function denoting if v defaults in W or not. Example 1. Reconsider the graph in Figure 3 . Suppose the self-risk probabilities and diffusion probabilities are all 0.2 for each node and edge. Then, we have p(A) = 0.2 and p(B)

Problem statement. Given an uncertain graph G = (V, E), we aim to develop efficient algorithms to identify the set R of top-k vulnerable nodes, i.e., the k nodes with the highest default probability. Theorem 1. It is #P-hard to compute the default probability.

Proof. We show the hardness of the problem by considering a simple case, where the self-risk probability p s (v) equals 1 for node v, and p s (u) equals 0 for u ∈ V \ v. Therefore, for the node u ∈ V \ v, the default probability p(u) is only caused by the default of node v. Then, the default probability of p(u) equals the reliability from v to u, which is #P-hard to compute [16] . Thus, it is #P-hard to compute the default probability. The theorem is correct.

In this section, we briefly introduce the bottom-k sketch [17] , which is used in our BSRBKframework to obtain the statistics information for early stopping condition. Bottom-k sketch is designed for estimating the number of distinct values in a multiset. Given a multiset A = {v 1 , v 2 , · · · , v n } and a truly random hash function h, each distinct value v i in the set A is hashed to (0, 1) and

is the k-th smallest hash value. So the number of distinct value can be estimated with k−1 L(A,k) . The estimation can converge fast with the increase of k, where the expected relative error is 2/π(k − 2) and the coefficient variation is no more than 1/ √ k − 2. To distinguish from the k in the top-k problem, hereafter in this paper, we use bk to denote the parameter k in the bottom-k sketch.

In this section, we first present a basic sampling method and analyze the sample size required. Then, we introduce the optimized approaches to accelerate the processing.

Due to the hardness of computing the default probability, in this section, we propose a sampling based method. In order to bound the worst case performance, rigorous theoretical analysis is conducted about the required sample size.

Sampling framework. To compute the default probability, we can enumerate all the possible worlds and aggregate the results. However, the possible world space is usually very large. In previous research, sampling based methods are widely adopted for this case. That is, we randomly select a set of possible worlds and take the average value as the estimated default probability. By carefully choosing the sample size, we can return a result with performance guarantee. Algorithm 1 shows the details of the basic sampling based method. The input is a given graph, where each node/edge is associated with a self-risk/diffusion probability. In each iteration, we generate a random number for each node to determine if it defaults by itself or not (Lines 4-7). Then, we conduct a breath first search from these nodes, i.e., h v = 1, to locate the nodes that will be influenced by them in the current simulation. For each encountered edge, we generate a random number to decide if the propagation will continue or not. For each node, the number of default times is accumulated in Lines 21. The final default probability is calculated by taking an average over the accumulated value v c . Finally, the algorithm returns k results with the largest estimated value. Sample size analysis. For sampling based methods, a critical problem is to determine the sample size required in order to bound the quality of returned result. In this section, we conduct rigorous theoretical analysis about the sample size required. Specifically, we say an algorithm A is ( , δ)-approximation if the following conditions hold. 

Definition 2 (( , δ)-approximation). Given an approximation algorithm A for the top-k problem studied, let R be the set of k nodes returned by A. P k is the default probability of the k-th node in the ground truth order. Given , δ ∈ (0, 1), we say A is ( , δ)-approximation if R fulfills the following conditions with at least 1 − δ probability.

If A is a ( , δ)-approximation algorithm for the vulnerable node detection problem, it means that with high probability (i) the default probabilities of returned nodes are at least P k v − ; (ii) for the nodes not in R, the default probabilities are at most P k + .

Theorem 2 (Hoeffding Inequality). Given a sample size t and > 0, let p v be the unbiased estimation of

Based on the Hoeffding inequality, we have following theorem hold.

Theorem 3. Given the sample size t, > 0 and two nodes

The last two steps consider p u − p v as the estimator of p(u) − p(v). In addition, 1] . Then, we can feed into the Hoeffding inequality and obtain the final result.

As shown in Theorem 4, Algorithm 1 is a ( , δ)-approximation algorithm if the sample size is no less than Equation 3. Theorem 4. Algorithm 1 is ( , δ)-approximation if the sample size is no less than

where n is the number of nodes, i.e., |V|.

Proof. Suppose we sort the nodes based on their real default probabilities, i.e., {v 1 , v 2 , ..., v n }. Then we show the two

Therefore, v will not be selected into R, which is contradict to the assumption. Thus, the first condition holds.

• For v / ∈ R, if v does not belong to the top-k result, the second condition holds naturally. Otherwise, there must be a node u that does not belong to the top-k result being selected into R.

Therefore, v should also be selected into the top-k, which is contradict to the assumption. Thus, the second condition holds.

Theorem 3 shows the theoretical result of bounding the order of a pair of nodes. Since 1 ≤ i ≤ k < j ≤ n, we need to bound the order of k(n − k) pairs of nodes. By applying union bound and Theorem 3, we have

Therefore, the theorem is correct.

Based on Theorem 4, Algorithm 1 can return a result with tight theoretical guarantee. However, it still suffers from some drawbacks, which make it hard to scale for large graphs. Firstly, to bound the quality of returned results, we need to bound the order of k(n − k) node pairs. The node size n can be treated as the candidate size, which is usually large in real networks. Therefore, if we can reduce the size of n (i.e., reduce candidate space) and k (i.e., verify some nodes without estimation), then the sample size can be reduced significantly. Secondly, in each sampled possible world, we only need to determine whether the candidate node can be influenced or not, i.e., compute h v . If the candidate space can be greatly reduced, the previous sampling method may explore a lot of unnecessary space.

According to the intuitions above, in this section, novel methods are developed to derive the lower and upper bounds of the default probability, which are used to reduce the candidate space. In addition, a reverse sampling framework is proposed in order to reduce the searching cost.

Candidate size reduction. To compute the lower and upper bounds of the default probability, we utilize the equation in default probability definition, i.e., Equation 1. The idea is that the default probability for each node is in [p s (v), 1] if no further information is given. By treating each node's default probability as p s (v) and 1, we can aggregate the probability over its neighbors to shrink the interval based on Equation 1 . Then, with the newly derived lower and upper bounds for neighbor nodes, we can further aggregate the information and update the bounds. The details of deriving Note that, there is a slight difference in the first iteration for the two algorithms, since by using 1 as the default probability will contribute nothing for the pruning. It is easy to verify that larger order will lead to tighter bounds. Users can make a trade-off between the efficiency and the tightness of bounds.

Given the lower bound and upper bound derived, we can filter some unpromising candidates and verify some candidates with large probability. Lemma 1 shows the pruning rules to verify and filter the candidate space. Lemma 1. Given the upper and lower bounds derived for each node, let T l and T u be the k-th largest value in p l (v) and p u (v), respectively. Then, we have 1) For u ∈ V, u must be in the top-k if p l (u) ≥ T u .

2) For u ∈ V, u must not be in the top-k if p u (u) < T l .

Proof. For the first case, suppose a node u with p l (u) ≥ T u but not being selected in the top-k results, which means a node must have default probability of at least p l (u) to be selected into the top-k result. Since T u is the k-th largest value in p u (v), it means there will be no more than k nodes that satisfy the condition. Therefore, the first case holds. For the second case, since T l is the k-th largest value of p l (v), which means P k must be at least T l . Note that, P k is default probability of the ranked k-th node in the the ground truth order. Therefore, the second case holds.

Based on Lemma 1, Algorithm 4 shows the details of reducing candidate space. The algorithm takes the derived lower and upper bounds as input and outputs the candidate nodes B and the number k of verified nodes. The verified k nodes will be put into the result set directly. Note that, if we can verify k nodes based on the first pruning rule, then we only need to find top-(k − k ) nodes from the candidate set B. In this case, we reduce both the value k and n of Equation 3

to k − k and |B|, respectively.

Reverse sampling approach. Based on Algorithm 4, we can greatly reduce the candidate space, which performance is verified in our experiments on real-world datasets. In the basic sampling method, it aims to estimate the default probability for each node. Here, we only need to compute the probability for the candidate nodes. Especially, when the candidate size is small, the previous sampling method will explore a lot of unnecessary space. Intuitively, given a sampled possible world, for each candidate node, we only need to verify if it can be reached by a node with h v = 1. Therefore, we can conduct a reverse traversal from the candidate nodes to see if it can meet the criteria. The details are shown in Algorithm 5, where G t is the graph by reversing the direction of each edge in G. Note that, our reverse sampling method is different from the reverse sampling framework used in influence maximization problem [18] .

The inputs are the graph G t and candidate set B. After processing a sample, it returns h v for each node v in B. At first, we set h v = 0 for all the nodes. Then we conduct a breath first search from each node in the candidate set. For each encountered node and edge, we mark it as checked and store the corresponding information (e.g., survived and h v ) in order to avoid generating random numbers for the same node/edge multiple times. The BFS terminates if it encounters a node h v with h v = 1 or there is no more node to be explored (Lines 6-8). If it encounters a node with h v = 1, it means the candidate node is influenced, and vice versa. Through this way, we can greatly reduce the computation cost by filtering unnecessary searching space. Given sample size t, we repeat the process t times and accumulate the h v value to estimate the default probability.

Reverse sampling based method. By integrating the pruning strategies and the reverse sampling technique, we have the reverse sampling based algorithm. According to Theorem 5, the approach is ( , δ)-approximation if the sample size fulfills Equation 4 . Theorem 5. The reverse sampling based algorithm is ( , δ)-approximation if the sample size is larger or equal than

Note that, we use the reverse sampling method here because the candidate space is reduced. In addition, it only accelerate the computation of influenced nodes in a sampled possible world. Based on the developed pruning techniques, we can verify some results and filter some candidates. To bound the quality of returned results, we need to bound the order of (k − k )(|B| − k + k ) pairs.

Based on the lower and upper bounds derived, we can reduce the candidate space. In addition, by using the reverse sampling technique, we can reduce the cost of exploring samples. The reverse sampling based algorithm can return a result with tight theoretical guarantee, which reduces the sample size from Equation 3 to Equation 4. However, in many real cases, the sample size and computation cost is still large. Intuitively, we only need sufficient samples to obtain a competitive result. In this section, we derive a method by using the bottom-k technique, which can greatly accelerate the procedure with competitive top-k results. We first present the idea of finding the top-1 result. Then, we extend the idea for the top-k scenario.

Finding the top-1 result. In the reverse sampling approach, when we process the samples one by one. We can terminate the processing, if there is a node that has sufficient statistic. In this paper, we use bottom-k sketch to serve this role. The idea is that, we first apply the lower and upper bound technique to obtain k and B. Let t be sample size computed by using Equation 4 . We assign each sample an id and generate a random hash value in (0, 1) for each of them. Since we does not materialize the samples, the time complex of generating hash value is only O(t). We sort the samples in ascending order based on the hash value, and materialize the samples accordingly by using the reverse sampling framework. For each node v in the candidate set, we record a accumulated value v c . Based on Theorem 6, the node whose v c reaches bk first is the top-1 result. bk is the threshold selected. Theorem 6. The node selected by using the above procedure is the top-1 node.

Proof. Suppose node u is the first node that reaches the criteria and the hash value of its bk-th encountered sample is h bk (u). According to the property of bottom-k sketch, we can estimate the default probability p(u) with bk−1 h bk (u)t . If v is the second node that reaches the criteria. We must have h bk (v) > h bk (u). Therefore, the corresponding estimated value is smaller that of u. The theorem is correct.

Here, we use bk to measure if the statistic is sufficient or not. Even though the bottom-k based method does not offer tight theoretical guarantee as the previous approaches. Through our experimental evaluation, the bottom-k based method shows great advantage compared with the others. Example 3. Reconsider the graph in Figure 3 with k = 1 and bk = 2. Suppose nodes D and E are the candidates in B. Then, we only need to reverse the graph and conduct the traversal from D and E. For the simplicity, we do not present the reverse graph here. Figure 3(b) is the first reverse sample. That is, E is default by itself, and D does not meet any default nodes when back-traversing from D. Therefore, the counter of E is set as 1. Figure 3(c) is the second sample. Then, the counter of E becomes 2, which meets the bottom-k criterion. Thus, D is the top-k result returned.

Finding the top-k results. To find the top-k, we follow similar manner as that in finding the top-1 result. By extending Theorem 6, we can stop exploring the samples when there are k − k nodes with sufficient statistic. That is, we continue visiting the samples until there are k − k nodes with counters reaching bk. Note that, there may be case when the stop condition cannot be met after all the samples are processed. Then, the algorithm turns to the reverse based sampling method, and we just return the k − k nodes with the largest estimated value. While, according to the experiments over real-world datasets, the algorithm can coverage quickly with bk.

Complexity analysis. In this paper, there are two types of samples involved, i.e., basic sample (Algorithm 1) and reverse sample (Algorithm 5). We use t b and t r to denote the cost of generating a basic sample and a reverse sample, respectively. To obtain a sample, in the worst case, we need to traverse the whole graph once for both sampling approaches, which cost is O(m + n). Therefore, for the basic sampling approach (i.e., Algorithm 1), the time 

where and δ are the input parameters. For the reverse sampling based method, the bound computation and candidate reduction phase cost O(z(m + n)) and O(n log n), respectively. Thus, the complexity is O(t r 2 2 ln (k−k )(|B|−k+k ) δ + z(m + n) + 2n log n). The time complexity of the bottom-k based method is the same as that of the reverse sampling based approach. This is because, we need to explore all the samples in the worst case, but it is usually much faster than others in practice.

In this section, we conduct extensive experiments to evaluate the effectiveness and efficiency of our proposed methods.

Datasets. We conduct the experiments on 3 real-world financial datasets, i.e., Interbank 1 , Fraud and Guarantee and 5 public benchmark datasets with drastically varying sizes and characteristics. Fraud and Guarantee are our contributed dataset. The details about the 3 real-world financial datasets are shown as follows.

• Interbank. Interbank networks is generated by the maximum-entropy (ME) approach [19] , in which each node represents a bank and edge corresponds to an interbank loan from the lender bank to the borrow bank.

• Fraud. Credit card fraud networks with 19, 240 nodes and 34, 892 edges is constructed based on credit card fraud transactions of a major commercial bank. Each edge represents a trade between the consumer and merchant.

• Guarantee. The guaranteed loans network is from a major commercial bank spanning 4 years. The names of the customers in the records are encrypted and replaced by IDs. We can access the guarantee relationships, which denotes an edge between the guarantor to borrower. Besides, in case studies, we also get the basic profile information such as the enterprise scale, and loan information such as the loan credit.

Besides the real-world financial datasets, we also employ 5 benchmark datasets, which are public available. We download Citation from network repository 2 . The others are downloaded from SNAP 3 . The statistic details of datasets are shown in Table 2 .

Algorithms. We evaluate the following algorithms to demonstrate the performance of following algorithms.

• N (Naive). Algorithm 1 with fixed sample size.

• SN (Naive+Sample). Algorithm 1 with the sample size calculated by Equation 3.

• SR (Sample+Reverse). Algorithm that uses reverse sampling method with candidate set derived with second rule of Lemma 1.

• BSR (Bound+Sample+Reverse). Optimized sampling method by integrating reverse sampling and bounds filtering techniques with the sample size calculated by Equation 4. Parameters and workload. To evaluate the effectiveness of proposed techniques, the precision is reported. For the ground truth, we use 20000 sampled possible worlds to obtain the results, which setting is commonly used in related research [14, 6, 5] . For Fraud and Guarantee datasets, the self-risk and diffusion probability are obtained in our previous research [20, 15] . For the other datasets, the probability is randomly selected from [0, 1]. For parameter k, we vary it from 1%|V| to 10%|V|, where |V| is the corresponding graph node size. We set = 0.3 and δ = 0.1 for computing the sample size.

In the experiment, we run all the methods by a server with two Intel E5-2680 CPU, 512GB memory and CentOS v7.2 operation system. In the system implementation, we develop the web backend server by Spring Cloud and frontend with JavaScript D3.js library. In case studies, CNN-max, crDNN and HGAR are experimented on a GPU server with two pieces of Telsa-P100 and implemented by Tensorflow. 

In this section, we tune the parameters bk and the order of bounds on 4 datasets, i.e., Citation, Interbank, Fraud and Guarantee.

Tuning the parameter bk. As analyzed in the paper, the precision of BSRBKshould converge rapidly with the increase of bk. We vary bk from 4 to 64. The results are shown in Figure 4 . Note that bk-X means bk is set to X. With the increase of bk, the algorithm converges quickly for all the datasets. When bk reaches 8, the drop of performance already becomes less significant. Thus, in the following experiments, we set bk to 16.

Tuning the order of bounds. Since the tightness of lower and upper bounds may greatly affect the sample size and computation cost, we conduct the experiments to tune the order of bounds. We vary the order of bounds from 1 to 5 and set k as 5% of the number of nodes. The candidate size is reported. Figure 5 visualizes the result with heatmaps. The lighter the color is, the less number of candidates will be. As we can see, the candidate size decreases rapidly at the beginning, and reach steady when the order reaches 2 for most cases. Therefore, we set the order of upper and lower bounds to 2 for the following experiments.

To demonstrate the efficiency of proposed techniques, we conduct experiments on all the datasets and report the response time. The results are shown in Figure 6 . In all methods, the computation time gradually increases along with k except for the naive approach N, because N uses a large fixed sample size. For the other methods, the sample size may change when k increases. As we can observe, algorithm N is the most time-consuming method, and the algorithm runs faster when more accelerating techniques involved. SR is better than SN because the reverse sampling technique and candidate set derived can greatly reduce the sampling cost. BSR is better than SR, since we can reduce the candidate space and sample size by using the lower and upper bounds derived. BSRBKis better than BSR because of the novel stop condition used. BSRBKalways outperforms the others and achieves up to 100x acceleration. These observations strongly proves the advantage of proposed techniques.

To evaluate the effectiveness of proposed methods, the precision is reported by varying k from 1%|V| to 10%|V|. The results are shown in Figure 7 . Generally, the precision of the 5 methods is very close to each other, and the largest gap between the naive method N and BSRBKis only 3%. Compared with the speedup in efficiency, the precision difference is much less noticeable. The naive method N is slightly better than the other methods, because it has used more samples. SN, SR and BSR report almost the same result, because they obtain the same theoretical guarantee. It should be noted that for the Interbank dataset, 1%|V| = 1 and all methods successfully detect that node. Therefore, the precision is 1 as Figure 7 (c). As observed, the experiment results prove that BSRBKcould achieve significant performance acceleration while keeping a tolerable precision reduction.

In this section, We present a system, named VulnDS, by integrating the proposed techniques with our current loan risk control system. The control system is deployed in our collaborated bank, which can evaluate our methods in real scenarios. We first present the overall architecture for VulnDS, and then describe the details of system implementation. Finally, we report the interface, observation and case study after system deployment.

Architecture Overview. Figure 8 shows the architecture overview of the VulnDS in a loan management system. We collect origin data from three upstream: loan data warehouse, data market and external loan data. In the pre-processing layer, raw records are extracted, merged and aggregated for risk control. We employ the in-memory database to store the frequent queried data, and graph database to preserve networked relationships, as well as rational DB for conventional tables. We utilize a monitoring platform for scheduling submitted tasks from the pre-processing and risk control module. The risk assessment results are consumed by the tools and application platform, which is the main scenario to control loan risks. Different roles of business users access the system from a unified application interface.

The risk control center consists of three main parts: the rule engine, vulnerable detection system and evaluation module. Rule engine mainly includes loan blacklist, white list and compliance rules. If a loan passes the rule check, it will be then processed by our proposed vulnerable detection system. VulnDS assess the self-risk of SME, the risk of guarantee relationships, and detect the top-k vulnerable nodes by our methods. Evaluation module leverage the output of VulnDS to quantify the loan grant amount, time limit and interest ratio, etc. Once the bank issue a loan, post-loan process are activated immediately. All three steps in the risk control center will be employed to evaluate all issued loans regularly. In our implementation, we detect all loans monthly by the proposed VulnDS in a risk control center.

Implementation Details. Figure 9 shows an overview of the data association, which is extracted by the pre-processing layer. We employ the internal black and white lists from our collaborated bank. The rules are mainly under the compliance of the new Basel protocol [21] . In vulnerable detection system, we employ HGAR [10] for self-risk assessments, p-wkNN [15] to infer the probability of risk guarantee relationships. The proposed methods are utilized for the final vulnerable SME detection. During implementation, we use the Drools [22] on Apache Flink as the rule engine, in which the hot data are stored in Redis [23] . We employ neo4j as the graph database, visualize the graph by open-source software package D3.js and layout ForceAtlas2 [24] . The training model and system implementation are written in Python, Java, and Scala. System Deployment. Our proposed VulnDS is deployed in a loan management system of our collaborated bank. Figure 10 presents the system interface and main components, where the left part of Figure 10 presents the control and metric panel, including the risk statistics of each of the loan communities and control menus. The right part of Figure 10 displays the loan status monitoring screen. The node size indicates the delinquent probability of each company, which is dynamic and changes periodically according to the time window. Thus, risk managers could focus on risky and dominant companies.

In this section, we conduct the case studies based on the deployed system. We directly observe labels from real-world behavior and validate the prediction result with the tagged labels. In the previous research, many machine learning based methods are developed for loan default prediction. To further demonstrate the performance of proposed methods, we compare the proposed methods with some baseline approaches, which are designed for the default prediction task for real-world system. The baseline methods include Wide [25] , Wide and Deep [26] , CNN-max [27] , GBDT [28] , crDNN [29] , INDDP [15] , HGAR [10] . We also employ betweenness [30] , PageRank [31] , k-core [32] and influence maximization (InfMax) [14] methods as baselines. We conduct the experiments over real-world dataset, i.e., Guarantee dataset, which spans 4 yeas, from 2012 to 2016. As observed, most of the loans are repaid monthly. Hence, we aggregate the behavior features within one-month time window and mark the delinquency loans as the target label for For the baseline methods, the training data is used to train the prediction models. For our methods, the training data is used to train the probabilities involved in the networks, which details are shown in our previous research [15] .

The results are shown in Table 3 

Uncertain graph. The uncertain (probabilistic) graph, where each node or edge may appear with a certain probability, has been widely used to model graphs with uncertainty in a wide spectrum of graph applications. A large number of classical graph problems have been studied in the context of probabilistic graphs. For instance, [33] investigates the distance-constraint reachability problem in probabilistic graph. [4] introduces a framework to address the k nearest neighbors (kNN) queries on probabilistic graphs. [34] investigate the reliability problem based on conditional probability. In [5] , authors provide a comprehensive comparison between different reliability algorithms. The problem of vulnerable nodes detection has been investigated in the context of network reliability (e.g., [35, 36, 37] ). Nevertheless, their models are inherently different with ours, and the techniques cannot be trivially applied. The problem investigated in this paper is similar to the study of node influence under the independent cascade (IC) model [14, 6] in the sense that the influence of a node can be modeled by possible world semantics. Although a large body of works (e.g., [14, 38, 39, 40] ) have been developed for the problem of influence maximization under the IC model, their proposed techniques cannot be applied to our problem due to the inherent difference between the two problems. Firstly, the nodes in IC model do not carry any probability. Secondly, their focus is to select k nodes such that the spread of influence is maximized. While we aim to find k nodes with largest default probabilities.

Credit evaluation. Consumer credit risk evaluation is often technically addressed in a data-driven fashion and has been extensively investigated [41, 42] . Since the seminal "Partial Credit" model [43] , numerous statistical approaches have been introduced for credit scoring: logistic regression, k-NN, neural network, and support vector machine. More recently, [41] presents an in-depth analysis on interpreting the learned knowledge embedded in neural networks by using explanatory rules, and discusses how to visualize these rules. Researchers combine debt-to-income ratio with consumer banking transactions and use a linear regression model with time-windowed dataset to predict the default rates in a short future. They claim an 85% default prediction accuracy and can save costs of between 6% and 25% [44] .

Diffusion in finance. The relationship between network structure and financial system risk has been carefully studied and several insights have been drawn. Network structure has little impact on system welfare, but it is important in determining systemic risk and welfare in short-term debt [45] . Network theory attracts more attention after the 2008 global financial crisis. The crisis brought about by Lehman Brothers infects connected corporations, which is similar to the 2002 Severe Acute Respiratory Syndrome (SARS) epidemic. Both of them start from small damages, but hit a networked society and cause serious events [46, 47] . The dynamic network produced by bank overnight funds loans may be an alert of the crisis [48] . Moreover, research that aims to understand individual behavior and interactions in the social network has also attracted extensive attention [49, 50] . Although preliminary efforts have been made using network theory to understand fundamental problems in financial systems [51] , there is little work on system risk analysis in networked-guarantee loans [52] .

In this paper, we investigate the problem of top-k vulnerable nodes detection over uncertain graphs, which is very important for risk management in real-world applications. We formally define the problem by considering both self-risk probability of the nodes and the prorogation probability of defaults among graph nodes. Due to the hardness of the problem, a sampling based method is developed with tight theoretical guarantee. To scale for large networks, effective pruning techniques and advanced sampling method are proposed with rigorous theoretical analysis. To further accelerate the search, a bottom-k based approach is presented. We conduct extensive experiments over real-world datasets to demonstrate the efficiency and effectiveness of proposed techniques. Moreover, the proposed techniques are integrated into our loan risk control system, which is deployed in real financial environment. Through case studies, we further verify the advantages of proposed model.

Improved algorithms for maximal clique search in uncertain networks

Efficient probabilistic supergraph search

Injecting uncertainty in graphs for identity obfuscation

Aristides Gionis, and George Kollios. K-nearest neighbors in uncertain graphs

An in-depth comparison of st reliability algorithms over uncertain graphs

Influence maximization on social graphs: A survey

Delinquent events prediction in temporal networked-guarantee loans

Determinants of the guarantee circles: The case of chinese listed firms

Loan 'guarantee chains' in china prove flimsy

Risk assessment for networked-guarantee loans using high-order graph attention representation

On the representation and querying of sets of possible worlds

iconviz: Interactive visual exploration of the default contagion risk of networked-guarantee loans

Risk guarantee prediction in networked-loans

Maximizing the spread of influence through a social network

Prediction defaults for networked-guarantee loans

Fast reliability search in uncertain graphs

Summarizing data using bottom-k sketches

Maximizing social influence in nearly optimal time

Filling in the blanks: Network structure and interbank contagion

Credit card fraud detection using convolutional neural networks

The basel II risk parameters: estimation, validation, stress testing-with applications to loan risk management

Transforming model oriented program into android source code based on drools rule engine

A forensic analysis method for redis database based on rdb and aof file

Forceatlas2, a figure layout algorithm for handy network visualization

Follow-the-regular ized-leader and mil-ror descent: Equivalence theorems and 11 regularization

Wide & deep learning for recommender systems

Joint deep modeling of users and items using reviews for recommendation

Lightgbm: A highly efficient gradient boosting decision tree

A deep learning approach to competing risks representation in peer-to-peer lending

Identifying top-k influential nodes in networks

Deeper inside pagerank

Edge manipulation approaches for k-core minimization: Metrics and analytics

Distance-constraint reachability computation in uncertain graphs

Conditional reliability in uncertain graphs

Identifying vulnerable nodes of complex networks in cascading failures induced by node-based attacks

Identification of K most vulnerable nodes in multi-layered network using a new model of interdependency

Nodal vulnerability to targeted attacks in power grids

Bring order into the samples: A novel scalable method for influence maximization

Influence maximization in near-linear time: A martingale approach

Efficient distance-aware influence maximization in geo-social networks

Using neural network rule extraction and decision tables for credit-risk evaluation

Statistical classification methods in consumer credit scoring: a review

A rasch model for partial credit scoring

Consumer credit-risk models via machine-learning algorithms

Financial connections and systemic risk

Complex financial networks and systemic risk: A review

Contagious chain risk rating for networked-guarantee loans

Using social network knowledge for detecting spider constructions in social security fraud

Complex networks in the study of financial and social systems

The lifecycle and cascade of wechat social messaging groups

Social network, social trust and shared goals in organizational knowledge sharing

Netrating: Credit risk evaluation for loan guarantee chain in china