key: cord-0043142-hkj7vwa3 authors: Lu, Sha; Liu, Lin; Li, Jiuyong; Le, Thuc Duy; Liu, Jixue title: LoPAD: A Local Prediction Approach to Anomaly Detection date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_50 sha: 2e4725949ac016efc6a566edb088a3746109a117 doc_id: 43142 cord_uid: hkj7vwa3 Dependency-based anomaly detection methods detect anomalies by looking at the deviations from the normal probabilistic dependency among variables and are able to discover more subtle and meaningful anomalies. However, with high dimensional data, they face two key challenges. One is how to find the right set of relevant variables for a given variable from the large search space to assess dependency deviation. The other is how to use the dependency to estimate the expected value of a variable accurately. In this paper, we propose the Local Prediction approach to Anomaly Detection (LoPAD) framework to deal with the two challenges simultaneously. Through introducing Markov Blanket into dependency-based anomaly detection, LoPAD decomposes the high dimensional unsupervised anomaly detection problem into local feature selection and prediction problems while achieving better performance and interpretability. The framework enables instantiations with off-the-shelf predictive models for anomaly detection. Comprehensive experiments have been done on both synthetic and real-world data. The results show that LoPAD outperforms state-of-the-art anomaly detection methods. According to [7] , anomalies are patterns in data that do not conform to a welldefined notion of normal behavior. The mainstream methods for anomaly detection, e.g. LOF [5] , are based on proximity between objects. These methods evaluate the anomalousness of an object through its distance or density within its neighborhood. If an object stays far away from other objects or in a sparse neighborhood, it is more likely to be an anomaly [1] . Another research direction in anomaly detection is to exploit the dependency among variables, which has shown successful applications in various fields [1] . Dependency-based methods firstly discover variable dependency possessed by the majority of objects, then the anomalousness of objects is evaluated through how well they follow the dependency. The objects whose variable dependency significantly deviate from the normal dependency are flagged as anomalies. These methods can detect certain anomalies that cannot be discovered through proximity because though these anomalies violate the dependency, they may still locate in a dense neighborhood. A way to measure dependency deviation is to examine the difference between the observed value and the expected value of an object, where the expected value is estimated based on the underlying dependency [1] . Specifically, for an object, the expected value of a given variable is estimated using the values of a set of other variables of the object. Here, we call the given variable the target variable, and the set of other variables relevant variables. Relevant variable selection and expected value estimation are the two critical steps of dependency-based anomaly detection, as they play a decisive role in the performance of the detection. However, they have not been well addressed by existing methods. Relevant variable selection faces a dilemma in high dimensional data. On the one hand, it is expected that the complete dependency, i.e., the dependency between a target variable and all the other variables, is utilized to discover anomalies accurately. On the other hand, it is common that in real-world data, only some variables are relevant to the data generation mechanism for the target variable. Irrelevant variables have no or very little contribution to the anomaly score, and even have a negative impact on the effectiveness [18] . How to find the set of most relevant variables that can capture the complete dependency around a target variable is a challenge, especially in high dimensional data given the large number of possible subsets of variables. A naive approach is to use all other variables as the relevant variables for a target variable, as the ALSO algorithm [12] does. However, doing so leads to two major problems. Firstly, it is computationally expensive to build prediction models in high dimensional data. Secondly, conditioning on all other variables means irrelevant variables can affect the detection accuracy. Another approach is to select a small set of relevant variables. COMBN [2] is a typical method falling in this category. COMBN uses the set of all direct cause variables of a target in a Bayesian network as the relevant variables. However, only selecting a small subset of variables may miss some important dependencies, resulting in poor detection performance too. To deal with these problems, we propose an optimal attribute-wise method, LoPAD (Local Prediction approach to Anomaly Detection), which innovatively introduces Markov Blanket (MB) and predictive models to anomaly detection to enable the use off-the-shelf classification methods to solve high dimensional unsupervised anomaly detection problem. MB is a fundamental concept in the Bayesian network (BN) theory [13] . For any variable X in a BN, the MB of X, denoted as MB(X), comprises its parents (direct causes), children (direct effects) and spouses (the other parents of X's children). Given MB(X), X is conditionally independent of all the other variables, which means MB(X) encodes the complete dependency of X. So for LoPAD, we propose to use MB(X) as the relevant variables of X. As in high dimensional data, MB(X) usually has much lower dimensionality than that of the dataset, which enables LoPAD to deal with high dimensional data. Moreover, using MB(X) LoPAD can achieve a more accurate estimation of the expected value of X. The study in [9] has shown that MB(X) is the optimal feature set for a prediction model of X in the sense of minimizing the amount of predictive information loss. Therefore, we propose to predict the expected value of X with a prediction model using MB(X) as the predictors. It is noted that LoPAD is not limited to a specific prediction algorithm, which means a variety of off-the-shelf prediction methods can be utilized and thus relax the restrictions on data distributions and data types. In summary, by using MB of a variable, LoPAD simultaneously solves the two challenges in dependency-based anomaly detection, relevant variable selection and expected value estimation. The main contributions of this work are as below: -Through introducing Markov Blanket into dependency-based anomaly detection, we decompose the high dimensional unsupervised anomaly detection problem into local feature selection and prediction problems, which also provide better interpretation of detected anomalies. -We develop an anomaly detection framework, LoPAD, to efficiently and effectively discover anomalies in high dimensional data of different types. -We present an instantiated algorithm based on the LoPAD framework and conduct extensive experiments on a range of synthetic and real-world datasets to demonstrate the effectiveness and efficiency of LoPAD. In this paper, we use an upper case letter, e.g. X to denote a variable; a lower case letter, e.g. x for a value of a variable; a boldfaced upper case letter, e.g. X = {X 1 , X 2 , . . . , X m } for a set of variables; and a boldfaced lower case letters, e.g. x = (x 1 , x 2 , . . . , x m ), for a value vector of a set of variables. We have reserved the letter D for a data matrix of n objects and m variables, x i for the i-th row vector (data point or object) of D, and x ij for the j-th element in x i . In LoPAD, the anomalousness of an object is evaluated based on the deviation of its observed value from the expected value. There are two types of deviations, value-wise deviation and vector-wise deviation as defined below. Given an object x i , its value-wise deviation with respect to variable X j is defined as: where x ij is the observed value of X j in x i , and is the expected value of X j estimated using the function g() based on the values on other variables X ⊆ X \ {X j }. x i is the aggregation of all its value-wise deviations calculated using a combination function as follows: From the above definitions, we see that value-wise deviation evaluates how well an object follows the dependency around a specific variable, and vector-wise deviation evaluates how an object collectively follows the dependencies. Based on the definitions, we can now define the research problem of this paper. Given a dataset D with n objects and a user specified parameter k, our goal is to detect the top-k ranked objects according to the descending order of vector-wise deviations as anomalies. To obtain value-wise deviation of an object, two problems need to be addressed. One is how to find the right set of relevant variables of a target variable, i.e. X in Eq. 2, which should completely and accurately represent the dependency of X j on other variables. For high dimensional data, it is more challenging as the number of subsets of X \ {X j } increases exponentially with the number of variables in a dataset. The other problem is how to use the selected relevant variables to make an accurate estimation of the expected value. The LoPAD framework adapts optimal feature selection technique and supervised machine learning technique to detect anomalies in three phases: (1) Relevant variable selection for each variable X j using the optimal feature select technique; (2) Estimation of the expected value of X j using the selected variables with a predictive model; (3) Anomaly score generation. In this phase, the goal is to select the optimal relevant variables for a target variable. We firstly introduce the concept of MB, then explain why MB is the set of optimal relevant variables. Markov Blankets are defined in the context of a Bayesian network (BN) [13] . A BN is a type of probabilistic graphical model used to represent and infer the dependency among variables. A BN is denoted as a pair of (G, P ), where G is a Directed Acyclic Graph (DAG) showing the structure of the BN and P is the joint probability of the nodes in G. Specifically, G = (V, E), where V is the set of nodes representing the random variables in the domain under consideration, and E ⊆ V × V is the set of arcs representing the dependency among the nodes. X 1 ∈ V is known as a parent of X 2 ∈ V (or X 2 is a child of X 1 ) if there exists an arc X 1 → X 2 . In a BN, given all its parents, a node X is conditionally independent of all its non-descendant nodes, known as the Markov condition for a BN, based on which the joint probability distribution of V can be decomposed to the product of the conditional probabilities as follows: where P a(X) is the set of all parents of X. For any variable X ∈ V in a BN, its MB contains all the children, parents, and spouses of X, denoted as MB(X). Given MB(X), X is conditionally independent of all other variables in V, i.e., According to Eq. 5, MB(X) represents the information needed to estimate the probability of X by making X irrelevant to the remaining variables, which makes MB(X) is the minimal set of relevant variables to obtain the complete dependency of X. This phase aims to estimate the expected value of a variable in an object (defined in Eq. 2) using the selected variables. The function g() in Eq. 2 is implemented with a prediction model. Specifically, for each variable, a prediction model is built to predict the expected value on the variable using the selected relevant variables as predictors. A large number of off-the-shelf prediction models can be chosen to suit the requirement of the data. By doing so, we decompose the anomaly detection problem into individual prediction/classification problems. In this phase, the vector-wise deviation, i.e., anomaly score, is obtained by applying a combination function over value-wise deviations. Various combination functions can be used in the LoPAD framework, such as maximum function, averaging function, weighted summation. A detailed study on the impact of different combination functions on the performance of anomaly detection can be found in [10] . As shown in Algorithm 1, we present an instantiation of the LoPAD framework, i.e. the LoPAD algorithm. Given an input dataset D, for each variable, its relevant variable selection is done at Line 3, then a prediction model is built at Line 4. From Lines 5 to 8, value-wise deviations are computed for all the objects. In Line 10, value-wised deviation is normalized. With Lines 11 to 13, vector-wise deviations are obtained by combining value-wise deviations. At Line 14, top-k scored objects are output as identified anomalies. As anomalies are rare in a dataset, although LoPAD uses the dataset with anomalies to discover MBs and train the prediction models, the impact of anomalies on MB learning and model training is limited. For the LoPAD algorithm, we use the fast-IAMB method [16] to learn MBs. For estimating expected values, we adopt CART regression tree [4] to enable the LoPAD algorithm to cope with both linear and non-linear dependency. It is noted that regression models are notorious for being affected by the outliers in the training set. We adopt Bootstrap aggregating (also known as bagging) [3] to mitigate this problem to achieve better prediction accuracy. Before computing vector-wise deviations, the obtained value-wised deviations need to be normalized. Specifically, for each object x i on each target variable X j , δ ij is normalized as the Z-score using the mean and standard deviation of δ j . After normalization, negative values represent the small deviations. As we are only interested in large deviations, the vector-wise deviation is obtained by summing up the positive normalized value-wise deviations as follows: The time complexity of the LoPAD algorithm mainly comes from two sources, learning MB and building the prediction model. For a dataset with n objects and m variables, the complexity of the MB discovering using fast-IAMB is O(m 2 λ) [15] , where λ is the average size of MBs. The complexity of building m prediction models is O(mλnlogn) [4] . Therefore, the overall complexity of the LoPAD algorithm is O(m 2 λ) + O(mλnlogn). Data Generation. For synthetic data, 4 benchmark BNs from bnlearn repository [14] are used to generate linear Gaussian distributed datasets. For each BN, 20 datasets with 5000 objects are generated. Then the following process is followed to inject anomalies. Firstly, 1% objects and 10% variables are randomly selected. Then anomalous values are injected to the selected objects on these selected variables. The injected anomalous values are uniformly distributed values in the range of the minimum and maximum values of the selected variables. In this way, the values of anomalies are still in the original range of the selected variables, but their dependency with other variables is violated. For each BN, the average ROC AUC (area under the ROC curve) of the 20 datasets is reported. For real-world data, we choose 13 datasets ( Table 1) that cover diverse domains, e.g., spam detection, molecular bioactivity detection, and image object recognition. AID362, backdoor, mnist and caltech16 are obtained from Kaggle dataset repository, and the others are retrieved from the UCI repository [8] . These datasets are often used in anomaly detection literature. We follow the common process to obtain the ground truth anomaly labels, i.e. using samples in a majority class as normal objects, and a small class, or down-sampling objects in a class as anomalies. Categorical features are converted into numeric ones by 1-of-encoding [6] . If the number of objects in the anomaly class is more than 1% of the number of normal objects, we randomly sample the latter number of objects from the anomaly class as anomalies. Experiments are repeated 20 times, and the average AUC is reported. If the ratio of anomalies is less than 1%, the experiment is conducted once, which is the case for the wine, AID362 and arrhythmia datasets. Comparison Methods. The comparison methods include dependency-based methods, ALSO [12] and COMBN [2] ; and proximity-based methods, MBOM [17] , iForest [11] and LOF [5] . The major difference in LoPAD, ALSO and COMBN is the choice of relevant variables. ALSO uses all remaining variables, and COMBN uses parent variables, while LoPAD utilizes MBs. The effectiveness of using MB in LoPAD is validated by comparing LoPAD with ALSO. MBOM and iForest are proximity-based methods, which detect anomalies based on density in subspaces. LOF is a classic density-based method, which is used as the baseline method. In the experiments (including sensitivity tests), we adopt the commonly used or recommended parameters that are used in the original papers. For a fair comparison, both LoPAD and ALSO adopt CART regression tree [4] with bagging. In CART, the number of minimum objects to split is set to 20, and the minimum number of objects in a bucket is 7, the complexity parameter is set to 0.03. The number of CART trees in bagging is set to 25. In MBOM and LOF, the number of the nearest neighbor is set to 10. For iForest, the number of trees is set to 100 without subsampling. All algorithms are implemented in R 3.5.3 on a computer with 3.5 GHz (12 cores) CPU and 32 G memory. Performance Evaluation. The experimental results are shown in Table 2 . If a method could not produce a result within 2 hour, we terminate the experiment. Such cases occur to COMBN and are shown as '-' in Table 2 . LoPAD yields 13 best results (out of 17) and LoPAD achieves the best average AUC of 0.859 with the smallest standard deviation of 0.027. Overall, dependency-based methods (LoPAD, ALSO and COMBN) perform better than proximity-based methods (MBOM, iForest and LOF). Compared with ALSO, LoPAD improves 4.2% on AUC, which is attributed to the use of MB. COMBN yields two best results, but its high time complexity makes it unable to produce results for several datasets. Comparing LoPAD with MBOM, LoPAD performs significantly better with a 15.3% AUC improvement. Although iForest has the best result among the proximity-based methods, LoPAD has a 9.1% AUC improvement over it. As to LOF, LoPAD has a 14.6% AUC improvement over it. The average size of the MB is much smaller than the original dimensionality on all datasets, which means that comparing to ALSO, LoPAD works based on much smaller dimensionality but still achieves the best results in most cases. We apply the Wilcoxon rank sum test to the results of the 17 datasets (4 synthetic and 13 real-world datasets) by pairing LoPAD with each of the other methods. The null hypothesis is that the result of LoPAD is generated from the distribution whose mean is greater than the compared method. The p-values are 0.0005 with ALSO, 0.0001 with MBOM, 0.0599 with COMBN, 0.0007 with iForest and 0.0002 with LOF. The p-value with COMBN is not reliable because of the small number of results (COMBN is unable to produce results for 5 out of 21 datasets). Except for COMBN, all the p-values are far less than 0.05, indicating that LoPAD performs significantly better than the other methods. The running time of these datasets is shown in Table 3 . Overall, dependencybased methods are slower because they need extra time to learn MBs or the BN and prediction models. COMBN is unable to produce results in 2 h on 5 datasets. Fig. 1 . In Fig. 1(a) , the number of variables with injected anomalous values ranges from 1 to 20, while the ratio of anomalies is fixed to 1%. In Fig. 1(b) , the ratio of anomalies ranges from 1% to 10%, while the number of anomalous variables is fixed to 10. In Fig. 1(c) , the dimension ranges from 100 to 1000, while the number of variables with injected anomalous values is 10 and the ratio of anomalies fixes to 1%. Overall, all methods follow similar trends in terms of their sensitivity to these parameters, and LoPAD shows consistent results which are better than comparison methods in most cases. Anomaly Interpretation. One advantage of LoPAD is the interpretability of detected anomalies. For a detected anomaly, the variables with high deviations can be utilized to explain detected anomalies. The difference between the expected and the observed values of these variables indicate the strength and direction of the deviation. We use the result of the mnist dataset as an example to show how to interpret an anomaly detected by LoPAD. In mnist, each object is a 28 * 28 grey-scaled image of a hand-writing digit. Each pixel is a variable, whose value ranges from 0 to 255. Zero corresponds to white, and 255 is black. In our experiment, 7 is the normal class, and 0 is the anomaly class. Figure 2(b) is the top-ranked anomaly by LoPAD (a digit 0) and Fig. 2(c) is its expected values. In Fig. 2(d) (which also shows the top-ranked anomaly as in Fig. 2(b) ), the pixels indicated with a red dot or a green cross are the top-100 deviated pixels (variables). The green pixels have negative deviations, i.e. their observed values are much smaller than their expected values, which means according to the underlying dependency, these pixels are expected to be darker. The red pixels have positive deviations, i.e. their observed values are much bigger than their expected values, which means they are expected to be much lighter. We can use these pixels or variables with high derivations to understand why this image is an anomaly as explained in the following. In Fig. 2(d) , the highly deviated pixels concentrate in the 3 areas in the blue ellipses. These areas visually are indeed the areas where the observed object ( Fig. 2(b) ) and its expected value (Fig. 2(c) ) differ the most. Comparing Fig. 2(d) with Fig. 2(a) , we can see the anomalousness mainly locates in the 3 areas: (1) in area 1, the stroke is not supposed to be totally closed; (2) the little 'tail' in area 2 is not expected; (3) the stroke in area 3 should move a little to the left. In summary, this example shows that the deviations from the normal dependency among variables can be used to explain the causes of anomalies. Dependency-based anomaly detection approach works under the assumption that normal objects follow the dependency among variables, while anomalies do not. The key challenge for applying approach is how to decide the predictors of a target variable, especially in high dimensional data. However, existing research has not paid attention to how to choose an optimal set of relevant variables. They either use all the other variables, such as ALSO [12] , or a small subset of variables, such as COMBN [2] . The inappropriate choice of predictors has a negative impact on the effectiveness and efficiency of anomaly detection, as indicated by the experiments in Sect. 3. In this paper, we innovatively tackle this issue by introducing MBs as relevant variables. Apart from dependency-based approach, the mainstream of anomaly detection methods is proximity-based, such as LOF [5] . These methods work under the assumption that normal objects are in a dense neighborhood, while anomalies stay far away from other objects or in a sparse neighborhood [7] . Building upon the different assumptions, the key difference between dependency-based and proximity-based approaches is that the former considers the relationship among variables, while the latter relies on the relationship among objects. A branch of proximity-based approach, subspace-based methods, partially utilizes dependency in anomaly detection. In high dimensional data, the distances among objects vanish with the increase of dimensionality (known as the curse of dimensionality). To address this problem, some subspace-based methods are proposed [18] to detect anomalies based on the proximity with respect to subsets of variables, i.e., subspaces. However, although subspace-based anomaly detection methods make use of variable dependency, they use the dependency to determine subspaces, instead of measuring anomalousness. Often these methods find a subset of correlated variables as a subspace, then still use proximity-based methods to detect outlier in each subspace. For example, with MBOM [17] , a subspace contains a variable and its MB, and LOF is used to evaluate anomalousness in each such a subspace. Another novel subspace-based anomaly detection method, iForest [11] , randomly selects subsets of variables as subspaces, which shows good performance in both effectiveness and efficiency. In this paper, we have proposed an anomaly detection method, LoPAD, which divides and conquers high dimensional anomaly detection problem with Markov Blanket learning and off-the-shelf prediction methods. Through using MB as the relevant variables of a target variable, LoPAD ensures that complete dependency is captured and utilized. Moreover, as MBs are the optimal feature selection sets for prediction tasks, LoPAD also ensures more accurate estimation of the expected values of variables. Introducing MB into dependency-based anomaly detection methods provides the sound theoretical support to the most critical steps of dependency-based methods. Additionally, the results of the comprehensive experiments conducted in this paper have demonstrated the superior performance and efficiency of LoPAD comparing to the state-of-the-art anomaly detection methods. Outlier analysis Mining causal outliers using Gaussian Bayesian networks Bagging predictors Classification and Regression Trees LOF: identifying density-based local outliers On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study Anomaly detection: a survey UCI machine learning repository Causal feature selection Interpreting and unifying outlier scores Isolation forest A decomposition of the outlier detection problem into a set of supervised learning problems Causality: Models, Reasoning and Inference Bayesian network repository Algorithms for large scale Markov blanket discovery Speculative Markov blanket discovery for optimal feature selection Markov boundary-based outlier mining A survey on unsupervised outlier detection in high-dimensional numerical data We acknowledge Australian Government Training Program Scholarship, and Data to Decisions CRC (D2DCRC), Cooperative Research Centres Programme for funding this research. The work has also been partially supported by ARC Discovery Project DP170101306.