key: cord-024553-jat7pttf authors: Fan, Haoyi; Zhang, Fengbin; Wang, Ruidong; Xi, Liang; Li, Zuoyong title: Correlation-Aware Deep Generative Model for Unsupervised Anomaly Detection date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_52 sha: doc_id: 24553 cord_uid: jat7pttf Unsupervised anomaly detection aims to identify anomalous samples from highly complex and unstructured data, which is pervasive in both fundamental research and industrial applications. However, most existing methods neglect the complex correlation among data samples, which is important for capturing normal patterns from which the abnormal ones deviate. In this paper, we propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM), which captures the complex correlation among data points for high-quality low-dimensional representation learning. More specifically, the relations among data samples are correlated firstly in forms of a graph structure, in which, the node denotes the sample and the edge denotes the correlation between two samples from the feature space. Then, a dual-encoder that consists of a graph encoder and a feature encoder, is employed to encode both the feature and correlation information of samples into the low-dimensional latent space jointly, followed by a decoder for data reconstruction. Finally, a separate estimation network as a Gaussian Mixture Model is utilized to estimate the density of the learned latent vector, and the anomalies can be detected by measuring the energy of the samples. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed method. Anomaly detection aims at identifying abnormal patterns that deviate significantly from the normal behavior, which is ubiquitous in a multitude of application domains, such as cyber-security [15] , medical care [19] , and surveillance video profiling [14] . Formally, anomaly detection problem can be viewed as density estimation from the data distribution [23] : anomalies tend to reside in the low probability density areas. Although anomaly detection has been well-studied in the machine learning community, how to conduct unsupervised anomaly detection from highly complex and unstructured data effectively, is still a challenge. Unsupervised anomaly detection aims to detect outliers without labeled data for the scenario that only a small number of labeled anomalous data combined with plenty of unlabeled data are available, which is common in real-world applications. Existing methods for unsupervised anomaly detection can be divided into three categories: reconstruction based methods, clustering based methods, and one-class classification based methods. Reconstruction based methods, such as PCA [5] based approaches [10, 18] and autoencoder based approaches [20] [21] [22] [23] , assume that outliers cannot be effectively reconstructed from the compressed low-dimensional projections. Clustering based methods [6, 17] aim at density estimation of data points and usually adopt a two-step strategy [3] that performs dimensionality reduction firstly and then clustering. Different from previously mentioned categories, one-class classification based methods [1, 7, 11] make the effort to learn a discriminative boundary between the normal and abnormal instances. Although the above-mentioned methods had their fair share of success in anomaly detection, most of these methods neglect the complex correlation among data samples. As shown in Fig. 1 , the conventional methods attempt to conduct feature learning on the original observed feature space of data samples, while the correlation among similar samples is ignored, which can be exploited during feature learning by propagating more representative features from the neighbors to generate high-quality embedding for anomaly detection. However, modeling correlation among samples is far different from those conventional feature learning models, in which highly non-linear structure needs to be captured. Therefore, how to effectively incorporate both the original feature and relation structure of samples into an integrated feature learning framework for anomaly detection is still an open problem. To alleviate the above-mentioned problems, in this paper, we propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM), which considers both the original feature and the complex correlation among data samples for feature learning. Specifically, the relations among data samples are correlated firstly in forms of a graph structure, in which, the node denotes the sample and the edge denotes the correlation between two samples from the feature space. Then, a dual-encoder that consists of a graph encoder and a feature encoder, is employed in CADGMM to encode both the feature and correlation of samples into the low-dimensional latent space jointly, followed by a decoder for data reconstruction. Finally, a separate estimation network as a Gaussian Mixture Model is utilized to estimate the density of the learned latent embedding. To verify the effectiveness of our algorithms, we conduct experiments on multiple real-world datasets. Our experimental results demonstrate that, by considering correlation among data samples, CADGMM significantly outperforms the state-of-the-art on unsupervised anomaly detection tasks. In this section, we formally define the frequently-used notations and the studied problem. Problem 1. Anomaly detection: Given a set of input samples X = {x i |i = 1, ..., N }, each of which is associated with a F dimension feature X i ∈ R F , we aim to learn a score function u(X i ) : R F → R, to classify sample x i based on the threshold λ: where y i denotes the label of sample x i , with 0 being the normal class and 1 the anomalous class. In this section, we introduce the proposed CADGMM in detail. CADGMM is an end-to-end joint representation learning framework for unsupervised anomaly detection. As shown in Fig. 2 , CADGMM consists of three modules named dualencoder, feature decoder, and estimation network, respectively. Specifically, the relations among data samples in the original feature space are correlated firstly in form of the graph structure. In the constructed graph, the node denotes the sample and the edge denotes the correlation between two samples in the feature space. Then, a dual-encoder that consists of a graph encoder and a feature encoder, is employed to encode both the feature and correlation information of samples into the low-dimensional latent space jointly, followed by a feature decoder for sample reconstruction. Finally, a separate estimation network is utilized to estimate the density of the learned latent embedding in the framework of Gaussian Mixture Model, and the anomalies can be detected by measuring the energy of the samples with respect to a given threshold. To explore the correlation among non-structure data samples for feature learning, we explicitly construct a graph structure to correlate the similar samples from the feature space. More specifically, given a set of input samples X = {x i |i = 1, ..., N }, we employ K-NN algorithm on sample x i to determine its K nearest neighbors N i = {x i k |k = 1, ..., K} in the feature space. Then, an undirected edge is assigned between x i and its neighbor x i k . Finally, an undirected graph being the edge set, and X ∈ R N ×F being the feature matrix of nodes. Based on the constructed graph, the feature affinities among samples are captured explicitly, which can be used during feature learning by performing message propagation mechanism on them. In order to obtain sufficient representative high-level sample embedding, Dual-Encoder consists of a feature encoder and a graph encoder to encode the original feature of samples and the correlation among them respectively. To encode the original sample features X, feature encoder employs a L X layers Multi-Layer Perceptron (MLP) to conduct a non-linear feature transform, which is as follows: where Z X(lX−1) , Z X(lX) , W X(lX−1) and b X(lX−1) are the input, output, the trainable weight and bias matrix of (l X -1)-th layer respectively, l X ∈ {1, 2, ..., L X }, and Z X(0) = X is the initial input of the encoder. σ(•) denotes an activation function such as ReLU or Tanh. Finally, the final feature embedding Z X =Z X(LX) is obtained from the output of the last layer in MLP. To encode the correlation among the samples, a graph attention layer [16] is employed to adaptively aggregate the representation from neighbor nodes, by performing a shared attentional mechanism on the nodes: where w i,j indicates the importance weight of node v i to node v j , attn(•) denotes the neural network parametrized by weights a ∈ R D c and W c ∈ R D c 2 ×F that shared by all nodes and D c is the number of hidden neurons in attn(•), || denotes the concatenate operation. Then, the final importance weight α i,j is normalized through the softmax function: where N i denotes the neighbors of node v i , which is provided by adjacency matrix A, and the final node embedding Z V = {Z V i } can be obtained by the weighted sum based on the learned importance weights as follows: Given the learned embedding Z X and Z V , a fusion module is designed to fuse the embeddings from heterogeneous data source into a shared latent space, followed by a fully connected layer to obtain the final sample embedding where W and b are the trainable weight and bias matrix, and ⊕ indicates the element-wise plus operator of two matrices. Feature decoder aims at reconstructing the sample features from the latent embedding Z f : where ZX (lX−1) , ZX (lX) , WX (lX−1) and bX (lX−1) are the input, output, the trainable weight and bias matrix of (lX-1)-th layer of decoder respectively, lX ∈ {1, 2, ..., LX}, and ZX (0) = Z f is the initial input of the decoder. Finally, the reconstructionX is obtained from the last layer of decoder: To estimate the density of the input samples, a Gaussian Mixture Model is leveraged in CADGMM over the learned latent embedding. Inspired by DAGMM [23] , a sub-network consists of several fully connected layers is utilized, which takes the reconstruction error preserved low-dimentional embedding as input, to estimate the mixture membership for each sample. The reconstruction error preserved low-dimentional embedding Z is obtained as follows: where Z r is the reconstruction error embedding and Dist(•) denotes the distance metric such as Euclidean distance or cosine distance. Given the final embedding Z as input, estimate network conducts membership prediction as follows: where M ∈ R N ×M is the predicted membership of M mixture components for N samples. With the predicted sample membership, the parameters of GMM can be calculated to facilitate the evaluation of the energy/likelihood of input samples, which is as follows: where μ m and Σ m are the means and covariance of the m-th component distribution respectively, and the energy of samples is as follows: where | • | indicates the determinant of a matrix. The training objective of CADGMM is defined as follows: where the first term is reconstruction error used for feature reconstruction, the second is sample energy, which aims to maximize the likelihood to observed samples, the third is covariance penalization, used for solving singularity problem as in GMM [23] by penalizing small values on the diagonal entries of covariance matrix, and the last is embedding penalization, which serves as a regularizer to impose the magnitude of normal samples as small as possible in the latent space, to deviate the normal samples from the abnormal ones. λ 1 , λ 2 , and λ 3 are three parameters which control the trade off between different terms. The anomaly score is the sample energy E Z , and based on the measured anomaly scores, the threshold λ in Eq. 1 can be determined according to the distribution of scores, e.g. the samples of top-k scores are classified as anomalous samples. In this section, we will describe the experimental details including datasets, baseline methods, and parameter settings, respectively. -One Class Support Vector Machines (OC-SVM) [4] is a classic kernel method for anomaly detection, which learns a decision boundary between the inliers and outliers. -Isolation Forests (IF) [8] conducts anomaly detection by building trees using randomly selected split values across sample features, and defining the anomaly score as the average path length from a specific sample to the root. -Deep Structured Energy Based Models (DSEBM) [21] is a deep energy-based model, which aims to accumulate the energy across the layers. DSEBM-r and DSEBM-e are utilized in [21] by taking the energy and reconstruction error as the anomaly score respectively. -Deep Autoencoding Gaussian Mixture Model (DAGMM) [23] is an autoencoder based method for anomaly detection, which consists of a compression network for dimension reduction, and an estimate network to perform density estimation under the Gaussian Mixture Model. -AnoGAN [13] is an anomaly detection algorithm based on GAN, which trains a DCGAN [12] to recover the representation of each data sample in the latent space during prediction. -ALAD [20] is based on bi-directional GANs for anomaly detection by deriving adversarially learned features and uses reconstruction errors based on the learned features to determine if a data sample is anomalous. The parameter settings in the experiment for different datasets are as follows: In this section, we will demonstrate the effectiveness of the proposed method by presenting results of our model on anomaly detection task, and provide a comparison with the state-of-the-art methods. As in previous literatures [20, 21, 23] , in this paper, Precision, Recall and F1 score are employed as the evaluation metrics. Generally, we expect the values of these evaluation metrics as big as possible. The sample with high energy is classified as abnormal and the threshold is determined based on the ratio of anomalies in the dataset. Following the settings in [21, 23] , the training and test sets are split by 1:1 and only normal samples are used for training the model. The experimental results shown in Table 3 demonstrate that the proposed CADGMM significantly outperforms all baselines in various datasets. The performance of CADGMM is much higher than traditional anomaly detection methods such as OC-SVM and IF, because of the limited capability of feature learning or the curse of dimensionality. Moreover, CADGMM also significantly outperforms all other deep learning based methods such as DSEBM, DAGMM, AnoGAN, and ALAD, which demonstrates that additional correlation among data samples facilitates the feature learning for anomaly detection. For small datasets such as Arrhythmia, we can find that traditional methods such as IF are competitive compared with conventional deep learning based method such as DSEBM, DAGMM, AnoGAN, and ALAD, which might because that the lack of sufficient training data could have resulted in poorer performance of the data hungry deep learning based methods, while CADGMM is capable of leveraging more data power given the limited data source, by considering the correlation among data samples. In this section, we study the impact of noise data for the training of CADGMM. To be specific, 50% of randomly split data samples are used for testing, while the rest 50% combined with 1% to 5% anomalies are used for training. As shown in Table 4 , with the increase of noise data, the performance of all baselines degrade significantly, especially for OC-SVM, which tends to be more sensitive to noise data because of its poor ability of feature learning on high-dimensional data. However, CADGMM performs stable with different ratios of noise and achieves state-of-the-art even 5% anomalies are injected into the training data, which demonstrates the robustness of the proposed method. In this section, we evaluate the impact of different K values during the graph construction on CADGMM. More specifically, we conduct experiments on all three datasets by varying the number of K from 5 to 19, and the experimental results are illustrated in Fig. 3 . During training, the batch sizes are set as 1024, 128, and 512 for KDD99, Arrhythmia, and Satellite, respectively, the experimental results show that the changing of K value causes only a little fluctuation of performance on all datasets with different settings, which demonstrates that CADGMM is less sensitive to the K value and easy to use. In order to explore the quality of the learned embedding, we make a comparison of the visualization of sample representation for different methods in Fig. 4 . Specifically, we take the low-dimensional embeddings of samples learned by DAGMM and CADGMM, as the inputs to the t-SNE tool [9] . Here, we randomly choose 40000 data samples from the test set of KDD99 for visualization, and then we generate visualizations of the sample embedding on a twodimensional space, in which blue colors correspond to the normal class while orange the abnormal class. We can find that CADGMM achieves more compact and separated clusters compared with DAGMM. The results can also explain why our approach achieves better performance on anomaly detection task. In this paper, we study the problem of correlation aware unsupervised anomaly detection, which considers the correlation among data samples from the feature space. To cope with this problem, we propose a method named CADGMM to model the complex correlation among data points to generate high-quality lowdimensional embeddings for anomaly detection. Extensive experiments on realworld datasets demonstrate the effectiveness of the proposed method. Enhancing one-class support vector machines for unsupervised anomaly detection UCI machine learning repository Anomaly detection: a survey One-class SVM for learning in image retrieval Principal component analysis Robust kernel density estimation Improving one-class SVM for anomaly detection Isolation forest. In: ICDM Visualizing data using t-SNE Robust feature selection and robust PCA for internet traffic anomaly detection Using an ensemble of one-class SVM classifiers to harden payload-based anomaly detection systems Unsupervised representation learning with deep convolutional generative adversarial networks Unsupervised anomaly detection with generative adversarial networks to guide marker discovery Real-world anomaly detection in surveillance videos Fast anomaly detection for streaming data Graph attention networks Group anomaly detection using Flexible Genre Models Robust PCA via outlier pursuit CXNet-m1: anomaly detection on chest x-rays with imagebased deep learning Adversarially learned anomaly detection Deep structured energy based models for anomaly detection Anomaly detection with robust deep autoencoders Deep autoencoding gaussian mixture model for unsupervised anomaly detection This work was supported in part by National Natural Science Foundation of China (No. 61172168, 61972187) .