key: cord-0043185-fd293rwg authors: Ishii, Yoshinao; Koide, Satoshi; Hayakawa, Keiichiro title: L0-norm Constrained Autoencoders for Unsupervised Outlier Detection date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_51 sha: be1ce05d0df173cfd6ffeaf6cffeb60d1f5aca69 doc_id: 43185 cord_uid: fd293rwg Unsupervised outlier detection is commonly performed using reconstruction-based methods such as Principal Component Analysis. A recent problem in this field is the learning of low-dimensional nonlinear manifolds under L0-norm constraints for error terms. Despite significant efforts, no method that consistently treats such features exists. We propose a novel unsupervised outlier detection method, L0-norm Constrained Autoencoders (L0-AE), based on an autoencoder-based detector with L0-norm constraints for error terms. Unlike existing methods, the proposed optimization procedure of L0-AE provably guarantees the convergence of the objective function under a mild condition, while neither the relaxation of the L0-norm constraint nor the linearity of the latent manifold is enforced. Experimental results show that the proposed L0-AE is more robust and accurate than other reconstruction-based methods, as well as conventional methods such as Isolation Forest. Unsupervised outlier detection has attracted much attention because it does not require time-consuming manual annotation. Reconstruction-based methods, such as Robust Principal Component Analysis (RPCA), are popular approaches for unsupervised outlier detection [5, 12] . A recent trend is the use of nonlinear models, particularly neural network models [1, 20, 24] . For example, the Robust Deep Autoencoder (RDA) [24] learns a low-dimensional nonlinear manifold where normal samples are located using an autoencoder (AE) [10] . These reconstruction-based methods assume that the feature vector of each sample may include outlier elements; therefore, it is necessary to learn low-dimensional nonlinear manifolds to avoid the impact of outliers. For robustness to outliers, the l 0 -norm is often used for optimization; however, due to its combinatorial property, the optimization is difficult [4] . To avoid such difficulty, relaxation methods, i.e., the use of the l 1 -norm or other convex regularization terms, are used. This is, however, problematic because the learned low-dimensional nonlinear manifold is affected by the values of outlier elements, especially in corrupted data. We describe these methods in detail in Sect. 2. In this paper, we propose L0-norm Constrained Autoencoders (L0-AE), a novel reconstruction-based unsupervised outlier detection method that can learn lowdimensional manifolds under an l 0 -norm constraint for the error term using AE. Table 1 compares the features of different reconstruction-based methods. Compared with the other reconstruction-based methods, L0-AE can provably guarantee the convergence of optimization under the l 0 -norm constraint and treat nonlinear features. The key contributions of this work are as follows: 1. We propose a new alternating optimization algorithm that can decompose data nonlinearly under an l 0 -norm constraint for the error term (Sect. 3.1). 2. We prove that our alternating optimization algorithm converges under a mild condition, which demonstrates the stability of our algorithm (Sect. 3.2). 3. Through extensive experiments, we show that L0-AE achieves not only high detection accuracy but also stable convergence properties (Sect. 5). In this section, we describe related reconstruction-based methods. Throughout the paper, we denote a given data matrix by X ∈ R N ×D , where N and D denote the number of samples and feature dimensions of X, respectively. Robust PCA: RPCA [5] , a robustified version of PCA [12] , decomposes X into a low-rank matrix L and a sparse error matrix S such that X = L + S by solving the optimization problem where || · || 0 is the l 0 -norm that represents the number of non-zero elements, λ is a parameter that controls the sparsity of S, and || · || 2 is the l 2 -norm. The use of the l 0 -norm cancels out the outliers in X, making the estimation more robust against outliers. However, this optimization (1) is NP-hard. To mitigate this issue, a convex relaxation has been proposed as follows: where ||·|| * is the nuclear norm and ||·|| 1 is the l 1 -norm. In general, the outlierness of each sample is obtained by adding S • S along the feature dimension, where • is the element-wise product. Robust Deep Autoencoder: RDA [24] is a method that relaxes the linearity assumption of RPCA. RDA uses an AE instead of linear mapping. We denote the model parameters of an AE as θ and an output of the AE with a certain input and parameters θ as f AE ( · ; θ). Concretely, RDA aims to decompose X as X = L D + S, where S is a sparse error matrix, the non-zero elements of which indicate reconstruction difficulty, and L D is easily reconstructable data for AE. This is defined as the following l 1 -relaxed optimization problem: An alternating optimization method for θ and S was proposed (see [24] for details) for optimization. Note that RDA is equivalent to AE when λ = inf. In real applications, outliers are often structured [24] , i.e., outliers are concentrated on a specific sample. For such cases, the use of grouped norm regularization instead of the l 1 -norm in Eq. (3) has been proposed: Although RDA can detect outliers even for nonlinear data, there are several concerns with RDA. First, owing to the NP-hardness, RDA uses the l 1 -norm instead of the l 0 -norm, which causes sensitivity to outliers. Second, the alternating optimization method of RDA does not include a theoretical analysis of convergence. In practice, it has been experimentally confirmed that the progress of training the RDA model may be unstable. To address these issues, we propose an unsupervised outlier detection method that can decompose data nonlinearly using AE under an l 0 -norm constraint for the sparse matrix S. We prove that our algorithm always converges under a certain condition. For clarity, we first describe L0-AE for unstructured outliers and then extend it for structured outliers. Considering that all elements may contain some errors in real datasets, we decompose X intoL = f AE (X; θ), a sparse error matrix S, and a small error matrix E as in Stable Principal Component Pursuit [25] : To train an AE that captures the features of X successfully, ||E|| 2 2 = ||X − f AE (X; θ) − S|| 2 2 must be as small as possible. For optimization, we minimize E while adjusting the sparsity of S using the parameter k ≥ 0 as follows: By solving (6), we can obtain a low-dimensional manifold that captures the nonlinear features of X and can completely avoid the influence of outliers. In the following, we propose an alternating optimization algorithm for θ and S for the l 0 -norm constrained optimization problem (6) . We denote X − f AE (X; θ) as Z(θ); then the objective function can be expressed as ||Z(θ) − S|| 2 2 . In the optimization phase of θ with S fixed, we employ a gradient-based method. With θ fixed, the optimal S is obtained in a closed form; it is the matrix that zeroes out the elements with the top-k largest absolute values in Z(θ), which can be written as follows: where c is the k-th largest value in We rewrite our proposed formulation (6) and alternating optimization method to be algorithmically concise as follows: where A ∈ {0, 1} N ×D is a binary-valued matrix. In the alternating optimization of Eq. (8), θ is optimized by gradient-based optimization and A is optimized by The procedure of our proposed optimization algorithm is as follows: and Epoch max ∈ N Initialize A ∈ R N ×D as a zero matrix, epoch counter Epoch = 0, and an autoencoder f AE ( · ; θ) with randomly initialized parameters. Repeat the following Epoch times: 1. Obtain reconstruction error matrix Z: Return the elementwise outlierness R ∈ R N ×D computed as follows: In step 3, the number of iterations in each gradient-based optimization process affects the performance of L0-AE. In practice, L0-AE shows sufficient detection accuracy and convergence without iteration (see Sect. 5). In this case, the total computational cost of L0-AE is the sum of that the cost of normal AE and sorting to obtain the top-k error value. In this section, we prove that our alternating optimization algorithm always converges under the assumption that AE is trained appropriately by gradient-based optimization. Here, we denote the objective function ||A • Z(θ)|| 2 2 as K(A, θ) and the variables A and θ at the t-th step of each alternating optimization phase as A t and θ t , respectively. Under this assumption, the convergence of the proposed alternating optimization method can be shown as follows: Proof. By updating with Eq. (9), the obtained A * minimizes Eq. (8) for any Z(θ). This implies that a sequence {K(A t , θ t )} is a monotonically non-increasing and non-negative sequence. Therefore, by applying the monotone convergence holds when the learning rate of the AE model is sufficiently small. Although this assumption might not hold for a fixed learning rate in practice, L0-AE shows better convergence than RDA (see Sect. 5.5). In what follows, we describe an alternating optimization algorithm for data with structured outliers. In order to detect structured outliers, Eq. (8) and (9) are, respectively, reformulated as follows: where the subscripts N and D represent the number of elements in the column vector and c is the k-th largest value of the vector D j=1 (z ·j ) 2 . The sample-wise outlierness r is calculated using the R defined by Eq. (10) as follows: L0-AE uses this version of the formulation and the alternating optimization method for outlier detection. As with the update of A using Eq. (9), the update of a using Eq. (12) always minimizes the objective function (11) with θ fixed. The convergence of this algorithm using Eq. (12) is easily proved in a similar manner with Theorem 1. Remark. The concept of our optimization methodology for structured outliers can be regarded as Least Trimmed Squares (LTS) [17] , in which the sum of all squared residuals except the largest k squared residuals is minimized. Recently, highly accurate neural network-based anomaly detection methods, such as AE, Variational Autoencoder (VAE), or Generative Adversarial Networkbased methods [1, 18, 26] , have been proposed; however, they assume a different problem setting from ours, i.e., training data does not include anomalous data, and finding anomalies in test datasets is the target task. Therefore, these methods do not have a mechanism that excludes outliers during training. In [8] , the equivalence of the global optimum of the VAE and RPCA is shown under the condition that a decoder has some kind of affinity; however, connections between VAE and RPCA are not shown for general nonlinear activation functions. The Discriminative Reconstruction Autoencoder (DRAE) [20] has been proposed for unsupervised outlier removal. DRAE labels samples for which reconstruction errors exceed a threshold as "outliers" and omits such samples for learning. To appropriately determine the threshold, the loss function of DRAE has an additional term to separate the reconstruction errors of inliers and outliers. Because of this additional term, DRAE does not solve an l 0 -norm constrained optimization problem, i.e., the learned manifold is affected by outlier values, which degrades the detection performance (see Sect. 5). RandNet [7] has been proposed as a method to increase the robustness through an ensemble scheme. Although this ensemble may improve the robustness, it does not completely avoid the adverse effects of outliers, because each AE uses an non-regularized objective function. Deep Structured Energy-Based Models [21] , a robust AE-based method that combines energy-based models and non-regularized AE, has the same drawback. In [11] , a method that simply combines AE and LTS was proposed; however, no theoretical analysis for the combined effects of AE and LTS was presented. Datasets. We employed both artificial and real datasets. Figure 1 illustrates the artificial data. We sampled 9,000 inlier samples (x, 2x 2 − 1) ∈ R 2 where x ∈ [−1, 1] was sampled uniformly. Further, we sampled 1,000 outliers uniformly from [−1, 1] × [−1, 1]. As real datasets, we used 11 datasets from Outlier Detection DataSets (ODDS) [16] , which are commonly used as benchmarks for outlier detection methods. In addition, we also used the "kdd99 rev" dataset introduced in [26] . Table 2 summarizes the 12 datasets. Before the experiments, we normalize the values of the datasets by dimension into the range of −1 to 1. Evaluation Method. Following the evaluation methods in [7, 14, 15] , we compared AUCs of the outlier detection accuracy. Evaluation was performed as follows: (1) all samples (whether inlier or outlier) were used for the training; (2) the outlierness of each sample was calculated after training; and (3) AUCs were calculated by using outlierness and inlier/outlier labels. Note that we need not specify the detection threshold in this evaluation scheme. Robust PCA (RPCA). We utilize the RPCA implemented in [9] as a baseline linear method. We set tol = 1E-05 that is used to determine the convergence. (N-AE) . We implemented N-AE with a loss function ||X − f AE (X; θ)|| 2 2 as a baseline non-regularized detection method. For every AE-based method below, we used common network settings. We used three fully connected hidden layers (with a total of five layers), in which the number of neurons was ceil([D, D , D]) from the input to the output unless otherwise noted. These were connected as (input layer) -linear -relu -(hidden layer1) -linear -(hidden layer2) -linear -relu -(hidden layer3) -linear -(output layer). We set the mini-batch size to N/50 and applied Adam [13] (α = 0.001) for optimization with Epoch max = 400. To prevent undue advantages to our method (L0-AE) and the other AE-based methods, we searched this architecture by maximizing the average AUC of N-AE. We implemented the RDA [23] with the grouped norm version of Eq. (3). We use Eq. (13) to calculate the samplewise outlierness of RDA. To make the number of loops equal to those of the other AE-based methods, the parameter inner iteration, which is the number of iterations required to optimize AE during one execution of l 1 -norm optimization, is set to 1. We set λ as 0.00005. This baseline is used to confirm the effect of the l 0 -norm constraint of L0-AE. RDA-Stbl minimizes the objective function ||L D − f AE (X; θ)|| 2 + λ||S T || 2,1 such that X − L D − S = 0, with respect to S and θ. This model can be regarded as a relaxed version of L0-AE. We set λ as 0.0005. We use L0-AE for structured outliers (described in Sect. 3.3). The sample-wise outlierness of L0-AE is calculated using Eq. (13). We do not iterate for updating the parameters of an AE at each gradient-based optimization step. Instead of k, we use C p = k/N (0 ≤ C p ≤ 1), which is normalized by the number of samples, and set C p as 0.3. This type of normalized parameter is often used in other methods such as the One-Class Support Vector Machine (OC-SVM) [19] and Isolation Forest (IForest) [15] . Note that L0-AE is equivalent to N-AE when C p = 0. Variational Autoencoder (VAE). We adopted VAE for our problem setting. The outlierness is computed using reconstruction probability [1] . Note that the number of output dimensions of hidden layer1 and layer3 is twice that of the other AE-based methods. We set λ as 0.1, which determines the weight of the term in the objective function for separating the inlier and outlier reconstruction errors (see Sect. 4). We used Chainer (ver. 1.21.0) [6] for implementation of the AE-based methods above. In addition, we apply the following three conventional methods for a comparison of detection accuracy against real benchmark datasets. We use the OC-SVM implemented in scikit-learn and set kernel = 'rbf'. [3] . We use the LOF implemented in scikit-learn and set "k" for the k-nearest neighbors to 100. Isolation Forest. We used the IForest from a Python library pyod [22] with "n-estimators" (the number of trees) set to 100. We tuned the above-mentioned parameters that control the robustness against noise to achieve high AUC on average over all real datasets; for the other parameters, we used the recommended (default) values unless otherwise noted. We evaluate the robustness against outliers of L0-AE and the baseline reconstruction-based methods using the artificial data. We compare the average AUC, as well as the average outlierness of inlier samples O i avg , average outlierness of outlier samples O o avg , and the ratio O o avg /O i avg (a higher value implies that less outliers are close to a low-dimensional manifold). In this experiment, because D = 2, we could not set the number of neurons and parameters as mentioned in Sect. 5.2; instead, for N-AE to achieve a high AUC, we used [2, 100, 1, 100, 2], which are empirically obtained. For RDA and RDA-Stbl, we used λ = 0.00001; for DRAE, λ = 0.1; and for L0-AE, C p = 0.2. These are chosen based on the AUC values. Table 3 reports the average of these measurements over 20 trials with different initial network weights, and Fig. 2 shows the distributions of outlierness of each method except RPCA in the first run. As VAE uses the probability as outlierness, only the AUC is included in Table 3 . These results show that L0-AE outperforms the other methods in terms of AUC and the distribution of outlierness between inliers and outliers (i.e., sparsity of the error matrix). RPCA performs significantly poorer than the other methods because of its linearity. Among the AE-based methods, L0-AE shows the best performance. Table 4 . In L0-AE, we can see that the learned manifold is almost entirely composed of inliers. Therefore, it can be confirmed that the l 0 -norm constraint of L0-AE functions as intended, and L0-AE can learn by almost completely eliminating the influence of the corrupted samples while capturing nonlinear features. In contrast, the performances of the other AE-based methods are inferior to that of L0-AE because the other methods cannot completely exclude the influence of outliers. VAE is less accurate than the other AE-based methods; it is considered that VAE is unable to demonstrate robustness owing to the non-affinity of the decoder. For DRAE, the reconstruction errors of inliers and outliers are relatively well separated, but DRAE is more strongly affected by outliers than L0-AE because the DRAE objective function depends on how large outliers are, while the L0-AE objective function does not. We compare the detection accuracy for the real datasets. The AUC values are averaged over 50 trials with different random seeds. Table 4 presents the average AUCs for each dataset; Avg. AUC, Avg. rank, and Avg. time refer to the average AUC, the average rank over the datasets, and the average run-time, respectively. L0-AE demonstrates the highest average AUC and average rank. Among the reconstruction-based methods, L0-AE showed the highest AUCs for 8 out of 12 datasets. Especially on kdd99 rev, the AUC of L0-AE is considerably higher than those of the other AE-based methods. Because kdd99 rev has a high rate of outliers and they are distributed close to each other, the methods with l 1 -norm regularization and no regularization cannot avoid reconstructing the outliers, whereas L0-AE can almost completely avoid reconstruction because of its l 0norm constraint. Furthermore, we observed that the AUCs of RDA and RDA-Stbl are nearly equal. This shows the importance of the l 0 -norm constraint. L0-AE outperforms DRAE on average; it is considered that L0-AE selectively reconstructs only the inliers, while DRAE reconstructs inliers and reduces the variance of each label, allowing outliers to affect manifolds. In addition, the computational cost of DRAE is higher than that of L0-AE, owing to the calculation of the threshold. For VAE, the training was unstable for some datasets. One possible reason is that VAE involves random sampling in the reparametrization trick which increases the randomness of the results under these experimental settings. In contrast, among the AE-based methods, L0-AE showed stable results. RPCA results are relatively good in some datasets, suggesting that these datasets have linear features and l 0 -norm regularization works; L0-AE shows good performance by capturing nonlinear features even for the other datasets. The reason why RPCA outperforms some AE-based methods on average is that RPCA can automatically detect the rank of the inlier, while the AE-based methods have a fixed latent dimension (there is no known method for obtaining an appropriate latent dimension in an unsupervised setting). Next, we evaluate the parameter sensitivity of L0-AE using real datasets. Fig. 3 shows the AUCs with different C p values for L0-AE (averaged over 50 trials). Overall, the maximum AUC values occur at C p values moderately greater than the true outlier rates. If C p is greater than the true outlier rate, outliers are safely detected as outliers; in contrast, there are inliers that are not trained to be reconstructed at an epoch. However, such inliers are also trained to be well reconstructed because inliers are likely to be distributed close to each other. If C p is less than the true outlier rate, the detection accuracy is basically better than in the case of N-AE (C p = 0) because some outliers are not reconstructed. For kdd99 rev, owing to the distribution of outliers as mentioned above, the outliers are unexpectedly detected as inliers when C p is small; for large C p values, such outliers are safely detected as outliers. Therefore, the change in AUC against C p is large. The development of an automatic optimal C p search method under the l 0 -norm constraint without the ground truth labels is an important future work. We compare the convergence of L0-AE with that of RDA. Here, we do not use mini-batch training to remove the effect of randomness. Table 5 presents the sum of the number of epochs in which the value of the objective function has increased over the previous epoch during 20 trials (96,000 epochs in total). The results of N-AE are also included for reference. In addition, Fig. 4 shows two transition examples of the values of the objective functions of RDA and L0-AE. Among them, Fig. 4(d) shows the result of the only trial in which the objective function increased in L0-AE; the epochs in which the objective function increased were 294 to 302 when C p = 0.3, with an average increase of 0.23, which is considerably less than the value of the objective function. In Table 5 and Fig. 4 , we observe that L0-AE shows good convergence regardless of the parameter C p , unlike RDA. This empirically demonstrates the validity of Theorem 1, which states that our alternating optimization algorithm converges when the gradientbased optimization behaves ideally. For RDA, when λ is small, the value of the objective function is unstable, but when λ is large, the characteristic of RDA approaches N-AE; therefore, the stability improves. We observe that, with N-AE, the values of objective function do not increase, which implies that our gradient-based optimization basically satisfies the assumption in Theorem 1. In this paper, we proposed L0-norm Constrained Autoencoders (L0-AE) for unsupervised outlier detection. L0-AE decomposes data nonlinearly into a lowrank matrix and a sparse error matrix under the l 0 -norm constraint. We proposed an efficient alternating optimization algorithm for training L0-AE and proved that this algorithm converges under a mild condition. We conducted extensive experiments with real and artificial data and confirmed that L0-AE is highly robust to outliers. We also confirmed that this high robustness leads to higher outlier detection accuracy than those of existing outlier detection methods. Variational autoencoder based anomaly detection using reconstruction probability Axiomatisations of the average and a further generalisation of monotonic sequences LOF: identifying density-based local outliers LP norm relaxation approach for large scale data analysis: a review Robust principal component analysis? Chainer: a flexible framework for neural networks Outlier detection with autoencoder ensembles Connections with robust pca and the role of emergent sparsity in variational autoencoder models GitHub -dganguli/robust-PCA: A simple python implementation of R-PCA Reducing the dimensionality of data with neural networks Low-cost unsupervised outlier detection by autoencoders with robust estimation Principal Component Analysis Adam: a method for stochastic optimization LoOP: local outlier probabilities Isolation forest Odds library Trimmed least squares estimation in the linear model Unsupervised anomaly detection with generative adversarial networks to guide marker discovery Estimating the support of a high-dimensional distribution Learning discriminative reconstructions for unsupervised outlier removal Deep structured energy based models for anomaly detection GitHub -zc8340311/robustautoencoder: a combination of autoencoder and robust PCA Anomaly detection with robust deep autoencoders Stable principal component pursuit Deep autoencoding Gaussian mixture model for unsupervised anomaly detection