key: cord-0158919-bxgcjbdj
authors: Lai, Chieh-Hsin; Zou, Dongmian; Lerman, Gilad
title: Novelty Detection via Robust Variational Autoencoding
date: 2020-06-09
journal: nan
DOI: nan
sha: 7faf551b9bf12e8f1f1033ed73fdb5861811e6bf
doc_id: 158919
cord_uid: bxgcjbdj

We propose a new method for novelty detection that can tolerate nontrivial corruption of the training points. Previous works assumed either no or very low corruption. Our method trains a robust variational autoencoder (VAE), which aims to generate a model for the uncorrupted training points. To gain robustness to corruption, we incorporate three changes to the common VAE: 1. Modeling the latent distribution as a mixture of Gaussian inliers and outliers, while using only the inlier component when testing; 2. Applying the Wasserstein-1 metric for regularization, instead of Kullback-Leibler divergence; and 3. Using a least absolute deviation error for reconstruction, which is equivalent to assuming a heavy-tailed likelihood. We illustrate state-of-the-art results on standard benchmark datasets for novelty detection.

Novelty detection, also known as semi-supervised anomaly detection, refers to the task of detecting testing data points that deviate from the underlying structure of a given training dataset [1] [2] [3] . It finds crucial applications in areas such as insurance and credit fraud [4] , mobile robots [5] and medical diagnosis [6] . Ideally, novelty detection requires learning the underlying distribution of the training data, where sometimes it is sufficient to learn a significant feature, geometric structure or another property of the training data. One can then apply the learned distribution (or property) to detect deviating points in the test data. This is different from unsupervised anomaly detection, or outlier detection [1] , in which one does not have training data and has to determine the deviating points in a sufficiently large dataset assuming that the majority of points share the same structure or properties. We note that novelty detection is equivalent to the well-known one-class classification problem [7] . In this problem, one needs to identify members of a class in a test data, and consequently distinguish them from "novel" data points, given training points from this class. The points of the main class are commonly referred to as inliers and the novel ones as outliers.

There are a myriad of solutions to one-class classification and equivalently to novelty detection. Nevertheless, such solutions often assume that the training set is purely sampled from a single class or that it has a very low fraction of corrupted samples. In some practical scenarios, it is hard to guarantee this assumption. For example, a recent study [8] shows that false positives and false negatives are common in COVID-19 tests. Therefore, one cannot design a pure set of one-class training points using such tests. We thus study a robust version of novelty detection that allows a nontrivial fraction of corrupted samples, namely outliers, within the training set. We solve this problem by using a special variational autoencoder (VAE) [9] . Our VAE is able to model the underlying distribution of the data, despite nontrivial corruption. We refer to our new method as "Mixture Autoencoding with Wasserstein penalty", or "MAW". In order to clarify it, we first review previous works and then explain our contributions in view of these works.

Solutions to one-class classification and novelty detection either estimate the density of the inlier distribution [10, 11] or determine a geometric property of the inliers, such as their boundary set [12] [13] [14] [15] . When the inlier distribution is nicely approximated by a low-dimensional linear subspace, [16] proposes to distinguish between inliers and outliers via Principal Component Analysis (PCA). In order to consider more general cases of nonlinear low-dimensional structures, one may use autoencoders (or restricted Boltzmann machines), which nonlinearly generalize PCA [17, Ch. 2] and whose reconstruction error naturally provides a score for membership in the inlier class. Instances of this strategy with various architectures include [18] [19] [20] [21] [22] . In all of these works, but [19] , the training set is assumed to solely represent the inlier class. In fact, Perera et al. [21] observed that interpolation of a latent space, which was trained using digit images of a complex shape, can lead to digit representation of a simple shape. If there are also outliers (with a simple shape) among the inliers (with a complex shape), encoding the inlier distribution becomes even more difficult. Nevertheless, some previous works already explored the possibility of corrupted training set [14, 15, 19] . In particular, [14, 19] test artificial instances with at most 5% corruption of the training set and [15] considers ratios of 10%, but with very small numbers of training points. In this work we consider corruption ratios up to 30%, with a method that tries to estimate the distribution of the training set, and not just a geometric property.

VAEs [9] have been commonly used for generating distributions with reconstruction scores and are thus natural for novelty detection without corruption. They determine the latent code of an autoencoder via variational inference [23, 24] . Alternatively, they can be viewed as autoencoders for distributions that penalize the Kullback-Leibler (KL) divergence of the latent distribution from the prior distribution. The first VAE-based method for novelty detection was suggested by An and Cho [25] . It was recently extended by Daniel et al. [26] who modified the training objective. A variety of VAE models were also proposed for special anomaly detection problems, which are different than novelty detection [27] [28] [29] . Current VAE-based methods for novelty detection do not perform well when the training data is corrupted. Indeed, the learned distribution of any such method also represents the corruption, that is, the outlier component. To the best of our knowledge, no effective solutions were proposed for collapsing the outlier mode so that the trained VAE would only represent the inlier distribution.

We remark that there are two other interesting variational-type models, the adversarial autoencoder (AAE) [30] and the Wasserstein autoencoder (WAE) [31] . The penalty term of AAE takes the form of a generative adversarial network (GAN) [17] , where its generator is the encoder. A Wasserstein autoencoder (WAE) [31] generalizes AAE with a framework that minimizes the Wasserstein distance between the sample distribution and the inference distribution. It reformulates the Wasserstein distance so that it can be implemented in the form of an AAE.

There are two relevant lines of works on robustness in linear modeling that can be used in nonlinear settings via autoencoders or VAEs. Robust PCA aims to deal with sparse elementwise corruption of a data matrix [32] [33] [34] [35] . Robust subspace recovery (RSR) aims to address general corruption of selected data points and thus better fits the framework of outliers [36, 33, [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] . Autoencoders that use robust PCA for anomaly detection tasks were proposed in [48, 49] . Dai et al. [50] show that a VAE can be interpreted as a nonlinear robust PCA problem. Nevertheless, explicit regularization is often required to improve robustness to sparse corruption in VAEs [51, 52] . RSR was successfully applied to outlier detection by Lai et al. [53] . One can apply their work to the different setting of novelty detection; however, our proposed VAE formulation seems to work better. Nevertheless, our dimension reduction component is inspired by their RSR layer and our use of Wasserstein-1 loss is motivated by their Proposition 5.1.

We propose a robust novelty detection procedure, MAW, that aims to model the distribution of the training data in the presence of nontrivial fraction of outliers. It has the following new features:

• MAW uses a Gaussian mixture to model the co-existence of inliers and outliers in the latent distribution. It then only applies the inlier distribution for testing. Our variational setting accommodates this mixture model and does not require some previously used tools such as the construction of another network [19] or the application of clustering algorithms [54] . Another difference with [54, 19] is that they use Gaussian mixture models for different modes of inliers, whereas our setting assumes a single inlier mode. Also, previous VAE-based methods for novelty detection [25, 26] used diagonal covariances in their models, whereas we use general covariances. • For reconstruction, MAW replaces the common least squares formulation with a least absolute deviations formulation. This can be justified by either the use of a robust loss that yields a more robust estimator [55] or by the use of a likelihood function with a heavier tail.

• For the latent code penalty, MAW uses the Wasserstein-1 metric. It is hard to guarantee robustness of this metric in our setting, though Proposition 5.1 of [53] with p = 1 guarantees such robustness for a different setting. We emphasize that VAE, AAE and WAE use instead the KL-divergence for penalty, and that the Wasserstein metric in WAE is used to measure the distance between the data distribution and the generated distribution and does not appear in the latent code. Nevertheless, one may view this contribution as a variant of AAE, which replaces GAN with Wasserstein GAN (WGAN) [56] . • At last, MAW is attractive for practitioners. It is simple to implement in any standard deep learning library, and is easily adaptable to other choices of network architecture, energy functions and similarity scores. Moreover, it achieves state-of-the-art results on popular anomaly detection datasets.

We explain the MAW paradigm and its various details of implementation in §2. We carefully test it in comparison with state-of-the-art methods in §3. At last, we conclude this work in §4.

We motivate and overview the underlying model and assumptions of MAW in §2.1. We describe the simple implementation details of its components in §2.2. Fig. 1 illustrates the general idea of MAW and can assist in reading this section. 

MAW aims to robustly estimate a mixture inlier-outlier distribution for the training data and then use its inlier component to detect outliers in the testing data. For this purpose, it designs a novel variational autoencoder with an underlying mixture model and a robust loss function in the latent space. We find the variational framework natural for novelty detection. Indeed, it learns a distribution that describes the inlier training examples and generalizes to the inlier test data. Moreover, the variational formulation allows a direct modeling of a Gaussian mixture model in the latent space, unlike a standard autoencoder.

We assume L training points in R D , which we designate by

. Let x be a random variable on R D with the unknown training data distribution that we estimate by the empirical distribution of the training points. We assume a latent random variable z of low and even dimension 2 ≤ d ≤ D, where our default choice is d = 2. We further assume a standardized Gaussian prior, p(z), so that z ∼ N (0, I d×d ). The posterior distribution p(z|x) is unknown. However, we assume an approximation to it, which we denote by q(z|x), such that z|x is a mixture of two Gaussian distributions representing the inlier and outlier components. More specifically, z|x ∼ ηN (µ 1 , Σ 1 ) + (1 − η)N (µ 2 , Σ 2 ), where we explain next its parameters. We assume that η > 0.5, where our default value is η = 5/6, so that the first mode of z represents the inliers and the second one represents the outliers. The other parameters are generated by the encoder network and a following dimension reduction component. The dimension reduction component involves a mapping from a higher-dimensional space onto the latent space. It is analogous to the RSR layer in [53] that projects encoded points onto the latent space. Due to this reduction, we assume that the mapped covariance matrices of z|x are full, unlike common single-mode VAE models that assume a diagonal covariance [9, 25] . Our underlying assumption is that the inliers lie on a low-dimensional structure and we thus enforce the lower rank d/2 for Σ 1 , but allow Σ 2 to have full rank d. Nevertheless, we later describe a necessary regularization of both matrices by the identity.

Following the VAE framework, we approximate the unknown posterior distribution p(z|x) within the variational family Q = {q(z|x)}, which is indexed by µ 1 , Σ 1 , µ 2 and Σ 2 . A standard VAE framework would minimize the expected KL-divergence from p(z|x) to q(z|x) in Q, where the expectation is taken over p(x). By Bayes' rule this is equivalent to maximizing the evidence lower bound (ELBO):

The first term of ELBO is the reconstruction likelihood. Its second term restricts the deviation of q(z|x) from p(z) and can be viewed as a regularization term. We use a more robust version of ELBO with a different regularization. That is, we replace E p(x) KL(q(z|x) p(z)) with W 1 (q(z), p(z)), where W 1 denotes the Wasserstein-1 distance. We remark that the W 1 distance cannot be computed between q(z|x) and p(z) and we thus replace q(z|x) with its expected distribution, q(z) = E p(x) q(z|x). We refer to our new function as ELBO-Wasserstein, or in short, ELBOW, and summarize it as follows:

(1)

Following the VAE framework, we use a Monte-Carlo approximation to estimate E q(z|x) log p(x|z) with i.i.d. samples, {z (t) } T t=1 , from q(z|x) as follows:

To improve the robustness of our model, we choose the log likelihood function log p(x|z (t) ) to be a constant multiple of the 2 norm of the difference of the random variable x and a mapping of the sample z (t) from R d to R D by the decoder, D, that is,

Note that we deviate from the common choice of the squared 2 norm, which corresponds to an underlying Gaussian likelihood and assume instead a likelihood with a heavier tail.

, where all samples are independent. Using the aggregation formula:

, which is also used by an AAE, the approximation of p(x) by the empirical distribution of the training data, and (1)-(3), MAW applies the following approximation of -ELBOW(q):

Details of minimizing (4) are described in §2.2. We remark that the procedure described in §2.2 is independent of the multiplicative constant in (3) and therefore this constant is ignored in (4).

During testing, MAW records a similarity score between each given test point and points generated from the learned inlier component of z|x. Inliers and outliers in the testing set are identified according to high and low values of this score. The fact that this reconstruction score is solely based on the inlier distribution N (µ 1 , Σ 1 ) should be advantageous for anomaly detection.

MAW has a VAE-type structure, but it also uses a WGAN for minimizing the W 1 loss in (4). We provide here details of implementing these structures. Some specific choices of the networks are described in §3 since they may depend on the type of datasets.

The VAE-type structure of MAW contains three ingredients: encoder, dimension reduction component and decoder. The encoder forms a neural network E that maps the training sample 

. For j = 1, we first need to reduce the rank of M (i) 1 . For this purpose, we form 

. Since the TensorFlow package requires numerically-significant positive definiteness of covariance matrices, we add an identity matrix to both Σ The decoder, D :

, into the reconstructed data space. The loss function associated with the VAE structure is the first term in (4) . We can write it as

The dependence of this loss on E and A is implicit, but follows from the fact that the parameters of the sampling distribution of each z

gen were obtained by E and A. The WGAN seeks to minimize the second term in (4) . The generator of this WGAN is composed of the encoder E and the dimension reduction component, which we represent by A. It generates the samples {z 

whereas the generator (E and A) maximizes (8) . Since the second term in (8) does not contain parameters of the generator, its maximization is equivalent to minimization of the loss function

During the training phase, MAW alternatively minimizes the losses (7)-(9) instead of minimizing a weighted sum. Therefore, any multiplicative constant in front of either term of (4) will not effect the optimization. In particular, it was okay to omit the multiplicative constant of (3) when deriving (4).

For each testing data point y (j) , we sample {z

in } T t=1 from the learned inlier mode of the corresponding latent Gaussian mixture and decode them as

Using a similarity measure S(·, ·), where our default choice is the cosine similarity, we compute S (j) = T t=1 S(y (j) ,ỹ (j,t) ). If S (j) is larger than a chosen threshold, then y (j) is classified normal, and otherwise, novel.

We further detail the procedures for training MAW and applying it for novelty detection in Algorithms 1 and 2, respectively. In these descriptions, we denote by θ, ϕ and δ the trainable parameters of the encoder E, decoder D and discriminator Dis respectively. Recall that A includes the trained parameters of the dimension reduction component. Our default input parameters are specified in §3.

We describe the competing state-of-the-art methods and the experimental choices for all methods, including MAW, in §3.1. We report the experimental setting, which involves four representative datasets, and results in §3.2.

Input: Training data {x (i) } L i=1 ; initialized parameters θ, ϕ and δ of E, D and Dis, respectively; initialized A; weight η; number of epochs; batch size I; sampling number T ; learning rate α Output: Trained parameters θ, ϕ and A 1: for each epoch do 2: for each batch {x (i) } i∈I do 3:

ComputeM (i) 1 according to (5) and (6) 6:

for t = 1, · · · , T do 8: (θ, A, ϕ) ← (θ, A, ϕ) − α∇ (θ,A,ϕ) L VAE (θ, A, ϕ) according to (7) 12:

δ ← δ − α∇ δ L W1 (δ) according to (8) 13: 

1 according to (5) and (6) 5:

(j)T 1

for t = 1, · · · , T do

compute S(y (j) ,ỹ (j,t) ) 10: end for 11:

if S (j) ≥ T then 13: y (j) is a normal example 14: else 15: y (j) is a novelty 16: end if 17: end for 18: return Normality labels for j = 1, . . . , N

We compared MAW with the following methods: Deep Autoencoding Gaussian Mixture Model (DAGMM) [19] , Deep Structured Energy-Based Models (DSEBMs) [18] , Isolation Forest (IF) [57] , Local Outlier Factor (LOF) [12] , One-class Novelty Detection Using GANs (OCGAN) [21] , One-Class SVM (OCSVM) [58] and RSR Autoencoder (RSRAE) [53] . DAGMM, DSEBMs, OCGAN and OCSVM were proposed for novelty detection. IF, LOF and RSRAE, were originally proposed for outlier detection, but we apply their trained model for detecting novelties in the test data.

For DSEBMs and DAGMM we used the codes of [59] . For LOF, OCSVM and IF we used the scikitlearn [60] packages for novelty detection. For OCGAN we used its TensorFlow implementation from https://pypi.org/project/ocgan. For RSRAE, we adapted the code of [53] to novelty detection.

For MAW and the above four reconstruction-based methods, that is, DAGMM, DSEBMs, OCGAN and RSRAE, we use the following structure of encoders and decoders, which vary with the type of data (images or non-images). For non-images, which are mapped to feature vectors of dimension D, the encoder is a fully connected network with output channels (32, 64, 128, 128 × 4) . The decoder is a fully connected network with output channels (128, 64, 32, D) , followed by a normalization layer at the end. For image datasets, the encoder has three convolutional layers with output channels (32, 64, 128), kernel sizes (5 × 5, 5 × 5, 3 × 3) and strides (2, 2, 2). Its output is flattened to lie in R 128 and then mapped into a 128 × 4 dimensional vector using a dense layer (with output channels 128 × 4). The decoder of image datasets first applies a dense layer from R 2 to R 128 and then three deconvolutional layers with output channels (64, 32, 3), kernel sizes (3 × 3, 5 × 5, 5 × 5) and strides (2, 2, 2).

Some specific parameters of MAW are set as follows. Intrinsic dimension: d = 2; mixture parameter: η = 5/6, sampling number: T = 5, and size of A (used for dimension reduction): 128 × 2. The matrix A and the network parameters θ, ϕ and δ are initialized by the Glorot uniform initializer [61] . For all experiments, the discriminator is a fully connected network with size (32, 64, 128, 1).

All neural networks were implemented with TensorFlow [62] and trained for 100 epochs with batch size 128. We apply batch normalization to each layer of any neural network. The competing neural networks-based methods were optimized by Adam [63] with learning rate 0.00005. For the VAEstructure of MAW, we use Adam with learning rate 0.00005. For the WGAN discriminator of MAW, we perform RMSprop [10] with learning rate 0.0005, following the recommendation of [56] for WGAN. All experiments were executed on a Linux machine with 64GB RAM and four GTX1080Ti GPUs.

We test our method on four datasets for novelty detection: KDDCUP-99 [64] , COVID-19 Radiography database [65] , Caltech101 [66] and Reuters-21578 [67] . We distinguish between image datasets (COVID-19 and Catlech101) and non-image datasets (KDDCUP-99 and Reuters-21578). We carefully describe each dataset, common preprocessing procedures and choices of their largest clusters in the supplementary material. Each dataset contains several clusters (2 for KDDCUP-99, 3 for COVID-19, 11 largest ones for Caltech101 and 5 largest single-labelled ones for Reuters-21578). We arbitrarily fix a class and uniformly sample N training inliers and N test testing inliers from that class. We let N = 6000, 160, 100 and 350 and N test = 1200, 60, 100 and 140 for KDDCUP-99, COVID-19, Caltech101 and Reuters-21578. We then fix a value c among 10%, 20% or 30%, and uniformly sample c percentage of outliers from the rest of the clusters for the training data. We also fix c test in {0.1, 0.3, 0.5, 0.7, 0.9} and uniformly sample c test percentage of outliers from the rest of the clusters for the testing data Using all possible thresholds for the finite datasets, we compute the AUC (area under curve) and AP (average precision) scores, while considering the outliers as "positive". For each fixed c = 0.1, 0.2 and 0.3, we average these results over the values of c test , the different choices of inlier clusters (among all possible clusters), and three runs with different random initializations for each of these choices. We also compute the corresponding standard deviations. We report these results in Fig. 2 and specify numerical values in the supplemental material. We observe state-of-the-art performance of MAW in all of these datasets. In Reuters-21578, DSEBMs performs slightly better than MAW and OCSVM has comparable performance, however, these two methods are not competitive in the rest of the datasets.

We introduced MAW, a robust VAE-type framework for novelty detection that can tolerate corruption of the training data. We demonstrated state-of-the-art performance of MAW with a variety of datasets. As with other deep learning frameworks, it is hard to theoretically guarantee the performance of MAW. Nevertheless, our supplementary material provides careful experimental validation that omitting any of the new ideas results in significant decrease of accuracy. These new ideas include least absolute deviations for reconstruction, W 1 metric for regularization, Gaussian mixture model and lower rank for the inlier covariance matrix. These experiments further demonstrate the advantage over a traditional VAE. We hope to further understand and extend our proposal in the following ways.

First of all, we plan to extend and test some of our ideas for the different problem of robust generation, in particular, for building generative networks which are robust against adversarial training data. Second of all, we would like to carefully study the virtue of our idea of modeling the most significant mode in a training data. In particular, when extending the work to generation, one has to verify that this idea does not lead to mode collapse. Furthermore, we would like to explore any tradeoff of this idea, as well as our setting of robust novelty detection, with fairness. At last, we plan to study the robustness of the W 1 metric in our setting. Proposition 5.1 of [53] with p = 1 implies robustness under a simpler and a different setting, and we hope to extend it to Gaussian mixtures that represent our scenario. Even if we cannot prove what we aim to, we plan to formulate some interesting mathematical conjectures.

Novelty detection is an essential problem in machine learning with various important applications discussed in our manuscript. This problem and its equivalent versions of one-class classification and semi-supervised anomaly detection have been studied by broad research communities. Academic work on novelty detection has been mainly pursued under the pure assumption that the training set has no outliers. However, as we clarified in the introduction, in some applications it may be hard to guarantee this assumption. We thus expect our proposed method for novelty detection in the presence of corruption to impact practitioners. For example, it can assist in medical diagnosis, where the training data contains unavoidable corruptions as in the case of COVID-19 tests discussed in [8] .

Our method tries to generate a model for the corrupted training set. We believe that our ideas may open the door for robust generation in the presence of corrupted samples and we leave this direction for future work. Nevertheless, we have focused here on quantitative evaluation of novelty detection using both the AP and AUC scores, unlike the current subjective evaluation of generation. We followed the extensive past experience of testing and evaluation in this well-established area, while comparing with seven established methods.

Our proposed model and algorithm are generic and can be easily adapted for various practical applications. We plan to post our code online, so practitioners can easily use it. We tested the method on several datasets of different nature. One may need to adapt our method for other types of datasets, but we don't expect it to be difficult, especially if standard neural networks have been already formed for such sets.

We do not provide any theoretical guarantees for our work, since guarantees for deep learning, and especially for generative tasks, are hard to obtain. Nevertheless, we mention a future theoretical direction that we find interesting. Even if one may not establish this direction, one can formulate some interesting mathematical conjectures that might be later addressed by direct experts.

Since we try to find a specific model for a main behavior of the training data, it is important to explore in future works whether this is fair to different underrepresented communities in the sampled data, and quantify the tradeoff with fairness. It is also interesting to explore the tradeoff with fairness of the theoretical setting of novelty detection in the presence of corruption. One should then try to distinguish between the effect on fairness of the theoretical setting and the effect of our particular algorithm and its choices. This study should also aim to include necessary requirements on the training and testing procedures that may guarantee sufficient fairness for both the mathematical problem and the proposed method. Nevertheless, we could not notice any bias within the reported experiments.

In §A we experimentally validate the contribution of each novel feature of MAW. In §B we study the sensitivity of MAW with respect to different intrinsic dimensions. In §C we report the numerical values used to create Fig. 2 of the main manuscript. In §D we summarize the description of the benchmark methods. At last, in §E we provide details about the four datasets we used.

We experimentally validate the effect of the following four new features of MAW: the least absolute deviation for reconstruction, the W 1 metric for the GAN regularization, the Gaussian mixture model assumption and the lower rank constraint for the inlier mode. We also compare with a standard variational autoencoder (VAE) [9] . We thus compare MAW with the following methods for novelty detection, where the first four methods change one essential component of MAW with a traditional one and the fifth method is a traditional VAE.

• MAW-MSE replaces the least absolute deviation loss L VAE with the common mean squared error.

• MAW-KL divergence replaces the Wasserstein regularization L W1 with the KL-divergence. This is implemented by replacing the WGAN of the discriminator with a standard GAN. • MAW-same rank uses the same rank d for both the covariance matrices Σ Gaussian distribution with a full covariance matrix. • VAE has the same encoder and decoder structures as MAW. Instead of a dimension reduction component, it uses a dense layer which maps the output of the encoder to a 4-dimensional vector composed of a 2-dimensional mean and 2-dimensional diagonal covariance. This is common for a traditional VAE. 

For convenience, we overview the benchmark methods compared with MAW, where we present them according to alphabetical order of names. We will include all tested codes in a supplemental webpage.

Deep Autoencoding Gaussian Mixture Model (DAGMM) [19] is a deep autoencoder model. It optimizes an end-to-end structure that contains both an autoencoder and an estimator for a Gaussian mixture model. Anomalies are detected using this Gaussian mixture model. We remark that this mixture model is proposed for the inliers.

Deep Structured Energy-Based Models (DSEBMs) [18] makes decision based on an energy function which is the negative log probability that a sample follows the data distribution. The energy based model is connected to an autoencoder in order to avoid the need of complex sampling methods.

Isolation Forest (IF) [57] iteratively constructs special binary trees for the training dataset and identifies anomalies in the testing set as the ones with short average path lengths in the trees.

Local Outlier Factor (LOF) [12] measures how isolated a data point is from its surrounding neighborhood. This measure is based on an estimation of the local density of a data point using its k nearest neighbors. In the novelty detection setting, it identifies novelties according to low density regions learned from the training data.

One-class Novelty Detection Using GANs (OCGAN) [21] is composed of four neural networks: a denoising autoencoder, two adversarial discriminators, and a classifier. It aims to adversarially push the autoencoder to learn only the inlier features.

One-Class SVM (OCSVM) [58] estimates the margin of the training set, which is used as the decision boundary for the testing set. Usually it utilizes a radial basis function kernel to obtain flexibility.

Robust Subspace Recovery Autoencoder (RSRAE) [53] uses an autoencoder structure together with a linear RSR layer imposed with a penalty based on the 2,1 energy. The RSR layer extracts features of inliers in the latent code while helping to reject outliers. The instances with higher reconstruction errors are viewed as outliers. RSRAE trains a model using the training data. We then apply this model for detecting novelties in the test data.

We provide additional details on the different datasets used in the test and further summarize the number of inliers and outliers per dataset (for both training and testing) in Table 10 .

KDDCUP-99 is a classic dataset for intrusion detection. It contains feature vectors of connections between internet protocols and a binary label for each feature vector identifying normal vs. abnormal ones. The abnormal ones are associated with an "attack" or "intrusion".

COVID-19 (Radiography) contains chest X-ray RGB images, which are labeled according to the following three categories: COVID-19 positive, normal and viral Pneumonia cases. We resize the images to size 64 × 64 and rescale the pixel intensities to lie in [−1, 1].

Caltech101 contains RGB images of objects from 101 categories with identifying labels. Following [53] we use the largest 11 classes and preprocess their images to have size 32 × 32 and rescale the pixel intensities to lie in [−1, 1].

Reuters-21578 contains 21,578 documents with 90 text categories having multi-labels. Following [53] , we consider the five largest classes with single labels. We utilize the scikit-learn packages: TFIDF and Hashing Vectorizer [68] to preprocess the documents into 26,147 dimensional vectors.

We remark that COVID-19, Caltech101 and Reuters-21578 separate between training and testing datapoints. For KDDCUP-99, we randomly split it into training and testing datasets of equal sizes. 

Anomaly detection: A survey

A review of novelty detection

Deep learning for anomaly detection: A survey

A state of the art survey of data mining-based fraud detection and credit scoring

Real-time automated visual inspection using mobile robots

Anomaly detection for medical images based on a one-class classification

Network constraints and multi-objective optimization for one-class classification

False-negative of RT-PCR and prolonged nucleic acid conversion in COVID-19: Rather than recurrence

Auto-encoding variational Bayes

Non-local manifold tangent learning

Gaussian mixture pdf in one-class classification: computing and utilizing confidence values

LOF: identifying density-based local outliers

Support vector method for novelty detection

Robust one-class SVM for fault detection

Robust support vector data description for novelty detection with contaminated data

A novel anomaly detection scheme based on principal component classifier

Deep Learning

Deep structured energy based models for anomaly detection

Deep autoencoding gaussian mixture model for unsupervised anomaly detection

Adversarially learned one-class classifier for novelty detection

OCGAN: One-class novelty detection using gans with constrained latent representations

Generative probabilistic novelty detection with adversarial autoencoders

An introduction to variational methods for graphical models

Variational inference: A review for statisticians

Variational autoencoder based anomaly detection using reconstruction probability

Deep variational semi-supervised novelty detection

Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications

VELC: A new variational autoencoder based model for time series anomaly detection

Anomaly detection with conditional variational autoencoders

Adversarial autoencoders

Wasserstein auto-encoders

Robust principal component analysis?

A framework for robust subspace learning

Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization

Static and dynamic robust PCA and matrix completion: A review

Some Problems in Orthogonal Distance and Non-Orthogonal Distance Regression. Defense Technical Information Center

R1-PCA: rotational invariant l 1 -norm principal component analysis for robust subspace factorization

Median K-flats for hybrid linear modeling with many outliers

Two proposals for robust PCA using semidefinite programming

Robust PCA via outlier pursuit

l p -recovery of the most significant subspace among multiple subspaces with outliers

A novel M-estimator for robust PCA

Robust computation of linear models by convex relaxation

Fast, robust and non-convex subspace recovery

A well-tempered landscape for non-convex robust subspace recovery

An overview of robust subspace recovery

Robust subspace recovery with adversarial outliers

Robust, deep and inductive anomaly detection

Anomaly detection with robust deep autoencoders

Connections with robust pca and the role of emergent sparsity in variational autoencoder models

Robust variational autoencoder

Robust variational autoencoders for outlier detection and repair of mixed-type data

Robust subspace recovery layer for unsupervised anomaly detection

Clustering and unsupervised anomaly detection with l 2 normalized deep auto-encoder representations

Breakdown points of affine equivariant estimators of multivariate location and covariance matrices

Wasserstein generative adversarial networks

Isolation forest

One class support vector machines for detecting anomalous windows registry accesses

Deep anomaly detection using geometric transformations

API design for machine learning software: experiences from the scikit-learn project

Understanding the difficulty of training deep feedforward neural networks

TensorFlow: Large-scale machine learning on heterogeneous systems

Adam: A method for stochastic optimization

UCI machine learning repository

Can AI help in screening viral and COVID-19 pneumonia?

Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories

Reuters-21578 text categorization test collection

Mining of massive datasets

This research has been supported by NSF award DMS 18-30418.