key: cord-0618404-kdpqceqq authors: Guggilam, Sreelekha; Chandola, Varun; Patra, Abani title: Anomaly Detection for High-Dimensional Data Using Large Deviations Principle date: 2021-09-28 journal: nan DOI: nan sha: 052da17061646d37406ad7bb525cb48b87f2016f doc_id: 618404 cord_uid: kdpqceqq Most current anomaly detection methods suffer from the curse of dimensionality when dealing with high-dimensional data. We propose an anomaly detection algorithm that can scale to high-dimensional data using concepts from the theory of large deviations. The proposed Large Deviations Anomaly Detection (LAD) algorithm is shown to outperform state of art anomaly detection methods on a variety of large and high-dimensional benchmark data sets. Exploiting the ability of the algorithm to scale to high-dimensional data, we propose an online anomaly detection method to identify anomalies in a collection of multivariate time series. We demonstrate the applicability of the online algorithm in identifying counties in the United States with anomalous trends in terms of COVID-19 related cases and deaths. Several of the identified anomalous counties correlate with counties with documented poor response to the COVID pandemic. Anomaly detection has been extensively studied over many decades across many domains [9, 18] . Among the most useful applications of anomaly detection is to simultaneously monitor multiple systems' behaviors and identify the system that exhibits anomalous behavior due to external or internal stress factors. For instance, consider the example of the COVID-19 infection data. Studying the confirmed case and death trends across various countries, states or counties could highlight and identify the most (or least) significant public policies. One possible approach to study the data could be to monitor each time series [8, 20, 30] and identify sudden outbreaks or significant causal events. However, such methods study each time series individually and cannot not be used to detect the gradual divergence from the normal trends or initial signs of such drift. An alternate approach is to analyze each time series in the context of a collection of time series, which can reveal anomalies beyond sudden and significant events, such as anomalous trends and gradual drifts. Such methods typically require an appropriate similarity measure [16] . Through appropriate combination with state-of-the-art similarity-based models, these methods can identify potential anomalous time series and cluster similar trends. Implementing such methods in a time varying setting could even help detect change points or anomalous events in individual time series as well as identifying anomalous time series [5, 31] . However, these methods are typically unable to scale to long time series [4, 31] . In this paper, we propose a new anomaly detection algorithm called Large deviations Anomaly Detection (LAD), for large/highdimensional data and multivariate time series data. LAD uses the rate function from large deviations principle (LDP) [14, 27, 28] to deduce anomaly scores for the underlying data. Core ideas for the algorithm are inspired from large deviation theory's projection theorem that allow better handling of high dimensional data. Unlike most high dimensional anomaly detection models, LAD does not incorporate feature selection or dimensionality reduction, which makes it ideal to study multiple time series in an online mode. The intuition behind the LAD model allows it to naturally segregate the anomalous observations at each time step while comparing multiple multivariate time series simultaneously. The key contributions of this paper are following: (1) We propose the Large deviations Anomaly Detection (LAD) algorithm, a novel and highly scalable LDP based methodology, for scoring based anomaly detection. (2) The proposed LAD model is capable of analyzing large and high dimensional datasets without additional dimensionality reduction procedures thereby allowing more accurate and cost effective anomaly detection. (3) An online extension of the LAD model is presented to detect anomalies in an multivariate time series database using an evolving anomaly score for each time series. The anomaly score varies with time and can be used to track developing anomalous behavior. (4) We perform an empirical study on publicly available anomaly detection benchmark datasets to analyze robustness and performance of the proposed method on high dimensional and large datasets. (5) We present a detailed analysis of COVID-19 trends for US counties where we identify counties with anomalous behavior (See Figure 1 for an illustration). The rest of this document is organized as follows. Section 2 provides an overview of relevant existing methods for anomaly detection. Section 3 is a short background on underlying large deviations theory motivating LAD. Section 4 details our LAD model for detecting unsupervised anomalies in multivariate time series. Section 5 describes the experiments and demonstrate the state-ofthe-art performance of our method. Section 6 concludes the paper and sketches direction for possible future work. At any time-instance, the algorithm analyzes the bi-variate time series for all the counties to identify anomalies. The time-series for the non-anomalous counties are plotted (light-gray) in the background for reference. For the counties in North Dakota (Burleigh and Grand Forks), the number of confirmed cases (top), and the sharp rise in November 2020, is the primary cause for anomaly 1 . On the other hand, Wayne County in Michigan was identified as anomalous primarily because of its abnormally high death rate, especially when compared to the relatively moderate confirmed infection rate. In this section, we provide a brief overview of relevant anomaly detection methods which have been proposed for high-dimensional data and for multivariate time-series data. We also discuss other works that have used the large deviations principle for detecting anomalies. A large body of research exists on studying anomalies in high dimensional data [1, 3] but challenges remain. Many anomaly detection algorithms use dimensionality reduction techniques as a pre-processing step to anomaly detection. However, many high dimensional anomalies can only be detected in high dimensional problem settings and dimensionality reduction in such settings can lead to false negatives. Many methods exist that identify anomalies on high-dimensional data without dimensional reduction or feature selection, e.g. by using distance metrics. Elliptic Envelope (EE) [25] fits an ellipse around data centers by fitting a robust covariance estimates. Isolation Forest (I-Forest) [19] uses recursive partitioning by random feature selection and isolating outlier observations. nearest neighbor outlier detection (kNN) [23] uses distance from nearest neighbor to get anomaly scores. local outlier factor (LOF) [7] uses deviation in local densities with respect to its neighbors to detect anomalies. k-means-- [12] method uses distance from nearest cluster centers to jointly perform clustering and anomaly detection. Concentration Free Outlier Factor (CFOF) [2] uses a "reverse nearest neighbor-based score" which measures the number of nearest neighbors required for a point to have a set proportion of data within its envelope. In particular, methods like I-Forest and CFOF are targeted towards anomaly detection in high dimensional datasets. In most settings, real time detection of anomalies is needed to dispatch necessary preventive measures for damage control. Such problem formulation requires collectively monitoring a high dimensional time series database to identify anomalies in real time. Recently, large deviations theory has been widely applied in the fields of climate models [13] , statistical mechanics [26] , networks [22] , etc. Specially for analysis of time series, the theory of large deviations has proven to be of great interest over recent decades [6, 21] . However, these methods are data specific, often study individual time series and are difficult to generalize to other areas of research. Anomaly detection for time series have been extensively explored in the literature [17] , though most focus has been on identifying anomalous events in a single time-series. While, the task of detecting anomalous time series in a collection of time series has been studied in the past [10, 11, 29] , most of these works have focused on univariate time series and have not shown to scale to long time series data. Our proposed method addresses this issue by using the large deviation principle. Large deviations theory provides techniques to derive the probability of rare events 2 that have an asymptotically exact exponential approximation [14, 27, 28] . In this section, we briefly go over the large deviation theory and different ways to generate the rate functions required for the large deviations principle. The key concept of this theory is the Large Deviations Principle (LDP). The principle describes the exponential decay of the probabilities for the mean of random variables. The rate of decay is characterized by the rate function I. The theorem is detailed below: Polish space X is said to satisfy large deviation principle (LDP) with the rate function I : X → [0, ∞] if: (1) I has compact level sets and is not identically infinite To implement LDP on known data with known distributions, it is important to decipher the rate function I. Cramer's Theorem provides the relation between the rate function I and the logarithmic moment generating function Λ. Definition 3.2. The logarithmic moment generating function of a random variable is defined as be a sequence of iid real random variables with finite logarithmic moment generating function, e.g. Λ( ) < ∞ for all ∈ R. Then the law for the empirical average satisfies the large deviations principle with rate = 1/ and rate function given by Thus, we get, For more complex distributions, identifying the rate function using logarithmic moment generating function can be challenging. Many methods like contraction principle and exponential tilting exist that extend rate functions from one topological space that satisfies LDP to the topological spaces of interest [14] . For our work, we are interested in the Dawson-Gärtner Projective LDP, that generates the rate function using nested family of projections. If ∀ ∈ N , the family { } >0 satisfies the LDP on X with rate function I , then { } >0 satisfies the LDP with rate function given by, , ∈ Y, the supremum defining I is monotone in N because projections are nested. The theorem allows extending the rate function from a lower projection to higher projection space. The implementation of this theorem in LAD model is discussed in Section 4. Consider the case of multivariate time series data. Let {t n } N n=1 be a set of multivariate time series datasets where t n = (t n,1 , . . . , t n,T ) is a time series of length and each t n,t has attributes. The motivation is to identify anomalous t n that diverge significantly from the non-anomalous counter parts at a given or multiple time steps. The main challenge is to design a score for individual time series that evolves in a temporal setting as well as enables tracking the initial time of deviation as well as the scale of deviation from the normal trend. As shown in following sections, our model addresses the problem through the use of rate functions derived from large deviations principle. We use the Dawson-Gärtner Projective LDP (See Section 4.2) for projecting the rate function function to a low dimensional setting while preserving anomalous instances. The extension to temporal data (See Section 4.3) is done by collectively studying each time series data as one observation. Our approach uses a direct implementation of LDP to derive the rate function values for each observation. As the theory focuses on extremely rare events, the raw probabilities associated with them are usually very small [14, 27, 28] . However, the LDP provides a rate function that is useful as a scoring metric for our LAD model. Consider a dataset of size . Let a = {a 1 , . . . , a n } and I = {I 1 , . . . , I n } be anomaly score and anomaly label vectors for the observations respectively such that By large deviations principle, we know that for a given dataset of size , (¯= ) ≈ − I ( ) . Assuming that the underlying data is standard Gaussian distribution with mean 0 and variance 1, we can use the rate function for Gaussian data where I ( ) = 2 2 . Then the resulting probability that the sample mean is is given by: Now, in presence of an anomalous observation , the sample mean is shifted by approximately / for large . Thus, the probability of the shifted mean being the true mean is given by, However, for large n and | | << 1, the above probabilities decay exponentially which significantly reduces their effectiveness for anomaly detection. Thus, we use 2 2 as anomaly score for our model. Thus generalizing this, the anomaly score for each individual observation is given by: High dimensional data pose significant challenges to anomaly detection. Presence of redundant or irrelevant features act as noise making anomaly detection difficult. However, dimensionality reduction can impact anomalies that arise from less significant features of the datasets. To address this, we use the Dawson-Gärtner Projective theorem in LAD model to compute the rate function for high dimensional data. The theorem records the maximum value across all projections which preserves the anomaly score making it optimal to detect anomalies in high dimensional data. The model algorithm is presented in Algorithm 1. The definition of an anomaly is often contingent on the data and the problem statement. Broadly, time series anomalies can be categorized to two groups [10] : (1) Divergent trends/Process anomalies: Time series with divergent trends that last for significant time periods fall into this group. Here, one can argue that generative process of such time series could be different from the rest of the non-anomalous counterparts. (2) Subsequence anomalies: Such time series have temporally sudden fluctuations or deviations from expected behavior which can be deemed as anomalous. These anomalies occur as a subsequence of sudden spikes or fatigues in a time series of relatively non-anomalous trend. The online extension of the LAD model is designed to capture anomalous behavior at each time step. Based on the mode of analysis of the temporal anomaly scores, one can identify both divergent trends and subsequence anomalies. In this paper, we focus on the divergent trends (or process anomalies). In particular, we try to look at the anomalous trends in COVID-19 cases and deaths in US counties. Studies to collectively identify divergent trends and subsequence anomalies is being considered as a prospective future work. In this section, we present an extension of the LAD model to multivariate time series data. Here, we wish to preserve the temporal dependency as well as dependency across different features of the time series. Thus, as shown in Algorithm 2, a horizontal stacking of the data is performed. This allows collective study of temporal and non-temporal features. To preserve temporal dependency, the anomaly scores and labels are carried on to next time step where the labels are then re-evaluated. As long term anomalies are of interest, time series with temporally longer anomalous behaviors are ranked more anomalous. The overall time series anomaly score for each time series t n can be computed as: For a database of time series with varying lengths, the time series anomaly score is computed by normalizing with respective lengths. In this section, we evaluate the performance of the LAD algorithm on multi-aspect datasets. The following experiments have been conducted to study the model: (1) Anomaly Detection Performance: LAD's ability to detect realworld anomalies as compared to state-of-the-art anomaly detection models is evaluated using the ground truth labels. We consider a variety of publicly available benchmark data sets from Outlier Detection DataSets /ODDS [24] (See Tables 1) for As described in Section 4, LAD falls under unsupervised learning regime targeted for high dimensional data, we do not compare with supervised algorithms. For this we consider Elliptic Envelope (EE) [25] , Isolation Forest (I-Forest) [19] 3 , local outlier factor (LOF) [7] , and Concentration Free Outlier Factor CFOF [2] . The CFOF and LOF models assign an anomaly score for each data instance, while the rest of the methods provide an anomaly label. As above mentioned methods have one or more user-defined parameters, we investigated a range of values for each parameter, and report the best results. For Isolation Forest, Elliptic Envelope and CFOF, the contamination value is set to the true proportion of anomalies in the dataset. The LAD model relies on a threshold value to classify observations with scores the value as strictly anomalous. Though this value is iteratively updated, an initial value is required by the algorithm. In this paper, the initial threshold value for the experiment is set to 0.95 for all datasets. All the methods for anomaly detection benchmark datasets are implemented in Python and all experiments were conducted on a 2.7 GHz Quad-Core Intel Core i7 processor with a 16 GB RAM. As LAD is an score based algorithm, we study the ROC curves by comparing the True Positive Rate (TPR) and False Positive Rate (FPR), across various thresholds. The final ROC-AUC (Area under the ROC curve) is reported for evaluation. For time series anomaly detection, we present the final outliers and study their deviations from normal baselines under different model settings. Table 2 shows the performance of LOF, I-Forest, EE, CFOF and LAD on anomaly detection benchmark datasets. Due to relatively large run-time 4 , CFOF results are shown for datasets with samples less than 10k. For all the listed algorithms, results for best parameter settings are reported. The proposed LAD model outperforms other methods on most data sets. For larger and high-dimensional datasets, it can be seen from Table 2 that the LAD model outperforms all the models in most settings. 5 To study the LAD model's computational effectiveness, we study the computation time and scaling of LAD model on large and high dimensional datasets. We consider datasets with more than 10k observations or over 100 features for our analysis. Figures 2 and 3 show the computation time in seconds for benchmark datasets. It can be seen that the LAD model is relatively low computation time second only to Isolation Forest in most datasets. In fact, the computation time is more stable for our model as opposed to others in high dimensional datasets. Figure 4 shows the scalability of LAD with respect to the number of records in the data. We plot the time needed to run on the first k records of the KDD-99 dataset. Each record has 29 dimensions. Figure 5 shows the scalability of LAD with respect to the number of dimensions (linear-scale). We plot the time needed to run on the first 1, 2, ..., 29 dimensions of the KDD-99 dataset. The results 4 The CFOF model is computationally expensive relative to the rest of the algorithms. As it is aimed to study high-dimensional data, only results on datasets with <10k observations are presented. 5 The lowest AUC values for the LAD model are observed for Speech and Optdigits data where multiple true clusters are noted. confirm the linear scalability of LAD with number of records as well as number of dimensions. This section presents the results of LAD model on COVID-19 time series data at the US county level. Multiple settings were used to understand the data: (1) Deaths and confirmed case trends were considered for analysis (2) Daily New vs Total Counts: Both total cases as well daily new cases were analyzed for anomaly detection. To bring all the counts to a baseline, the total counts in each time series were scaled to the respective county population. Missing information was replaced with zeros and counties with population less than 50k were eliminated from the study. Complete history vs One Time Step. The full history setting considers the complete history of the time series and is aimed to capture most deviant trends over time. The one time step (or any smaller window) setting is more suitable to study deviations within the specific window. As we target long term deviating trends, the one time step setting returns trends that have stayed most deviant throughout the entire time range. This can be seen in Figures 6 and 7 where the one time step setting returns trends that have stayed deviant almost throughout the duration while the full history setting is able to capture significantly wider deviations. For instance, counties like Grand Forks (ND), Burleigh (ND) and Miami-Dale (FL), that had massive outbreaks at later stages 6 were not captured as anomalous in the one time step model as seen in Figure 7a and 7b. Similarly, Hall, Nebraska, which has see a deviation in trend due to an outbreak in meat packing facility in late April 2020, was captured as anomalous trend by the full history model in Figure 6a and 6b. Univariate vs Multivariate Time series. In Figures 6, 7, 8 and 9 we see the anomalous trends in multivariate time series, where total confirmed cases and deaths were collectively evaluated for anomaly detection. For instance, despite the near-normal trends in deaths Similarly, in Figures 6c and 6d , Wayne, Michigan along with Rockland, Richmond, Queens and Bronx in NY have been identified as anomalous. In particular, Michigan was seen to have 3rd highest deaths after NY and NJ in the early stages of the pandemic with Detroit metro-area contributing to most cases 8 . Though Wayne county has near normal trend in total confirmed cases where as the total deaths trend has deviated significantly. Daily New vs Total Counts. Figures 7 and 9 , show anomalous trends in multivariate time series for total and daily new counts respectively. It can be seen that the anomaly score is erratic for multivariate time series on new case counts. This is due to the fact that the data for new case and death counts is more erratic leading to fluctuating normal average as well as non-smooth anomaly scores. 7 https://www.omaha.com/news/state_and_regional/237-coronavirus-cases-tied-tojbs-beef-plant-in-grand-island-disease-specialists-are-touring/article_2894db56-913a-5c61-a065-6860a8ae50ad.html 8 https://www.npr.org/sections/coronavirus-live-updates/2020/03/31/824738996/ after-surge-in-cases-michigan-now-3rd-in-country-for-coronavirus-deaths The LAD model on the daily new counts data was able to capture the escalation in Greater Boston area, Essex, Massachusetts in Figure 9a and 9b during March 2020. Though the total trends seem to be normal, the multiple anomalous daily trends led to their high anomaly scores. Similar patterns led to identification of Lincoln In this paper, we propose LAD, a novel scoring algorithm for anomaly detection in large/high-dimensional data. The algorithm successfully handles high dimensions by implementing large deviation theory. Our contributions include reestablishing the advantages of large deviations theory to large and high dimensional datasets. We also present an online extension of the model that is aimed to identify anomalous time series in a multivariate time series data. The model shows vast potential in scalability and performance against baseline methods. The online LAD returns a temporally evolving score for each time series that allows us to study the deviations in trends relative to the complete time series database. A potential extension to the model could include anomalous event detection for each individual time series. Another possible future work could be extending the model to enable anomaly detection in multi-modal datasets. Additionally, the online LAD model could be enhanced to use temporally weighted scores prioritizing recent events. Outlier detection for high dimensional data CFOF: a concentration free measure for anomaly detection Fast outlier detection in high dimensional spaces Time series anomaly detection based on shapelet learning Unsupervised outlier detection for time series by entropy and dynamic time warping No early warning signals for stochastic transitions: insights from large deviation theory LOF: identifying density-based local outliers Estimation of COVID-19 prevalence in Italy Anomaly detection: A survey Detecting Anomalies in a Timeseries Database A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data Sanjay Chawla and Aristides Gionis. 2013. k-means-: A unified approach to clustering and outlier detection Rogue waves and large deviations in deep sea Large deviations An interactive web-based dashboard to track COVID-19 in real time A review on time series data mining Outlier detection for temporal data: A survey A Survey of Outlier Detection Methodologies Isolation-based anomaly detection Time series modelling to forecast the confirmed and recovered cases of COVID-19 A large deviations approach to limit theory for heavy-tailed time series Spatio-temporal network anomaly detection by assessing deviations of empirical measures Efficient algorithms for mining outliers from large data sets ODDS Library A fast algorithm for the minimum covariance determinant estimator The large deviation approach to statistical mechanics Large deviations and applications Large deviations Disk aware discord discovery: Finding unusual time series in terabyte sized datasets Deep learning methods for forecasting COVID-19 time-Series data: A Comparative study A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data