key: cord-0057844-nhli615s
authors: Hamdi, Slim; Snoussi, Hichem; Abid, Mohamed
title: Fine-Tuning a Pre-trained CAE for Deep One Class Anomaly Detection in Video Footage
date: 2021-02-22
journal: Pattern Recognition and Artificial Intelligence
DOI: 10.1007/978-3-030-71804-6_1
sha: 85a153e41b6023c6369cfcae48108b38b9a72ff0
doc_id: 57844
cord_uid: nhli615s

In recent years, abnormal event detection in video surveillance has become a very important task mainly treated by deep learning methods taken into account many challenges. However, these methods still not trained on an anomaly detection based objective which proves their ineffectiveness in such a problem. In this paper, we propose an unsupervised method based on a new architecture for deep one class of convolutional auto-encoders (CAEs) for representing a compact Spatio-temporal feature for anomaly detection. Our CAEs are constructed by added deconvolutions layers to the CNN VGG 16. Then, we train our CAEs for a one-class training objective by fine-tuning our model to properly exploit the richness of the dataset with which CNN was trained. The first CAE is trained on the original frames to extract a good descriptor of shapes and the second CAE is learned using optical flow representations to provide a strength description of motion between frames. For this purpose, we define two loss functions, compactness loss and representativeness loss for training our CAEs architectures not only to maximize the inter-classes distance and to minimize the intra-class distance but also to ensure the tightness and the representativeness of features of normal images. We reduce features dimensions by applying a PCA (Principal Component Analyser) to combine our two descriptors with a Gaussian classifier for abnormal Spatio-temporal events detection. Our method has a high performance in terms of reliability and accuracy. It achieved abnormal event detection with good efficiency in challenging datasets compared to state-of-the-art methods.

Security is a founding value of any modern society, it contributes strongly to creating a climate of peace necessary for good social development. Currently, the conditions and the various mechanisms for its implementation are major concerns, whether at the individual or collective level. In recent decades, cameras are used everywhere in public space for security purposes. Video surveillance is a system composed of cameras and signal transmission equipment. The use of video surveillance is an essential tool for fighting crime and strengthening security. It allows controlling the necessary conditions for security and the identification of the risked elements in the scene. In the current context, one operator is in charge of several scenes at the same time and may on the same screen. In [1] , the author proves that an operator can miss 60% of target events when it is in charge of viewing 9 or more video streams. A possible solution to this problem would be the use of intelligent video surveillance systems. Theses systems will have to be able to learn the normal behavior of a monitored scene and detect any abnormal behavior that may represent a safety risk.

The AE auto-encoder is a fully connected and neural network widely used in unsupervised learning. It consists of an input layer, an output layer, and one or more hidden layers. The hidden layers are distributed between the encoder and the decoder, the encoder is used to encode the input data into a more compact representation, the decoder is used to reconstruct the data according to the representation generated by the encoder. To exploit its unsupervised learning capacity, the AE has been widely explored in the detection of abnormal events. The author in [2] proposes AMDN (Appearance and Motion DeepNet) a network consisting of three SDAEs (stacked denoising auto-encoders) a first trained to reconstruct patches extracted from normal images, a second trained with the optical flow representations corresponding to the patches and a third trained with the concatenation of the patches and their optical flow representations. Moreover, based on CAEs the author in [3] proposes to train a CAE for the reconstruction of 3D input volumes and the optical flux extracted from the image and the previous image. In, [4] compared two methods also based on CAEs. The first method suggests that a CAE should be trained to reconstruct low-level characteristics (HOG and HOF) extracted from samples in the normal class. In the second method, the authors propose to use a Spatio-temporal CAE trained on video volumes. In effect, in both approaches, the anomalies are captured using a regularity score calculated with the error of reconstruction. In recent years, many works exploit the progress that has been made in both areas of Deep Learning (DL) and Computer Vision (CV) to automate surveillance for abnormal events detection. Deep Learning automatizes the feature extraction from raw data to realize many purposes such as image classification [5] , facial recognition [6] , automatic generation of computer code [7] , automatic natural language processing [8] and automatic speech recognition [9] . Unsupervised Deep Learning is often used in the field of anomaly detection not only due to the subjective aspect of the anomaly but also usually only normal data are available for training. The development of learning methods that do not require a labeled database has always been a primary objective in the field of automatic learning. In this perspective, many recent works have aspired to the development of deep one-class networks has have been proposed [35] . However, these methods proposed to use an extra data set to ensure the compactness of normal features with a deep CNN. To remedy those drawbacks we propose in this paper, a new deep architecture for abnormal event detection. It consists of two convolutional auto-encodes, one formed on images and other on optical flow representations to obtain compact and descriptiveness features. This combination allows extracting high-level compact representation able to describe complex behaviors and dissociate between normal and abnormal events. In this paper, we propose new method based on a combination between auto-encoders to extract deep features contain both information about motion and shapes. The aim of this combination is to extract tight and representative spatio-temporal features of normal frames, and subsequently, these features are more easy to isolate it from abnormal frames. The originality of this work is to extract a deep spatio-temporal features of deep one class without using any external database.

Anomaly detection in video footage is very import task in computer vision. Usually, state-of-the-art methods try to train a model to represent the normal events and labelled any new event at the testing phase that has small occurrence during the training as abnormal events. The earlier methods were proposed to extract low-level features to train a model, for example in [10] , the author used the Histogram of Oriented Social Force (HOSF) to represent the events and in [11] , the authors propose multiples features extraction such as size, color, and edges on small regions at any frame of input video obtained by foreground segmentation technique. Multiple classifier for each feature are exploited to decide if that region is contain anomaly or not. [12] , use Histograms of Optical Flow (HOF) to represent the motion information of each frame enhanced by one class Support Vector Machine (SVM) classifier to pick up abnormal motion. In [13] , the author propose to train a model from the available frames at the training using sparse coding and based on the assumption: "Usual events in a video footage are more reconstructible from a normal event dictionary compared to unusual events". The dictionary is obtain a model capable of computing normality score at each new event in order to dissociate normal and abnormal events. Moreover, other trajectory-based methods have been applied in order to recognize unusual trajectories in monitored scene. [14] propose to represent trajectories by Kanade Lucas-Tomasi Feature Tracker (KLT) and use Multi-Observation Hidden Markov Model (MOHMM) to determine if trajectory are normal or abnormal. [15] propose to train One-Class Support Vector Machine model to recognize the normal trajectories and pick up any abnormal events may occur. [16] combine two models; a vector quantization and a neural networks to extract robust representation. In last few years, several researchers based their works on deep learning. They have obtained greats results on various applications such as object detection [17] , action recognition [18] , face recognition [19] . This success come from to their capability to learn non-linear and complex representations from raw images, which is important because the real-world application contain many non-linear relationships. These methods also have a good property of generalization: they can be applied on data unused during the learning process. The author of [20] propose to apply optical flow to extract spatial-temporal volumes of interest (SVOI) and use them to train a 3D -CNN to classify events into normal and abnormal. [21] combine pre-trained CNN completed with Binary Quantization Layer (BQL) and optical flow to detect local anomalies. [22] propose a method called AVID (Adversarial Visual Irregularity Detection) to detect and locate abnormalities in videos footage. A GAN composed of a generator trained to remove abnormalities in the input images and replace them with the dominant patterns of the same images and a discriminator in the form of an FCN that predicts the probability that the different regions (patches) of the input images will be abnormal. The two networks are trained in an adversarial manner and the abnormalities are simulated using Gaussian noise. After the training, each of the two networks is capable to detect abnormalities.

One-class classification is a machine learning problem that has received important attention by many researchers in different fields such as novelty detection, anomaly detection, and medical imaging. Nevertheless, the lack of data in the training phase reduces the depth of network architecture which in turn reduces the representativeness of features. To solve this weakness we propose to finetuning a pre-trained CAE for a one-class training objective constructed from VGG 16 CNN which is achieved 92.7% top-5 test accuracy. The database used to train VGG 16 CNN is ImageNet which is a dataset of over 14 million highresolution images belonging to 1000 classes. The images were collected from the web and labeled by humans using Amazon's Mechanical Turk crowd-sourcing tool. We freeze the first layers of convolutions to properly exploit the richness of the database with which the CNN was trained (Fig. 1) . The objective of the convolution operation is to extract the high-level features from the input image. Our architecture need not be limited to only one convolution layer. Conventionally, the first convolution layer is responsible for capturing the Low-Level features such as edges, color, gradient orientation, etc. With added layers, the architecture adapts to the High-Level features as well, giving us a network that has the wholesome understanding of images in the dataset, similar to how we would. So, we construct the encoder part of our CAE architecture based on convolutions layers of pre-trained CNN VGG16. We freeze the first convolutional block of VGG 16 and we keep the others convolutional blocks trainable (Fig. 2) . In the hand, the decoder part is a plane network made up of four 2D-deconvolution layers to be able to reconstruct the original frames, Its hyper-parameters is given in (Table 1) .

Similar to the traditional auto-encoder, the CAE is composed of two parts. The encoder part which is a sequence of convolutional layers aims to extract compressed data of input image at the bottleneck layer and the decoder part which is successive of deconvolutional layers aims to reconstruct the input data from compressed data at bottleneck layer. The CAE can reconstruct better the data with was trained than the data that have ever seen, so the bottleneck layer must be reduced and representative as possible which in reality presents a compromise, many tests are done to select properly the bottleneck dimension (Table 1) . A non-linear activation function is used at the convolutional and deconvolutional layers to obtain more useful and robust representations, except for the last deconvolution layer we used linear activation function due to the range of our input data which is [−255, 255]. Our architecture consists of two parallel CAEs constructed as mentioned above. The first CAEs are trained on original images to be able to detect any abnormalities in shapes and the second CAEs are trained on optical flow representation aim to detect any abnormal motion relative to training (Fig. 3) . 

The training phase aims to obtain a model capable to get representative and compact features of normal images for easy classification. We can ensure that in two methods; the first method (Fig. 4) is to do training in cascade objectives by training only at the beginning with the reconstruction objective and after a few epochs we extract a representative point denoted "c" of features of the dataset which with our model is training at bottleneck layer as the mean of features. Then, we do training only with the compactness objective and we fix the point c as the target of our new features. The disadvantage of this training method is that the representativeness of the images is not robust but it gives very compacted features. To remedy this flaw, a second training method is proposed with pseudo-parallel objectives (Fig. 3) , we start the training with only reconstruction objective then as we have done at the first method we extract a fixed point "c" as the target of features then we continue the training with both compactness and reconstruction objectives to get robust model. During the training phase (Fig. 5) , both 2D-CAE are trained, one is trained with a stream of a sequence of original images and the other is trained with a stream of a sequence of optical flow representation. The optical flow is the pattern of apparent motion of image objects between two consecutive frames caused by the movements of the object. We have used a color code for better visualization. (Figure 6 ) shows some samples of images and optical flow images. Representativeness Loss: L r The aim of representativeness loss is to evaluate the capacity of the learned feature to generalize normal class. The representativeness loss increases the capacity of our model to raise the distance inter-classes.

Compactness Loss: L c The objective of compactness loss is to tight all the features used during the training phase belonging to the normal class. Compactness loss evaluates the similarity between each feature vector and the fixed point 'C'. It is used to decrease the intra-class variance of the normal class.

To perform back-propagation using this loss, it is necessary to assess the contribution each element of the input has on the final loss. For each ith sample

we define the gradient l c with respect to the input F v ij is given as,

The proposed testing procedure aims to classify features of testing images as normal or abnormal based on the Mahalanobis distance threshold. Both motion and shapes features vectors noted respectively

are extracted from trained encoders parts to be concatenated into one vector. Then we apply PCA to this vector to reduce dimension and to extract important information noted X = PCA ([F(V); F'(V')] = {X i1 ,X i2 ,...X ik ,X i1 } ∈ R p when p < 2 × k (Fig. 7) . Using PCA is made the calculation of the covariance matrix Q faster and not complicate. For each new feature vector X test we calculate a Mahalanobis distance between each feature vector andX as given:

WhenX as the mean of X ∈ R p and Q ∈ R p×p as its covariance. The classification process is carried out according to the following process: In the first step, we extract feature vectors X ={x i }, x i ∈ R 512 from the normal training examples, the mean M and the inverse of the covariance matrix Q of X are then calculated. In the second step, we evaluate each feature vector x j of the testing frames with Mahalanobis distance d j using M and Q. This is represented in the following equation:

The outlier vectors, which actually represents abnormal frames, are then picked by thresholding the distance. If the distance exceeds a threshold α, the vector x j is considered as outlier and the frame p j is labeled as abnormal, Eq. (6) . In the other hand, The UMN dataset has consisted of 3 scenes: lawn (1450 frames), indoor (4415 frames) and plaza (2145 frames) and the ground truth is provided in the video frames that need to be extracted to evaluate the performance. We evaluate our different methods using (Error Equal Rate) EER and (Area Under Curve ROC) AUC as evaluation criteria. A smaller EER corresponds with better performance. As for the AUC, a bigger value corresponds with better performance.

Our two methods have the same results nearly, with a little advantage for the pseudo-parallel objectives method.

It proves the robustness to occlusion and high performance in anomaly detection compared with state-of-the-art methods. To visualize the important effect of the compactness loss function we extract from each feature extracted by our architecture two components by applying the PCA. These components are named later features for visualization. Figure 8 illustrates the results, just to better understand its effects, we will categorize our database into three classes.

-Normal images contains only normal events as mentioned in ground truth, this class represented by green points in Fig. 8 .

-Confused images when a portion of anomaly start to appear and not a whole of the anomaly enter in the scene, this class is presented by blue points in Fig. 8 . -Abnormal images when a more of the half of anomaly enter in the scene, this class represented by red points in Fig. 8 .

The Fig. 8 1.a represents features for visualisations of our architecture trained with only representativeness loss, as we can see in this figures each of three classes reserved a region of space. Which is mean representativeness loss has increased the inter-classes distance between the three classes in an unsupervised way and using only the class of normal images (Class one). In order to decrease the intra-class distance for normal image we have used compactness loss. The Fig. 2 1.b represents features for visualisations of our architecture trained with both representativeness loss and compactness loss. In this case, the normal images not only are reserved region in space but also are very tight and easy to separate from abnormal images.

Combining the two CAEs have decreased the EER from 17% to 11% which make the importance of using of optical flow image to represent the motion in each frames. The Table 2 shows our results on Ped2 dataset and proves the robustness of our method compared to others state of the art methods. Our results in scene of UMN is presented in the following table: This table shows our results relatively at each scene. Despite that our model is trained on different scenes. It proves that our method have good efficiency for anomaly detection (Table 3) .

This table shows our results for UMN dataset, in this case we use one threshold for whole the dataset and its independent to the scenes. It proves that our method have good efficiency and robust for variation of scenes (Table 4 ). This figure is plotted with tools from python library sklearn.metrics and roc curve. It proves that our architecture achieve more then 99% of AUC. 

In this paper, a new unsupervised methods were proposed to train CAEs for the Deep One-Class objective. We used these methods to learn a new architecture composed of two CAEs, one trained on video volumes and the second on optical flow representations. Our two networks allow extracting high-level Spatio-temporal features taking into account the movements and shapes present in each small region of the video. This robust representation makes possible, with a simple classifier, to differentiate between normal and abnormal events. We have tested our network on challenging datasets, containing crowded scenes (USCD Ped2 and UMN) Our method obtained high results competing with the best state-of-the-art methods in the detection of abnormal events (Fig. 9 ). Our future works will investigate the strengthening of our learning process and apply our model on drone video for anomaly detection.

Generating code from a graphical user interface screenshot

Learning deep representations of appearance and motion for anomalous event detection

Detection of video anomalies using convolutional autoencoders and one-class support vector machines

Learning Temporal Regularity in Video Sequences

Deep residual learning for image recognition

Deepface: closing the gap to humanlevel performance in face verification

Generating code from a graphical user interface screenshot

Very deep convolutional networks for natural language processing

Deep speech 2: end-to-end speech recognition in English and Mandarin

Abnormal Event Detection Using HOSF

Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture

Detection of abnormal visual events via global optical flow orientation histogram

Online detection of unusual events in videos via dynamic sparse coding

Unusual event detection in crowded scenes by trajectory analysis

Trajectory-based anomalous event detection

Learning the distribution of object trajectories for event recognition

YOLOv3: An Incremental Improvement

Learning spatiotemporal features with 3D convolutional networks

FaceNet: a unified embedding for face recognition and clustering

Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes

Plug-and-Play CNN for crowd motion analysis: an application in abnormal event detection

Avid: adversarial visual irregularity detection

Abnormal crowd behavior detection using social force model

Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates

Multi-scale and real-time non-parametric approach for anomaly detection and localization. Computer Vis

Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Process

Abnormal event detection using convolutional neural networks and 1-class SVM classifier

Hybrid deep learning and HOF for Anomaly Detection

Anomaly detection and localization in crowded scenes

Abnormal event detection in videos using spatiotemporal autoencoder

Learning to detect anomalies in surveillance video

Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes

Chaotic invariants based on local statistical aggregates

Sparse reconstruction cost for abnormal event detection

Learning deep features for one-class classification