key: cord-0056429-fr19b0gj
authors: Bakalos, Nikolaos; Voulodimos, Athanasios; Doulamis, Nikolaos; Doulamis, Anastasios; Papasotiriou, Kassiani; Bimpas, Matthaios
title: Fusing RGB and Thermal Imagery with Channel State Information for Abnormal Activity Detection Using Multimodal Bidirectional LSTM
date: 2021-01-28
journal: Cyber-Physical Security for Critical Infrastructures Protection
DOI: 10.1007/978-3-030-69781-5_6
sha: 5671d4ed5a0aeac2ff916bfc58eb11e69107b37f
doc_id: 56429
cord_uid: fr19b0gj

In this paper, we present a multimodal deep model for detection of abnormal activity, based on bidirectional Long Short-Term Memory neural networks (LSTM). The proposed model exploits three different input modalities: RGB imagery, thermographic imagery and Channel State Information from Wi-Fi signal reflectance to estimate human intrusion and suspicious activity. The fused multimodal information is used as input in a Bidirectional LSTM, which has the benefit of being able to capture temporal interdependencies in both past and future time instances, a significant aspect in the discussed unusual activity detection scenario. We also present a Bayesian optimization framework that fine-tunes the Bidirectional LSTM parameters in an optimal manner. The proposed framework is evaluated on real-world data from a critical water infrastructure protection and monitoring scenario and the results indicate a superior performance compared to other unimodal and multimodal approaches and classification models.

Abnormal activity detection is a research problem that attracts significant interest in the image and video analysis research community (e.g. [9, 10] ). Many different techniques have been proposed in the field of computer vision and video analysis, including methods based on trajectory analysis [12] , pixel-level processing [11] , combined trajectory and low-level analysis [1] , background modelling [14] , object detection [13] and tracking [15] , activity recognition [16] , and crowd behavior analysis [17] . Despite the efficacy of such techniques, their dependence on strictly visual information makes them susceptible to occlusions, difficult fields of view and poor illumination circumstances. This limitation has motivated the exploration of vision techniques beyond the visible spectrum. Thermographic data can provide a useful alternative stream of information. Thermal camera sensors are not sensitive to illumination changes [4] ; on the other hand, thermal information does not entail texture or color information. Since both RGB and thermal sensing are actually based on visual cues, an interesting idea is to supplement them by additional data that are not limited by the restrictions of visual information (such as occlusions).

Recent studies have indicated that wireless signal reflection can be effectively leveraged to sense human presence. Different kinds of techniques have been described in the literature, including device-free Software Defined Radio (SDR) methods, which process the Received Signal Strength of a transmitted signal. However, the accuracy of such techniques is often not sufficiently high [18] . In contrast, it has been shown that techniques based on commercial off the shelf (COTS) equipment [5] can yield good performance rates in human presence detection, by making use of Channel State Information (CSI) [7] .

Moving on from the input modalities to the machine learning models used for abnormal activity detection, it is clear that deep learning techniques, and especially Convolutional Neural Networks (CNN), have been shown to outperform traditional classifiers [1, 6, 16] , which is explained by their high representational capabilities. However, one limitation of CNNs is that they cannot inherently capture temporal interdependencies in a bidirectional manner, i.e. from both past and future time instances, which is an important aspect in time series modeling problems.

In this work, we propose a model based on a Bayesian optimized multimodal bidirectional LSTM neural network for abnormal activity detection. Our model harnesses the power of LSTM networks to capture long and short term dependencies, while the backward and forward pass of the bidirectional version of LSTM ensure the consideration of both past and future time instances. Our proposal also includes a Bayesian optimization framework that optimally tunes the parameters of the utilized bidirectional LSTM. Finally, the combination of heterogeneous input modalities, such as RGB and thermal imagery with Channel State Information (CSI) from wireless signal reflection leads to a significantly improved detection performance compared to cases that are solely based on a single information modality.

Contrary to traditional abnormal activity detection systems which are usually based on RGB video sequence input, in the work at hand an additional modality is considered, that of thermographic imagery. Visual streams from RGB cameras are initially processed using the object detection module YOLO (You only look once) [13] . YOLO locates spatial bounding areas on the frame and allocates each area a probability for an object class. A Convolutional Neural Network is used for object detection, comprising 24

convolutional layers and 2 fully connected layers. Each image frame is described as a class image CL RGB , having the same size as the initial RGB image, where the (x,y) pixel of the RGB image I(x,y) is denoted as o k,RGB (x, y), in the class in the following way:

where k denotes the object with identity k in the object detection module employed.

Data acquired by thermographic sensors undergo background subtraction [14] . A class label image CLT is extracted, having the same size as the input thermal frame T, where the (x,y) pixel of T is denoted in the class label image as:

In order to facilitate the subsequent processing steps, the RGB and thermal image frames are resized so as to become of identical sizes, NxM . In other words, x RGB (n) ∈ R NxM stands for an image, whereby each pixel indicates the object ID that pixel belongs to. In a similar manner, tensor x thermal (n) ∈ R NxM denotes the class label image of the thermographic modality.

Channel State Information (CSI) can be leveraged for human movement detection using WiFi devices, based on propagation modeling of a signal from the transmitter to the receiver, supporting many subcarriers due to the Orthogonal Frequency Division Multiplexing (OFDM) principle. CSI includes physical attributes of the wireless channel, such as scattering, power decay per distance, fading, shadowing and effects of interference [7] , which are measured by the amplitude/phase over all K available subcarriers:

where H(n, f i ) refers to the amplitude and the phase of the i-th subcarrier with central frequency f i .. Therefore, we have that:

Usually, H (n) input data contain noise and are distorted by outliers. For this reason, CSI signals H(n) need to undergo a pre-processing stage. First, outliers are removed using a Hampel identifier [8] or density-based clustering algorithms such as DBSCAN [23] . In the sequel, noise is removed with wavelet denoising, followed by normalization, correlation of subcarriers and, finally, eigenvector processing of the signals. After preprocessing, CSI data are used as input to a linear SVM for human intrusion detection. The SVM's output classification IDs, say C CSI (n), will be used as input to our proposed multimodal bidirectional LSTM framework. The CSI related input x CSI (n) is given by:

For spatial coherency with the visual input data, tensor x CSI (n) is expanded over the R NXM grid, forming an additional input channel.

Approaches based on solely one of the above types of information are unavoidably plagued by the limitations of each information modality (e.g. occlusions, noise, etc.). We hereby propose the fusion of the above described information channels to create a multimodal input tensor x(n):

where x RGB (n) is the data tensor pertaining to RGB visual signals, x thermal (n) the respective data tensor of the thermal component, and x CSI (n) the data tensor pertaining to the WiFi reflection signal.

LSTMs is a type of Recurrent Neural Network (RNN) which was designed to address the problem of exploding and vanishing gradient that can arise when training traditional RNNs. LSTM networks are a good fit to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series [25] [26] [27] . In LSTMs, each node in the hidden layer is replaced by a memory cell, instead of a single neuron [25] . The structure of a memory cell is illustrated in Fig. 1 .

The LSTM memory cell is composed of the following: the forget gate, the input node, the input gate, and the output gate. The input gate controls the extent to which a new value flows into the cell, the forget gate controls the extent to which a value remains in the cell and the output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. The activation function of the LSTM gates is often the logistic sigmoid function. The goal of the forget gate is to decide what information should be discarded out of the memory cell [24] . The output, denoted as f (n) ranges between 0 and 1, according to the sigmoid activation function. The forget gate learns whether a previous or future vector state is necessary for the estimation of the current value state. The input node performs the same operation with that of a hidden neuron of a typical recurrent regression model. We denote the output of this node as I n (n). The goal of this node is to estimate the way in which each latent state variable contributes to the final model.

As far as the input gate is concerned, its role is to regulate whether the respective hidden state is sufficiently important. The output of this gate is denoted as I g (n). It has the sigmoid function, therefore its response ranges between 0 and 1. This gate addresses problems related to the vanishing of the gradient slope of a tanH (·) operator. Finally, the output gate regulates whether the response of the current memory cell is sufficiently significant to contribute to the next cell. Therefore, this gate actually models the long range dependency together with the forget gate. The output of this gate is denoted as O(n).

One of the disadvantages of the memory cell of Fig. 1 , is that it considers only past state information. On the contrary, bi-directional forms of LSTM can process data in both directions, and include, therefore, apart from the forward pass an additional backward operation. The structure of a bi-directional LSTM, unfolded in time is presented in Fig. 1(b) . Detection of abnormalities in video and CSI time series is an application which can inherently benefit from this additional backward operation of the bidirectional LSTM, which is the base model adopted in this work.

We hereby present a Bayesian optimization method for the selection of the bidirectional LSTM model parameters. In lieu of employing manual tuning of model parameters, we hereby present and use a probabilistic Bayesian approach through which model parameters are optimally tuned.

As in all models, let us denote by π i the set of configurable parameters, e.g. in our case the number of memory cells, the learning rates, etc. Supposing a set Q of different configurations, i.e., D 1:Q = {π 1 . . . π Q }, we can then evaluate the error E(x, d , π) yielded when (i) the model receives input data x, (ii) its output is compared against the target outputs d and (iii) we consider a specific model configuration π . Let E min be the minimum Mean Square Error across all Q configurations. The following can then be an improvement function:

In the sequel, the expectations of Eq. (6) can be computed in a probabilistic context. The probability distribution of the error function for a given set of configurations, P E|D 1:Q , is written in a Bayesian context as:

Usually P(E) follows a Gaussian distribution and P D 1:Q |E is then expressed as a Gaussian process of mean μ(π ) and standard deviation [28] : π 1 ) . . . k(π 1 , π Q ) . . . . . . . . .

where k(•) is a kernel function. The target of our optimization is to find out a new configuration π * ≡ π Q+1 , which will further reduce the MSE or equivalently increase the improvement I (x, d , π  *  ) . Then, for the new augmented set D 1:Q+1 , that includes π * ≡ π Q+1 , P(D 1:Q+1 |E) will again be a Gaussian process of standard deviation

Where b = [k(π Q+1 , π 1 ) . . . k(π Q+1 , π Q ). Then, according to [28] , it can be proven that the P(E Q+1 |D 1:Q , π Q+1 ) is also a Gaussian with mean value and standard deviation related with previous variables. Therefore, the new configuration π * is estimated, which is actually the integral of I (•) and P(E Q+1 |D 1:Q , π Q+1 ), that is the probability that I (•) follows.

To scrutinize the effectiveness of the proposed model, we have used a dataset that has been created in the context of the European Horizon 2020 STOP-IT Project (https://stopit-project.eu/). STOP-IT aims at tackling the protection of critical water infrastructure using novel methods. The dataset includes RGB and thermal video sequences as well as Channel State Information. The RGB data were captured using an OB-500Ae camera with 1280 × 720 pixel resolution at 30 fps. The thermal data were obtained by means of a Workswell InfraRed Camera 640 (WIC) with a 640 × 512 pixel resolution at 30 fps. WiFi data were acquired using a transmitter-receiver couple that comprised a WiFi router (TP-Link N300 TL-WR841N) and an Intel 5300 NIC receiver, with a 0.1 Hz capturing frequency. Data annotation was performed on the basis of pre-determined scenarios by end users that prescribed whether the captured activity over all data modalities should be considered as irregular/abnormal.

The entirety of data across all modalities were normalized so as to be in the same range (0-1). The computer used for all training and testing was an Intel® Core™ i7-6700 CPU@ 4000 GHz CPU with 16GB RAM and an NVIDIA GeForce GTX 1070 with 8GB DDR5 memory. CUDA 9.2 Toolkit was also used for deep learning classifiers.

The first round of experiments focuses on the impact of using fused multimodal data as input, instead of solely considering a single modality. We have initially experimented with the following popular machine learning models: (i) a linear kernel SVM, (ii) a non-linear Radial Basis Function SVM (RBF-SVM), two different architectures of a traditional feedforward neural network: (iii) with 1 hidden layer of 10 neurons/layer and (iv) 2 hidden layers of 10 neurons/layer respectively, (v) a CNN and (vi) a plain LSTM (without bidirectionality or optimization). Fig. 2 depicts the accuracy rates attained by the above classifiers in cases with (a) only RGB and thermal input, (b) CSI (WiFi) and (c) multimodal input. From the results, it is evident that the proposed data fusion scheme of significantly increases the achieved performance detection performance regardless of classification scheme.

In the second round of experiments, we conduct experiments to validate the effectiveness of the proposed multimodal Bayesian optimized bidirectional LSTM. Focusing on the multimodal case, we compare the performance of the proposed model with the six models mentioned above (SVM-linear, SVM-RBF, FNNs, CNN, LSTM). The results of the experiments in terms of precision, recall, F1-score and accuracy are depicted in Table 1 . We observe that all deep learning models (CNN, LSTM) clearly outperform shallow classifiers, which is explained by the greater representational and understanding power of the deep models in complex scenarios such as the discussed abnormal activity detection application. Moreover, the proposed approach based on optimized bidirectional LSTM attains higher performance rates compared to the other examined deep learning models, revealing the contribution of both the bidirectionality and the proposed framework for Bayesian optimization of the network parameters. Finally, we have experimented with providing as input to the classifiers a "window" of past frames of different sizes, in other words feeding the model with "memory". We have explored three cases for window length: no window, brief window (50 frames) and longer window (100 frames). The results for the multimodal case are depicted in Fig. 3 . We can see that the presence of a time window in the input increases the performance in the examined cases of CNN, LSTM and the proposed optimized bidirectional LSTM, but the improvement ratio decreases as the window length increases. Furthermore, the improvement attained by the window is less significant in the proposed model compared to CNN and plain LSTM, where there is more room for improvement. In any case, though, the performance attained by the proposed model steadily outperforms the remaining examined approaches by a considerable difference. 

In this paper, we proposed a multimodal bidirectional Long Short-Term Memory neural network (LSTM) model for detection of abnormal activity in critical infrastructures.

Three input modalities are considered: RGB, thermal and Channel State Information, the fusion of which is proved to provide significant added value in the unusual activity detection scenario. The multimodal input is fed into a bidirectional LSTM, which allows for an effective capturing of both forward and backward temporal dependencies. Moreover, a Bayesian optimization method is used to optimally select the parameters of the employed model. The presented methods have been experimentally evaluated with a real-world critical water infrastructure monitoring and protection dataset, and have been shown to achieve very promising detection rates.

Funding. The research leading to these results has received funding from the EU H2020 research and innovation programme under grant agreement No. 740610, STOP-IT project.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Toward abnormal trajectory and event detection in video surveillance

Bayesian filter based behavior recognition in workflows allowing for user feedback

Hardware for machine learning: challenges and opportunities

Data-driven background subtraction algorithm for in-camera acceleration in thermal imagery

Tool release: gathering 802.11n traces with channel state information

Gradient-based learning applied to document recognition

R-TTWD: robust device-free through-the-wall detection of moving human with WiFi

The identification of multiple outliers

Video-based abnormal human be-havior recognition -a review

Crowded scene analysis: A survey

Anomaly detection in crowded scenes

Incremental behavior modeling and suspicious activity detection

You only look once: unified, real-time object detection

Background subtraction techniques: Systematic evaluation and comparative analysis

Superpixel-based tracking-by-segmentation using markov chains

A system for multicamera task recognition and summarization for structured environments

Analyzing tracklets for the detection of abnormal crowd behavior

FILA: fine-grained indoor localization

Survey of multispectral image fusion techniques in remote sensing applications

A comparative analysis of visual and thermal face image fusion based on different wavelet family

Recurrent neural networks and robust time series prediction

An adaptable neural-network model for recursive nonlinear traffic prediction and modeling of MPEG video sources

A density-based algorithm for discovering clusters in large spatial databases with noise

Learning to forget: continual prediction with LSTM

Long short-term memory

On the difficulty of training recurrent neural networks. In: ICML

Sequence to sequence learning with neural networks

Surrogating the surrogate: accelerating Gaussian-process-based global optimization with a mixture cross-entropy algorithm