key: cord-0123261-pgsvl2qt authors: Mondal, Anindya; Das, Mayukhmali; Chatterjee, Aditi; Venkateswaran, Palaniandavar title: Recovery of Missing Sensor Data by Reconstructing Time-varying Graph Signals date: 2022-03-01 journal: nan DOI: nan sha: a688b7d1d40a9885c2110466da0a6cca103c3abd doc_id: 123261 cord_uid: pgsvl2qt Wireless sensor networks are among the most promising technologies of the current era because of their small size, lower cost, and ease of deployment. With the increasing number of wireless sensors, the probability of generating missing data also rises. This incomplete data could lead to disastrous consequences if used for decision-making. There is rich literature dealing with this problem. However, most approaches show performance degradation when a sizable amount of data is lost. Inspired by the emerging field of graph signal processing, this paper performs a new study of a Sobolev reconstruction algorithm in wireless sensor networks. Experimental comparisons on several publicly available datasets demonstrate that the algorithm surpasses multiple state-of-the-art techniques by a maximum margin of 54%. We further show that this algorithm consistently retrieves the missing data even during massive data loss situations. Wireless sensor networks (WSNs) are networks of linked sensor nodes that interact wirelessly to collect information of interest. These wireless nodes are tiny battery-powered devices that consume low power and are easy to deploy. For these reasons, we widely use WSNs for object detection, weather monitoring, pollution monitoring, security surveillance [1] and many other tasks. However, because of low cost and use in remote locations, often it becomes challenging to repair the malfunctioning nodes (due to power depletion or damage). Also, since the number of nodes often crosses the range of thousands [1] , manually replacing or repairing faulty sensors becomes a tedious task. Since missing data severely hamper the process of decision-making, it is necessary to find out those missing values. There are several approaches in the literature to deal with this problem. One of the widely used methods is k-nearest neighbors interpolation (kNN) [2] , [3] . kNN recovers the missing data by weighing and averaging the readings of the nearest k neighboring nodes. Some statistical methods based on Expectation-Maximization (EM) algorithm [4] are also proposed. Here the missing data is recovered by alternately computing the conditional expectation in E-step and updating the estimate by maximizing this conditional expectation in the Mstep. Even though these approaches are easy to implement and have beaten several contemporary state-of-the-art methods, they fail to consider the spatio-temporal relationship in data, thereby restricting their implementation in real-life situations. As an improvement on these limitations, few studies have approached the spatial-temporal data recovery problem as a lowrank matrix completion (LRMC) [5] , [6] problem, estimating missing values by finding an appropriate low-rank approximation of the original incomplete data matrix. Nonetheless, these methods only capture the superficial features, failing to infer the embedded spatio-temporal dependencies in the sensor data. In the recent literature, some promising data-recovery methods are proposed, like the Probabilistic Matrix Factorization (PMF) [7] . In this method, the sensors are divided into different groups based on their similarity using k-means clustering, and then a PMF algorithm is applied within each group to recover the missing data. Even though this algorithm gives promising results, the performance of this algorithm is severely dependent upon the choice of hyperparameter k in k-means clustering and is prone to overfitting [8] . In recent years, due to the increasing demands for signal and information processing in irregular domains, the use of graphbased methods has increased in various fields, e.g. computer vision [9] , [10] , biological networks , point cloud processing, data science [11] , prediction of infectious diseases [12] and others. Sensors network is one such domain where graph signal processing (GSP) finds its most natural applications [11] , [13] . GSP tries to extend the concept of classical digital signal processing to graphs. When GSP is applied in sensor networks, the graph vertices represent the position of sensor nodes, whereas the sensor attributes (readings like temperature, pressure, and humidity) are represented by a vertex-indexed signal, known as graph signals [13] . In this paper, we show that GSP-based methods can be useful to interpolate missing sensor data. Here we formulate the problem of missing sensor data as a reconstruction of time-varying graph signals [12] , [14] . Our method is built upon the work of Giraldo et al. [12] , which predicts the number of new COVID-19 cases by extending the Sobolev norm [15] defined in GSP for time-varying graph signals [12] . The Sobolev algorithm successfully interpolates missing sensor values even in situations involving massive data loss. Our method also significantly improves upon the baselines while tested on several publicly available datasets. Our contributions can be summarized as follows: • We introduce the concepts of reconstruction of graph signals from GSP for recovering missing data in wireless sensor networks. • The Sobolev algorithm shows significant performance improvement over state-of-the-art (SoTA) algorithms deal-ing with the problem of missing sensor data. • We test the studied method on several publicly available datasets from diverse environments (indoor and outdoor) to demonstrate its versatility. We organize the rest of the paper as follows: In Section II, we discuss GSP basics and then gradually move into the details of the studied method. The experimental framework (including datasets, evaluation metrics, experiments, and results) is briefly discussed in Section III. Finally, in Section IV, we conclude the paper with outlines for the direction of future work. This section presents the mathematical notations used in this paper, along with the basics of graph signal processing. Further, we discuss the framework to recover missing data in sensor networks which is inspired by the theory of reconstruction of time-varying graph signals [14] , [16] by minimizing the Sobolev norm [12] . For the convenience of readers, firstly we introduce the various notations used in this paper. The uppercase boldface letters like X represents the matrices, lowercase boldface letters like x denote vectors. Calligraphic letters like L denotes sets. Matrix products like Hadamard and Kronecker products are denoted by • and ⊗, respectively. The vectorization of X is denoted by vec(X). (X) T denotes the transpose of matrix X. The diagonal of a matrix with entries x 1 , x 2 , x 3 , ...., x n is denoted by diag(x). The 2 and Frobenius norm of a vector x are represented by x 2 , x F . The trace of a matrix X is given by tr(X). Any other mathematical symbol used in this paper bear their usual significance. Let us consider an undirected weighted graph G = (ν, E, W). Here ν is the set of N nodes/ vertices with |ν| = N . E = {(i, j)} represents the set of edges with (i, j) as an edge between the nodes i and j. W is the weighted adjacency matrix. According to the spectral graph theory [17] and graph signal processing [11] , we define the unnormalized Laplacian of G as Finally, on each node, we define a graph signal that is a function on ν such that x : ν → R. The graph signal can be represented as a vector x ∈ R N , where x(i) is the value of the function in the node i ∈ ν. In the theory of graph signal processing, sampling and reconstructing graph signals play an essential role [11] . For reconstructing a graph signal from its samples, the graph signals need to be bandlimited [18] . In the literature of GSP, most of the proposed recovery algorithms make a prior assumption that graph signals are smooth in graphs. In our work involving sensor networks, we also consider graph signals to be spatiotemporally smooth. This is because adjacent sensor readings are similar in nearby positions and the readings are progressive with time. For static graph signals in G, the graph Laplacian quadratic form S 2 (x) = x T Lx is widely used as a measure of smoothness [19] . However, in sensor networks, missing data may arise at any point in space-time; hence we need to consider both the spatial and the temporal distribution of missing data while we try to reconstruct their original values. So the concept of time-varying graph signals is ideal for dealing with these forms of missing data. Qiu et al. [14] extended the definition of S 2 (x) to time-varying graph signals. Let a time-varying graph signal be expressed as a matrix Here each row of X represents a time-series on the corresponding vertex. Now we define the smoothness of X as [14] : As per Qiu et al., the temporal difference operator matrix D h ∈ R M ×(M −1) is defined as: This is done so as to include temporal information in the problem of time-varying graph signal reconstruction. Also, the temporal difference signal is defined as: In their work, Qiu et al. had proposed two different reconstruction methods, one for the noisy case and the other for the noiseless case. Since sensor networks accumulate sufficient noise during their operational period, we consider the noisy case in our work. As per Qiu et al., the noisy case is defined as follows: Here Y is the sampled matrix and the sampling matrix for the whole time-varying graph signal is J ∈ {0, 1} N ×M , defined as: S t being the sampled set of vertices at time t. In a nutshell, Eqn. 4 reconstructs a time-varying graph signal X with a small error J • X − Y 2 F while minimizing the temporal difference graph signal smoothness tr(XD h ) T LXD h ). γ in Eqn. 4 is known as the regularization parameter and it weights the importance between the error and smoothness terms. In 2020, Giraldo and Bouwmans [12] proposed a new algorithm for time-varying graph signals reconstruction inspired by the minimization of the Sobolev norm. The norm was first defined by Pesenson et al. [15] for introducing the variational problem in graphs as: Sobolev norm: Let L and x be the Laplacian matrix and graph signal, respectively. Let ≥ 0 and β ∈ R + be two constant parameters, and the Sobolev norm is defined as follows: x β, = (L + I) β/2 x When L is symmetric, we have that as: In their work, they found that the term (L+ I) in Eqn. 7 has a better condition number than L when > 0. By extending the concept of Sobolev-norm to time-varying graph signals (considering L to be a symmetric matrix), we get: Finally, the Sobolev reconstruction problem for timevarying graph signals is formulated as: Here we use the temporal difference operator of Eqn. 2 along with the Sobolev-norm of time-varying graph signals of Eqn. 8. The schematic diagram of our method is shown in Fig. 1 . This section briefly discusses the implementation of our proposed method and the dataset used to assess it. We also discuss the evaluation metrics and compare our method with SoTA baselines [2]- [7] on those metrics. All the experiments were performed in MATLAB ® 2018a on a Intel ® Core TM i5 8th Gen. Laptop with Linux Mint and having 4 CPUs with a clock frequency of 1.6 GHz each. We used Python 3.10.0 for generating the figures and graphs in the paper. To evaluate the performance of our proposed method, we test it on many publicly available widely used sensor-network datasets. To test the flexibility of the proposed method, we use datasets collected in diverse environments (indoor and outdoor). Moléne Dataset: [20] The French national meteorological service 1 published an open-access dataset of hourly weather observations in Brittany, France, for the month of January 2014. In addition to the graph of ground weather stations, the dataset contains hourly readings of those stations. Readings include temperatures, wind characteristics, rain, and other information. Intel Lab Dataset: [21] This dataset contains data collected from 54 sensors deployed in the Intel Berkeley Research lab between February 28th and April 5th, 2004 2 . Sensors measured various parameters like temperature, humidity, light intensity, and voltage values at an interval of 30 seconds in an indoor environment. For convenience in processing and representing the data, we consider the temperature readings only. To compare our method against baselines, we use two of the most widely used metrics in sensor networks, namely the Root Mean Square Error (RMSE) and the Mean Absolute Error (MAE). The two metrics are defined as: Here n is the number of unavailable/ missing readings, x i is the ground truth (original) data and x * i is the reconstructed/ recovered data using the proposed method. Let M ∈ R N ×2 be the matrix of positions of all sensor nodes in ν such that M = m 1 , . . . , m N T , where m i ∈ R 2 is the vector with the latitude and longitude (for the Moléne dataset) or x and y coordinates (for the Intel dataset) of vertex i. To connect the vertices of the graph, we use a knearest neighbors (kNN) strategy. The value of k is determined experimentally. The weights of each edge (i, j) connecting the nodes i and j is given by W(i, j) = exp(− δ(i,j) 2 σ 2 ), where δ(i, j) = m i − m j 2 is the Euclidean distance between the nodes i and j and σ 2 is the standard deviation of the Gaussian kernel σ = 1 |E| (i,j)∈E δ(i, j). We randomly remove some data points from the dataset and then recover them to evaluate the Sobolev reconstruction model's performance. For a fair comparison, we test the baseline methods on the same datasets on which we test our method. We also scale the data points in the two datasets to the range [0, 1]. The reconstructed values are scaled to this range 2 http://db.csail.mit.edu/labdata/labdata.html as well. For the sampling matrix J, we take a random sampling strategy so that each time-graph signal x t has the same number of sampled nodes ∀t (1 < t < M ). Also, we empirically tune the values of constants , γ, and β to get the combination of values for which the performance of our proposed method is the best. Again, for fairly comparing the studied algorithm with the baselines (kNN, Expectation-Maximization, LRMC, and PMF), we find out the best values of different parameters affecting their performances. For the Moléne dataset, the value of nearest-neighbors is determined experimentally and is set to k = 5. Since all the weather stations do not have readings over the whole period of 31 days, we consider the temperature readings of 10 such stations where the readings are consistently available. From the 744 hours of readings, we randomly remove some readings while using the available data from the rest of the dataset. The sampling densities are set to {0.1, 0.3, 0.5, 0.7}, where each sampling density means the amount of valid information. For example, when the sampling density is 0.1 the reconstructed data is 90% and the available data is 10%. For the Intel dataset, the value of k is set to 3. Here we consider the first 10 4 readings out of 10 5 readings (38 days × 24 hours × 60 minutes × 2). In this dataset too, we randomly remove some values with sampling densities {0.1, 0.3, 0.5, 0.7} and reconstruct those values using our method. For both datasets, we perform Monte-Carlo cross-validation [22] with 20 repetitions. Finally we compare the reconstructed values with the ground truth data and evaluate the performance of our algorithm. Table I shows the performance of our algorithm under different sampling densities. Here we observe that our method performs well under higher sampling densities than under lower ones. Since the number of available readings increases with increasing sampling density, we are getting richer information about the underlying spatio-temporal information in the dataset. So we are getting better performance while recovering the missing values. In Table II , we compare the proposed method against the baselines [2]- [7] with different sampling densities. Fig 2 shows that even though other methods fail to accurately recover the missing values under lower sampling densities, our method significantly outperforms them by providing consistent performance under those conditions. One explanation behind our model's success can be its ability to effectively capture the spatial as well as temporal correla- In this paper, we introduce a time-varying graph signals reconstruction approach to solve the problem of missing sensor data recovery in wireless sensor networks. We show how the proposed method performs well under massive data loss situations, even without prior knowledge of the factors affecting the sensor readings. We also validate the proposed method on several publicly available datasets belonging to diverse environmental conditions to demonstrate its flexibility. We believe that our method will help future researchers to use the promising field of GSP in solving various related problems like intelligent transportation systems, weather conditions forecasting, disaster management, and others. In the future, we plan to extend the studied algorithm to nonsmooth situations, for example, where nearby localities have dissimilar readings (e.g. in mountainous and plateau regions) and changes in weather conditions are abrupt with time. Using graph learning can also be a way forward. V. ACKNOWLEDGEMENT Anindya Mondal would like to thank Jhony H. Giraldo of MIA Lab, La Rochelle Université, France, who proposed the Sobolev norm minimization method for reconstructing timevarying graph signals [12] , for sharing his valuable insights in various sections of this paper. We also thank the Texas Instruments Innovation Laboratory, Jadavpur University, for using the laboratory resources to carry out the experiments. Simultaneous State Estimation of Cluster-Based Wireless Sensor Networks K-nearest Temperature Trends: A Method for Weather Temperature Data Imputation Expectation -Maximization Approach To Fault Diagnosis with Missing Data Matrix and Tensor Based Methods for Missing Data Estimation in Large Traffic networks Low-Rank Data Matrix Recovery With Missing Values And Faulty Sensors Probabilistic Recovery of Incomplete Sensed Data in IoT Missing Data Problem in the Monitoring System: A Review Moving Object Detection for Event-based Vision using Graph Spectral Clustering Moving Object Detection for Event-based Vision using k-means Clustering Graph Signal Processing: Overview, Challenges, and Applications. Proc. of the IEEE On the Minimization of Sobolev Norms of Time-varying Graph Signals: Estimation of New Coronavirus Disease 2019 Cases Graph Signal Processing in Applications to Sensor Networks, Smart Grids, and Smart Cities Time-varying Graph Signal Reconstruction Variational Splines and Paley-Wiener Spaces on Combinatorial Graphs Introduction to Graph Signal Processing Spectral Graph Theory. No. 92 Discrete Signal Processing on Graphs: Sampling Theory The Emerging Field of Signal Processing on Graphs: Extending Highdimensional Data Analysis to Networks and Other Irregular Domains Stationary Graph Signals Using an Isometric Graph Translation Intel Berkeley research lab data, USA: Intel Corporation Monte Carlo cross validation