key: cord-0044185-c03pysjc authors: Khan, Nadia Masood; Khan, Gul Muhammad; Matthews, Peter title: AI Based Real-Time Signal Reconstruction for Wind Farm with SCADA Sensor Failure date: 2020-05-06 journal: Artificial Intelligence Applications and Innovations DOI: 10.1007/978-3-030-49186-4_18 sha: a48ee0c8f0e4780c92016a6efc70bdd21bd03090 doc_id: 44185 cord_uid: c03pysjc Supervisory Control and Data Acquisition (SCADA) systems used in wind turbines for monitoring the health and performance of a wind farm can suffer from data loss due to sensor failure, transmission link breakdown or network congestion. Sensory data is used for important control decisions and such data loss can make the failures harder to detect. This work proposes various solutions to reconstruct the lost information of important SCADA parameters using Linear and non-linear Artificial Intelligence (AI) algorithms. It comprises of three major contributions; (1) signal reconstruction from other available SCADA parameters, (2) comparison of linear and non-linear AI models, and (3) generalization of the AI algorithms between turbines. Experimental results demonstrate the effectiveness of the developed methodologies for reconstruction of the lost information for valuable planning decisions. Wind energy generation is an ideal source of green energy due to which the capacity of wind farms has been increased 30 times with a 17% cumulative growth in the last few years. Wind energy is expected to supply 12% of worldwide energy demand by 2020 [22] . Due to growth in the wind industry, wind turbines are most likely to be installed in diverse climatic conditions, onshore and offshore which would need continuous monitoring. These systems are monitored using a Supervisory Control and Data Acquisition (SCADA) system for their operation and performance. Unexpected failures of wind turbine components cause an increase in machine downtime, repair cost and subsequently cost of energy. Condition monitoring is often used to monitor health parameters, e.g. temperature or vibration that shows the condition of a machinery and any significant change in its pattern is indicative of developing failure [4] . One of the most significant problems arises when a communication link or a sensor fails causing faulty or no data to be received for timely probable control decisions [13] . This research develops a system for reconstruction of the lost signal from low correlated parameters when one of the SCADA sensors fails to send data. Linear and non-linear AI algorithms have been analysed to find a generalized model which will be robust and perform better for all wind turbines in a wind farm rather than one turbine. The wind power curve defines the relationship between wind speed and power. It is frequently used to monitor the health of a wind turbine using SCADA data received from other system parameters. This study will assume that wind power, being the most important parameter to monitor the performance and health of a wind turbine, is lost or corrupted in transmission. Artificial intelligence (AI) based models are extensively used in detecting failures and predicting wind power from a SCADA system of a wind farm [17] . Research in the literature is focused mostly on signal reconstruction from historical data. Signal reconstruction from other available parameters was never considered as an option. The motivation of this research is to reconstruct a signal when one of the SCADA sensors fails to send data either due to a sensor failure or a communication breakdown for longer than expected. Also, when the highly correlated variables and historical data is not available for a very long time, signal reconstruction becomes vital for optimum operation of the plant. Electrical power generated from a wind turbine is considered to be the corrupted/lost signal in this case, since it is the most important parameter describing the normal operation of a wind turbine and is hard to predict due to its high degree of fluctuations and randomness. AI algorithms have the ability to learn and model the non-stationary behaviours. We have explored two AI methods: random forest and Cartesian Genetic programming evolved Artificial Neural Network (CGPANN) and then compared these results with a linear regression model to find out the best performing model for accurate estimation of the failed sensor data. Training and test results on the same turbine demonstrate random forest performing much better than its counterparts. Its performance degrades when tested on data from other wind turbines in the same wind farm. The CGPANN model having multilayered feed-forward architecture arranged in Cartesian format show remarkable generalization and continue to perform better in diverse data conditions [6] . SCADA collects data from a machine and send it to a central processing unit for proactive measures. Data collected and stored from SCADA comprises of information regarding every aspect of a wind farm which can be used to infer overall health of the wind turbine in real-time. SCADA systems are often at risk due to various factors such as the sensor failure or network congestion, limited power or equipment abnormality resulting in data loss [12] . Sensor failure means it might be sending abnormal data or not sending data at all, this work is focused on the latter one. Missing/lost data is a challenge faced in engineering and industry specifically in applications employing sensing technologies implementing intelligent real-time monitoring and control such as an offshore wind farm SCADA data and wireless sensor networks [12] . A framework for the effective data management and dealing with issues of missing and corrupted samples in the acquired data is developed in [15] for effective fatigue assessment of offshore wind turbines. Yang et al. [23] proposed a machine learning based reconstruction model for real-time condition monitoring and fault detection. They performed correlation analysis to select input parameters and then used Support Vector Regression (SVR) for building a reconstruction model. Their focus was on failures caused due to high temperature, so signals relevant to temperature faults are selected as input features to estimate generator drive end temperature. Singh [19] used wind power curve to identify the abnormal operation of a wind turbine. Wind power being considered the vulnerable parameter, since deviation of power from its normal operational values helps in identifying probable failures in advance. Establishing a generalised model to reconstruct the lost/corrupted SCADA signal of wind power is the focus of this work. Lind et al. [11] have explored a stochastic approach to reconstruct the tower top acceleration signals from a single external variable, i.e. wind speed and previous values of the tower top acceleration. Their finding was that signal reconstruction can be used to monitor and detect abnormal behaviour. De-noising auto-encoder (DAE) is proposed in [1] for reconstruction of original sensor measurements from a corrupted SCADA system due to covert cyber-deception attack (CCDA). An ad-hoc method is presented in [14] to reconstruct the long bursts of data lost by SCADA due to sensor or communication failure. Lamrini et al. [10] applied self-organizing map (SOM)-based methods for the reconstruction of data from a water treatment system to deal with data loss due to sensor failure or corrupted input data. A number of statistical and artificial intelligence models have been proposed for wind power estimation to prevent damage to wind turbines and ensure stability of the power system. Sun et al. [21] proposed a hybrid model of deep belief networks (DBN) and random forest for short term wind power forecasting. SCADA is widely used in different areas for monitoring and control in real time. The SCADA data used in this study is acquired from La Haute Borne wind farm (ENGIE Green) 1 which consists of four wind turbines from Senvion MM82 technology located in the Grand East region in north-eastern France. The SCADA system collected data from 31 parameters along with their statistics such as average, maximum, minimum and standard deviation of each parameter. Since the average value of these parameters captures most of the information, only the average value of each parameter has been used for the experiments. Each data point is sampled at 10 min interval. Nominal power for each turbine at the La Haute Borne wind farm is 2050 kW, with a rotor diameter of 82 m and a hub height of 80 m. Cut in wind speed of 3.5 m/s, nominal wind speed of 14.5 m/s and cut-out wind speed of 25 m/s. Some of the parameters from the dataset are, active power, reactive power, vane position, wind speed, nacelle angle, gearbox bearing temperature, generator bearing temperature, pitch angle, torque and converter torque. The layout of the wind farm with latitude/longitude and inter-turbine distances in meters is shown in Fig. 1 . The wind farm comprises of four wind turbines: R80711, R80780, R80721 and R80736. The objective of this work is to develop a reliable signal reconstruction model for wind power prediction from other SCADA parameters. This is necessary because the power produced from a wind turbine depicts its health statistics and eventually help in monitoring the condition of a wind turbine. Accurate wind power prediction has a significant economical and technical impact for a reliable large-scale wind power integration and important energy management planning decisions. There has been ample work carried out to estimate power generated from wind speed and an acceptable accuracy is reported in literature [2] . The challenging part is to estimate wind power when the closely correlated parameters such as wind speed, torque, apparent power, rotor speed etc. are not available, and the system has to predict highly fluctuating wind power from very low correlated parameters with non-linear and non-stationary characteristics. This work is focused on restoring the lost information from the low correlated parameters, assuming that all the highly correlated parameters are missing. Due to the non-linear nature of the problem, an adequately accurate algorithm needs to be developed to model this complex relationship. Wind power is estimated in absence of highly correlated variables. Any deviation of a power curve from its normal operational values is indicative of a number of faults such as blade pitch angle failure, blade damage, pitch control failure and blades affected by ice or dirt [20] . The idea is to develop a model in which the output has low correlation with input signals and input signals have low cross-correlation with one another [18] . As part of the methodology, a correlation matrix has been generated and input signals having an absolute cross-correlation coefficient greater than 0.8 are removed from input when developing wind power signal reconstruction model in this study. One year (2014) data is divided into 3 segments based on wind speed. Figure 2 shows the performance curve for each region. The first segment contains the data when the wind speed is below cut-in and the turbine blades are trying to overcome friction. There is no power produced in this region. Instead the wind turbine takes power from grid to keep the turbine blades moving to prevent damage to blades due to ice and dust. The second segment is the most important to model, as it contains data above cut-in speed to rated wind speed; there is a rapid growth in power produced. In segment 3, constant power is produced until the wind speed reaches the cut-off speed and the turbine is turned off to prevent damage due to excessive wind speed. These three segments represent the various key modes of wind turbine operation. Linear regression, random forest and CGPANN are implemented on each of the data segment and then on the data without segmentation. The data from wind turbine R80721 between 01/01/2014 to 31/12/2014 is selected for training and testing following the split strategy 75% training, 25% testing [9] . Testing has been performed on the same turbine and on data from the other three wind turbines in the wind farm. The reconstruction algorithm performance is tested on all three segments as well as on the whole dataset. Similarly, data from the other three wind turbines is divided into segments and evaluated to find the best signal reconstruction algorithm. Linear regression is a statistical method used for prediction of a dependent variable from a single independent variable. It is termed as Multiple linear regression (MLR) [5] when the dependent variable Y is predicted from a set of independent variables X k where k is an index of the predictor variables. The model parameters β 0 (intercept), β 1 to β k (regression coefficients) are learned from the data and is the residual standard deviation in Y. Python's Scikit-learn library [16] has implemented a number of algorithms. The linear regression is implemented using the Scikit-learn library in this study on the transformed dataset. Random forest [3] is a supervised machine learning algorithm which consists of an ensemble of decision trees. More trees means more robust performance. The ensemble decision trees are trained using a bagging method. This trains a number of learning models and its combination increases the overall results. RF can be represented as an ensemble of C number of trees T 1 (X), T 2 (X), ..., T C (X) having m inputs given by X = x 1 , x 2 , . ..x m . The resulting ensemble produces C outputsŶ 1 = T 1 (X),Ŷ 2 = T 2 (X), ......,Ŷ C = T C (X) and the mean is calculated from the output of these randomly generated trees to give final predictionŶ of random forest regression tree. Random forest comes under the umbrella of artificial intelligence models which develops decision trees with the root node having the most important feature. Random forest is implemented using Scikit-learn library [16] and is used with the default parameter setting. Hyper-parameters are not optimized since it performed well with default parameters. Cartesian genetic programming evolved Artificial Neural Network (CGPANN) was first proposed in 2010 [8] . The study conducted in [7] shows that CGPANN has performed best when compared with Hidden Markov model (HMM), Auto Regressive Integrated Moving Average (ARIMA), regression model, Classification and Regression tree (CART), and neural network model for time series prediction. The essence of CGPANN is to tune the hyper parameters using evolutionary programming called Cartesian genetic programming (CGP) instead of traditional gradient based methods. All neurons are arranged in Cartesian format in CGPANN. Recently, a single row format has been used, since it provides more flexibility of generating infinite graphs as shown in Fig. 3 . Figure 3 shows the CGPANN phenotype with corresponding genotype in part A and B. Not all neurons are connected in CGPANN while creating graphs from output to inputs. This ability of CGPANN provides the flexibility of generating arbitrary graphs. The numbers in genotype can change during evolutionary training using mutation operator, causing the network topology to change accordingly producing novel solutions to the problem. Direct encoding scheme is used to encode connection weights, connection type, and the topology of the network. The 1 + λ(λ = 9) evolutionary strategy is followed to produce population of probable solutions. In this work, the log sigmoid function is used as the activation function as represented by Eq. 3. The following two performance metrics are applied to evaluate the abilities of three models to reconstruct the wind power from loosely correlated parameters. where y i is the actual value,ŷ i is the estimated value and N is the total number of observations. The proposed wind power reconstruction model is trained on the parameters available in the SCADA dataset under consideration. The framework for the Fig. 4 . Conditioning monitoring data from the SCADA system is segmented based on different wind speeds, it results in four sub-dataset. Each segment is analysed separately to select inputs for training. Figure 5 shows the correlation heat map for each segment. The heat map is used for two purposes in this research, first to verify the different modes of operation of a wind turbine by showing the difference in heat map of each segment. It demonstrates that correlation between different variables varies across various segments. Second, to locate the variables having an absolute cross-correlation greater than 0.8 and removing these from the input. The main aim of this research is to reconstruct the important parameter i.e. wind power which shows the health of a wind turbine from poorly correlated input parameters. Pitch angle Ba has very low correlation with power in segment 1, but noticeable negative correlation in segment 2 and 3. Blade pitch angle keeps on changing to capture most of the wind energy. When the power production gets low due to change in wind direction in segment 2, pitch angle is changed to increase the power production. In segment 3, when power reaches its rated maximum value, pitch angle is adjusted to stop production and prevent turbine blades from damage. Similarly, wind speed is recorded by three sensors (Ws, Ws1, Ws2) at different locations in a wind turbine. Wind speed has highest correlation with power in segment 2, since power produced relies on wind speed in this region. In segments 1 and 3, correlation between wind speed and power produced is low as there is constant power produced due to wind energy in these two regions. The heat map demonstrates various operational modes in each segment to exploit wind turbine physics. After the input selection, each sub-dataset is split into training and testing data. Training data is then used to develop three reconstruction models using Linear Regression, Random Forest and CGPANN. The trained reconstruction models are then tested on two types of testing datasets, that include testing data from same wind turbine it is trained on and the data from other three wind turbines in the same wind farm. To validate the performance of linear and non-linear methods for wind power reconstruction, experiments have been carried out on four wind turbines' data for the year 2014. Table 1 shows the results obtained by testing the trained algorithms' performance on different segments and all year (2014) data of the turbine R80721. Random forest (RF) performs exceptionally well in all three segments as well as on the overall data without segmentation when tested on same wind turbine data (R80721) it has been trained on. Random forest performs well on segments 1 and 2 giving mean absolute error (MAE) of 0.02. On segment 3 MAE is 0.06 and on data without segmentation MAE is 0.05. High values of error on complete data shows that the data segmentation can be helpful in accurately reconstructing the power production of a wind turbine from other SCADA parameters. Trained models have been evaluated for their performance on other wind turbines from the same wind farm to check if the developed models had learned the power variation and fluctuation pattern in wind farms in general rather then learned a single wind turbine's operation. Table 2 shows the performance of the three trained reconstruction models on different wind turbines in the same wind farm. Random forest (RF) still performs well in segment 1 due to the fact that segment 1 does not have a lot of variations and power produced in this region is constant. Ideally, the wind turbine does not produce any power below cut in wind speed, but it is not the case in real time as the wind turbine takes power from national grid to keep the blades rotating slowly in the cold weather to prevent icing. Overall, it should be noted that CGPANN outperforms random forest and linear regression in all three wind turbines. Segment 2 is the most important operational region of all as it has a lot of variations in power produced based on different wind speeds. The relationship between wind power and other SCADA parameters is not linear in this region which can be seen in Table 2 that linear regression has highest error in segment 2. While CGPANN has lowest MAE of approximately 0.14 and the random forest having MAE ≈0.6 which verifies that a neural network has the property of transferability and able to learn the power patterns in a wind farm. In Table 2 , the low error on all data without segmentation when tested on different wind turbines is because segments 1 and 3 have low errors as compared to segment 2. Table 3 depicts the size of data in each segment. This emphasizes the importance of segmentation as the model might show good performance on all data but not well on segment 2 with high variations as shown in Table 2 . All data 47387 15797 53560 This paper presents a methodology to deal with the challenge of wind power estimation from low correlation data and proposes a signal reconstruction model for wind power in case of SCADA sensor failure in a wind farm. The proposed model can be used to monitor the wind power signal continuously even when the power sensor fails and the signals of parameters that are highly correlated with wind power are also not received either. Linear regression, random forest and CGP evolved ANN is used for real time prediction of electric power produced from a wind turbine. Data segmentation based on wind speeds help in the accurate estimation of wind power and emphasizes the importance of segment 2. Although Random Forest testing results are better for a specific wind turbine, CGPANN generalizes better when exposed to wind turbines other than it is trained on. Accurate and timely prediction of these important parameters are able to help in important decisions for a wind farm. This work can be extended to the cases where more than one sensor fails. In future, the reconstructed signal will be used to identify faults in the wind turbine so that an alarm can be generated before the actual failure. Mitigating the impacts of covert cyber attacks in smart grids via reconstruction of measurement data utilizing deep denoising autoencoders Different models for forecasting wind power generation: Case study Random forests Fault diagnosis in rotating machine using full spectrum of vibration and fuzzy logic Statistical Models: Theory and Practice Transferability of artificial neural networks for clinical document classification across hospitals: a case study on abnormality detection from radiology reports Breaking the stereotypical dogma of artificial neural networks with cartesian genetic programming Evolution of neural networks using cartesian genetic programming Short-horizon prediction of wind power: a data-driven approach Data validation and missing data reconstruction using self-organizing map for water treatment Normal behaviour models for wind turbine vibrations: comparison of neural networks and a stochastic approach A statistical learning framework for the intelligent imputation of offshore wind farm missing SCADA data Different approaches to scada data completion in water networks Double tensor-decomposition for scada data completion in water networks Data management for structural integrity assessment of offshore wind turbine support structures: data cleansing and missing data imputation Scikit-learn: machine learning in Python A novel condition monitoring method of wind turbines based on long short-term memory neural network Comparative analysis of neural network and regression based condition monitoring approaches for wind turbine fault detection Analytical techniques of SCADA data to assess operational wind turbine performance Machine learning methods for wind turbine condition monitoring: a review Multistep wind speed and wind power prediction based on a predictive deep belief network and an optimized random forest Direct quantile regression for nonparametric probabilistic forecasting of wind power generation Real-time condition monitoring and fault detection of components based on machine-learning reconstruction model