key: cord-0016529-wkonanbg
authors: Nath, Pritthijit; Saha, Pratik; Middya, Asif Iqbal; Roy, Sarbani
title: Long-term time-series pollution forecast using statistical and deep learning methods
date: 2021-04-03
journal: Neural Comput Appl
DOI: 10.1007/s00521-021-05901-2
sha: 2277e82ae3a7c5ebe8d113fa9bbbd3c87b6cd641
doc_id: 16529
cord_uid: wkonanbg

Tackling air pollution has become of utmost importance since the last few decades. Different statistical as well as deep learning methods have been proposed till now, but seldom those have been used to forecast future long-term pollution trends. Forecasting long-term pollution trends into the future is highly important for government bodies around the globe as they help in the framing of efficient environmental policies. This paper presents a comparative study of various statistical and deep learning methods to forecast long-term pollution trends for the two most important categories of particulate matter (PM) which are PM2.5 and PM10. The study is based on Kolkata, a major city on the eastern side of India. The historical pollution data collected from government set-up monitoring stations in Kolkata are used to analyse the underlying patterns with the help of various time-series analysis techniques, which is then used to produce a forecast for the next two years using different statistical and deep learning methods. The findings reflect that statistical methods such as auto-regressive (AR), seasonal auto-regressive integrated moving average (SARIMA) and Holt–Winters outperform deep learning methods such as stacked, bi-directional, auto-encoder and convolution long short-term memory networks based on the limited data available.

The problem of urban air pollution has become more and more serious due to rapid industrialization in recent times, thus badly affecting not only our physical health but also the environment around us. Research on pollution forecasting has thus become a key issue in environmental protection to better evaluate the necessary steps to be taken to curb its long-term effects. Major cities around the world have set up various automatic air quality monitoring stations that detect the levels of particulate matter (PM) such as PM2.5 [11] and PM10 [11] , in specific areas spread throughout the city.

Air quality forecasting methods proposed till now can be broadly classified into two main categories, namely statistical methods and deep learning methods. The performance of each method depends on multiple factors such as trend, seasonality and noise in the data as well as meteorological and socio-economic trends [12] , which also equally play an important role in contributing to pollution in a specific region.

Ong et al. [13] proposed a two-stage approach where a recurrent neural network (RNN) is pre-trained on hourly data using an auto-encoder-based model, followed by finetuning to filter out sensor data. The resulting network was then used to predict PM2.5 concentrations. Bashir Shaban et al. [14] studied the forecasting of support vector machines (SVM), artificial neural networks (ANN) and model trees (M5P) using both univariate and multivariate modelling to predict hourly forecasts. Tao et al. [15] worked on air pollution forecasting using one-dimensional convolution neural networks (CNN) and bi-directional gated recurrent networks (GRU) based on Beijing PM2.5 dataset [16] which consists of hourly data extracted from different air quality monitoring stations in Beijing.

Mlakar et al. [17] explored the important task of feature selection in a model and put forward several algorithms for feature reduction, all of them based on the case of forecasting SO 2 half an hour in advance. Li et al. [18] proposed a multivariate CNN-LSTM model. The authors performed a thorough comparison with other hybrid long short-term memory (LSTM) models based on their root mean squared error (RMSE) and mean average error (MAE) along with their training times, and in the end, they proposed the hybrid CNN-LSTM model to be more effective than others. Wang et al. [19] explored the use of the seasonal autoregressive integrated moving average (SARIMA) model, along with studies on the periodicity of monthly PM2.5 data as well as procedures for parameter estimation, diagnostic checking, to predict and forecast the air pollutants in an effective way. Other related studies in this area can be found summarized in Table 1 .

All of these works are crucial and have been found to be extremely effective in carrying out short-term predictions of pollution levels in a city. However, these works do not account the usage of those methods in prediction horizons which span for more than a year. To tackle the problem of long-term forecasting, a different approach had to be adopted, which involves the usage of monthly data for predicting long-term trends. Daily data from monitoring stations are resampled into monthly data in our study as it has been observed that long-term yearly forecasts performed on the daily data converged to the statistical mean, thus making the results produced ineffective.

This paper presents a comparative study of long-term pollution forecasts using the best four statistical methods such as auto-regressive (AR), seasonal auto-regressive integrated moving average (SARIMA), Holt-Winters and Prophet along with four best deep learning methods such as stacked, bi-directional, auto-encoder and convolution LSTMs. The study is based on the historical pollution data that are extracted from various government set-up monitoring station(s) of the city Kolkata (India). Here, the overall end-to-end approach for long-term forecasting of pollution level can be viewed as a combination of three main stages, namely data pre-processing, time-series analysis (based upon the pre-processed historical data) and data modelling (using various statistical and deep learning models to predict PM2. 5 and PM10 values in future). Unlike previous studies [20] [21] [22] , this study aims at finding the optimal combination of techniques for pre-processing, time-series analysis and finally forecasting, so that statutory bodies focussing on producing similar projections for their city can take advantage of the proposed approach to construct their forecasting infrastructure. After pre-processing (e.g. missing value imputation), an in-depth timeseries analysis of both PM2.5 and PM10 is conducted to find the major trends in the data. The results obtained thus are used to ascertain the hyper-parameters of the predictive models, which is further tuned using a popular hyper-parameter finding process called Grid-Search. Different performance metrics (namely RMSE, MAE) are utilized to analyse the performance of various models. The two-year forecast produced by different predictive models is then studied in detail, and domain-specific discussions are presented based on the projections made. The main aim/objectives of this comparative study are:

-To evaluate a set of methods and find the optimal ones for all the stages ranging from data pre-processing to data modelling, in performing long-term forecasting of PM2.5 and PM10 time-series data. -To perform the imputation of missing values in the raw data by using different imputation techniques like multivariate imputation and mean before after [23] . -To conduct a comprehensive time-series analysis of both PM2.5 and PM10 for analysing underlying trends. -To carry out the evaluation and forecasts of various models using walk forward approach (WFA) allowing the results to be more accurate and close to real-world scenarios.

The rest of the paper is organized as follows: In Sect. 2, a brief background of the city and the pollutants that are a part of the study are presented along with a description and summary statistics of the data obtained from multiple sources. Section 3 consists of the detailed description of the techniques used in missing value imputation, timeseries analysis and the models used in the study. Section 4 is about the approaches used in data preparation, timeseries analysis, model training, evaluation and future forecasts. The results are laid out in great detail in Sect. 5, along with a detailed discussion on the quality of forecasts and efficiency of the models with regard to learning the underlying trends. The conclusions drawn from the findings are ultimately presented in Sect. 6.

Located on the eastern side of India, Kolkata is the capital city of the state of West Bengal. As per the 2011 Census, around 14 million people reside in the city making it one of the major cities in the world. Due to high socio-economic activity, the air quality of Kolkata is sub-par due to the presence of significantly higher levels of particulate matter and toxic gases in the city atmosphere. Besides the usual contribution of industries, transportation is also one of the major air-polluting sectors due to ineffective control measures and high abundance of poorly maintained vehicles plying in the city [24] . This section deals mainly with the major pollutants and the statistical description of pollution data used in the study.

Particulate matter (PM) is a mixture of coarse, fine and ultra-fine solid and liquid particles suspended in the air. PM2.5 refers to that particulate matter which has a diameter less than 2:5lm, as a result, they remain suspended in the air for longer periods. These are mostly produced from burning fuels, forest fires, volcanic eruptions, etc. Exposure to PM2.5 can lead to multiple short-term and long-term health issues. Prolonged exposure may result in permanent respiratory problems such as asthma, chronic bronchitis and heart disease.

PM10 are those solid and liquid particles that have a diameter greater than 2:5lm and less than 10lm; hence, they persist in the air for lesser time compared to PM2.5. These are particles that consist of smoke, dust from industries, roads and other places. Soil and rocks when crushed, create such particles that get blown away by the wind. Being heavier than PM2.5, they cannot go deep enough into the lungs, hence are less risky than PM2.5; however, they are responsible for lung injury and can cause ailments like chronic obstructive pulmonary disease (COPD) [25] .

The pollution data of Kolkata were provided by the central pollution control board (CPCB) [26] , responsible for providing field information regarding pollution of various places throughout the country. In this paper, the pollution data which form the basis of the study were extracted from the station positioned at Victoria Memorial Hall (22:5448 N, 88:3426 E), supplemented with data procured from other nearby stations.

Preliminary analysis of the data obtained found it to be daily in nature, spanning four years from 10th January 2016 to 18th February 2020. The air quality monitoring station at Victoria Memorial Hall supplied values for temperature, relative humidity, PM2.5 and PM10. Of the values supplied, large chunks of the raw data extracted were found to be missing due to external factors such as hardware failure, maintenance operations, etc. Hence, these values needed to be either found out from other external sources or have to be internally imputed using various techniques.

Temperature values originally absent were extracted from the University of Dayton's Temperature [27] archive. Relative humidity values were web scraped from Weather Underground [28] , followed by further re-sampling to get the daily values needed. Missing daily PM2.5 values were extracted from the US Department of State's AirNow [29] web portal.

The descriptive statistics for PM2.5, PM10, temperature and relative humidity can be seen in Table 2 . It shows that during winter, the PM2.5 levels rise significantly. Low wind speeds present along with lower temperatures create conditions for temperature inversion. On the other hand, during summer and monsoon, comparatively lower levels of pollution are observed. This can be attributed to the increased circulation of air in the troposphere as well the squalls from the north-west direction which the city experiences during the months just before the onset of monsoon.

As the study concerns with a general approach for finding an optimal combination of techniques for pre-processing, missing value imputation and finally forecasting, only the best methods (as shown in Fig. 1 ) in statistical and deep learningbased modelling are studied in depth. The time-series analysis methods mentioned in Fig. 1 are specifically curated to help in investigating the underlying patterns and trends present in the data. In this section, all those techniques are presented in detail along with a brief discussion of the theory behind them.

Here, two widely used missing value imputation methods, namely mean before after and multivariate imputation, are discussed.

The mean before after method replaces the missing value at time i by the mean of the value at one time instant i þ 1 in the future and the value at one time instant i À 1 in the past.

Norazian et al. [23] showed that mean before after method of imputation gave the least error when compared to other univariate imputation methods. However, it must be noted that the mean before after method works best only when there are non-null values present in the window being considered. If there are a high amount of null values 

The method of multivariate imputation on electronic computer devices was proposed by Buck [30] . If m out of n rows have the complete set of k observations for all the features, we can consider a matrix X containing all the n rows, having those m rows at the very first. From the matrix X, we can obtain k equations of the form

where f j resembles the fitted regression function and x rj ðr ¼ 1; 2; 3; . . .; mÞ implies the expected value by forming a multiple regression of j on the other k À 1 variables for the rth row. By replacing the value of r with the row i, we can estimate the value of the missing variable.

By extending this idea for univariate imputation, multivariate imputation can be performed by calculating the multiple regression formula for each missing variate on k À v other variates. For any combination of v variates missing,

equations have to be calculated. The missing value can be estimated by selecting the proper equation and solving it.

In this subsection, different methods for time-series analysis are discussed.

The Hodrick-Prescott [31] (HP) filter is a mathematical tool used to remove the cyclical component of a time series from raw data. It is used to obtain a smoothed-curve representation of a time series, from which the long-term trend can be better observed compared to short-term variations. The sensitivity of the trend to short-term variations can be adjusted by modifying a multiplier k. The greater the value of k, the closer the trend path will be a straight line.

Given a time series y t ¼ s t þ c t þ t where s t is the trend component, c t is the cyclical component and t is the error component, for an adequately chosen k, there exists a trend component which solves 

In time-series analysis, simple moving average (SMA) is an arithmetic method which involves finding out the unweighted mean of the last n periods of data. Upon calculation of successive values, the oldest sum can be left out and the resulting new value can be calculated using the following equation:

where p SM denotes the simple moving average at that instant while p M denotes the mean over the previous n periods of time.

Decomposition is the statistical approach of breaking down time series into its trend, seasonal, cyclical and the irregular components. A time series following an additive model can be thought of:

, whereas a multiplicative model would be expressed in the following way:

where s t , c t , s t and t are the trend, cyclical, seasonal and irregular (noise) components, respectively. To find the trend component s t of a time series with frequency f, a convolution filter with a linear kernel containing of elements equal to 1/f is applied. By removing the s t component, the seasonal component is found out by averaging over smoothed series for each period of the component left out.

The autocorrelation function proposed by Box and Jenkins [32] can be used to detect non-randomness in the data and also to identify parameters of appropriate time-series models. Given measurements,

. . .; N, respectively, the lag k autocorrelation function is defined as

An augmented Dicky Fuller (ADF) test [33] uses the null hypothesis that a unit root is present in a time-series sample. The Dicky Fuller test is used if a time-series sample is a random walk or not.

Stationarity [34] refers to the time-series data being devoid of any trend or seasonal effects, thereby making the data easier to model as the summary statistics such as mean and variance tend to stay the same with respect to time. If c ¼ 0 then we have a random walk process, if not, then the data are a stationary process. The augmented Dicky Fuller test is an extension of the Dicky Fuller test, allowing for higher-order regressive processes of the form Dy tÀp where 1 p	.

The null hypothesis is that the data are non-stationary. We intend to reject the null hypothesis for this test, so we want a p value :05.

In this subsection, different statistical models for timeseries forecasting are discussed.

The Holt-Winters [35] method comprises four equations, namely the forecast equation and three smoothing equations. The additive component form of the method is shown in Eqs. (12)- (15):

where l t , b t and s t stand for level, trend and seasonal components, respectively, along with the corresponding smoothing factors a, b Ã and c. The seasonality is denoted by m, while k is the integer part of the fraction ðh À 1Þ=m, which ensures that the estimates of the seasonal indices used for forecasting come from the last part of the sample. The multiplicative component form of Holt-Winters is:

Neural Computing and Applications

Additive methods are used when the magnitude of the seasonal fluctuations does not vary with the level of the time series. On the other hand, multiplicative methods are used when there is variation in the seasonality which appears to be proportional to the level of the time series.

An auto-regressive model is a model which upon taking input of the previous observations predicts the values at the next time step. The model can be described in the form:

where / 1 ; / 2 ; . . .; / p are the parameters of the model, c is the constant term and i is the noise term. p is referred to as the order of the model denoted by AR(p). The coefficients of the AR model can be solved by ordinary least-squares (OLS) method or by using Yule-Walker [36] equations.

An ARIMA model [32] consists of an auto-regressive (AR), integrated (I) and a moving average (MA) component to better understand a time-series data or to predict future time-series data. 

where p, d and q denote the time lags of the AR component, the degree of differencing and the order of the MA component, respectively. The seasonal ARIMA model is an extension of the ARIMA Model, with additional seasonal AR, I and MA terms as well a periodic term denoted by m.

Prophet [37] is a forecasting procedure developed recently by Facebook. The model focuses on providing fast and accurate forecasts that can be later tuned manually. It is based on an additive model where nonlinear trends are fit with yearly, weekly and daily seasonality, along with holiday effects. It works best with time series where seasonal effects are profound and the historical data spans several seasons.

Here, different deep learning-based forecasting methods, namely stacked LSTM, LSTM auto-encoder, bi-directional LSTM and convolution LSTM are presented.

Long short-term memory (LSTM) [38] networks are a special kind of recurrent neural networks (RNN) designed to be used to remember information for longer periods. They are explicitly designed to counter the vanishing and exploding gradient problem, unlike RNNs which are very much affected by it. LSTMs have four interacting layers in their repeating module compared to one in RNNs.

The layers of an LSTM network can be mathematically expressed as shown in Eqs. (22)-(27):

where f t , i t , o t andc t are the activation vectors of forget gate, input gate, output gate and cell input gate, respectively. c t and h t are the cell state and hidden state vectors.

x t is the input state vector. Matrices of the form W q and U q , respectively, contain the weights of the input and recurrent connections. In activation functions, r h denotes the hyperbolic tangent function, while r c denotes the sigmoid function.

Stacked LSTM is an extension of a vanilla LSTM network in which the LSTM layers are stacked on top of each other. This helps to increase model complexity. If the input is already the result from an LSTM layer then the current LSTM layer can create a more complex feature representation of the current input.

An auto-encoder [39] is a type of artificial neural network which is used to learn the features of input data in an unsupervised manner. An auto-encoder aims to learn the encoding for a set of data, by training the model to ignore noise. After the reduction is completed, reconstruction is undertaken in which the model learns to generate an output as close as possible to the original input from the encoding done by the reduction side.

It consists of an encoder and decoder which can be expressed mathematically as shown in Eqs. (28)- (30) :

/; w ¼ arg min

where / and w denote the encoder and decoder components, respectively. X is the input, and F denotes the feature space generated by the mapping. LSTM auto-encoder is a type of neural network in which an LSTM architecture is used in the encoder and decoder components to work on data arranged in sequences.

Bi-directional LSTM is an extension of the Vanilla LSTM network and is the LSTM implementation of bi-directional recurrent neural networks [40] in which the two hidden layers of opposite directions are connected to the same output. Due to the added connection, the output layer can benefit from both the past (backward) and future (forward) states simultaneously.

In a bi-directional layer, the neurons are split into the positive and negative direction which corresponds to the forward and backward states, respectively. However, it must be noted that the output of the two states is not connected to the input of the opposite direction's state.

Convolution neural networks (CNN) [41] are a type of neural networks where the layers employ a special kind of mathematical operation called convolution unlike matrix multiplication in other cases. Mainly used for analysing visual imagery, CNNs have a wide application in the timeseries analysis.

Unlike, multi-layer perceptrons (MLP) [42] which are prone to overfitting due to the presence of fully connected layers, CNNs are regularized by taking advantage of the hierarchical pattern in data and hence assemble more complex patterns using smaller and simpler patterns.

As a convolution layer serves well for capturing spatial features, LSTM layers are used to detect correlations over time. However, by stacking these kinds of layers, the correlation between space and time features may not be captured properly. Shi et al. [43] proposed a network structure able to capture spatio-temporal correlations. In the convolution LSTM approach, convolutions are directly used as part of reading input into the LSTM units.

It is to be noted that the air quality data (in our case, PM2.5 and PM10) in different locations vary depending on the degree of industrialization, population density, traffic density, topographical characteristics, etc. [44, 45, 46] , and all these factors play an important role in the performance of any time-series forecasting method. The existing literature [46] [47] [48] [49] confirms that there is no best single method that can perform well for any given forecasting situation. Hence, a model which is built based on the historical PM2.5 or PM10 data for a particular location may not provide similar accuracy for other locations. Due to this reason, selection of a single forecasting method as a proposed approach may not be realistic; thus, a set of methods instead of one has to be considered so that the best method could be selected based on their performance on location specific data.

The general overview of the approach undertaken in this study is shown in Fig. 2 . Missing value imputation is done on the raw data to prepare it for further processing. Timeseries analysis is then performed on the imputed data to understand and extract the underlying patterns of the data. The data are then modelled using various statistical and deep learning methods as mentioned in Fig. 1 . In order to apply the statistical and deep learning models for long-term forecasting of PM2.5 and PM10 values of Kolkata, instead of directly following any existing implementation, a problem specific version of those models is developed. The models created thus are then used to train on the entire dataset to produce the next two-year forecast, which is then made the basis for the subsequent discussion presented in the latter part of this paper.

The PM10 data after extraction of PM2.5, temperature and relative humidity are found to contain missing values which are to be imputed internally using multivariate imputation and mean before after methods. In contrast to the existing methods [44, 50, 51] where univariate imputations are popularly used for missing value imputation in univariate time-series forecasting, in this work, a combination of univariate and multivariate approach is utilized to improve the missing value imputation ability. Pearson correlation is used to measure the amount of change caused qðX; YÞ ¼ P n i¼1 ðx i À xÞðy i À yÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi P n i¼1 ðx i À xÞ 2 ðy i À yÞ 2 q ð31Þ where x i , y i refers to the i th sample in time series X and Y while x and y refer to the mean of all the samples in X and Y. On completion of missing value imputation, time-series analysis is performed on the data to understand the underlying patterns present. HP Filter [31] is applied to the daily data to bring out the long-term trends. Although different multiplier values (k) of HP filter corresponding to different frequencies have been suggested by Ravn and Uhlig [52] , due to inadequate data in case of annual resample and the controversy regarding the k value for monthly resample [52, 53] , in this work data are resampled into quarterly and the suggested value of k ¼ 1600 is used as the multiplier value.

Next, the simple moving averages are plotted for windows spanning 1 week and 1 month. The daily data are then resampled into monthly and time-series decomposition is performed, to get a better understanding of the trend and the seasonal components present in the data.

ADF test is performed to determine the stationarity of the time-series data. Many important statistical models require the data to be stationary for complexity reduction and effective analysis [54] . In this study, both PM2.5 and PM10 time series are non-stationary in nature as they show both trend and seasonal patterns; hence in order to model effectively, the times series need to be made stationary. By using repeated ADF tests, the number of lags or difference components is found out based on the p value score, which is then used to turn a non-stationary time series into stationary for further modelling and analysis.

The autocorrelation [32] values are found out on the time-series data and plotted to mathematically determine the seasonality based on statistically significant spikes present.

The monthly data are then trained using four statistical methods and four deep learning methods. In the statistical approach, we use AR, Holt-Winters [35] , SARIMA [32] and Prophet [37] to carry out the model fitting and the Figure 3 describes the model architecture diagrams created using all four variations of LSTM. In case of LSTM auto-encoder-based model architecture as described in Fig. 3a , two LSTM layers l 1 and l 2 consisting of 100 and 50 units, respectively, serve as the encoder, while layers l 3 and l 4 consisting of 50 and 100 units, respectively, serve as the decoder with a repeat vector layer in between. A dense layer d 1 consisting of 1 unit is attached to the encoderdecoder architecture to produce the desired output. As shown in Fig. 3b , the bi-directional LSTM-based model architecture consists of two bi-directional LSTM layers b 1 and b 2 consisting of 200 units followed by a 100 unit LSTM layer l 1 and a dense layer d 1 to model the time series. For the convolution LSTM [43] -based model architecture as described in Fig. 3c , a 1 Â 10 Â 64 ConvLSTM2D layer is used whose output is flattened and fed to an LSTM layer l 1 consisting of 100 units. The output of the LSTM layer is then provided as input to a 1 unit dense layer d 1 to get the final prediction value as the required output. In case of the stacked LSTM-based model architecture as shown in Fig. 3d , a 1 unit dense layer d 1 following n ¼ 8 number of LSTM and dropout regularization layer pairs ld i 81 i n, consisting of 50 units and 0.5 dropout rate, respectively, is utilized for modelling purposes.

Since the data used to perform the study do not possess a spatial component, the input shape is adjusted accordingly when the data is passed to the convolution LSTM model for training. All deep learning-based models are made to undergo 50 runs, to get a better understanding of the variance introduced due to random initialization of weights in the training process.

The process of training the monthly data in these different models is the same except for the use of min-max scaling in order to normalize the data before passing it into any deep learning-based model. Train/test split is performed in which the last year is made the test set ð% 25%Þ and the remaining part is made the training set ð% 75%Þ.

Hyperparameter optimization is an important part of model building. As finding the set of optimal parameters is a tedious process, manually trying random combinations take a lot of time. To counter this, a parameter sweep (aka Grid-Search) can be done parallely on different sets of optimal parameters thus reducing the time required in comparison with simple manual searching. Not only is this process faster, but also it is more accurate as all sets of parameters tested compared to few random sets that would have been done if it was performed manually. However, it must be noted that in each of the parallel runs corresponding to a parameter set chosen out of the entire search space, all of the required processes are done sequentially. In this study, a detailed search space specific to the model is taken and Grid-Search is performed on it. In case of statistical models different combinations of lags, p, d, q (wherever applicable) are taken into consideration, whereas in case of neural networks, various combinations of the number of epochs, batch size, learning rate and optimizers are taken into consideration. The hyper-parameters are assessed based on a validation set which is created out of the training set. The best combination so found out, are finally evaluated on the test set made earlier.

After the evaluation is completed, the entire monthly data are used to train the model, so as to perform the forecast of pollution levels for the next two years.

Walk forward approach (WFA) [55, 56] is used in both evaluation and forecasting. In WFA, first a window spanning a particular time period at the beginning is taken and is used to train and optimize the model. Another segment consisting of the data present right after the end of the window is used to validate the model. After this, the window is rolled over and the process is again repeated till the end of the training data is reached. The model is constantly trained as new data become available, unlike other common approaches which involve model training to happen only with historical data already present. As the real-world performance of the model is one of the key points of this study, WFA turns out to produce a more realistic outcome, especially for time-series data, where information is constantly added with time. In this study, a period of 12 month is taken as the window length for WFA. As the window is rolled over, the immediate next sample is added while the oldest sample is dropped off. The rolling over is continued till the end is reached. The performance on the test set periods denotes the out-of-sample performance of the models and is discussed in detail in the results section of this paper.

In this section, the findings of this comparative study involving PM2.5 and PM10 is presented, along with a brief discussion about future trends as projected by the models.

The test bench used to carry out the study involves a 6C/ 12T Ryzen 5 3600 CPU clocked at 3.6 GHz coupled with 16 GB 3000 Mhz DDR4 RAM and a 1TB NVMe SSD for carrying out the mathematical computations. For deep learning purposes, an Nvidia RTX 2070 Super GPU is also used as a hardware accelerator to speed up matrix-related calculations.

The developmental code for this study was based on python [57], due to the presence of good high-end libraries like numpy [58], tensorflow [59] , statsmodels [60] and scikit learn [61] to help in decreasing the overall complexity of the code without compromising in efficiency and performance.

From the Pearson correlation heatmaps as shown in Fig. 4a, PM10 shows very strong correlation value of 0.82 with PM2.5, compared to temperature and relative humidity. This allowed the imputation of PM10 to be based upon PM2.5 when using the multivariate imputation method as mentioned before.

As observed in Fig. 4 , the change in correlation values of PM2.5 and PM10, between the two heatmaps is found to be within a range of 0.1, thus indicating that the underlying patters of the data were kept intact and preserved. Few missing values which remained in the data were imputed using mean before after method.

From the descriptive statistics presented in Table 2 and the daily time-series plot in Fig. 5a , it can be seen that PM2.5 values are higher in the winter months of December, January and February compared to monsoon months of June and July. It can also be inspected visually in the simple moving average as well as in the monthly plots presented in Figs. 6b, c, respectively, that the peak levels of PM2.5 are on a decreasing trend. This is mathematically confirmed from the dotted line in Hodrick-Prescott [31] plot in Fig. 6a and the decomposition trend plot of monthly data in Fig. 7a . Just like PM2.5, the PM10 values are higher in the winter months compared to the monsoon months. This is clearly evident in the descriptive statistics presented in Table 2 as well in the daily time series, simple moving average and the monthly plots in Figs. 5b-f, respectively. However, unlike PM2.5, PM10 values show an increasing overall trend as can be found out from the HP Filter [31] plot in Fig. 6d and the decomposition trend plot in Fig. 7b . In Figs. 7a , b, the direction of the trend line after 2019 is decreasing in nature, indicating that the pollution levels of both PM2.5 and PM10 declined in the year 2019 compared to previous years. One interesting observation that can be noted from Figs. 6a, d and 4 is that the trends of PM2.5 and PM10 are opposite in nature even though the Pearson correlation coefficient q ¼ 0:88 is highly positive. Although the results of both the trend and correlation plots seem to contradict each other, there is an actual misconception [62] among many regarding the interpretation of correlation and trends. More specifically, high positive correlation can be possible between two time series even though their trends [calculated using Eq. (5) ] are opposite in nature [62] . In Fig. 8 , a step-by-step calculation of the Pearson correlation between PM2.5 and PM10 time series is provided, using the same data of Fig. 6a , d to validate the claim that there is indeed a strong positive correlation between PM2.5 and PM10 even though their trends are opposite. Augmented Dicky Fuller test [33] when performed on the monthly data gave us a p value = 0.995 and 0.647 for PM2.5 and PM10, respectively. As a p value greater than 0.05 is considered to be statistically significant, the null hypothesis cannot be rejected and the data are considered to be non-stationary in nature and possess a unit root. The results of the ADF test performed on the non-stationary time-series data differenced by one period gave a p value lesser than 0.05 thereby rejecting the null hypothesis and making it clear that a difference component needs to be present in the statistical models trained on the data.

The seasonal plots using monthly data of both PM2.5 and PM10 in Fig. 7c , d, respectively, indicate a seasonality of 12 time periods as it can be observed that the underlying pattern of the line plots repeats over every 1 year (12 months). It can also be confirmed from the lag having the highest positive correlation (i.e. lag 12) in the set of positive lags after the first set of negative correlation lags in Fig. 9 . The autocorrelation plots also show a gradual decrease to zero in contrast to a sharp decline, thus visually confirming that the time-series data is non-stationary. It is to be also noted, that seasonality of 12 months is not unusual in Kolkata [63, 64] . For instance, in every year, the concentration of particulate matter during winter (Nov-Feb) is higher compared to other seasons because of the longer residence time of particulate matter in the atmosphere during winter due to low winds and low mixing height [64] .

The hyper-parameters are obtained using Grid-Search on a search space defined based on the time-series analysis as performed before. Lags ¼ 2 and 9 are produced to give the Neural Computing and Applications least RMSE for AR in case of PM2.5 and PM10, respectively. SARIMA (1, 0, 0)(1, 0, 1, 12) for PM2.5 and SARIMA (1, 0, 0)(0, 1, 1, 12) for PM10 are found out to show the best performance in terms of RMSE out of all SARIMA models. Multiplicative trend with seasonality of 12 is used for Holt-Winters, whereas default parameters are taken into consideration for Prophet in case of both PM2.5 and PM10. In case of deep learning models, all models are trained on a minimum of 100 epochs with batch sizes ranging from 1 to 64, based on the hyper-parameters found out by Grid-Search using a constant seed for reproducibility.

Measurement of model performance is based on root mean squared error (RMSE) and mean average error (MAE) in comparison with the test set mean. The effect of each error on RMSE is proportional to the size of the squared error; thus, larger errors have a disproportionately larger effect on the RMSE. 

whereŷ t is the prediction made by the model and y t is the actual value at instant t. Here, T denotes the count of the number of time-series samples. Due to comparatively longer training time for deep learning models, two epochs were used to retrain the models during each WFA cycle in both evaluation as well as in forecasting.

As can be seen in Tables 3 and 4 , out of the four statistical models and four deep learning models used to fit the data, Holt-Winters gave the overall best RMSE and MAE score combination, while in case of deep learning convolution LSTM [43] and stacked LSTM gave the best results in terms of RMSE and MAE with respect to a test set mean of 54.36 and 101.41 for both PM2.5 and PM10, respectively. The actual vs predicted correlation plots in Figs. 10 and 11 show that Holt-Winters (in Fig. 10b , f) has the best model performance compared to others. One interesting observation from Fig. 11 is that the values predicted by deep learning models are relatively scattered more in comparison with their statistical counterparts in Fig. 10 , which is further reflected in their RMSE and MAE values. Now, the forecast plots of PM2.5 and PM10 are shown in Figs. 12 and 13 , respectively, for different statistical and deep learning models where the shaded portion representing the forecast region. On analysing the nature of the forecasts produced by different models as shown in Figs. 12 and 13 , AR, stacked LSTM, bi-directional LSTM and LSTM auto-encoder (in Fig. 12a, c-g) showed a tendency of converging to the mean in the long-term for PM2.5. Although Prophet (as shown in Fig. 12d ) was able to pick up the trend and the seasonal components clearly, the forecast produced became negative in the period between 2021 and 2022. The decrease in PM2.5 levels over the years as forecasted by SARIMA (in Fig. 12c ) was relatively lower compared to Holt-Winters and convolution LSTM models as can be seen in Fig. 12b, h. The behaviour, however, was a little different for PM10 where none of the models showed any explicit tendency of converging to the mean. Like PM2.5, Holt-Winters, SARIMA and convolution LSTM models, as evident from Fig. 13b , c and h accurately were able to extract the trend Bold values indicate the best performing models with the respect to the metrics mentioned Bold values indicate the best performing models with the respect to the metrics mentioned In case of AR, Holt-Winters, bi-directional and auto-encoder LSTM (as shown in Fig. 13a , b, f-g), a decreasing trend could be seen in the future years. SARIMA (in Fig. 13c ) showed a constant forecast while Prophet and stacked LSTM (in Fig. 13d , e) produced a forecast following an increasing overall trend. Out of all deep learning models that were a part of the study, except convolution LSTM, all models showed a forecast which was decreasing in nature. Convolution LSTM as can be seen in Fig. 13h produced a forecast having an increasing trend just like Prophet.

From the performance metrics in Tables 3 and 4 , statistical methods performed better compared to deep learning. This performance difference can be attributed to the quantity of data available. As monthly data are considered in this approach, the quantity of data will be limited in all practical situations; hence, statistical methods will be found to give better results.

The decrease in PM2.5 pollution levels and the concave downward trend in PM10 levels as indicated by the forecasts can be a good indication of the recent measures taken by the Government of West Bengal and the Central Government to bring down pollution levels in Kolkata.

However, some forecasts showing a positive upward trend are still cause for alarm, as the present quality of PM10 levels is already significantly higher than the safe limit of 20l/m 3 prescribed by the WHO [65] . Even PM2.5 levels are significantly higher compared to the global safe limit of 10l/m 3 . The government should continue adopting strict policies regarding environmental pollution, especially focussing on large scale industries that are the main causes of PM10 levels. A complete ban of dumping sand, stone chips and other construction raw materials openly on roadsides should also be a part of their action plan to curb pollution. Measures such as promoting the usage of electric vehicles or vehicles based on CNG or LPG, a complete ban on the incineration of garbage in public places can be a part of an action plan set up by the government to curb PM2.5 levels in the city.

This study undertook a quantitative approach to understand the future trends of PM2.5 and PM10 based on historical pollution data extracted from various sources. The most widely used time-series modelling methods were put to the test to carry our long-term forecasts, and their efficiency was compared with each other. Based on the limited data available, statistical methods especially Holt-Winters were able to outperform deep learning methods. If the quantity of data available would have been higher, or if the proposed approach is used to forecast the next few months by using weekly resampled data, deep learning models could be expected to perform relatively better.

However, a certain shortcoming of this study is the absence of the use of exogenous variables. Although methods like Holt-Winters and AR can be used to model time-series data efficiently, those methods do not have the flexibility to account for exogenous variables. If exogenous variables were made a part of this study, models like SARIMAX and LSTMs could be expected to give more accurate results.

Even though the city taken in this study was Kolkata, the approach used in this study can be applied to any major city in the world. Based on the forecasts, concerned policymaking organizations can implement new measures and regulations to curb the pollution levels in their cities and make the environment healthier for the city's inhabitants.

Curbing pollution levels also will have a major positive impact on the environment. PM particles adversely affect ecosystems including plants, soil, water, etc. Water quality gets degraded and plant growth and yield also get largely affected. It is hoped that this study will help the policy makers to judge the gravity of the pollution scenario in their cities and aid them to implement better pollution measures.

This study was performed on data before the COVID-19 pandemic-related lockdown was enforced in Kolkata. Due to a huge reduction in socio-economic activity, the pollution forecasts performed may change drastically from the actual values. Based on the changes in the nature of the pollution data during and after the lockdown, a study analysing those changes can be presented in the future.

An empirical study of PM2.5 forecasting using neural network

Forecasting air pollution PM2.5 in beijing using weather data and multiple kernel learning

Deep neural network for PM2.5 pollution forecasting based on manifold learning

A fast PM2.5 forecast approach based on time-series data analysis, regression and regularization

Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation

Study on prediction of atmospheric PM2.5 based on RBF neural network

Encoder-decoder model for forecast of PM2.5 concentration per hour

Development of a model for forecasting of PM10 concentrations in Salamanca

Prediction of PM10 and tsp air pollution parameters using artificial neural network autoregressive, external input models: a case study in salt, jordan. Middle-East

Prediction of ambient pm10 concentration with artificial neural network. In: Computational methods in engineering and science

Distribution of PM2.5 and PM10-2.5 in PM10 fraction in ambient air due to vehicular pollution in Kolkata megacity

Health status and air pollution related socioeconomic concerns in urban china

Neural Computing and Applications

Dynamically pre-trained deep recurrent neural networks using environmental monitoring data for predicting

Urban air pollution monitoring system with forecasting models

Air pollution forecasting using a deep learning model based on 1D convnets and bidirectional GRU

Assessing Beijing's PM2.5 pollution: severity, weather impact, apec and winter heating

Determination of features for air pollution forecasting models

A hybrid CNN-LSTM model for forecasting particulate matter (PM2.5)

Air pollution PM2.5 data analysis in los angeles long beach with seasonal arima model

Air pollution forecasts: an overview

An online air pollution forecasting system using neural networks

Forecasting air pollution PM2.5 in Beijing using weather data and multiple kernel learning

Estimation of missing values in air pollution data using single imputation techniques

Banned vehicles found plying in kolkata in november

Mechanism of lung injury caused by PM10 and ultrafine particles with special reference to COPD

Central Pollution Control Board

The Weather Company (IBM): Weather Underground

Air Now International US Embassies and Consulates

A method of estimation of missing values in multivariate data suitable for use with an electronic computer

Postwar U.S. business cycles: an empirical investigation

Time series analysis, Forecasting and control

New York 34. Manuca R, Savit R (1996) Stationarity and nonstationarity in time series analysis

Forecasting sales by exponentially weighted moving averages

On periodicity in series of related terms

Forecasting at scale

Long short-term memory

Nonlinear principal component analysis using autoassociative neural networks

Bidirectional recurrent neural networks

Gradient-based learning applied to document recognition

Learning internal representations by error propagation

Convolutional LSTM network: a machine learning approach for precipitation nowcasting

Development of a model for forecasting of PM10 concentrations in Salamanca

Jusense: a unified framework for participatory-based urban sensing system

Towards smart city: sensing air quality in city based on opportunistic crowd-sensing

Rule induction for forecasting method selection: meta-learning the characteristics of univariate time series

Principles of forecasting: a handbook for researchers and practitioners

Evidence for the selection of forecasting methods

Time series analysis: forecasting and control

imputeTS: time series missing value imputation in R

On adjusting the Hodrick-Prescott filter for the frequency of observations

The financial cycle and macroeconomics: what have we learnt?

Introduction to modern time series analysis. Springer 55. _ Zbikowski K (2015) Using volume weighted support vector machines with walk forward testing and feature selection for the purpose of creating stock trading strategy

Neural Computing and Applications

Python tutorial. Centrum voor Wiskunde en Informatica Amsterdam, The Netherlands 58. van der Walt S

TensorFlow: large-scale machine learning on heterogeneous systems

Statsmodels: econometric and statistical modeling with python

Scikit-learn: machine learning in Python

Correlation vs. trends: a common misinterpretation

Influence of temperature, relative humidity and seasonal variability on ambient air quality in a coastal urban area

Seasonal variations of PM10 and TSP in residential and industrial sites in an urban area of Kolkata

World Health Organization: Ambient (outdoor) air quality and health

Acknowledgements This comparative study was supported by the project entitled-''Participatory and Realtime Pollution Monitoring System For Smart City'', funded by the Department of Science and Technology, Government of West Bengal, India.

Conflicts of interest The authors declare that they have no conflicts of interest.