key: cord-0228988-7udn7ggn
authors: Huang, Huang; Castruccio, Stefano; Genton, Marc G.
title: Forecasting High-Frequency Spatio-Temporal Wind Power with Dimensionally Reduced Echo State Networks
date: 2021-02-01
journal: nan
DOI: nan
sha: 4498f57bdabf69dd9f1f19ec7d9737843004f020
doc_id: 228988
cord_uid: 7udn7ggn

Fast and accurate hourly forecasts of wind speed and power are crucial in quantifying and planning the energy budget in the electric grid. Modeling wind at a high resolution brings forth considerable challenges given its turbulent and highly nonlinear dynamics. In developing countries, where wind farms over a large domain are currently under construction or consideration, this is even more challenging given the necessity of modeling wind over space as well. In this work, we propose a machine learning approach to model the nonlinear hourly wind dynamics in Saudi Arabia with a domain-specific choice of knots to reduce the spatial dimensionality. Our results show that for locations highlighted as wind abundant by a previous work, our approach results in an 11% improvement in the two-hour-ahead forecasted power against operational standards in the wind energy sector, yielding a saving of nearly one million US dollars over a year under current market prices in Saudi Arabia.

The wide consensus of the increasingly negative effects of a warming climate caused by greenhouse gas emissions (IPCC, 2014) has prompted the global community to seek alternative, carbon-free energy sources. While heterogeneous across countries, the increase in the usage of renewable resources worldwide has been steady over the past decades, and the aftermath of the current COVID-19 pandemic is foreseen to further accelerate this trend (Organisation for Economic Co-operation and Development, 2020) .

Saudi Arabia has the second-largest oil reserve in the world, and its internal energy portfolio is almost exclusively focused on fossil fuels. Worldwide, the country currently ranks sixth in oil consumption and has one of the highest per capita energy consumptions (British Petroleum, 2020) . In an effort to align the country's targets with the Paris Agreement (Kinley, 2017) and reduce its carbon footprint, the Saudi Arabian government recently outlined "Vision 2030", a blueprint plan to promote an economically viable transition to renewable energies to reduce its dependence on oil. While solar energy has the greatest potential given the country's geographical position, the plan aims to also install 16 GW of wind capacity by 2030. Wind investment is strategic for the diversification and increased penetration of renewable energies given its high abundance, especially during night hours when solar power is not available. For a country with little to no installed capacity, the development of "Vision 2030" requires a preliminary understanding on where the turbines should be built and what factors are more influential in determining their sites.

Very recent work provided some initial evidence. Giani et al. (2020) investigated various types of installation and maintenance costs and identified the optimal wind farm buildout under the present wind regimes; Zhang et al. (2021) studied the resilience of the plan under a changing climate over the next thirty years; and Chen et al. (2021) investigated the probability of disruption of wind operations due to extreme wind. One fundamental question must be addressed to provide a pathway for its practical implementation: What is the benefit of a reliable forecasting system, and what would be its economic advantage against standard operational forecasting systems? Such information would be instrumental to policymakers in their final decision regarding the optimal construction sites.

Modeling hourly wind fields (the resolution of interest in the energy market) for a country of one-fourth the size of the United States, with nonlinear dynamics implied by diverse geography comprising two coastal areas, mountain ranges, and sandy and rocky deserts, is a considerable challenge. Standard statistical approaches such as the Autoregressive Integrated Moving Average (ARIMA) models are effective in capturing a short-range (approximately) linear dependence; however, more flexible alternatives are necessary for nonlinear dynamics, and simple modifications such as fractional integration (Taqqu et al., 1995) have been shown to be largely insufficient in capturing a more structured dependence in time.

Machine learning approaches to time series, such as artificial neural networks, have been more successful in capturing nonlinear dynamics, owing to their flexibility and the possibility of defining a recursive structure (Recurrent Neural Networks, RNNs) suitable for temporally resolved data. However, the inference for RNN is very challenging given the high dimensionality of the parameter space and the impossibility to directly apply the iterative gradient computations of traditional neural networks, which would lead to numerical instabilities (Doya, 1992) . To achieve more stable and computationally affordable inference, Jaeger (2001) proposed a new approach to RNN, i.e., the Echo State Networks (ESN), which is predicated on the use of sparse, random matrices instead of dense unknown ones (Lukoševičius and Jaeger, 2009 ). Maass et al. (2002) independently developed a similar approach with the name Liquid State Machines but with more sophisticated topological constraints motivated by neuroscience. This paradigm is now commonly referred to as Reservoir Computing when inputs of the neural networks are mapped to high-dimensional hidden reservoir states through fixed nonlinear dynamics. These hidden reservoir states consider sequential linkages and thus allow for a nonlinear transformation of the input history. Despite the use of reservoir states being ideally suited as a baseline for statistical modeling, the machine learning community has mostly focused on this class of methods as a means to ease the computational burden.

Recently, a series of seminal works Wikle, 2017, 2019a,b) on the ESN framework was proposed in the statistical literature to forecast Pacific sea surface temperature and the United States soil moisture. The Empirical Orthogonal Functions (EOF) method, or equivalently, the Principal Component Analysis (PCA), is used to reduce the spatial dimensionality. The leading orthogonal basis functions, which are the leading eigenvectors of the spatial covariance matrix of the space-time data, capture the main modes of spatial variability. Therefore, they reduce the dimensionality and preserve the primary spatial variability structure when using the leading orthogonal basis functions with a much smaller number than the original number of locations to represent the entire spatial field.

In this work, however, we propose a new ESN framework that describes the dynamics of a spatial field through a collection of knots, sampled at both a grid and fixed points over areas of complex patterns such as rugged terrains (see Figure 3 ). The entire field is then reconstructed with a spatial interpolation approach, thereby allowing for a full spatio-temporal forecast without a computationally prohibitive ESN. Comparison with the EOF-based ESN, ARIMA, a spatio-temporal random process model, and the persistence forecasting (i.e., assuming the variable of interest does not change from the last observed value) shows that the prediction under our proposed ESN framework is more accurate.

Furthermore, we offer a comprehensive simulation study investigating how and to what extent the ESN can capture the data dynamics and yield predictive improvements against other methods when the underlying simulated process has a controlled level of nonlinearity.

The manuscript proceeds as follows: Section 2 presents the wind data; Section 3 in-troduces the ESN and spatial model; Section 4 discusses the inferential and forecasting approach; Section 5 compares the forecasting from our model to several operational alternatives in the wind energy market; Section 6 presents the computation of the wind power forecast and discusses and quantifies the economic advantages of our method; and Section 7 provides the conclusions of the study. Implementation of our proposed ESN for the wind speed forecasts in Saudi Arabia can be found at https://github.com/hhuang90/KSA-wind-forecast.

While developed countries can perform forecasting with publicly available, continuously updated high-resolution simulations routinely used for trading energy options (e.g., the High-Resolution Rapid Refresh (Benjamin et al., 2016) in North America), developing countries lack such data products. To investigate the wind speed and subsequent wind power potentials in Saudi Arabia, Giani et al. (2020) (Mellor and Yamada, 1982; Janjić, 2001; Gómez-Navarro et al., 2015) and had a six-kilometer resolution, and we consider the hourly near-surface wind speed. For practical purposes, we exclude islands over the Red Sea (Tiran Island, Sanafir Island in the Red Sea Project Lagoon, and Farasan Island), for a total of 53, 333 locations. higher values in the west due to mountain ranges. While abundant in wind, these regions are generally unsuitable for building wind farms due to the high installation costs implied by the rough terrain. The area in northwest Saudi Arabia near the Gulf of Aqaba is the construction site of NEOM City (Farag, 2019) , and data indicate its suitability for wind energy harvesting, as also discussed in Giani et al. (2020) . We select two locations with a high or low average wind to display the diurnal temporal variability both in terms of mean and standard deviation in Figure 1 (B) and (C). We observe that the wind speed annual variability is low in the Arabian Peninsula, as also highlighted in Chen et al. (2018) , hence lending support for the extension of the results in this work beyond the simulation years. WRF was validated by the observations at ten sites within the King Abdullah City for Atomic and Renewable Energy monitoring network through various assessment metrics in Giani et al. (2020) .

Throughout this work, we denote by Z t (s i ) the near-surface wind speed at location s i and time t, where t takes values from {0, . . . , T } indicating the number of hours after 00:00 Jan. 1, 2013 (Feb. 29 in the leap year of 2016 is removed for simplicity, but could be easily incorporated with more general harmonics in Equation (1)), and s i , i = 1, . . . , n = 53, 333, is the data location in Saudi Arabia.

We observe the periodic inter-daily and multi-daily patterns in the wind speed, which are explainable by both the daily land-ocean heat fluxes and mesoscale patterns of recurring winds in the Arabian Peninsula; see Figure 1 for the daily patterns. For each location s i , i = 1, . . . , n, we assume that the dynamics of Z t (s i ) depend on harmonics:

where K is the number of harmonics, β 0 (s i ) is the location-wise intercept, β k,1 (s i ) and

β k,2 (s i ), k = 1, . . . , K, are the location-wise coefficients for each harmonic with period T k , and γ(s i ) is the location-wise scaling parameter such that the residual Y t (s i ) has unit variance at each location. According to the discrete Fourier transform diagnostics illustrated in Figure S2 in the Supplementary Material, there are K = 5 significant harmonics at periods of one year, half a year, one day, twelve hours, and eight hours. We consider the square root because the wind speed is generally right-skewed due to occasional wind gusts.

The choice of this particular transformation to normality is justified by well-established literature on wind modeling (e.g., Gneiting, 2002) , as well as our diagnostics in Figure S3 in the Supplementary Material.

The parameters β 0 (s i ), β k,1 (s i ), and β k,2 (s i ), i = 1, . . . , n, k = 1, . . . , K, are estimated by linear regression independently across sites using the data from 2013 to 2015 (training data), a task that can be performed in parallel across locations. The scaling parameter γ(s i ), i = 1, . . . , n, is also estimated site by site so that the resulting Y t (s i ) is a zero-mean spatio-temporal residual process with a structure specified in the following sections.

The hourly residual wind Y t (s i ), i = 1, . . . , n, is bound to show a highly nonlinear behavior that traditional time series approaches, such as ARIMA or more general linear space-time models, may fail to capture. The inability of traditional methods to capture the nonlinearity of spatio-temporal dynamics (McDermott and Wikle, 2017) necessarily translates into sub-optimal forecasts, as will be illustrated in Section 5. We propose an ESN model with nonlinear dynamics generated by a state space with neural network dynamics. We will introduce the model with the standard statistical terminology, but we also put in parenthesis the equivalent machine learning terms and use them henceforth.

Quadratic interactions have been shown to be effective in characterizing the nonlinear dynamical evolution in many physical processes (Wikle and Hooten, 2010; Gladish and Wikle, 2014) . In this work, and consistently with McDermott and Wikle (2017), we propose to model the relationship between the data (output) of the wind speed residual at time t and all locations, y t := Y t (s 1 ), . . . , Y t (s n ) , and the n h -dimensional latent variables (hidden states or reservoir states), h t , by a quadratic ESN:

where is the Hadamard (element-wise) product, V 1 and V 2 are weight matrices whose entries are unknown, and t is the error at time t, assumed to follow a zero-mean multivariate normal distribution. Here, V 1 and V 2 are estimated by least squares, as will be shown later in Section 4.2.1, and are independent from the spatial covariance matrix of t . The uncertainty of the estimated y t relies on the covariance of t , but we will show in Section 4.1 that the prediction uncertainty can be calibrated with the ESN mean, by developing an approach similar to Bonas and Castruccio (2021) .

The dynamics of the reservoir state vector h t is formulated as

The vector x t represents all the covariates (input), and here x t := (1, y t−1 , . . . , y t−m ) , i.e., we consider the intercept and the lagged wind speed residuals up to m time steps before;

f is the element-wise activation function (here we choose the commonly used hyperbolic tangent function, but other choices are possible such as the rectified linear unit (Goodfellow et al., 2016) ); λ W is the maximal absolute eigenvalue of W; δ is a tuning parameter to scale W; and φ is the leaking rate to control the speed of the dynamics. In summary, the temporal evolution of h t is controlled by two parts. The first one is a nonlinear evaluation of a linear combination of h t−1 and x t , and the second is a memory term that allows for long-range dependence.

The n h × n h -dimensional weight matrix W = [w ij ] controls the interaction among reservoir states with time evolution, and the n h × (mn + 1)-dimensional weight matrix U = [u ij ] determines the mapping from the inputs to the reservoir states. The key difference between the ESN and the standard RNN lies in the structure of W and U: an ESN assumes that these matrices are random with a controlled level of sparsity, while an RNN assumes full matrices with unknown fixed entries. Therefore, in the ESN, the interaction of each pair of reservoir states, as well as each pair of inputs and reservoir states, is assumed to exist with a given probability. The sparsity in both matrices prevents overfitting and dramatically reduces the model dimensionality. Specifically, the entries of W and U are generated as follows:

where a w and a u control the magnitude of nonzero entries in W and U, respectively, and π w and π u determine the expected sparsity.

An extensive literature in neural networks has proposed stochastic optimization methods, the most common being the random initialization of the weight matrices in the learning algorithm (Yehudai and Shamir, 2019) . ESNs are fundamentally different, as the weight matrices W and U are instead randomly generated from a given distribution; once drawn, they are fixed and do not need to be estimated. The randomness of W and U results in an ensemble of predictions, and its mean can be used as point prediction given its reduced variance with respect to a single ensemble member.

The proposed ESN model could in principle be used to directly produce forecasts of y t ; however, this would imply the use of input comprising a spatially referenced vector of n = 53, 333 points at each time t, and consequently a high-dimensional vector of reservoir states h t , matrices V 1 , V 2 in Equation (2) and matrices W, U in Equation (3) to capture the dynamics. A brute-force approach disregarding the spatial information would therefore be computationally intractable.

One commonly used approach to reduce the size of spatially referenced data in geosciences is using the EOF (or equivalently PCA) method. This approach allows to capture the main modes of spatial variability; for example, Gladish and Wikle (2014) used the first ten EOFs to represent the sea surface temperature on a region in the Pacific Ocean comprising of more than two thousand locations. In this work, we propose a stochastic approach with a Gaussian random field to model the spatial data. After fitting the Gaussian random field with all the available data, we consider a set of knots and use them to represent the entire process. Results in Section 5.1 show that this modeling choice results in better predictive performances than EOFs.

We assume a zero-mean Gaussian random field to model Y t (s) at each time t, i.e., Y t (s) ∼ GP 0, C(·, ·) . To reflect the spatially varying dependence structure dictated, among others, by the different topographical features in Saudi Arabia, we use a nonstationary Matérn covariance function (Paciorek and Schervish, 2006) for Y t (s). More specifically, for any pair of locations (s, s ), the covariance is

where σ 2 (s), ν(s), and τ 2 (s) are the spatially varying partial sill, smoothness, and nugget parameters, respectively, Σ(s) is the spatially varying covariance matrix of a Gaussian kernel centered at s,

is the stationary Matérn correlation function (Stein, 1999) , K ν (·) is the modified Bessel function of the second kind with order ν, and 1 {} is the indicator function.

We use the R (R Core Team, 2019) package convoSPAT (Risser and Calder, 2017) to fit the nonstationary random field with Y t (s i ) i = 1, . . . , n, in 2015.

The package convoSPAT uses a mixture component model for σ 2 (·), ν(·), Σ(·) and

τ 2 (·) and locally estimates the mixture parameters. We place 42 mixture components spanning the entire spatial domain as illustrated in Figure 2 (A). Our results suggest that the nugget estimateτ 2 (·) is negligible, and the spatially varying partial sill estimateσ 2 (·) and smoothness estimateν(·) are shown in Figure 2 (B) and (C). The covariance of the Gaussian kernel characterizes anisotropy and the dependence scale, and we show in Figure 2 (A) the isocurves of the estimateΣ(·) at the mixture components. We adopt a moderately large set of knots across the spatial domain, and then model their space-time dynamics with the ESN. Once forecasts on the knots have been provided, spatial interpolation (kriging) is used to produce spatial predictions. A formal comparison between the approximate process using knots and the true process has been studied, for example, in Titsias (2009) and Matthews et al. (2016) using the Kullback-Leibler divergence.

Our chosen knots consist of two sets. The first set comprises 2,756 data locations closest to a 0.25 • -resolution regular grid across the entire domain, while the second set is chosen at locations with a high wind power potential. We calculate the average wind speed from 2013 to 2015 and focus on the locations with values greater than 6 m/s, which is approximately the 98% quantile. A set of 417 locations are then selected, such that no pairs have differences less than 0.005 • in longitude or latitude. Figure 3 shows a total number of n * = 3, 173 chosen knots, which we denote by {s * 1 , . . . , s * n * }. The output y t is then replaced by

, and the reservoir states h t in Equation (3) are updated by

This section discusses both the inference and the forecasting approach. In Section 4.1 we show how forecasting is performed given all parameter estimates and then in Section 4.2 we elaborate on the inference procedure, where the tuning parameters are estimated via cross-validation based on the forecasting performance.

Let the training period be {0, . . . , T } and the forecasting period be {T +1, . . . , T max }. In this work, we are not only interested in forecasts for one hour ahead, but also for two and three hours ahead to conform our predictions with the majority of wind energy markets (Hering and Genton, 2010) . When we forecast y * t+2 at time t, the observation y * t+1 is required but unavailable, so we substitute the one-time-step forecastŷ * t+1 for y * t+1 to forecast y * t+2 . Section S1 in the Supplementary Material formalizes the forecasts for multiple time-steps ahead.

At time t, once the forecastŷ * t+h = Ŷ t+h (s * 1 ), . . . ,Ŷ t+h (s * n * ) , h = 1, 2, 3, at the knots is obtained, the wind speed residuals at all locationsŷ t+h = Ŷ t+h (s 1 ), . . . ,Ŷ t+h (s n ) can be obtained via spatial interpolation, i.e.,ŷ t+h

. As pointed out by Efron and Hastie (2016) , averaging a set of unbiased estimators is an effective way to reduce estimator variance. Rougier (2016) also investigated why the ensemble mean often yields a smaller Mean Squared Error (MSE) than the majority of the ensemble members. Throughout this work, we use 100 ensemble members to forecast, each consisting of an independent random realization of W and U. The mean of the 100 ensemble forecasts is then used as the final forecast. In order to quantify the uncertainty, an empirical calibration similar to that proposed in Bonas and Castruccio (2021) 

Estimating V 1 and V 2 can be performed via least square regression. The large size of the output vector would potentially result in a high dimension of the reservoir state space.

Therefore, it is necessary to penalize the coefficient estimation to reduce its variance and prevent overfitting. If we denote

then, we can express Equation (2) in a matrix linear regression form as Y = HB + E. We use ridge regression or Tikhonov regularization to estimate B such that

where · F is the Frobenius norm and λ is the ridge penalty. The ridge estimatorB has the following closed-form:B

OnceB is obtained, the forecast for one time-step ahead (and iteratively for any future time step) on the knots can be computed as follows:

for t ∈ {T + 1, . . . , T max }.

Except for V 1 and V 2 , several other parameters need to be estimated via cross-validation:

the dimension of the reservoir states n h , the number of lags m in x t , the leaking rate φ and the scaling parameter δ in Equation (3), the magnitude and sparsity parameters a w , π w , a u , and π u in Equation (4), and the ridge penalty λ in Equation (6). We consider y * t from 2013 to 2014, for a total of 17,520 hourly observations, as the training data to obtain the ridge estimatorB in Equation (6). Subsequently, y * t in 2015, for a total of 8,760 hourly observations, is used as the validation data. More specifically, at any time t in 2015, y * 0 , . . . , y * t are available; we forecast y * t+1 , we compare it with the true validation data, and we then calculate the MSE at all knots and time points for the mean of 100 ensemble forecasts (see Section S2 in the Supplementary Material for more details). We search over a large grid (shown in Table S1 ) to obtain the optimal values of the parameters, see Table 1 .

The large number of reservoir states indicates that a complex model is necessary to capture the nonlinearity of the multivariate time series with n * = 3, 173 dimensions. The estimated ridge penalty avoids overfitting using the large reservoir space. The obtained optimal leaking rate is one (i.e., the reservoir states are fully updated by new activations at each time). Since we removed the low-frequency periodicities to the original wind speed (see Equation (1)), the optimal leaking rate of one suggests a short-term dynamic in the residual field Y t (s). In our ESN, we use the lagged wind speed residuals as the input, and we observe that the optimal model only requires one lagged observation as the input at each time. The parameter δ determines the spectral radius of the matrix W after scaling and influences the fading speed of the input in the hidden neurons. In practice, δ < 1 always guarantees the echo state property (i.e., the reservoir states are independent on the initial conditions after a sufficiently long time). Our estimated value of δ indicates that even though only one lagged observation is necessary at each time, the hidden neurons preserve a long memory of the input (lagged wind speed residuals) by employing a moderately large spectral radius of W, 0.9. The values of a w , π w , a u , and π u are all comparatively small, implying a sparse collection of weak connections among reservoir states and between the reservoir states and the input.

Once the parameters have been estimated in the training step from 2013 to 2015, we proceed to forecast y * t in 2016 and compare it with other common strategies for short-term wind forecasts. The calibration algorithm explained in Section 4.1 is used to obtain the probabilistic forecast distribution in 2016. Table 2 shows the proportion of the true wind speed residuals y * t in 2016 falling into the associated prediction intervals, and we observe that overall they match the nominal levels closely. We compare the ESNs with persistence forecasting and the ARIMA model. The persistence forecasting method uses the last observed values as estimates for the near future (i.e.,ŷ t+3 =ŷ t+2 =ŷ t+1 = y t ). This method is very simple, but frequently used in practice because of its good performance for very short-term forecasting. Additionally, we propose an ARIMA approach, which assumes that at each knot s * i , i = 1, . . . , n * , the time series of wind residual Y t (s * i ) has the following model: 

where p(s * i ), d(s * i ), and q(s * i ) are the order of the autoregressive, differencing, and movingaverage terms, respectively. Here, L is the lag operator (e.g.,

are unknown parameters at s * i . The optimal orders for the ARIMA model at each knot are independently optimized through cross-validation using the data from 2013 to 2015. For each time in 2016, the ARIMA model with the optimal parameters uses all the data before that time and computes the forecasts at each knot up to three hours ahead. Table 3 summarizes the MSE results for y * t at all the n * knots and time points in 2016. The ESN forecasts outperform these from persistence and ARIMA, with better results as the lead time increases from one to three hours ahead. After forecasting Y t (s) at the n * knots, we produce a map at the n locations by spatial interpolation from the inferred spatial Gaussian random field described in Section 3.3, and we denote this approach by S-ESN. We compare this method with ARIMA or persistence approaches trained on all the n locations, a choice which puts the S-ESN at a competitive disadvantage, as it can learn the dynamics only on the knots. Table 4 shows the outper-formance of ESN over the ARIMA and persistence in terms of the average MSE, and in Figure 4 we show the maps of the MSE difference between S-ESN and persistence for one to three hours ahead. We observe smaller errors for our forecasts than the persistence forecasts at the majority of locations. We also provide probabilistic forecasts of Y t (s) at all locations in 2016 based on the calibrated quantiles from 2015. Table S2 in the Supplementary Material shows the prediction interval coverage, and we observe that the uncertainty is well captured. As mentioned in Section 3.3, the EOF approach is also commonly used to reduce dimensions (McDermott and Wikle, 2017) . We denote this method by EOF-ESN, and we compare it with our method. We use y t at all locations from 2013 to 2015 to calculate its covariance matrix and extract the first n EOF = 3, 173 EOFs (principal components). We then reduce the dimension of y t to n EOF , apply the ESN to the projected vector, compute forecasts up to three hours ahead, and recover the forecasted map at all locations by projecting the EOFs back into the original space. Results in Table 4 indicate better forecast skills from the S-ESN, which relies on spatial information to reduce dimensionality, as opposed to EOF-ESN, a data-driven, non spatially informed method for dimension-reduction.

We reiterate that in Table 4 , the persistence forecasting method relies on past values at all locations, whereas S-ESN and EOF-ESN only model the data in a reduced space. Despite the use of a more limited set of data, even the EOF-ESN is competitive against persistence:

for two and three hours ahead, it produces better forecasts.

Traditional spatio-temporal random process models can also be used to provide wind forecasts. However, given the high dimension in space and time, approximation techniques need to be applied for inference and prediction. We use the Fixed Rank Kriging (FRK;

Cressie and Johannesson, 2008) approach as one example for comparison. FRK proposes a spatio-temporal random effect model, where the spatio-temporal random process is decomposed by a series of basis functions, thus achieving dimensionality reduction and naturally accounting for nonstationarity. We use the recently developed R package FRK (Zammit-Mangion and Cressie, 2021) to model y t , and the results are shown in Table 4 . FRK yields more accurate forecasts than persistence and EOF-ESN for one and two hours ahead. We notice that the outperformance of FRK over EOF-ESN reduces as the lead time increases (FRK becomes worse than EOF-ESN for three hours ahead), indicating that the temporal dynamic may not be well characterized by the spatio-temporal process. For all three forecast horizons, S-ESN and ARIMA have consistently more accurate results than FRK. One possible reason for the outperformance of ARIMA over FRK is that ARIMA computes the location-wise forecast using all the available data directly but FRK uses basis functions for the spatio-temporal data, which implies some level of approximation.

In conclusion, we stress the remarkable result of our proposed S-ESN approach: by only relying on data at the knots (about 6% of all the locations) S-ESN is able to outperform alternative methods, even for very short-term forecasts.

Once the wind speed forecasts are produced, we proceed to evaluate the associated predictions for wind power. While it may be interesting to perform a comparison over the entire spatial domain, the practical interest lies in the locations suitable for wind farm construction. In Section 6, we investigate the generated wind power for locations highlighted as suitable for wind farms by Giani et al. (2020) . It is also noteworthy that MSE may not be a perfect metric to assess wind speed or associated wind power due to the possibly asymmetric economic cost between under-and over-forecasts. The evaluation can be performed differently by replacing the MSE with a case-specific commercial loss function.

To better understand the ability of ESN to capture nonlinear dynamics, we also perform a simulation study from a model with a controlled level of nonlinearity in the temporal dynamics. We use one of the most popular models in numerical weather prediction, the Lorenz 96 model (Lorenz, 1996) , and we modify it by adding a factor to control the nonlinearity in the dynamics. We have observed that ESN leads to the most accurate forecasts compared to the ARIMA, vector autoregression, and persistence methods. When the nonlinearity of the dynamics is stronger, the outperformance of the ESN is more significant.

Details of the simulation study for the modified Lorenz 96 model is given in Section S3 in the Supplementary Material.

Once our forecasting method has been validated against existing alternatives, we proceed to produce the forecasts of wind power. From an operational perspective, the electricity produced by wind power cannot be stored and must be consumed once it is injected into the power grid. An accurate forecast of the wind power helps to plan the correct amount of additional electricity to be dispatched from other sources. Penalties or fines are applied in utility markets in case of unfulfilled commitment to provide an agreed amount of power.

The typical forecast horizon for scheduling electricity transmission, allocating resources, and dispatching generated power is two to four hours (Gneiting et al., 2006; Hering and Genton, 2010) . Herein, we focus on two-hour-ahead wind forecasts. We study the wind power forecasts based on the optimal wind farm build-out in Saudi Arabia as shown in Giani et al. (2020) and assess how much power would be saved with our forecasts if all the identified wind farms were installed and in operation.

The two-hour-ahead forecast of Y t (s) is transformed into near-surface wind speed Z t (s) according to Equation (1). Since the wind speed is different at different heights, we convert the wind speed near the surface (10 meters above the ground) to the wind turbine hub height. The wind power law is commonly used to perform this conversion:

where h is the hub height and α(s, t) is the location-and time-specific factor. Traditionally, the factor α(s, t) is fixed as a constant for all locations and times, and the value 1/7 is usually used over open land surfaces (Pryor and Barthelmie, 2011) and specifically over Saudi Arabia (Rehman et al., 2007; Tagle et al., 2019) . However, Crippa et al. (2021) showed that if the location-and time-specific factor α(s, t) is independently estimated at each location from different profile heights, the recovered wind speed at the hub height is on average 36% more accurate, so we use their model to convert the wind speed vertically.

It is noteworthy that this vertical extrapolation model by the wind power law would lead to some errors. In this study, we do not take into account the uncertainty of α(s, t) for the probabilistic forecasts of wind power.

There are two turbine models chosen for the 75 identified optimal wind farms (Giani et al., 2020) : Nordex N131/3300 and GE Energy 2.75-100, with hub heights 84 and 75 meters above the ground, respectively. We compute the wind speed for each wind farm at the turbine hub heights and then use the power curve provided by the wind turbine manufacturers to convert the wind speed to electric power (see Figure S4 ). Each curve has three critical points that divide the power curve into four zones. When the wind speed is less than the cut-in speed, the turbine motor does not rotate. When the wind speed is greater than the cut-in but less than the rated speed, the turbine motor produces more power with faster wind. Once the wind speed reaches the rated speed but not greater than the cut-out speed, the turbine keeps producing the maximal power output. If the wind speed goes beyond the cut-out speed, the turbine motor stops rotating to avoid damage and produces no power. Therefore, better wind speed forecasts do not necessarily yield better wind power forecasts, as their relationship is not strictly monotonic.

We also convert the forecasted near-surface wind speed by the ARIMA and persistence methods as well as the true wind speed at the 75 wind farms, and we compare their error in wind power forecasts against our approach. Note that the economic cost may be asymmetric for under-and over-estimation of wind power. Under-forecast occurrence means that the power grid has ordered more electricity than needed, which results in a power surplus.

Then, the power generation must be reduced, and it is usually more expensive than the opposite case. This asymmetry is case-specific and can vary in different countries, and for simplicity, we do not distinguish the direction of the wind power forecast error in this work. We calculate the absolute difference between the S-ESN (similarly, the ARIMA or persistence) wind power forecasts and the true power for each hour in 2016. The annual sum of the absolute differences in wind energy for all the 75 wind turbines is 2.77 × 10 8 kW·h for S-ESN, 3.05 × 10 8 kW·h for ARIMA, and 3.12 × 10 8 kW·h for persistence. Thus, we obtain a 9% improvement against the ARIMA and an 11% improvement against the persistence forecasts. The bid price in the electricity market varies based on many factors (demand and supply, providers, etc.) . Modeling the imbalance price is generally difficult and out of the scope of our work; see Zhu and Genton (2012) and Pinson (2013) for some discussion on the operational management in an electricity market. To gain some insights, using a medium bid price of 0.025 US dollars per kW·h nowadays in Saudi Arabia as a reference, the forecasts from our approach could yield a saving of nearly one million US dollars over a year against the persistence forecasts. When probabilistic forecasts are considered, Figure 5 shows the whole-year difference between the S-ESN forecast quantiles and the truth in wind energy. The difference curve around center quantiles is quite flat.

Specifically, the differences from 40% to 70% quantiles vary from 2.77 × 10 8 kW·h to 3 × 10 8 kW·h, and the maximum difference from 20% to 80% quantiles reaches 3.95 × 10 8 kW·h.

It is also observed that the difference is inflated drastically for extreme quantiles. Finally, we notice that the wind energy forecast error is asymmetric in quantiles, owing to the characteristics of the power curve illustrated in Figure S4 .

In this work, we have introduced a novel ESN framework for spatio-temporal wind data.

The approach is shown to have superior predictive performances than the operational standards in the wind energy market, contributes to the spatio-temporal model literature, and has practical implications for the wind energy sector in Saudi Arabia. This work presents a fast and efficient approach to hourly wind forecasting over large areas. A more detailed quantification of the economic gains from an improved forecast is reliant upon a market model for trading and bidding energy. The energy sector of Saudi Arabia is currently under full governmental control; hence, the integration of our forecasting approach would require considerable speculation on the market model that will be employed in the future. Therefore, we choose not to pursue this route but rather opt for a simpler comparison against operational standards, such as persistence, under the current market prices in Saudi Arabia. Some of our current studies focus on engaging with local leaders in the energy sectors to detail the range of possible future market models that could be integrated with our approach. In more immediate terms, this work will provide additional information on where to build turbines aside from the wind abundance and construction and operation costs in Giani et al. (2020) . Finally, while this work is reliant upon (validated) high-resolution WRF simulations, it represents a template to assess wind predictions over a large spatial area for developing countries in the Arab Gulf, and more broadly worldwide, with an emerging wind energy portfolio.

•ŷ * t|3 : t ∈ {T + 3, . . . , T max }.

For example, we do not considerŷ * T +1|2 because the needed forecast,ŷ * T |1 , is not available. At time T max −1, we only forecast for one time-step, and the forecastsŷ * Tmax+1|2 andŷ *

are not required because we do not have data y * Tmax+1 and y * Tmax+2 for assessment.

We modify the Lorenz 96 model by adding a factor η to control the nonlinearity in the dynamics as follows:

where y −1 (t) := y N −1 (t), y 0 (t) := y N (t), y N +1 (t) := y 1 (t), N is the number of sites, and F is the external forcing.

We choose N = 5 and F = 8 and simulate data on t ∈ [−199.9, 100] by numerical integration with ∆t = 0.1. The initial values are drawn from an independent standard normal distribution (i.e., y i (−199.9)

i.i.d.

∼ N (0, 1), i = 1, . . . , N = 5). The realizations at the first 2,000 time points are treated as burn-in and discarded. To account for the measurement error, we add white noise to the generated realizations (i.e.,ỹ i (t) = y i (t)+ i (t)

∼ N (0, 1), i = 1, . . . , N = 5, t = 0.1, 0.2, . . . , 100). We proceed to model the simulated dataỹ i (t), i = 1, . . . , N = 5, t = 0.1, 0.2, . . . , 100.

We choose seven values, 0.2, 0.4, . . . , 1.4, for η to control different levels of nonlinearity levels (η is 1 in the original Lorenz 96 model). Nonlinearity is more significant when η is greater. For each η, we generate 50 independent sets of realizations. We apply the ESN, Vector Autoregression (VAR), ARIMA, and persistence methods to the simulated data.

Since we have a small number of sites (N = 5), it is feasible to apply the VAR for the 5-variate time series. The VAR model is described as follows:

where p is the order of the vector autoregression term, ε 1 (t), . . . , ε N (t) are error terms, and c 1 , . . . , c N and the matrices A 1 , . . . , A p are unknown parameters. It is noteworthy that the ESN and VAR model the realizations at the five sites altogether and make a 5variate forecast at each time point, whereas the ARIMA models the time series at each site independently and performs univariate forecasting. All the parameters in these models are selected via the cross-validation by the MSE for the one-time-step forecast. In the cross-validation, the realizations at t ∈ {0.1, . . . , 50} are used for the training, while those at t ∈ {50.1, . . . , 75} are used for the validation.

Once the optimal parameter values are obtained, we compute forecasts up to three timesteps ahead for t ∈ {75.1, . . . , 100} by each method. Figure S1 illustrates the MSE results.

Short-term forecasting is more difficult when the nonlinearity of the dynamic becomes more significant as η increases. It is no surprise that forecasts for more time-steps ahead are less accurate. However, in all cases, the ESN leads to the most accurate forecasts. When η is less, the outperformance of the ESN is not obvious. When η departs from zero, the ESN is generally more capable of capturing the nonlinear dynamics compared to other linear methods. 

A North American hourly assimilation and model forecast cycle: The rapid refresh

Calibration of spatial forecasts from citizen science urban air pollution data with sparse recurrent neural networks

BP statistical review of world energy

Assessing the risk of disruption of wind turbine operations in Saudi Arabia using Bayesian spatial extremes

Current and future estimates of wind energy potential over Saudi Arabia

Fixed rank kriging for very large spatial data sets

A temporal model for vertical extrapolation of wind speed and wind energy assessment

Bifurcations in the learning of recurrent neural networks

Computer age statistical inference

The story of NEOM city: Opportunities and challenges

Closing the gap between wind energy targets and implementation for emerging countries

Physically motivated scale interaction parameterization in reduced rank quadratic nonlinear dynamic spatio-temporal models

Nonseparable, stationary covariance functions for space-time data

Calibrated probabilistic forecasting at the stateline wind energy center: The regime-switching spacetime method

Sensitivity of the WRF model to PBL parametrisations and nesting techniques: Evaluation of wind storms over complex terrain

Deep Learning

Powering up with space-time wind forecasting

Part a: Global and sectoral aspects

The "echo state" approach to analysing and training recurrent neural networks-with an erratum note

Nonsingular implementation of the Mellor-Yamada level 2.5 scheme in the NCEP Meso model

Climate change after Paris: From turning point to transformation

Predictability: A problem partly solved

Reservoir computing approaches to recurrent neural network training

Real-time computing without stable states: A new framework for neural computation based on perturbations

On sparse variational methods and the Kullback-Leibler divergence between stochastic processes

An ensemble quadratic echo state network for non-linear spatio-temporal forecasting

Bayesian recurrent neural network models for forecasting and quantifying uncertainty in spatial-temporal data

Deep echo state networks with uncertainty quantification for spatio-temporal forecasting

Development of a turbulence closure model for geophysical fluid problems

The impact of coronavirus (COVID-19) and the global oil price shock on the fiscal position of oil-exporting developing countries

Spatial modelling using a new class of nonstationary covariance functions

Wind energy: Forecasting challenges for its operational management

Assessing climate change impacts on the near-term stability of the wind energy resource over the United States

R: A Language and Environment for Statistical Computing

Wind power resource assessment for Rafha, Saudi Arabia

Local likelihood estimation for covariance functions with spatially-varying parameters: The convoSPAT package for R

Ensemble averaging and mean squared error

A description of the advanced research WRF version 3. NCAR Technical Note TN-475+STR

Interpolation of Spatial Data: Some Theory for Kriging

A non-Gaussian spatiotemporal model for daily wind speeds based on a multi-variate skew-t distribution

Estimators for long-range dependence

Variational learning of inducing variables in sparse gaussian processes

A general science-based framework for dynamical spatio-temporal models

On the power and limitations of random features for understanding neural networks

FRK: An R package for spatial and spatiotemporal prediction with large datasets

Assessing the reliability of wind power operations under a changing climate with a non-Gaussian bias correction

Short-term wind speed forecasting for power system operations

We write h t (x * t ) more explicitly as a function of x * t in the forecastedŷ * t ,; the forecastedŷ * t+2 uses; and the forecastedŷ * t+3 uses

Let the training period be {0, . . . , T } and the forecasting period be {T + 1, . . . , T max }. We writeŷ * t|1 as the one-hour-ahead forecast at time t − 1,ŷ * t|2 as the two-hour-ahead forecast at time t − 2, andŷ * t|3 as the three-hour-ahead forecast at time t − 3, respectively. The used time points for the forecasts are as follows:•ŷ * t|1 : t ∈ {T + 1, . . . , T max },•ŷ * t|2 : t ∈ {T + 2, . . . , T max },