key: cord-0562760-wwvlhyoq authors: Zhang, Wenyong; Li, Lingfei; Zhang, Gongqiu title: A Two-Step Framework for Arbitrage-Free Prediction of the Implied Volatility Surface date: 2021-06-14 journal: nan DOI: nan sha: 2f5306232ae0b129a1d5f106c8a18470e83b661b doc_id: 562760 cord_uid: wwvlhyoq We propose a two-step framework for predicting the implied volatility surface over time without static arbitrage. In the first step, we select features to represent the surface and predict them over time. In the second step, we use the predicted features to construct the implied volatility surface using a deep neural network (DNN) model by incorporating constraints that prevent static arbitrage. We consider three methods to extract features from the implied volatility data: principal component analysis, variational autoencoder and sampling the surface, and we predict these features using LSTM. Using a long time series of implied volatility data for S&P500 index options to train our models, we find two feature construction methods, sampling the surface and variational autoencoders combined with DNN for surface construction, are the best performers in out-of-sample prediction. In particular, they outperform a classical method substantially. Furthermore, the DNN model for surface construction not only removes static arbitrage, but also significantly reduces the prediction error compared with a standard interpolation method. Our framework can also be used to simulate the dynamics of the implied volatility surface without static arbitrage. The implied volatility for an option with given strike K and time to maturity τ is the volatility level that makes the option price from the Black-Scholes formula equal to the observed option price. In practice, traders quote implied volatility for the price of call and put options. The implied volatility surface (IVS) is the collection of implied volatilities as a function of K and τ , and it is a fundamental input for various tasks, such as derivatives pricing and hedging, volatility trading, and risk management. There exists many studies on the implied volatility surface. Since only a finite set of options are traded on each day, an important task in practice is to interpolate and extrapolate the implied volatitlies for these options to obtain the entire surface. Methods for this task mainly fall into three categories: (a) Parametric models: Gatheral (2004) proposes the well-known stochastic volatility inspired (SVI) model for single-maturity implied volatility skews and Gatheral and Jacquier (2014) further generalizes the model to obtain the surface without static arbitrage. (b) Splines: Fengler (2009) , Orosi (2015) and Fengler and Hin (2015) develop spline-based models considering the no static arbitrage conditions. (c) Machine learning models: Neural networks have been utilized to construct the IVS. See Zheng et al. (2019) for a neural network model that combines several single layer neural networks and Ackerer et al. (2020) for a neural network model with multiple layers and hence deep. Both works formulate the static no arbitrage constraints as penalties in their loss functions to obtain a surface free of static arbitrage. Bergeron et al. (2021) employs the variational autoencoder for constructing the IVS. This method first extracts latent factors for observed implied volatilities through a neural network called encoder and then obtains the surface through another neural network known as decoder by using the latent factors together with moneyness and time to maturity as inputs. Almeida et al. (2021) boosts the performance of classical parametric option pricing models by fitting a neural network to the residuals of these parametric models. Horvath et al. (2021) uses a deep neural network to approximate the pricing function of a sophisticated rough stochastic volatility model to facilitate its calibration to implied volatility data. In this paper, we focus on the dynamic prediction problem of IVS, i.e., at time t we predict the IVS of time t with t > t. While interpolation and extrapolation of the IVS for a single day has been well studied with satisfactory solutions, the dynamic prediction problem requires more research in our opinion. It can also be more challenging than the former problem as we need to capture the behavior of IVS over time. Unlike predicting financial quantities like stock prices, interest rates and exchange rates, predicting the IVS is a more challenging problem as we need the whole surface which is a bivariate function as opposed to e.g., the price of a stock which is a single value. Furthermore, the predicted surface must satisfy certain restrictions so that it is free of arbitrage, whereas there is no such concern for predicting fundamental financial variables. The dynamics of the IVS has been analyzed in various papers. Skiadopoulos et al. (2000) , Fengler et al. (2003) and Cont and Da Fonseca (2002) apply principal component analysis (PCA) to IVS data. The first two papers perform PCA for smiles of different maturities while the third reference performs PCA directly on the surface. It is found in Cont and Da Fonseca (2002) that the first three eigenmodes already explain most of the variations and they are associated with the level, skewness and convexity of the IVS. The authors approximate the IVS on each day using the linear combination of the first three eigensurfaces and the dynamics of each coefficient is modelled by a first-order autoregressive model. Fengler et al. (2007) develop a semiparametric factor model for the IVS, where they assume the IVS is given by a sum of basis functions, each of which uses moneyness and time to maturity as inputs. The factor loadings are modelled by a vector autoregressive process. The basis functions and the loadings are estimated by weighted least squares with the weight given by a kernel function. Bloch (2019) proposes a general modeling framework. He uses several risk factors to represent the surface and model each parameter (corresponding to a risk factor) using a neural network. In particular, he provides three ways of obtaining the risk factors: using the first three eigenmodes from PCA in Cont and Da Fonseca (2002) or the parameters of a polynomial model for the IVS, or the parameters of a stochastic volatility type model (such as the SVI model). Explanatory variables such as the spot price, volume, VIX, etc can be input into the neural networks for these parameters. Cao et al. (2020) model the relationship between the expected daily change in the implied volatility of S&P500 index options by a multilayer feedforward neural network with daily index return, VIX, moneyness and time to maturity as inputs. They find that with the index return and VIX as features, their model can improve an analytic model significantly. One can use all of these models for predicting the IVS on a future date, but their predicative performance is not assessed in these papers. Furthermore, all these papers do not show how to obtain the dynamics of the IVS without arbitrage. Several other papers directly address the dynamic prediction problem. Dellaportas and Mijatović (2014) predict the implied volatilities of a single maturity by forecasting the parameters in the SABR model from a time series model. As the SABR model is arbitrage free, the predicted implied volatility smile has no arbitrage. However, their approach does not apply to the whole surface because the SABR model cannot fit the surface well. Goncalves and Guidolin (2006) and Bernales and Guidolin (2014) model the IVS as a polynomial of time-adjusted moneyness and time to maturity. To predict the IVS on a future date, they use the forecasted coefficients of the polynomial from a time series model. Audrino and Colangelo (2010) develop a regression tree model through boosting to predict the IVS. Chen and Zhang (2019) apply the long short-term memory (LSTM) model with attention mechanism to predict the IVS. Bloch and Böök (2020) predicts the IVS by predicting the risk factors driving the dynamics of the IVS using temporal difference backpropagation (TDBP) models. Zeng and Klabjan (2019) use tick data on options to construct the implied volatilty surface at high frequency through a support vector regression model. A common issue with all these papers is that the predicted IVS from their models is not necessarily arbitrage free. Recently, Ning et al. (2021) introduce an interesting approach to simulate arbitrage-free IVS over time. They first calibrate an arbitrage-free stochastic model to the IVS data and then generate the model parameters of a future date from a variational autoencoder. The future IVS is then obtained under the stochastic model using the generated model parameters. However, they don't use their approach to predict future IVS. The contributions of this paper are twofold. First, we provide a general framework for dynamic prediction and simulation of IVS free of static arbitrage. This is a new feature provided by our approach compared with existing methods for dynamic prediction of IVS. Second, we show how to construct features to represent the IVS and develop some successful models for predicting the IVS in this framework. Our framework consists of two steps. • Step 1: We select features to represent the IVS and predict them. The predicted features are mapped to a discrete set of implied volatilities. If the task is simulation, we simulate the features in this step. • Step 2: We construct the entire IVS from the discrete set of implied volatilities in Step 1 through a deep neural network (DNN) model by taking into account the constraints that rule out static arbitrage. This framework is completely flexible as users can construct features and predict them in their own ways. Furthemore, any results obtained in Step 1 can be converted to a staticarbitrage free surface through the DNN model in Step 2. However, the predicted surface from our framework may admit dynamic arbitrage opportunities as we don't enforce constraints that prevent dynamic arbitrage in our model. It would be difficult to do so in our framework as it is data-driven and assumes no stochastic model for the underlying asset. The accuracy of predicting the IVS obviously hinges on the selected features and the model for prediciting them. In general, one can use features extracted from the IVS data together with exogenous variables to represent the surface. The focus of this paper is on how to extract features and we consider three approaches: using the eigenmodes from the PCA analysis of Cont and Da Fonseca (2002) , applying the variational autoencoder (VAE) to extract latent factors for the IVS, and directly sampling the IVS on a discrete grid of moneyness and time to maturity. PCA can be viewed as a parametric approach that approximates the change of the log-IVS using a linear combination of eigensurfaces. The VAE approach is more general than the PCA as it offers a flexible nonlinear factor representation of the surface. The sampling approach is completely nonparametric. To predict the extracted features, we utilize the long short-term memory (LSTM) model, which is a popular deep learning model for sequential data. Our choice is motivated by the success of deep learning in a range of prediction problems in finance. See e.g., Borovykh et al. (2017) and Sezer et al. (2020) for various financial time series, Sirignano (2019) and Sirignano and Cont (2019) for limit order books, Sirignano et al. (2018) for mortgage risk and Yan et al. (2018) for tail risk in asset returns. By training our models using data of 9.5 years and putting them to test in a period of 2.5 years, we find that both the sampling and the VAE approach are quite successful in predicting the IVS. The error of the PCA approach is almost three times of the other two, indicating that prediction based on a linear combination of eigensurfaces is not accurate enough. Another important finding is that the DNN model in Step 2 not only serves the purpose of constructing an arbitrage free surface but it is also crucial for prediction accuracy. Compared with a standard interpolation method for the IVS, using the DNN model can reduce the outof-sample prediction error substantially for the sampling and VAE approach. Our paper is related to Bloch (2019) which also provides a general and appealing framework, but they differ in a number of ways. First, Bloch (2019) doesn't consider how to obtain arbitrage free surfaces. Second, Bloch (2019) constructs features for the IVS using PCA and parametric models (such as the polynomial or SVI model). We offer another two approaches for feature construction in this paper. Our empirical results suggest that the prediction model based on features from PCA is not good enough and we also have some potential issues with predicting features from parametric models (see Remark 1). Third, Bloch (2019) proposes to model each feature by a neural network. We model these features jointly by one model (LSTM in our implementation). Having separate neural network models for the features may fail to capture potential dependence among them unless they are independent. Finally, Bloch (2019) doesn't provide empirical study to validate his models. The rest of the paper is organized as follows. Section 2 provides background information on the implied volatility surface, including the definition of implied volatility, conditions that ensure no static arbitrage, an interpolation method and our data. Section 3 presents the twostep framework for prediction and simulation. Section 4 shows empirical results and compares different models. Section 5 concludes with remarks for future research. We provide some background knowledge on the implied volatility surface (IVS) in this section. Consider a European call option on a dividend paying asset S t with maturity date T and strike price K. Set τ = T − t, which is the time to maturity. Denote the risk-free rate by r and the dividend yield by d. Let F t (τ ) be the forward price at t for time to maturity τ . It is given by F t (τ ) = S t e (r−q)τ . We will write F t below to simplify the notation. Under the Black-Scholes model, the call option price at time t is given by and N (·) is the cumulative distribution function of the standard normal distribution. The variable m defined in (1) is known as the log forward moneyness. It is well-known that the Black-Scholes model is unrealistic. To use it in practice, one looks for the level of volatility to match an observed option price, i.e., we solve σ from the equation where C mkt is the observed market price for the call option. The solution is called the implied volatility. The implied volatility surface at a time point is the collection of implied volatilities for options with different K and τ . We prefer to consider the IVS as a function of m and τ , because m is a relative coordinate. Hereafter, the IVS at time t is denoted by σ t (m, τ ). Practitioners also like to quote implied volatilities using the Black-Scholes Delta (detnoted by δ) and τ , where δ = e −qτ N (d 1 ). Conditions that ensure the implied volatility surface is free of static arbitrage have been well studied in the literature (see e.g., Roper (2010) , Gulisashvili (2012) ). We summarize them in the proposition below. Proposition 1. Consider an IVS σ(m, τ ) and suppose the following conditions are satisfied: (2) (3) For every τ , σ 2 (m, τ ) is linear as |m| → +∞. Then σ(m, τ ) is free of static arbitrage. Condition 3 implies that σ(m, τ ) is free of calendar spread arbitrage and condition 4 guarantees that there is no butterfly arbitrage. In Section 3.4, we will implement these conditions to get an arbitrage free surface. On a given day, implied volatilities can only be calculated for a discrete set of (m, τ ) pairs, which correspond to options that are listed on that day. Suppose we are given {σ(m i , τ i ) : i = 1, · · · , n} on a day. There are various approaches to interplate these given points to obtain the entire IVS. Here, we consider a simple and popular parametric model proposed in Dumas et al. (1998) and hereafter it will simply be called DFW. This model assumes σ(m, τ ) = max(0.01, a 0 + a 1 m + a 2 τ + a 3 m 2 + a 4 τ 2 + a 5 mτ ), where a floor of 0.01 is set to prevent the implied volatility of the model from becoming too small or even negative. The coefficients a 0 , · · · , a 5 can be estimated by regression. Another popular non-parametric approach to estimate the entire implied volatility surface uses the Nadaraya-Watson (NW) estimator (Härdle (1990) ), which is given bȳ is a Gaussian kernel. The estimator involves two hyper-parameters h 1 and h 2 , which are bandwidths and they determine the degree of smoothing. If the parameters are too small, a bumpy surface is generated. If they are too big, important details in the observed data can be lost after smoothing. When implementating this approach, on each day we apply five-fold cross-validation to the implied volatility data of this day to select the best pair of (h 1 , h 2 ) from a grid of values for them. Table 1 presents the average root mean squared error (RMSE) of interpolation of these two methods, where the average is taken over days in our training period. The DFW model is more accurate and it will be our choice for further study. The larger error of the NW estimator is very likely caused by applying the same bandwidth (h 1 , h 2 ) to all points, whereas our implied volatility data is non-uniformly distributed in the (m, τ ) space. DFW NW RMSE 0.018 0.026 The dataset used for this paper contains implied volatility surfaces for the S&P500 index options from January 1, 2009 to December 31, 2020. We obtained the data from Option-Metrics through the Wharton Research Data Services. On each day, we have implied volatilities for a set of (δ, τ ) pairs with δ going from 0.1 to 0.9 with an increment of 0.05 and τ = 10, 30, 60, 91, 122, 152, 182, 273, 365, 547, 730 calendar days. Since we consider the IVS as a function of m and τ , we convert δ to m using This results in implied volatilies on different days for different grids of moneyness but the same grid of τ . We denote by I d,t the set of (m, τ ) pairs for observed implied volatilities at time t. In total, we have data on 3021 days and on each day 374 points are observed from the implied volatility surface (i.e., the size of I d,t is 374). To demonstrate salient features of the implied volatility surface for index options, we calculate the average of σ t (δ, τ ) over t for all observed (δ, τ ) pairs, and plot the average values as a surface in Figure 1 (a) in terms of δ and τ . In Figure 1 (b), we show the implied volatility curves for different maturities as functions of log forward moneyness. A volatility skew is clearly observed for each τ and it remains quite steep even for large maturities. As the methods we will apply later cannot be used if the (m, τ )-grid changes from day to day, we have to fix it. We use the following grid for (m, τ ): where 30, 60, 91, 122, 152, 182, 273, 365, 547, 730}}. Since the set of τ is fixed over time in the data, we simply adopt the set as the grid for τ but change the time unit to year (τ was quoted in days initially). For the grid of moneyness, we first get the minimum value and maximum value of m in our dataset and set [log(0.6), log (2)] as the range, which is slightly wider than the range from the minimum to the maximum. Then we create a non-uniform grid on this range so that the grid is denser near ATM. As I 0 is different from the observed grid for (m, τ ) on a day, we use the DFW model given in (4) to obtain the implied volatilities on I 0 . Eventually, at every t, we have a set of 154 implied volatilities which can be deemed as a sample of the implied volatility surface. Consider the implied volatility surface process {σ t (m, τ ), t ≥ 0}. We would like to predict σ T +1 (m, τ ) (the entire surface) given the information available at T . In reality, we do not observe the entire IVS on a day, but only the implied volatilities for a finite number of (m, τ ) pairs. Furthermore, the observed (m, τ ) pairs can vary from day to day. Another important problem is how we can ensure the predicted surface is free of static arbitrage. We propose a two-step framework to deal with these problems. • Step 1: We select a feature vector Z t to represent σ t (m, τ ) for every t. Given {Z 0 , · · · , Z T }, we predict Z T +1 using some model and the predictor is denoted byẐ T +1 . • Step 2: We mapẐ T +1 to F T +1 , a discrete set of implied volatilities for T + 1, using some function h, i.e, F T +1 = h(Ẑ T +1 ). We predict the implied volatility surface at T + 1 aŝ where f is a deep neural network (DNN) that outputs an implied volatility surface free of static arbitrage. A flowchart is provided in Figure 2 to illustrate the framework. In this paper, we focus on the day-ahead prediction but our framework can obviously be applied to predict the IVS for any time horizon by replacing T + 1 with T + m where m is the length of the prediction horizon. The framework is flexible enough to accommodate various features and different ways of predicting them. We will explore some choices in this paper. The function h can be determined according to the selected features. We consider several methods to extract features from the implied volatility data. Method 1 (SAM): We directly use the sampled implied volatility setΣ t (see (6)) to represent the entire implied volatility surface. Thus, the feature vector Z t =Σ t , which is a 154 dimensional vector in our data. We name this method as the sampling approach or SAM for short. Here the function h is the identity fuction, i.e., F T +1 =Ẑ T +1 , because the predictedẐ T +1 is a set of implied volatilities. While having a high-dimensional feature vector can better approximate the surface, predicting it may be more difficult. Thus, natually we can consider some dimension reduction techniques to extract features, which leads us to the following two methods. Method 2 (PCA): Cont and Da Fonseca (2002) applied the surface principle component analysis (PCA) to dissect the dynamics of implied volatility surfaces. We follow their approach here. As a fixed (m, τ )-grid is needed, we consider {Σ t , t ≥ 0}. Define X t (m, τ ) = lnσ t (m, τ ), whereσ t (m, τ ) ∈Σ t and U t (m, τ ) = lnσ t (m, τ ) − lnσ t−1 (m, τ ) for (m, τ ) ∈ I 0 . Then we perform principle component analysis on {U t (m, τ ), (m, τ ) ∈ I 0 }, which is a 154 dimensional random vector in our data. Let K be the covariance matrix with K (x 1 , x 2 ) = cov (U (x 1 ) , U (x 2 )), x 1 , x 2 ∈ I 0 . We solve the eigenvalue problem where v k ≥ 0 is the k-th eigenvalue and f k is the associated normalized eigenvector. We sort the eigenvalues in a descending order and use the linear combination of the first K eigenvectors to approximate X t (m, τ ) − X 0 (m, τ ), which is the inner product between the vectors X t − X 0 and f k . Consequently, we havē Thus, we have Z t = (x 1 (t), · · · , x K (t)) as the feature vector for σ t (m, τ ). Typically, a small K already explains most of the variation in the data, so the feature vector is low dimensional. Let x k (T + 1) be the predicted k-th coefficient at T + 1. In this approach, we have Method 3 (VAE): The variational autoencoder (VAE) is proposed in Kingma and Welling (2013) . This approach extracts latent factors to represent given data through an encoder and then tries to generate synthetic data through a decoder to resemble the given data. Specifically, the method works as follows (see Figure 3 for a graphical illustration). • Let Y be the input data vector and H be the vector of d latent variables. The components of H are independent and H follows a multivariate normal distribution with mean vector µ(Y ) and standard deviation vector σ(Y ). • The encoder is modeled by a feedfoward neural network (FNN), denoted by N E . µ(Y ) and σ(Y ) are ouputs of N E . , where ∼ N (0, I d ) with I d as the d-by-d identity matrix and is the Hadamard product. • The decoder is modeled by another feedfoward neural network (FNN), denoted by N D , with H as the input. The outputŶ = N D (H). The loss function for training the VAE has two parts. The first part is the mean squared loss between the synthetic data and the original data given by where M is the batch size. The second part is the Kullback-Leibler (KL) divergence between the parameterized normal distribution and N (0, I) given by where µ k and σ k are the mean and standard deviation of the k-th latent variable. The loss function is defined as L(θ V AE , F ) = RE + βKL. Adding the KL divergence term encourages the model to encode a distribution that is as close to normal as possible and the hyperparameter β measures the extent of regularization. In our problem, Y t =Σ t and we set Z t = µ(Y t ). With the predictedẐ T +1 , F T +1 = h(Ẑ T +1 ) = N D (Ẑ T +1 ), i.e., the h function is given by the decoder FNN. Remark 1. A natural way to extract features from the IVS data is using a model for single day interpolation and extrapolation surveyed at the beginning of Section 1. For example, one can treat the parameters in parametric models like the surface SVI model in Gatheral and Jacquier (2014) as features for the IVS. One advantage of using such a parametric model is that its parameters can have intuitive meanings that are easily understood by traders (Bloch (2019) ). In our study, we calibrate the surface SVI model to our training data. However, there are two issues with predicting these calibrated parameters. First, they seem to be too volatile to be predicted well in our long training period. Second, certain constraints ensuring absence Figure 3 : The structure of VAE of arbitrage are not satisfied after prediction. The second one is probably a lesser issue as no arbitrage can be restored using the Step 2 DNN model in our framework. However, one cannot resolve the first issue easily. There might also be catches in using other single day models. For instance, the B-spline model of Fengler and Hin (2015) is accurate for interpolating and extrapolating the surface on a single day. One can use the control net of the B-spline model as features. However, it may vary considerably from day to day, making it difficult to predict. For the reasons above, we do not pursue these ideas for extracting features further in this paper. To predict Z T +1 from {Z 0 , · · · , Z T }, one can consider all kinds of models. In our experiment, we will use the long short-term memory (LSTM) model (Hochreiter and Schmidhuber (1997) ), which is a popular deep learning model for sequential data prediction and its success has been demonstrated in many problems. We use the model in the following way for our problem: • For any T , consider The first two represent monthly and weekly moving averages at T , respectively. We predict Z T +1 using Z 1 T , Z 2 T , Z 3 T , which can be viewed as long-term, medium-term and short-term features. A similar approach is taken by Corsi (2009) and Chen and Zhang (2019). • Let h j be a hidden state that represents a summary of information from {Z 1 T , · · · , Z j T }. Set h 0 = 0. For j = 1, 2, 3, calculate All W , U , b are parameters and σ g , σ h are the sigmoid function and the tanh function, respectively, for activation. At j, i j , r j and o j represent input gate, forget gate, and output gate. • Finally, we predict Z T +1 asẐ T +1 = σ out (W out y 3 + b out ). The range of Z T +1 varies in our framwork depending on the feature extraction method. For SAM, we use relu for σ out as Z T +1 is positive. For VAE and PCA, since Z T +1 can take any real value, we don't use any nonlinear activitation function and simply set σ out as the identity function. In the following, we will write the feature prediction model aŝ where θ P is the vector of parameters involved. With F T +1 = h(Ẑ T +1 ), we construct the entire implied volatility surface from F T +1 using a DNN illustrated in Figure 4 . The neural network is a feedforward one with inputs F T +1 , m, τ . The output is an implied volatility for the input (m, τ ) pair. We use the Softplus function ln(1 + exp(x)) as the activation function of the output layer, because it makes the output nonnegative and twice differentiable, so that the first two no-arbitrage conditions in Proposition 1 are fulfilled. Suppose the time horizon in our data is given by T = {1, 2, ..., T } and let q t be the number of observed implied volatilities on the surface at t (in our data q t = 374 for all t but in general it could change over time). The loss function of the featuer prediction part is given by We minimize L P (θ P ) to train the LSTM model. For the construction of the implied volatility surface, one can set the loss function as However, minizing L S (θ S ) to train the surface construction model cannot guarantee the output surface is arbitrage free. By design the output of the DNN model satisfies the first two conditions in Proposition 1, but does not necessarily fulfil the other three. Inspired by Zheng et al. (2019) and Ackerer et al. (2020) , we incorporate Conditions 3,4,5 for no arbitrage into our training by formulating them as penalties in the loss function. First, we create the following synthetic grids to faciliate the calculation of the penalty functions: where m min = log(0.6), m max = log(2), [a, b] 40 means a uniform grid over the interval [a, b] with it divided into 40 equal parts, and τ max = 730/365. The grid I C34 is used for the penalty calculation associated with Condition 3 and 4 while I C5 is used for Condition 5. These grids are different from the (m, τ )-grid in (5) used for sampling. In particular, I C34 has 1600 points which is a lot more than the 154 points on the sampling grid and it also covers a much wider range for both m and τ . We use such a dense grid on a wide region to reduce the chance of missing points on the surface at which there is significant violation of the no-arbitrage conditions. As Condition 5 considers the large moneyness behavior, we analyze moneyness levels which are extremely negative or positive. We denote by L Cj (θ S ) the penalty function for the j-th condition (j = 3, 4, 5). For Conditions 3 and 4, they are given by where cal (m i , τ i , F t ; θ S ) and but (m i , τ i , F t ; θ S ) are defined as in (2) and (3) with σ replaced f . For Condition 5, it is equivalent to that the second-order derivative of σ 2 (m, τ ) goes to zero as |m| → ∞, where ∂ 2 mm σ 2 (m, τ ) = σ(m, τ )∂ 2 mm σ(m, τ ) + (∂ m σ(m, τ )) 2 . Hence, the penalty is Finally, we obtain our loss function for training the DNN model as for some λ > 0. One could use a separate penalization parameter for each penalty, but for simplicy we assume they are the same. We minimize L C (θ S ) to train the DNN model. In our implementation, we choose λ = 1, which is used in Ackerer et al. (2020) in their penalized loss function. We also tried other values for λ and found that using λ = 1 results in the smallest error for the IVS on the training data and the penalties converge to zero quickly. Remark 2. To rule out static arbitrage, conditions 3 and 4 in Proposition 1 must hold for every pair of (m, τ ). However, in the implementation, we cannot check them at every point in the (m, τ ) space, so we consider a dense grid over a wide region (see I C34 ). Condition 5 specifies the limiting behavior of σ 2 (m, τ ) for |m| → ∞. In our implementation, we can only check this condition for very large values of |m| (see I C5 ). It's very unlikely that the surface from our DNN model violates these constraints at points not in I C34 or I C5 (see Figure 6 for the values of these penalties on the test data, which are zero if the DNN model has been trained for a sufficient number of epochs). But to be strict, one can say our DNN model yields an IVS almost free of static arbitrage. Our framework can also be used to simulate the IVS over time. We can write the feature transition equation as or where ε T +1 is the error vector at T + 1. In (8) we assume additive error and in (9) we assume multiplicative error. The multiplicative formulation is more convenient to use than the additive one when the positivity of Z T +1 is required. We assume the error process ε 1 , ε 2 , · · · is an i.i.d. white noise with mean zero and covariance matrix Σ ε . After obtaining the estimate of θ P by minimizing the loss function L P (θ P ), one can calculate the error vector on each day and hence obtain a sample for the errors. We can assume the error vector follows a multivariate parametric distribution F η with parameter vector η (e.g., Gaussian) and estimate η from the error sample. The simulation of Z T +1 given the available information at T consists of the following steps. • Step 1: Calculate p(Z 1 T , Z 2 T , Z 3 T ; θ P ). • Step 2: Simulate ε T +1 from F η or by bootstrapping from the error sample. • Step 3: Calculate Z T +1 by (8) or (9). • Step 4: Calculate σ T +1 (m, τ ) = f (m, τ, h(Z T +1 )). The DNN model f ensures that the output IVS is free of static arbitrage. Recall that our dataset consists of daily implied volatilities for S&P 500 index options from January 1, 2009 to December 31, 2020 with a total of 3021 trading days. We split the data into training and test sets. The training dataset is from January 1, 2009 to June 27, 2018 (about 9.5 years) while the test dataset is from June 28, 2018 to December 31, 2020 (about 2.5 years). In particular, the US stock market crash in 2020 due to the COVID-19 pandemic is included in the test period. On each day, we observe implied volatilities for 374 pairs of (m, τ ). As we only have a limited amount of training data (about 2390 days), we do not further partition it to create a validation set for hyperparameter tuning. The details of the three feature extraction methods can be found in Section 3.1. For each day in the dateset, we extract features using these three methods and some details are as follows: • For SAM, we useΣ t as the feature vector (see (6)), which is a set of implied volatilities on a (m, τ )-grid with 154 points, to represent the entire surface at t. • For PCA, we follow Cont and Da Fonseca (2002) to use the first three eigenmodes (i.e., K = 3 in (7)), which already explain over 96% of the variations in our data. • For VAE, the FNNs for the encoder and the decoder both have three hidden layers with 128 nodes per layer. We try five values for the latent dimension d: 2, 5, 10, 15, 20. Their performance on the test data is shown in Table 4 and the difference is small, indicating the performance of the VAE model is quite robust to the choice of d. The VAE model with d = 10 achieves the smallest out-of-sample prediction error. We use the LSTM model to predict the extracted features as discussed in Section 3.2. For the DNN model for surface construction, we use three hidden layers with fifty neurons on each layer. In the training of both models, we do the following: • We initialize the parameters using Xavier initialization (Glorot and Bengio (2010) ), which can prevent initial weights in a deep network from being either too large or too small. This method sets the weight of the j-th layer to follow a uniform distribution given by where n j is the number of neurons on the j-th layer. • We use the Adam optimizer with minibatches (Kingma and Ba (2014) ) to minimize the loss function. Calculating the gradient of the loss function using all the samples can be computationally expensive, so in each iteration we only use a minibatch (i.e., subset) of samples for the gradient evaluation. The Adam optimizer is a popular gradient-descent algorithm, which utilizes the exponentially weighted average of the gradients to accelerate convergence to the minimum. • We apply batch normalization to the inputs of the neural network (Ioffe and Szegedy (2015) ). For all the samples in a minibatch, we first estimate the mean and standard deviation of each input in this minibatch and then normalize it by subtracting its estimated mean and dividing by its estimated standard deviation. Values of the hyperparameters associated with training and the hidden size of the models (i.e., number of neurons on a hidden layer) are displayed in Table 2 Figures 5 and 6 show the results of loss on the training data and test data as the number of epochs increases. In Figure 6 , we only plot the DNN results for the model with features extracted from the sampling approach and results for the other two feature extraction approaches are similar. From Figure 5 , we can see that there is no overfitting for the LSTM model as the test loss is close to the training loss. Similarly, there is no overfitting for the DNN model as shown by Figure 6(a) . The value of the penalty for three no-arbitrage conditions also become zero in the test data eventually, so there is no violation of these conditions on the synthetic grids. It should be noted that although there are some spikes for the calender spread penalty in the training data, the largest value is still very small, so the violation is insignificant. Letθ P andθ S be the estimated parameters from the training data for the LSTM model and the DNN model, respectively. Suppose the time index of the last day in the training period is T train and of the last day in the whole dataset is T total . Set T test = T total − T train , which is the number of days in the test period. We do out-of-sample test as follows: for every t > T train , Here, I d,t is the set of (m, τ )-pairs in the observed implied volatility data at t, which contains 374 points. It is important to note that it is different from I 0 , the set of (m, τ )-pairs used for sampling the surface, which has only 154 points. The error for a pair of (m, τ ) is given bŷ where σ t (m, τ ) is the observed implied volatility at t for this pair (the ground truth). The error not only reflects the prediction error of the LSTM model for the features, but also the interpolation error of the DNN model. To evaluate the overall out-of-sample prediction performance, we consider two commonly used error measures: root mean squared error (RMSE) and the mean absolute percentage error (MAPE). They are defined as In our data, I d,t contains different (m, τ ) pairs for a different t, but |I d,t | = 374 for all t. 1 We examine various models. In the first step, there are three feature extraction approaches: SAM, PCA and VAE. In the second step, we consider two methods: the DNN model and the DFW model in (4) applied to F t to predict σ t (m, τ ) at time t. The DNN model yields an arbitrage free surface whereas the DFW interpolation model cannot. This leads to six models for comparison: SAM-DNN, SAM-DFW, PCA-DNN, PCA-DFW, VAE-DNN, VAE-DFW. We also consider a classical benchmark given by the DFW model. At time T , we simply forecast the IVS at T + 1 from the DFW model with its coefficients given by their estimates at T . The performance of these models on the test dataset is shown in Table 3 . For any pair of models, we also perform the Diebold-Mariano (DM) test (Diebold and Mariano (2002) ) to assess the statistical significance of the difference in the forecast performance as measured by RMSE and the p-value is shown in Table 5 . Consider model 1 and model 2. In the DM test, the null hypothesis is that the forecast error is equal while the alternative hypothesis is that the forecast error of model 1 is less than model 2. Table 5 should be read in the following way. For any entry of the table, the model on its row is model 1 and the model on its column is model 2. We consider 1% as the significance level. Several observations can be made from Table 3 and 5. (1) The best performers in outof-sample prediction are SAM-DNN and VAE-DNN. The DM test shows their difference is statistically insignificant, so they can be considered as equally good. Both of them outperfom the other models with overwhelmingly strong statistical evidence. In particular, these two models constructed in the proposed two-step framework beat the classical DFW model by a large margin. (2) The out-of-sample error of SAM-DNN and VAE-DNN is only about one third of the error of PCA-DNN. This result highlights the importance of feature selection in step 1 for predicting the IVS. While the PCA approach is good for understanding the main factors that drive the IVS movements, the approximation based on a linear combination of eigensurfaces is not sufficiently accurate for predicting the IVS. In contrast, the VAE approach is more flexible as it combines the latent factors in a nonlinear way. Thus, its improvement over PCA can be expected and this is confirmed by the results. The sampling approach can be deemed as a nonparametric way to represent the surface, which can lead to a high-dimensional feature vector (in our data its dimension is 154). Thanks to the power of LSTM, we are able to predict it quite accurately. Using a powerful model like LSTM for feature prediction is key to the success of the sampling approach. (3) The DNN model for IVS construction in step 2 also makes significant contributions to improving the prediction accuracy. For each feature extraction method, using the DNN model outperforms using the DFW model for surface construction in step 2 with statistical significance. The improvement in prediction accuracy is already considerable for SAM and even more substantial for VAE. (4) It's worth noting that the SAM-DFW model also performs quite well in prediction. It's simpler than the SAM-DNN model as it uses the simple DFW model instead of the complex DNN model for surface construction in step 2. We further plot the RMSE and MAPE of each day in the test period for four models in Figure 7 . Both SAM-DNN and VAE-DNN are better than PCA-DNN throughout the preiod and they also outperform DFW on most days. However, the error of all four models spikes up in March, 2020, during which the US stock market suffered a meltdown due to the pandemic. The relatively big error signals a shift in the market regime in that period. The features we use in all the models are extracted from the implied volatility data, which do not provide a direct representation of the market regime. To improve their performance, one can further augment the feature vector with exogenous variables like index return and VIX which are proxies of the market regime. , τ ) , 0). The no-arbitrage constraints require cal (m, τ ) and but (m, τ ) to be nonnegative for any (m, τ ). In the above quantities, we check cal (m, τ ) and but (m, τ ) on the (m, τ ) grid for the observed implied volatilities for each test day. A negative value at a pair of (m, τ ) indicates violation at this point and we calculate the average of the negative values over all test days. The results are reported for four models in Table 6 . While no violation is detected for the three models that use DNN in step 2, prediction under the DFW model yields implied volatility surfaces with arbitrage opportunities and calendar spread arbitrage in particular. Table 6 : Violation of no-arbitrage conditions for calendar spread and butterfly arbitrage Remark 3. For the PCA approach, we have also tried predicting each expansion coefficient using a separate AR(1) model (autoregressive model of order one), which is used in Cont and Da Fonseca (2002) for modeling the dynamics of these coefficients. One can view PCA(AR)+DFW as a simple model constructed based on classical statistical techniques without using machine learning. Table 7 shows that predicting the PCA coefficients using AR and LSTM are very close in the prediction performance for the IVS. We develop a flexible two-step framework for predicting and simulating the implied volatility surface dynamically free of static arbitrage. The first step involves constructing features to represent the IVS and predicting/simulating them. The second step constructs an IVS without static arbitrage through a deep neural network model from the predicted/simulated features. Using this framework, we develop two models that are quite successful in predicting the IVS. One of them extracts features by directly sampling the IVS data on a grid and the other extracts latent factors through the encoder neural network in the variational autoencoder. Both models significantly outperform a classical parametric model. The prediction accuracy of our models can be further improved in the following ways. First, we can add exogenous variables such as index return and VIX to the feature vector to represent the market regime. We expect that these features would boost the prediction power of our models when the marke is under stress. Second, we can use more sophisticated deep learning models for predicting the features and they might be particularly helpful if the feature vector is high dimensional and exhibits complex behavior. In future research, we also plan to apply our models to financial applications such as hedging, risk management and options trading. Deep smoothing of the implied volatility surface Can a machine correct option pricing models? Available at SSRN 3835108 Semi-parametric forecasts of the implied volatility surface using regression trees Variational autoencoders: A hands-off approach to volatility Can we forecast the implied volatility surface dynamics of equity options? predictability and economic value tests Neural networks based dynamic implied volatility surface Predicting future implied volatility surface using TDBPlearning Conditional time series forecasting with convolutional neural networks A neural network approach to understanding implied volatility movements Forecasting implied volatility smile surface via deep learning and attention mechanism Dynamics of implied volatility surfaces A simple approximate long-memory model of realized volatility Arbitrage-free prediction of the implied volatility smile Comparing predictive accuracy Implied volatility functions: Empirical tests Arbitrage-free smoothing of the implied volatility surface A semiparametric factor model for implied volatility surface dynamics The dynamics of implied volatilities: A common principal components approach Semi-nonparametric estimation of the call-option price surface under strike and time-to-expiry no-arbitrage constraints A parsimonious arbitrage-free implied volatility parameterization with application to the valuation of volatility derivatives. Presentation at Global Derivatives & Risk Management Arbitrage-free SVI volatility surfaces Understanding the difficulty of training deep feedforward neural networks Predictable dynamics in the s&p 500 index options implied volatility surface Analytically Tractable Stochastic Stock Price Models Applied Nonparametric Regression Long short-term memory Deep learning volatility: a deep neural network perspective on pricing and calibration in (rough) volatility models Batch normalization: Accelerating deep network training by reducing internal covariate shift Adam: A method for stochastic optimization Auto-encoding variational Bayes Arbitrage-free implied volatility surface generation with variational autoencoders Arbitrage-free call option surface construction using regression splines Arbitrage free implied volatility surfaces Financial time series forecasting with deep learning: A systematic literature review Universal features of price formation in financial markets: perspectives from deep learning Deep learning for mortgage risk Deep learning for limit order books The dynamics of the S&P 500 implied volatility surface Parsimonious quantile regression of financial asset tail dynamics via sequential learning Online adaptive machine learning based algorithm for implied volatility surface modeling. Knowledge-Based Systems Gated deep neural networks for implied volatility surfaces The first two authors were supported by Hong Kong Research Grant Council General Research Fund Grant 14206020. The third author was supported by National Science Foundation of China Grant 11801423 and by Shenzhen Basic Research Program Project JCYJ20190813165407555.