key: cord-0575110-f3wcj72u
authors: Lutkebohmert, Eva; Schmidt, Thorsten; Sester, Julian
title: Robust deep hedging
date: 2021-06-18
journal: nan
DOI: nan
sha: 98aeb753d2ca2129c75d3f1ae23830fd22dd0655
doc_id: 575110
cord_uid: f3wcj72u

We study pricing and hedging under parameter uncertainty for a class of Markov processes which we call generalized affine processes and which includes the Black-Scholes model as well as the constant elasticity of variance (CEV) model as special cases. Based on a general dynamic programming principle, we are able to link the associated nonlinear expectation to a variational form of the Kolmogorov equation which opens the door for fast numerical pricing in the robust framework. The main novelty of the paper is that we propose a deep hedging approach which efficiently solves the hedging problem under parameter uncertainty. We numerically evaluate this method on simulated and real data and show that the robust deep hedging outperforms existing hedging approaches, in particular in highly volatile periods.

Uncertainty, as coined by Frank Knight, refers to the case where a number of models (technically: probability measures) are available and one is not able to distinguish between them. This applies for example to the prediction of the evolution of a stock in the future. Even if we have a reliable and rich source of historic information, predicting the future evolution, the future variance or even the whole future distribution is highly complicated. On the one side, this is due to the classical estimation problem: estimated parameters allow for confidence intervals which need to be taken into account for the prediction. On the other side, in particular in financial markets, changes in the underlying dynamics are rather the rule than the exception and additional uncertainty and model risk come into effect, resulting in a widening of confidence intervals. For pricing, one can efficiently rely on the calibration to option surfaces with all its difficulties. For hedging, when one wants to incorporate the performance under the objective measure and not under the risk-neutral one, this becomes much more challenging.

Our paper addresses exactly this setting and suggests a deep learning approach for hedging under parameter uncertainty. The basis for our work is the recently developed class of affine processes under parameter uncertainty, see Fadina et al. (2019) , which we simply call nonlinear affine processes, referring to the associated nonlinear expectation arising from the pricing problem under uncertainty in this class. We extend this approach to those Markovian processes which satisfy dX t = (b 0 + b 1 X t )dt + (a 0 + a 1 X t ) γ dW t ,

(1.1)

where we allow for parameter uncertainty in all the parameters b 0 , b 1 , a 0 , a 1 , and γ. We develop the theory for this class of processes which we call nonlinear generalized affine (NGA) processes. The robust pricing problem is solved by utilizing a general dynamic programming principle and establishing the nonlinear Kolmogorov equation, which opens the door for fast (and well-known) numerical approaches. In order to solve the hedging problem under parameter uncertainty, we rely on a deep learning approach. To the best of our knowledge, this is the first attempt of this kind. We numerically evaluate this method first on simulated data and show that the robust deep hedging outperforms existing hedging approaches when parameter uncertainty is present. For a realistic data application, we consider the COVID-19 period. In this period, stock markets experienced unexpectedly high volatility and variation in the price paths, which poses a huge challenge to classical hedging approaches. When applying robust methods, the first challenge is to find reliable estimates for the parameter intervals specifying the uncertainty in the considered model class. We propose a sliding-window maximum-likelihood estimation approach for this whose maximal and minimal parameter estimates lead to the targeted intervals. With this uncertainty specification at hand, we are able to show that in the considered data examples the robust deep hedging approach leads to a remarkably smaller hedging error in comparison to classical hedging strategies.

Our paper relates to a rich stream of literature motivated by parameter uncertainty, dating back to Avellaneda et al. (1995) , Wilmott and Oztukel (1998) , and Fouque and Ren (2014) . More recent contributions are Cohen and Tegnér (2017) , Barnett et al. (2020) , Aksamit et al. (2020) , Cheridito et al. (2017) , Akthari et al. (2020) . In the context of option pricing and efficient hedging, Bouchard et al. (2015) , Hou and Ob lój (2018) developed approaches respecting an ambiguity set of possible underlying probability measures and Acciaio et al. (2016) , Beiglböck et al. (2013) , Cox and Ob lój (2011) , Dolinsky and Soner (2014) , Hobson (1998) , Lütkebohmert and Sester (2019) , Nadtochiy and Ob lój (2017) , Neufeld and Sester (2021b) introduced approaches to entirely model-free option pricing and to model-free super-replication.

Further, our paper contributes to the recent literature on deep learning approaches in hedging, starting from the seminal work Buehler et al. (2019) and followed by Gümbel and Schmidt (2020) , Cuchiero et al. (2020) , Cao et al. (2021) , Carbonneau (2021) , Carbonneau and Godin (2021) , Chen and Wan (2021) , Eckstein et al. (2021) , Gierjatowicz et al. (2020) , Horváth et al. (2021) , Neufeld and Sester (2021a) , amongst many others (see also Ruf and Wang (2020) for a review).

The remainder of the paper is organized as follows: In Section 2 we introduce the theoretical basis for NGA processes. In Section 3 we introduce the robust hedging approach, illustrated with simulated examples, while in Section 4 we apply robust hedging to real data. Section 5 concludes and the appendix contains some proofs.

In this section we extend the notion of affine diffusions under parameter uncertainty to a more general setting. To this end, consider a state space E which is either R or R >0 . We start with the setting without parameter uncertainty.

A generalized affine diffusion is a continuous semimartingale X which is a unique strong solution of the stochastic differential equation (SDE)

(2.1) with suitably chosen b i , a i ∈ R, i = 0, 1, γ ∈ [1/2, 1] and initial value X 0 = x ∈ E. Here, W denotes a standard Brownian motion. If we choose γ = 1/2 we obtain the well-known special case of a (continuous) affine process. In Proposition A.2 in the appendix we utilize the classical existence result of Engelbert and Schmidt (Engelbert and Schmidt (1985a,b) ) to show that a generalized affine diffusion exists on a proper state space.

Fix a time horizon T > 0 and consider Ω = C([0, T ]) as the canonical space of continuous one-dimensional paths. Denote by F the Borel-σ-algebra on Ω. Let X be the canonical process X t (ω) = ω t for ω ∈ Ω and t ∈ [0, T ] and denote by F = (F t ) t∈[0,T ] the filtration generated by X.

Let P(Ω) be the set of all probability measures on (Ω, F ). A probability measure P ∈ P(Ω) is called a semimartingale law for the process X if there exists a process B P with continuous paths of (locally) finite variation P -a.s. and a continuous local Pmartingale M P with B P 0 = M P 0 = 0 such that X = X 0 + B P + M P . Intuitively, this describes the setting when X is a continuous semimartingale under P which is given as a sum of the integrated drift process B P and a local martingale M P .

A continuous semimartingale X = X 0 +B P +M P is said to admit absolutely continuous characteristics (B P , C) with C = M P if there exist predictable processes β P and α > 0 such that 1

In the case that a continuous semimartingale possesses absolutely continuous characteristics, we can directly consider the drift (instead of the integrated drift).

Next, we introduce parameter uncertainty in the spirit of Frank Knight. Recall that a generalized affine diffusion is characterized by the five parameters b 0 , b 1 , a 0 , a 1 and γ. The targeted uncertainty we are interested in can be described as follows: instead of assuming the parameter θ to be known exactly, we introduce an interval [θ,θ] and consider each value in the interval equally likely. Taking into account this parameter uncertainty leads to a nonlinear setting, which we introduce now.

Denote the considered parameter intervals by [b i ,b i ] and [a i ,ā i ] with i = 0, 1, and by [γ,γ] . Denote by Θ :

To transport this parameter uncertainty to stochastic processes we have to be more careful. In the Markovian setting we consider here, the evolution of the process may depend on the current state x of the process. In this regard, we introduce the associated intervals

for x ∈ R, where (·) + := max{·, 0} which describe the possible diffusive behaviour of the process X given it is in state x.

Remark 2.1 (On the role of the state space) In the classical one-dimensional affine setting, the state space already defines if the affine process is of Cox-Ingersoll-Ross type (when the state space is R >0 or R ≥0 ) or of Vasiček-type. This is no longer the case when parameter uncertainty is introduced. Indeed, in Fadina et al. (2019) , the non-linear Vasiček-CIR model was introduced which has as state space R and intuitively is able to capture both sorts of dynamics. In interest rate markets, where negative rates have been neglected for a long time, such an approach efficiently avoids the model risk by the necessity to choose between the state space R and R ≥0 . In the non-linear setting considered here, this leads to the use of x + in the definition of a(x). This ensures non-negativity of the quadratic variation when E = R but does not restrict unnecessarily the dynamics of the generalized affine process.

The description of the generalized affine process under parameter uncertainty is now intuitively given by all those probability laws describing a diffusion where the drift and the volatility always stay in the intervals b(x) and a(x) considered at x = X s (ω). This means, that we consider all continuous semimartingales whose characteristics stay in the parameter uncertainty bounds.

More precisely, we introduce the following notion 2 : a nonlinear generalized affine process (NGA) starting in x ∈ E at time t ∈ [0, T ] is the family of all absolutely continuous semimartingale laws A(t, x, Θ), such that for each P ∈ A(t, x, Θ) giving rise to the differential characteristics (β P , α) we have

dt ⊗ dP -almost surely on (t, T ] × Ω and P (X t = x) = 1. We call P generalized affine dominated by Θ on (t, T ] or simply GA-dominated by Θ. Note that non-negativity of the quadratic variation is ensured by using (·) + in the definition of a in Equation (2.3)

In order to study derivative prices under parameter uncertainty, we consider an European claim with maturity T and payoff ψ(X T ). Here, ψ is an integrable function ψ : E → R. Since in our robust setting we do not have a single probability measure at hand, but a family of measures which we treat equally likely, a natural candidate for pricing is the worst-case price: the price which dominates all prices computed under the probability measures we consider.

In this regard, define the value function v :

Analogously one can define a lower bound of possible prices inf P ∈A(t,x,Θ) E P [ψ(X T )], which, due to the relation inf P ∈A(t,

, can be studied with identical methods. We therefore focus on the upper bound. A central tool for establishing a nonlinear Kolmogorov equation and therefore the tractability of the setting is the dynamic programming principle. Intuitively it states that if we consider a stopping time between t and T and compute the price (the value function) at that stopping time and take expectations of this random quantity, we obtain the value function at time t. Thus, the value can not be improved by however skilled stopping.

Proposition 2.2 Consider a nonlinear generalized affine process with state space E and a stopping time τ on [t, T ]. For any (t,

(2.5)

The proof of this result follows similarly to the proof of the dynamic programming principle in Fadina et al. (2019) which is based on Theorem 2.1 in El Karoui and Tan (2013) . The necessary measurability and stability conditions which are proved for the affine case in Lemma 1 and 2 of Fadina et al. (2019) can be shown similarly for the generalized affine case.

Note that the computation of the value function according to Equation (2.5), or the robust upper price of a derivative is not as easily accessible by Monte-Carlo estimation as in the classical case. Indeed, as is clear form Equation (2.5), it is not sufficient to simulate different paths under different distributions but we need to obtain Monte-Carlo estimates of expectations with a fixed probability measure P and then need to find the supremum of these expectations. If no monotonicity can be exploited, this will be difficult to compute.

However, a very efficient tool can be developed which is a nonlinear version of the Kolmogorov equations. By relying on numerical methods for nonlinear partial differential equations, we will be able to compute the value function within seconds.

A central tool for describing Markov processes is the infinitesimal generator. For the generalized affine process X (under no parameter uncertainty, see (2.1)), the infinitesimal generator is given by

For the nonlinear version of the Kolmogorov equation we will take the worst-case generator, i.e. the supremum over all generators with θ ∈ Θ. More precisely, for some integrable ψ : E → R consider the nonlinear partial differential equation

We then obtain the value function as viscosity solution of the nonlinear Kolmogorov equation. 

is a viscosity solution to the PDE (2.6).

The result follows by the similar arguments as in the proof of Theorem 1 in Fadina et al. (2019) . We relegate the proof to the appendix.

After the setting for NGA processes has been detailed and the pricing discussed, we come to the main novelty of the paper: the efficient computation of hedging strategies. It is our goal to also find a numerical procedure which replaces the classical Monte-Carlo estimation in a robust setting. Motivated by Theorem 2.3, we will proceed as follows: first, we discretize in time and utilize the Euler Maruyama approximation of the generalized affine process as given in Equation (2.1). Second, at each time step, we select a new parameter set θ ∈ Θ by sampling from a uniform distribution. Note that sampling from a uniform distribution corresponds to assigning equal weight to all probability measures under consideration, which seems adequate for a robust hedging approach that can also be applied in situations which are underestimated by methods solely relying on historical data, see also Remark 4.2.1, where we discuss possible extensions.

These processes serve as an approximation of the class A(t, x, Θ). In our robust deep hedging approach we train our network on these samples and determine the hedging function which minimizes the hedging error over all these samples.

Since this section is mainly on numerics, we take the freedom to generalize the setup of Section 2 slightly by also allowing for path-dependent derivatives. The theoretical subtleties which build the basis for this step can be found in Geuchen and Schmidt (2021) .

A path-dependent derivative allows that the payoff at maturity T depends on the full path of the process X up to time t, (X t ) 0≤t≤T . We denote the square-integrable payoff by Φ

Our aim is to determine hedging strategies (h t ) 0≤t≤T and cash positions d ∈ R such that the quadratic error is minimized

for all P ∈ A(0, x 0 , Θ). This formulation is a consequence of the considered model ambiguity, under which every measure from A(0, x 0 , Θ) is taken into account. In the following we develop a deep learning approach to compute the hedging strategy h.

First, we discretize the interval [0, T ] through 0 = t 0 ≤ t 1 ≤ · · · ≤ t n = T . Next, we approximate the hedging strategy h t i at grid point t i through neural networks. We start with a precise definition of neural networks, referring to Petersen (2020) for a detailed mathematical study on this topic.

Let ϕ : R → R be a non-constant function, called activation function. A (feed-forward) neural network with input dimension d in ∈ N, output dimension d out ∈ N, l ∈ N layers, and activation function ϕ is a function of the form

, and where the activation function is applied component-wise. The number h i ∈ N is called the number of neurons of layer i. We say a neural network is deep if l ≥ 2, and we denote the class of all neural networks with input dimension d in , output dimension d out , l layers and activation function ϕ by N N l,ϕ d in ,dout .

To solve the minimization problem stated in (3.1) for arbitrary measures P ∈ A(0, x 0 , Θ) we sample paths of (X t i ) 0≤i≤n , where we sample each path under a newly randomly picked measure P ∈ A(0, x 0 , Θ), where the parameters are uniformly chosen from Θ in each time

Input : parameter set Θ; hyperparmeters of the neural network such as number of layers l ∈ N, number of neurons, activation function ϕ, learning rate of the optimizer; number of iterations N iter ; batch size B; payoff function Φ((X t ) 0≤t≤T ); discretization 0 = t 0 ≤ t 1 ≤ · · · ≤ t n = T ; initial value x 0 ; Output: parameter d: cash position of the hedging strategy; neural network h ∈ N N l,ϕ 2,1 : self-financing strategy, inputs t and X t ; Initalize the parameters of the neural network h ∈ N N l,ϕ 2,1 randomly;

Generate paths of the generalized affine process using the Euler-Maruyama method:

w.r.t. the parameters of h and w.r.t. d end step. We then compute the quadratic hedging error on a batch of samples and optimize the neural network to minimize the quadratic hedging error. This procedure is summarized in Algorithm 1 and builds on the findings from Buehler et al. (2019) , where however no ambiguity w.r.t. the choice of the correct underlying probability measure is taken into account.

The main source of difference w.r.t. computational time of Algorithm 1 in comparison with the deep hedging approach from Buehler et al. (2019) turns out to be the random sampling of the 5 parameters while creating the paths on which we train our neural networks. This sampling step, which reflects the parameter uncertainty in our approach, is not necessary when applying the approach from Buehler et al. (2019) . We found however that the speed difference in practice is not very pronounced. In the setting of Section 3.2.1 and with the neural network architecture as specified in the beginning of Section 3.2 we tested that the approach of Buehler et al. (2019) runs approximately 1.24 times faster on a standard computer. (396 seconds vs. 318 seconds for 1,000 iterations of training).

Remark 3.2 (Uniform distribution of parameters) For training the hedging strategy in Algorithm 1 we simulate samples from a nonlinear generalized affine process in the spirit of an Euler-Maruyama scheme: in each time step i we simulate the Euler-Maruyama discretization and choose parameters according to Equation (2.4), i.e. we draw the parameters γ, a 0 , a 1 , b 0 , b 1 uniformly from Θ. This seems to be the natural choice for a robust setting since one is interested in putting equal weights on all possible scenarios. An alternative, but typically more costly strategy, would be to choose a certain discretization of Θ and to consider all gridpoints in each step. Another alternative would be a Bayesian a posteriori distribution, as noted in Remark 4.2.1.

Remark 3.3 (The quadratic loss function) In Algorithm 1 we chose a quadratic loss function which balances gains and losses from the seller and buyer symetrically and therefore leads to a fair hedging price which seems reasonable in many practical applications. If, however, one is rather interested in a classical robust hedging which dominates all hedging strategies for each P ∈ A(0, x 0 , Θ), one would choose a loss function which penalizes losses but not gains as for example a risk measure. This procedure was also suggested in Buehler et al. (2019), but is not studied further here.

We apply the presented numerical routine from Algorithm 1 in several examples. For all of the examples in this section we consider a nonlinear generalized affine process with parameters specified through

( 3.2) To train neural networks h ∈ N N l,ϕ 2,1 according to Algorithm 1 we specify their architecture as follows. We apply to all layers of the neural network the ReLU -activation function ϕ(x) = max{x, 0} while the neural networks possess l = 4 layers with 256 neurons each. Moreover, Algorithm 1 is implemented using the Tensorflow -environment (Abadi et al. (2016) ), in which we execute Algorithm 1 with a batch size of 256 and by employing the Adam optimizer (Kingma and Ba (2014) ) with standard parameters and a learning rate of 0.005 for the backpropagation step. The used Python-codes are provided for convenience and can be found under https://github.com/juliansester/nga.

In the following we will evaluate the performance based on the relative hedging error, defined as hedge minus payoff of the derivative divided by the respective price of the hedging strategy. Compared to the (relative) quadratic hedging error this has the advantage that we also observe the direction of the error.

First, we compute, by applying Algorithm 1, an optimal hedging strategy for fixed parameters a 0 = 0.5, a 1 = 0.5, b 0 = 0, b 1 = 0, γ = 1 (the mean of each of the respective intervals in (3.2)) for a call option with payoff Φ T = (X T − x 0 ) + for T = 30/365.

Then, for the same derivative, we consider parameter uncertainty by taking into account uncertainty w.r.t. the model parameters as specified in (3.2). By applying Algorithm 1 with n = 30, we determine the optimal hedging strategy under parameter uncertainty. In the left panel of Figure 3 .1, we depict the optimal hedging strategy with fixed parameters, obtained by Algorithm 1 after 10,000 iterations, whereas in the right panel of Even though both strategies look very similar at first sight, they perform very differently on random paths generated under parameter uncertainty. For an illustration of this effect, we compute the relative hedging error of both strategies on 50,000 paths that are generated according to the parameters from (3.2), i.e., under parameter uncertainty. The results are displayed in Figure 3 .2 and reveal that the hedging strategy which was trained on paths that take into account parameter uncertainty possesses a remarkably smaller hedging error in comparison with the strategy which was trained on paths with fixed parameters and which is optimal for these. In Table 3 .1 we provide mean and standard deviation of the relative hedging errors verifying the observation that the robust hedging strategy outperforms in this scenario the non-robust hedging strategy.

While a call option has a high degree of monotonicity, we now explore a more complicated option, a butterfly option. Note that in classical linear pricing, one can obtain the price of a butterfly as the sum of the prices of calls and puts. This is no longer true in the nonlinear case, the case with parameter uncertainty, since the supremum destroys the linearity. Thus, nonlinear pricing in this setting is substantially more involved.

We consider a NGA process with parameters as in Equation (3.2) and a butterfly payoff function given by

We depict the optimal hedging strategy which was computed according to Algorithm 1 with n = 30 in the left panel of Figure 3 .3. The relative hedging error evaluated on 50,000 samples, created under uncertainty, is illustrated in the middle panel while in the right panel we provide a histogram of the difference between the absolute value of the hedging error of a hedging strategy trained on paths with fixed parameters a 0 = 0.5, a 1 = 0.5, b 0 = 0, b 1 = 0, γ = 1 with the absolute value of the relative hedging error of the robust strategy. The histogram shows that in most of the samples the relative hedging error of the non-robust hedge is larger than the relative hedging error of the robust hedge, an observation which is also verified through Table 3 .1.

Next, we consider a NGA process with parameters as in (3.2) and a lookback call option with payoff function Φ T = (max((X) 0≤t≤T ) − 12) +

We depict the optimal hedging strategy computed with Algorithm 1 with n = 30 in the left panel of Figure 3 .4 and the hedging error evaluated on 50,000 samples, created Fixed hedging strategies were trained for the fixed parameters a0 = 0.5, a1 = 0.5, b0 = 0, b1 = 0, γ = 1 and robust strategies were trained with parameter uncertainty as in (3.2). For the lookback-option we also consider a hedging strategy depending on the running maximum (run max ), which outperforms the Markovian strategy in the path-dependent case.

according to uncertainty as in (3.2), in the middle panel of Figure 3 .4. In the right panel of Figure 3 .4 we compare the hedging error of a trained non-robust strategy with the hedging error of the trained robust strategy. The trained robust strategy outperforms the non-robust strategy clearly on scenarios that were created under uncertainty according to (3.2), which also can be seen in Table 3 .1.

As the payoff function is path-dependent, we could improve the hedging performance further by allowing the self-financing hedging strategy h t (X t , max 0≤s≤t X s ) to be dependent also on the running maximum. We observe that this approach indeed additionally improves the hedging performance to some degree. The results are displayed in the rightmost column of 

The presented robust hedging approach allows to respect the problem that in many situations robust price bounds such as sup P ∈A(0,x,Θ) E P [Φ T (X T )] are too expensive to have practical relevance (compare e.g. Frey and Sin (1999) , Biagini and Frittelli (2004) and Neufeld (2018) ).

Robust price bounds may of course still be of interest, for instance to check the market for mispriced derivatives or to compute price bounds when the parameter set Θ is chosen , that were computed by using a finite differences algorithm, with the price of the optimal hedge computed according to Algorithm 1. We assume the parameters from (3.2) and show prices as functions of the initial value x of the stock price X.

sufficiently small such that this approach leads to meaningful prices. To compute the price bound sup P ∈A(0,x,Θ) E P [Φ T ((X T )] one may then solve the corresponding PDE (2.6) by using an explicit finite-difference method, compare also Fadina et al. (2019) , where a similar approach in a nonlinear affine setting is pursued and see the companion code on https://github.com/juliansester/nga for more details.

In Figure 3 .5 we provide the prices of hedging strategies for a call option with strike K = 10 (as in Section 3.2.1) and of a butterfly option as in Section 3.2.2, where we consider the parameters as specified in (3.2). For comparison, we also show price bounds sup P ∈A(0,x,Θ) E P [Φ T ((X T )] and inf P ∈A(0,x,Θ) E P [Φ T ((X T )] computed with the mentioned explicit finite-difference method. We display prices and price bounds for different initial values of the underlying process.

The results indicate that the price of the hedging strategies, that were computed according to Algorithm 1 and which lie well between lower and upper price bound, possess great practical relevance for two reasons. First, the associated prices are neither too low nor too high to be tradable. Second and in contrast to the prices computed as maximal expectations, the prices come with a trading strategy that allows to hedge the associated financial derivative under model uncertainty.

In this section, we evaluate the performance of the robust deep hedging strategy on financial data. To this end, we extracted daily closing prices of 20 of the largest constituents 3 of the US stock market index S&P 500 from 26 September 2008 until 09 April 2020 from Thomson Reuters Eikon. This time period shows a high level of uncertainty during the beginning of the COVID-19 pandemic and thus poses a challenging environment for hedging strategies.

To analyze and illustrate the uncertainty present in parameter estimates, we consider parameter estimations on rolling windows. The obtained results allow us to specify the uncertainty set Θ. These results also underline the high degree of uncertainty present in the considered data.

More precisely, we estimated the parameters under the assumption that the price observations follow generalized affine processes based on data from 26 September 2008 until 03 March 2020 as follows: consider the discretization of the generalized affine processes (X t ) t≥0 according to the Euler Maruyama scheme,

for all values of the process (X i ) 1≤i≤N on N observation dates (daily observations), with time difference ∆t i = 1/250, and normally distributed ∆W i ∼ N (0, ∆t i ). Then, conditionally on X i , X i+1 is normally distributed since

Accordingly, given N ∈ N daily prices x := (x 1 , . . . , x N ), the log-likelihood function is given by

We consider 2880 trading days for each of the constituents of the S&P 500-index. After every 100 days we numerically maximize, by means of the Constrained Optimization by Linear Approximation (COBYLA) optimizer (Conn et al. 1997, pp. 83-108.) , the loglikelihood function x w.r.t. the parameters a 0 , a 1 , b 0 , b 1 , γ, where x = (x 1 , . . . , x 250 ) consists of the last 250 trading days. The results of these estimations are illustrated for a single stock in the left panel of Figure 4 .1. Moreover, in the middle panel of Figure 4 .1 we display all estimates from all of the considered 20 constituents.

The obtained estimates show a considerable variation over time. For example, the estimator of γ for Apple Inc. ranges from values slightly larger than 0.5 to values around 1 and highlights the advantage of using a generalized affine process rather than a simple affine process where γ would be fixed to 0.5. The variations of all parameter estimates over the considered 20 constituents of the S&P 500 confirm this finding. Also all other parameter estimates clearly exhibit a high degree of uncertainty.

Given this time series of historical parameter estimates, we estimate the uncertainty set Θ by the obtained minima and maxima of the maximum-likelihood estimations. The obtained estimator is denoted byΘ. This represents a conservative approach and takes all past observations into account. More precisely, this constitutes the smallest possible choice given the past observations. Of course, the uncertainty set could also be increased to improve robustness, which however comes at the cost of higher (and therefore potentially less attractive) derivatives' prices and higher hedging costs.

Remark 4.1 (Historical measure vs risk-neutral measure) In this section we are mainly interested in the hedging performance which is typically evaluated under the historical measure. However, there are also cases where one prefers the distribution under the risk-neutral measure, see for example Föllmer and Schied (2004) for a detailed exposition on various hedging concepts. In the later case one would obtain parameter estimates from liquid derivatives' prices through calibration and then proceed analogously. 

We defined the parameter setΘ by the intervals induced by the extreme maximum-likelihood-estimates. This corresponds to a conservative approach in which even outliers are deemed to be relevant for the future evolution of the underlying stochastic process. In less conservative approaches one could instead take into account inter-quartile ranges of the estimated parameters or only a grid of parameters associated to the historical estimates. The latter approach avoids that parameter combinations which did not appear in the past (e.g. large values of γ and a 1 usually do not occur at the same time) are considered as relevant for the future evolution.

Relying on the estimated uncertainty setΘ we compute, according to Algorithm 1 a hedging strategy for an Asian at-the money put option with daily observations:

where x 0 corresponds to the respective initial spot value at 09 March 2020 and T = 30 trading days (i.e. maturity 21 April 2020). Further, we compute for each constituent a hedging strategy which only takes the last maximum-likelihood estimation of the last 250 days into account. We then evaluate, on the real price evolution of the constituents of the S&P 500 from 09 March 2020 until 21 April 2020 (compare the right panel of Figure 4 .1) how both hedging strategies perform. For this we compare the relative hedging error of the strategies. The results are depicted in perform better in periods with high volatility as in the period under consideration. In Table 4 .1 we further display the hedging error of strategies that were trained, when all of the considered parameters are assumed to be contained in intervals except for a single parameter which is fixed. This analysis allows to compare and analyse the effect of robustness of single parameters on the hedging error. We observe that taking uncertainty into account is in particular important for the volatility parameters and the parameter γ. More precisely, while fixing the drift parameters does not lead to considerably worse hedging errors, fixing the exponent γ of the volatility term, and in particular the volatility parameter a 1 , significantly increases the mean and the standard deviation of the hedging error.

For comparison we also report the hedging error when applying the deep hedging approach in a (non-robust) Black-Scholes model 4 , where the parameters are estimated in a consistent manner through maximum likelihood estimation while taking into account the time series of the last 250 trading days. The results of this hedging approach are depicted in the rightmost column of Table 4 .1 and show that hedging under the Black-Scholes model leads in the considered period to a mean hedging error and standard deviation comparable to the mean hedging error of an NGA-process with fixed parameters.

This supports our choice of considering the class of generalized affine models in the robust pricing and hedging approach especially during such periods of market turmoil.

While we have now provided evidence for the outperformance of the robust hedging approach over other approaches in a crisis period, the question arises whether the approach is flexible enough to perform comparable to other approaches in periods that would be rather classified as non-crisis periods.

To this end, and to be consistent with the previously introduced methodology, we consider three additional 30 day testing periods, starting 100 trading days, 200 trading days and 300 trading days, respectively, after 09 March 2020. For each additional period we take new maximum-likelihood-estimations of the last 250 days into account and evaluate, for the same payoff function (4.1), the performances of a robust hedging approach, of a non-robust hedging approach and of a hedging approach under a Black-Scholes model. The results of this study are displayed in Table 4 .2 and show particularly that in these periods the mean hedging errors and the standard deviation of all approaches are reduced in comparison with the crisis period. The best performing model in these periods turns out to be the NGA-process with fixed parameters, while pursuing a robust hedging approach leads to a slightly higher hedging error and a higher standard deviation. These observations indicate that nonrobust approaches perform best in regular out-of-crisis periods, whereas a robust hedging approach performs slightly worse in these periods, presumably since such hedging strategies are adjusted and calibrated to a broader range of possible future market movements. Our investigation of the performance of the hedges in the crisis period (Table 4 .1) however reveals that this broad calibration can provide additional strong protection against unexpected market movements as they can be observed in crises.

Instead of the presented frequentist approach, in which we use the minimal and maximal maximum-likelihood-estimations to determine the intervals representing Θ and then to assume that the parameters of the SDE are uniformly distributed on Θ, one might also consider other distributions: first, the empricial distribution of the parameter estimates as shown in Figure 4 .1 is a natural choice. Second, a Bayesian approach for the determination of the parameter intervals can be implemented, see Duembgen and Rogers (2014) for a Bayesian approach. In this approach one starts from a prior distribution (e.g. uniform on some pre-defined intervals) for all of the parameters and then sequentially updates the resulting posterior distributions contingent on the same data which we use for the maximum-likelihood-estimations. Eventually, to determine optimal hedging strategies one modifies Algorithm 1 by drawing parameters according to the obtained posterior distributions, as already detailed in Remark 3.2. Alternatively, quasi Bayesian approaches as in Brignone et al. (2021) can be used where an asymptotic distribution of parameters is estimated from the quasi posterior distribution. Since the Bayesian approach may put relatively few weight to extreme parameters we decided to implement the presented approach which puts equal weight to all of the parameters that are considered possible. This approach is therefore robust w.r.t. extreme market movements, for what we provide evidence in the example in Section 4.

In this work we studied parameter uncertainty in the class of generalized affine processes and developed a robust hedging approach relying on deep neural networks. This approach shows resilience against unexpected changes in the dynamics of the underlying, which justifies the claimed robustness of this method. Our research is a first step towards the practical application of robust hedging approaches and still many questions remain open: the most pressing one is the practical determination of the uncertainty interval -how much risk is one willing to take by considering a smaller interval (which clearly in good weather conditions will be cheaper in pricing and hedging)? The second, highly interesting question is to incorporate transaction costs into the robust deep hedging approach, to treat other dynamics of the underlying and to consider loss functions different to the quadratic one we used in this paper.

Insurance and from Deutsche Forschungsgemeinschaft (DFG) of the grant SCHM 2160/13-1 is gratefully acknowledged.

It is well known that the state space E needs to be chosen in correspondence with Θ. In the case where E = R, this does not pose difficulties, but in the case where E = R >0 some care has to be taken. The special case where γ = 1 /2 is the content of Proposition 1 in Fadina et al. (2019) .

We call the state space E proper for the non-linear generalized affine process A(t, x, Θ) if P (X s ∈ E, t ≤ s ≤ T ) = 1 for all P ∈ A(t, x, Θ) and all 0 ≤ t ≤ T, x ∈ E. The next lemma extends Proposition 1 in Fadina et al. (2019) to the case where γ = 1 /2. Lemma A.1 Assume that E = R >0 , b 0 > 0, a 0 =ā 0 = 0, a 1 > 0 and 1 /2 < γ ≤γ ≤ 1. Consider the NGA A(0, x 0 , Θ) with x 0 ∈ E. Then it holds for any P ∈ A(0, x 0 , Θ) that

Proof: For the proof we rely on the integral test proposed in Theorem 5.2 in Criens (2020) . To this end we consider a sufficiently small subset (0, ε) ⊂ E such that b 0 + b 1 x > 0 for all x ∈ (0, ε).

To begin with, we observe the estimates

with constantsā > 0 and u 0 > 0. The next step is to show that v(u,ā)(x) from Equation (5.5) in Criens (2020) explodes as

So in the following we consider x < x 0/2. Then,

Then we can estimate (since y < x 0/2), setting β = 2γ − 1 > 0,

for some constant A 1 > 0. Up to constants we can now estimate v(u,ā) from below by

with β = −β −1 − 1. Now it is easy to see that the integral on the right hand side explodes as x → 0 by l'Hospital's rule. 2

As a consequence of Lemma A.1 we obtain that the state space E is proper in the following cases:

The next proposition establishes conditions such that the set of semimartingale measures A(t, x, Θ) is not empty.

Proposition A.2 (Existence of generalized affine process) Let γ ∈ [1/2, 1]. If E = R assume b 0 , a 0 > 0 and a 1 = 0 while for E = R >0 we assume b 0 > 0, a 0 = 0 and a 1 > 0 and, for γ = 1/2, additionally b 0 > a 1 /2. Then for all t ∈ [0, T ] and x ∈ E there exists a unique strong solution to the SDE (2.1).

Proof: The theorem follows from Corollary 5.5.16 in Karatzas and Shreve (1991) using the results from Engelbert and Schmidt (Engelbert and Schmidt (1985a,b) ). First note that in the case E = R >0 the function 1/(a 0 + a 1 x) 2γ is locally integrable for any x ∈ R >0 if a 0 = 0 and a 1 > 0. If E = R, the local integrability follows because a 0 > 0 and a 1 = 0 in that case. Further, for any x, y ∈ R we have

i.e. the function h in Corollary 5.16 in Karatzas and Shreve (1991) is given by the strictly increasing function h(z) = κz γ with h(0) = 0. Since we chose γ ∈ [1/2, 1], the function h satisfies the condition

Further, the conditions (ND) (a 0 + a 1 x) 2γ > 0 for all x ∈ E and (LI) for all x ∈ E there exists an > 0 such that x+ x− |b 0 +b 1 y| (a 0 +a 1 y) 2γ dy < ∞ in Karatzas and Shreve (1991) are satisfied when we choose a 0 > 0, a 1 = 0 if E = R and a 0 = 0, a 1 > 0 if E = R >0 . Thus, there exists a strong solution to the SDE (2.1), possibly up to an explosion time. Explosions to +∞ in finite time do not occur since we have at most linear growth.

If the state space is R >0 and γ = 1 2 , then the process X does not reach zero due to Proposition 1 in Fadina et al. (2019) . If γ ∈ (1/2, 1], Lemma A.1 implies that again X does not reach zero and the conclusion follows. 2

In this section, we prove Theorem 2.3, which we repeat for the reader's convenience. 

is a viscosity solution to the PDE (2.6).

For the proof we will need some preliminary tools.

The proof is a modification of the proof of Lemma 3 in Fadina et al. (2019) and Lemma 5.2 in Neufeld and Nutz (2017) and takes the generalized setting into account. Proof: Consider P ∈ A(t, x, Θ) and denote by X s = x + B P s + M P s , s ≥ t, the semimartingale representation of X from Equation (2.2). We will repeatedly use the elementary inequality (a 1 + a 2 ) q ≤ 2 q−1 (a q 1 + a q 2 ) (A.5) and denote c q := 2 q−1 . First, the Burkholder-Davis-Gundy (BDG) inequality (see Theorem IV.4.1 in Revuz and Yor (1994) ) together with Jensen's inequality and (A.5) yields for any h ∈ [0, T − t] that

Note that the constant C q ≥ 1 from the BDG inequality does depend on q only. We define K = 1 + |b 0 | + |b 1 | + |b 0 | + |b 1 | +ā 0 +ā 1 and choose any 0 < ε = ε(q) < 1 small enough such that it satisfies

Let us verify that such a fixed ε satisfies the desired property: by the very definition of P ∈ A(t, x, Θ), we have on [t, t + h] that both α and |β P | are bounded from above by (K + K sup 0≤s≤h |X t+s |) 2γ ≥ 1 and K + K sup 0≤s≤h |X t+s | ≥ 1, respectively, since they are GA-dominated. This, together with Jensen's inequality, yields that

Since K + K sup 0≤s≤h |X t+s | ≥ 1 and γ ≤ 1, we have that

Then,

Since the drift is affine dominated, we obtain in a similar way that

Inserting these inequalities into (A.6), considering h ≤ ε, and noting thatC q ≥ 1 implies that

Since h ≤ ε and we chose 0 < ε < 1 such that (A.7) holds, we obtain for the constant

> 0, being independent of t, h, P , that

As P ∈ A(t, x, Θ) was chosen arbitrarily, the claim is proven. is jointly continuous. In particular, v(t, x) is locally 1 /2-Hölder continuous in t and Lipschitz-continuous in x.

Proof: The statement follows similarly to Lemma 4 in Fadina et al. (2019) and Lemma 5.3 in Neufeld and Nutz (2017) . For x = y and fixed t ∈ [0, T ] it holds that

where L is the Lipschitz constant of the function ψ. Thus, the value function is Lipschitzcontinuous in x.

For the locally γ-Hölder continuity, let t ∈ [0, T ) and 0 ≤ u ≤ T − t small enough. Then the Lipschitz-continuity, the dynamic programming principle in Proposition 2.2 and Lemma A.3 imply that

with the constant c = c(x, 1) from Lemma A.3. Choosing a sequence (t n , x n ) converging to (t, x) we have that

The statement follows for n → ∞.

2

Proof: (of Theorem 2.

3) The proof essentially follows the well-known standard arguments in stochastic control, see e.g., the proof of (Neufeld and Nutz 2017, Proposition 5.4). By Lemma A.4, v(t, x) is continuous on [0, T ) × R, and we have v(T, x) = ψ(x) by the definition of v. We show that v is a viscosity subsolution of the nonlinear affine PDE defined in (2.7); the supersolution property is proved similarly. We remark that in the subsequent lines within this proof, c > 0 is a constant whose values may change from line to line.

Let (t, x) ∈ [0, T ) × R and let ϕ ∈ C 2,3 b ([0, T ) × R d ) be such that ϕ ≥ v and ϕ(t, x) = v(t, x). By the dynamic programming principle obtained in Proposition 2.2, we have for any 0 < u < T − t that 0 = sup P ∈A(t,x,Θ)

E P ϕ(t + u, X t+u ) − ϕ(t, x) . (A.10) Fix any P ∈ A(t, x, Θ), denote as above by (β P , α) the differential characteristics of the continuous semimartingale X under P , and denote by M P the P -local martingale part of the P -semimartingale X. Then, Itô's formula yields ϕ(t + u,X t+u ) − ϕ(t, x) = u 0 ∂ t ϕ(t + s, X t+s ) ds + u 0 ∂ x ϕ(t + s, X t+s ) dM P t+s + u 0 ∂ x ϕ(t + s, X t+s )β P t+s ds + 1 2 u 0 ∂ xx ϕ(t + s, X t+s )α t+s ds. (A.11)

As ϕ ∈ C 2,3 b ([0, T ) × R), ∂ x ϕ is uniformly bounded,we see that for small enough 0 < u < T − t the local martingale part in (A.11) is in fact a true martingale, starting at 0. In particular, its expectation vanishes. The next step is to estimate the expectation of the other terms. In this regard, note that E P u 0 ∂ x ϕ(t + s, X t+s )β P t+s ds ≤ u 0 E P ∂ x ϕ(t + s, X t+s ) − ∂ x ϕ(t, x) |β P t+s | + ∂ x ϕ(t, x)β P t+s ds. (A.12)

Since ϕ ∈ C 2,3 b , ∂ x ϕ is Lipschitz. Hence, we obtain with the constant K = 1 + |b 0 | + |b 1 | + |b 0 | + |b 1 | +ā 0 +ā 1 together with Lemma A.3 that for small enough u,

|X t+v − x| ds ≤ c u 3 + u 5/2 + u 2 + u 3/2 . (A.13)

Inserting (A.13) into (A.12) yields E P u 0 ∂ x ϕ(t + s, X t+s )β P t+s ds ≤ u 0 E P ∂ x ϕ(t, x) β P t+s ds + c u 3 + u 5/2 + u 2 + u 3/2 . (A.14)

The same argument applied to ∂ xx ϕ leads to u 0 E P ∂ xx ϕ(t + s, X t+s ) − ∂ xx ϕ(t, x) · |α t+s | ds and we obtain that E P u 0 ∂ xx ϕ(t + s, X t+s ) α t+s ds ≤ u 0 E P ∂ xx ϕ(t, x) α t+s ds + c u 3 + u 5/2 + u 2 + u 3/2 . As above, we write θ := (b 0 , b 1 , a 0 , a 1 ) for an element in Θ. Then, by taking expectations in (A.11) and using (A.12)-(A.17) yields E P ϕ(t + u, X t+u ) − ϕ(t, x) ≤ c u 3 + u 5/2 + u 2 + u 3/2 + u 0 ∂ t ϕ(t, x) + E P ∂ x ϕ(t, x) β P t+s + ∂ xx ϕ(t, x) α t+s ds ≤ c u 3 + u 5/2 + u 2 + u 3/2 + u∂ t ϕ(t, x)

Here, the supremum turns out to be G(X t+s , ∂ x ϕ(t, x), ∂ xx ϕ(t, x)). Note that by the very definition of G, G(X t+s , p, q) ≤ G(x, p, q) + sup θ∈Θ |b 1 | |X t+s − x| |p| + |a 1 | |X t+s − x| |q| .

Therefore, by using that ϕ ∈ C 2,3 b , the definition of the constant K and Lemma A.3, we have u 0 E P G(X t+s , ∂ x ϕ(t, x), ∂ xx ϕ(t, x)) ds ≤ uG(x, ∂ x ϕ(t, x), ∂ xx ϕ(t, x)) + ucKE P |X t+s − x| ≤ uG(x, ∂ x ϕ(t, x), ∂ xx ϕ(t, x)) + cK u 2 + u 3/2 .

(A.19)

E P ϕ(t + u, X t+u ) − ϕ(t, x) ≤ u∂ t ϕ(t, x) + uG(x, ∂ x ϕ(t, x), ∂ xx ϕ(t, x)) + c u 3 + u 5/2 + u 2 + u 3/2 . (A.20)

for some constant c > 0 which is independent of P . As the choice of P ∈ A(t, x, Θ) was arbitrary, we deduce from (A.10) that 0 ≤ sup P ∈A(t,x,Θ) E P ϕ(t + u, X t+u ) − ϕ(t, x)

≤ u∂ t ϕ(t, x) + uG(x, ∂ x ϕ(t, x), ∂ xx ϕ(t, x)) + c u 3 + u 5/2 + u 2 + u 3/2 . (A.21)

By dividing first in (A.21) by −u and then letting u go to zero, we obtain that −∂ t ϕ(t, x) − G(x, ∂ x ϕ(t, x), ∂ xx ϕ(t, x)) ≤ 0, which proves that v is indeed a viscosity subsolution as desired. 2

Tensorflow: A system for large-scale machine learning

A model-free version of the fundamental theorem of asset pricing and the super-replication theorem

Robust framework for quantifying the value of information in pricing and hedging

Generalized Feynman-Kac formula under volatility uncertainty

Pricing and hedging derivative securities in markets with uncertain volatilities

Pricing uncertainty induced by climate change

Model-independent bounds for option prices-a mass transport approach

On the super replication price of unbounded claims

Arbitrage and duality in nondominated discrete-time models

Efficient quasi-Bayesian estimation of affine option pricing models using risk-neutral cumulants

Deep hedging

Deep hedging of derivatives using reinforcement learning

Deep hedging of long-term financial derivatives

Equal risk pricing of derivatives with deep hedging

Deep neural network framework based on backward stochastic differential equations for pricing and hedging American options in high dimensions

Duality formulas for robust pricing and hedging in discrete time

European option pricing with stochastic volatility models under parameter uncertainty

On the convergence of derivativefree methods for unconstrained optimization', Approximation theory and optimization: tributes to MJD Powell

Robust pricing and hedging of double no-touch options

No arbitrage in continuous financial markets

A generative adversarial network approach to calibration of local stochastic volatility models

Pricing und Hedging von Derivaten unter Parameterunsicherheit, Master's thesis

Martingale optimal transport and robust hedging in continuous time

Estimate nothing

Robust pricing and hedging of options on multiple assets and its numerics

Capacities, measurable selection and dynamic programming part II: Application in stochastic control problems

On one-dimensional stochastic differential equations with generalized drift

On solutions of one-dimensional stochastic differential equations without drift', Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete

Affine processes under parameter uncertainty

Stochastic Finance

Approximation for option prices under uncertain volatility

Bounds on european option prices under stochastic volatility'

Non-linear affine processes and path-dependent derivatives

Robust pricing and hedging via neural sdes', Available at SSRN 3646241

Machine learning for multiple yield curve markets: fast calibration in the Gaussian affine framework

Robust hedging of the lookback option

Deep hedging under rough volatility

Robust pricing-hedging dualities in continuous time

Brownian Motion and Stochastic Calculus

Adam: A method for stochastic optimization

Tightening robust price bounds for exotic derivatives'

Robust trading of implied skew

Buy-and-hold property for fully incomplete markets when superreplicating markovian claims

Nonlinear Lévy processes and their characteristics

A deep learning approach to data-driven model-free pricing and to martingale optimal transport

Model-free price bounds under dynamic option trading

Neural network theory

Continuous Martingales and Brownian Motion

Neural networks for option pricing and hedging: a literature review

Uncertain parameters, an empirical stochastic volatility model and confidence limits

We thank the editor and two anonymous referees for several comments which significantly improved our paper. Moreover, we are thankful to David Criens for helpful remarks. Financial support of the NAP Grant Machine Learning based Algorithms in Finance and