key: cord-0661196-f1qb9ck0 authors: Cohen, Samuel N.; Snow, Derek; Szpruch, Lukasz title: Black-box model risk in finance date: 2021-02-09 journal: nan DOI: nan sha: 356f3a20b72a4dd3390c4bb26bf5142c33883cfe doc_id: 661196 cord_uid: f1qb9ck0 Machine learning models are increasingly used in a wide variety of financial settings. The difficulty of understanding the inner workings of these systems, combined with their wide applicability, has the potential to lead to significant new risks for users; these risks need to be understood and quantified. In this sub-chapter, we will focus on a well studied application of machine learning techniques, to pricing and hedging of financial options. Our aim will be to highlight the various sources of risk that the introduction of machine learning emphasises or de-emphasises, and the possible risk mitigation and management strategies that are available. Traditionally, the tractability of pricing and hedging methods was arguably more critical than their accuracy, and the limits of computation determined what methods were useful. The Black-Scholes formula is concise, simple to understand, can be implemented on a handheld calculator (Lo, 2019) ; these features were critical to its wide adoption. Similarly, the Heston model benefits from convenient (fast) Fourier transformation methods (see, for example, (Gatheral, 2006) ) and the SABR model from a convenient approximation (see (Hagan et al., 2002; Oblój, 2008) ), which have formed a key part of their attractiveness. While many more sophisticated and accurate models have been developed, computational bottlenecks impeded their wider adoption. In recent years, machine learning models in finance have become streamlined; in just a few lines of packaged code, modellers can develop state-of-the-art models with online computing power and open-source software (Snow, 2020; Dixon et al., 2020) . However, the risks of blindly using machine learning solutions, without understanding their inner workings and inherent drawbacks, are significant. In this sub-chapter, we seek to give an overview of the key issues which arise when using machine learning in finance, and some remedies which have been suggested. Rather than focus on developing a particular algorithm, we take a higher-level view of the risks and challenges which arise in these contexts. We wish to highlight that machine learning is not a panacea for financial markets, instead it provides tools which allow practitioners to shift between different sources of risk, some of which have not been a primary concern in the past. We will focus on those risks which are a core part of machine learning -the risks inherent in data and in the modelling algorithms used. We will not discuss what the WEF calls the erosion of "human financial talent" where humans lose the skill to challenge and disagree with machine learning systems (McWaters et al., 2019) , although this is potentially a significant concern in many financial applications. There are two broad uses of machine learning in finance. The first application of machine learning in finance is to remove computational barriers and enable use of advanced models in day to day business operations. When used in this manner, 'machine learning' is providing a next generation computational tools, which are used to speed up and improve traditional modelling. For example, when calibrating an option pricing model one often needs to price many derivatives many times, using a variety of potential parameter valuesthis is a task that can be improved by using a machine-learnt approximation for the pricing operator. Hutchinson et al. (1994) trained a neural network on simulated data to learn the Black-Scholes option pricing formula. A number of efficient algorithms have recently been developed to approximate parametric pricing operators with flexible modelling assumptions (for example, see Horvath et al. (2020) ; Jacquier and Oumgari (2019); Ferguson and Green (2018); McGhee (2018); Sabate-Vidales et al. (2018, 2020) ). This in turn can eliminate the calibration bottlenecks commonly found in using realistic pricing models. The second application involves a more fundamental change in the approach to modelling and working with data, where traditional, low-dimensional, handcrafted models are replaced with abstract over-parameterized models, using the tools of machine learning to avoid overfitting. These methods may be used to represent the statistical features of underlying assets, to determine prices, hedges and risk properties of portfolios in terms of market observables, as well as a combination of these tasks. This application depends in a far more significant way on the historical data available, leading to various challenges: for example, it becomes hard to understand what is driving the price of a derivative, and the data modelling and preprocessing steps might introduce an additional set of risks, which can be a cause of unease for regulators and risk managers. In this subchapter, for the sake of concreteness, we focus our attention on the challenge of pricing and hedging derivatives, which we outline in Section 2, and principally on the use of one machine learning method (deep neural networks) in this challenge. In Section 3 we discuss issues connected with the sources of data that are used as inputs into machine learning algorithms, while in Section 4 we are concerned with the risk associated with the way probabilistic modelling is incorporated within machine learning. To get more acquainted with a neural network solution, we will take a closer look at the problem of pricing and hedging an exotic derivative, by trading in a financial market. Here we give a non-technical description; for a technical primer on neural networks see, for ex-ample, the work of Bengio et al. (2017) . The inputs to our problem will be a combination of historical market data and commonly accepted handcrafted models (depending on the precise approach), the latter may be used to generate additional simulated data for training purposes. The key outputs are prices, hedging strategies and risk assessments for exotic options and portfolios of exotic options and other assets. The precise role of machine learning in options pricing and the data used to support it can vary significantly. If we consider one particular class of machine learning methods -neural networks -in Figure 1 we present one way of classifying some applications of this method, looking at whether they principally are concerned with the processing and generation of data, or with building models for financial markets, and how these contribute to different outputs. Figure 1 : Neural Network Pricing and Hedging. This figure illustrates five broad roles for neural networks as part of developing pricing, volatility, and hedging models. Neural networks can learn directly from real data (Data-driven), can be used as a numerical or computational tool (Numerical), can enhance handcrafted models (Hybrid), can generate novel simulated data (Generator), and can be used in reinforcement learning models to develop strategies in a dynamic way (Optimisation). Many of the use-cases for neural networks boil down to their ability to learn complex, high dimensional, non-linear relationships; for example, to solve partial differential equations in high dimension, to develop data-driven models with large feature sets, and to find optimal policies in large state-spaces via reinforcement learning. The effectiveness of neural nets in high dimensional settings suggests they have ability to overcome the computational curse of dimensionality 1 . We divide the neural network use-cases into five broad modes of application: Data-driven models, Hybrid models, Numerical approximations, Online Optimisation methods, and Generator models. (1) Numerical approximations are based on traditional parametric models, and exploit neural networks to, for example, approximate pricing or hedging functionals in the form of solutions to parametric families of PDEs. This approximation can significantly speed up calibration problems. In these applications, the neural network is not trained against real-world historical data, but is used purely to approximate complex functions, in an efficient way. The key difficulty in this area is the calibration of hyperparameters and network architectures, and the implementation to industrial standards. • Some models require computationally expensive procedures involving solving a partial differential equation (PDE) or performing Monte Carlo simulations to estimate the option price, implied volatility, or hedging strategy. For these models we can use offline neural networks to efficiently approximate an entire pricing or hedging function (Hutchinson et al., 1994) . • Some applications depend on solving a high-dimensional and/or non-linear PDE, even if underlying model is simple. For example to price and hedge pathdependent options or to compute XVAs 2 . Neural networks can be used as a function approximation tool which works well in high dimension, and are particularly efficient at solving PDEs when blended with Monte Carlo simulation. (Barucci et al., 1997; Beck et al., 2019; Sabate-Vidales et al., 2020) • Some problems, in particular in calibration of handcrafted models (for example, Heston or SABR models), require repeated calculation of various option prices prices under a variety of parameters. By providing an efficient means of approximating this calculation for a range of parameter choices, neural networks speed up the process of calibration, allowing a more efficient use of data. (Andreou et al., 2010; Bayer et al., 2019; Sabate-Vidales et al., 2018) . These methods often depend on simulating option values from the handcrafted model, under a range of parameter values. (2) Data-driven models rely on real market data to approximate pricing and hedging functions. These models disregard handcrafted models in their entirety and simply use historical, synthetic or simulated data of any type to learn new relationships and features (Ghaziri et al., 2000; Montesdeoca and Niranjan, 2016) . (3) Hybrid models rely on historical, simulated, or synthetic data to approximate pricing and hedging functions and also constrain or impose knowledge onto the architecture of an otherwise unconstrained neural network. • Some models first leverage a handcrafted model to estimate prices and then build a data-driven model to learn the difference or residuals between the observed price and the handcrafted model estimate (Lajbcygier and Connor, 1997) . • Other models constrain a universal neural network by adding domain knowledge to the architecture to learn more realistic relationships that increases the interpretability or efficiency of the model e.g., forcing monotonous relationships towards one direction by adding penalties to the loss function (Garcia and Gençay, 2000; Dugas et al., 2009; Gierjatowicz et al., 2020) . (4) Online Optimization methods: A number of option types, for example American options, benefit from learning optimal stopping rules using neural networks in a reinforcement learning framework; others may benefit from learning a value function or a hedging strategy that benefits from temporal optimal control, e.g. a model that takes evolving market frictions into account in an environment or control system (Buehler et al., 2019; Kolm and Ritter, 2019) . (5) Generator models can take any data as input and generate new data that has the same statistical properties. Data can be generated by applying a calibrated 'handcrafted' model (or from a range of handcrafted models) or from a machine learning generative model. Alternatively, they may be learn from data observed in one situation to generate representative data in a related setting. The first of these uses (where the generated data matches statistical properties of historical observations) is called 'synthetic data', and is a subset of 'simulated data', which includes scenarios that were not present in the historical data. The generated data's purpose is principally to aid the performance of machine learning pipelines, for example to provide an environment to train further models with reinforcement learning. It's worth noting that the generator, and hence the simulated data, should be seen as a statistical model 3 for our observations. This approach can be viewed as model-based data boosting (Buehler et al., 2020; Ni et al., 2020; Mariani et al., 2019) . Using a combination of these approaches, we can now build an abstract pipeline for learning to price and hedge options. (1) Using historical data and current market data, build up a collection of training trajectories of the assets under consideration, as well as a representation of the state of the market. As we typically only have one trajectory of past data, one often needs to augment historical observations with models or simulated data. We have discussed two main approaches to this: (i) Design and train generative models to provide additional realistic data, or provide a rich parametric class with which to work. (ii) Train handcrafted models (possibly using neural networks as a numerical tool to speed up calibration) from which to simulate. (2) Using this data, either (i) Use reinforcement learning to optimize hedging strategies (and thus determine initial prices), using the simulator as a training environment (ii) Learn data-driven pricing relationships and hedging strategies by observing prices in historical data. (iii) Learn a hybrid model that first trains on simulated data, and then transfers this learning over to real data for efficient training. (iv) Use a further numerical approach, to solve the PDEs arising from the (possibly high dimensional) models. Equivalently, one can ensure the calibrated model generates trajectories with probabilities from a risk-neutral measure, and use Monte Carlo simulation to estimate prices. The pipeline we have outlined above gives a very flexible approach to modelling and pricing of financial derivatives, however this is not a 'free lunch'. Neural networks are notoriously opaque as a modelling tool, and are often implemented simply as a 'black-box' approach to function approximation (the hybrid models discussed above being a partial exception to this). An important practical question is whether the potential disadvantages of black-box hedging can be justified by increased performance, and whether the risks associated with this approach can be distinguished and quantified. An industry standard to assess the quality of a new model is to compare it with simpler benchmarks such as BS-delta hedging with the presence of transnational costs (Davis et al., 1993; Whalley and Wilmott, 1999) . There is preliminary evidence that suggests, at least in simple, constant volatility settings, these benchmark models have performance close to that of reinforcement learning agents (Mikkilä, 2020) . If that is true, these benchmarks should be preferred because they have easy to explain analytical solutions. Ruf and Wang (2020a) have shown, on an out-of-sample test set, that a simple fixed hedging strategy that hedges calls by 0.9 × δ BS and puts by 1.1 × δ BS , where δ BS is the delta under the classic Black-Scholes model, outperformed 14 out of 16 models, including all the supervised neural network models with 1-day rebalancing, and outperformed all models with 2-day rebalancing. It should be noted, that these tests were performed in a simple one period setting, with no transaction costs. These results do not directly extend to a large basket of derivatives; as a result, more tests are needed. However, these results suggest that more complex black-box models may fail to outperform simpler ones. It is also possible that improved feature selection and simple models might be a better solution than a direct application of neural networks. Ruf and Wang (2020a) compared a neural network model with a linear regression model to estimate the hedge ratio, on simulated and real data. Predictors in the linear model included standard model sensitivities under the Black-Scholes model: moneyness, Delta, Vega, implied volatility, time to maturity and Vanna. They conclude that the classical option sensitivities already contain the non-linearities necessary to build an effective hedging strategy for common options, in a financially significant and efficient way. They further showed that this linear regression model outperformed their neural network model. Their approach diverges from Buehler et al. (2019) , who do not use option sensitivities as variables, but instead rely on the belief that the Greeks indirectly present themselves as non-linear functions that the agent has access to via the market state, in the form of hedging instrument prices. Future experiments should test hedging performance on a basket of derivatives, in a multi-period setting, with dynamic volatility, transaction costs, and environmental feedback. Such a real-life setup could benefit the deep hedging approach (Buehler et al., 2019) , as this has the capacity for direct feedback from the environment and online training. We now focus our attention on how the use of machine learning methods highlights risks associated with the underlying financial data. One can identify three primary sources of risk: biases in the training data, erroneous data or erroneous preprocessing, and legal and regulatory data risks. We will not focus on legal and regulatory risks. As we are moving the dial from handcrafted towards data-driven models, the data risk increases significantly. On the one hand, handcrafted models are more robust to biases and errors in the data, and the risk of using inadequate models is easier to detect. On the other other hand, for data-driven models the training data becomes an integral part of the model, making them more sensitive to data risk. A key issue in financial data is that the majority of data is backward-looking, and there is no guarantee that future behaviour of financial markets will be represented by historical observations. We typically only have one trajectory of historical data -we cannot see what might have happened in different scenarios -which makes it difficult to build a clear view of the range of likely outcomes in the future. Any recent changes in the true underlying state of a system that are not incorporated in a model's training dataset will also lead to biased predictions. These risks are not particular to machine learning -they are well known issues in financial markets; however, the use of machine learning models, which often depend on observed data in a more significant way than traditional handcrafted models, means understanding and managing these risks is critical for the success of these approaches. Here we summarize some key forms of bias that historical data could exhibit: (a) Backward looking: most data reflect prices and signals obtained in the past. This means that data could reflect a state of affairs that no longer applies, e.g. an options model might only have access to data from a low volatility period, or from an old regulatory regime. Financial markets are reactive and don't follow universal lawsfor example, the increase of high-frequency trading has changed the nature of many financial markets (see, for example, the discussion in MacKenzie (2018)). Restricting to only the very recent past, and projecting a model's predictions only into the near future, can mitigate this concern somewhat, but often results in significantly less data being available for training. (b) Spurious correlations: for some financial domains it is prudent to record and collect attributes that have some theoretical basis. For example, it is questionable whether an option pricing model should contain sentiment features. It is often difficult to identify intuitively unreasonable relationships within a black-box model, and the increase in dimensionality of the models being used results in a vast increase in the range of potential relationships that could be inferred (Fan and Zhou, 2016). Spurious correlations are well known in finance, but the increased use of machine learning techniques can exacerbate this problem. (c) Sample disparity: biases in the sampling procedure might lead to data that doesn't fairly represent the state of the market. For example, a firm may wish to use the same algorithm for trading over multiple exchanges and geographic locations, each of which has subtly different conventions and data. This introduces biases within the data which can be magnified through the use of machine learning models, particularly when a model is trained in one setting then deployed for use elsewhere. (d) Imbalanced inputs: some evidence shows that even when your sample accurately reflects the true state of the market, it remains imbalanced -rare events may be significant, but are only infrequently represented in historical samples. Many datadriven models are known to favour the performance of the majority outcome, to limit overall model errors (Provost, 2000) . Within a financial setting, this might correspond to a model which performs well in low-volatility regimes, even when high-volatility periods are observable (infrequently) within the training data. A related idea is the use of stressed periods for calculating risks within the Basel accords -these infrequent periods are significant to overall performance, and need to be explicitly taken into account. (e) Insufficient data: one often has insufficient data to use machine-learning models well. The calibration of neural networks requires significant quantities of data, which are often not available for training in a financial context (Gu et al., 2018) . The richer the context in which an algorithm is to be run, and the more finely tuned its behaviour needs to be, the larger the quantity of data needed. It is worth emphasising that this is not to say that the data in finance is 'small', but that often it is not 'large' in the directions needed -we may have enormous datasets due to high-frequency observations of a large number of asset's order books, but these will be of little use in determining good models over long time periods. A further concern in many applications is that data may display subtle errors, which need to be addressed before it can be effectively used. This is a common concern in many applications of machine learning, and data-cleaning methodologies form a key part of the implementation of these methods. (a) Observed financial data can fail to satisfy fundamental economic constraints, which can be subtle. For example, as discussed in Cohen et al. (2020) , historic options price data, for both listed and OTC contracts, may be inconsistent with no-arbitrage constraints, particularly in emerging markets. If such data is naïvely used when training a trading system, it is plausible that the system would learn to exploit this apparent arbitrage opportunity. Given these errors could arise due to multiple sources (for example, stale quotes being listed as live in historical data), this can lead to significant error in the resulting learnt behaviour. (b) When working with time-series data, it is critical to respect information flow when e.g splitting data into training, validation, and test sets; engineering features; or normalising data. Errors in this process can 'leak' information from the future, leading to unrealistic performance. (c) Financial data often has a particular concern around precise timekeeping, which may not be reflected in the accuracy of the data given. Particularly when working with very high-frequency data, failing to take into account latency and other implementation issues can have a significant effect, which may not be well reflected or available in historic data (see, for example, the effect of latency in Cartea and Sánchez-Betancourt (2018) ). This is particularly the case with the increased attention being given to nonmarket data sources (for example signals from online news sources), where historic time-stamping may be of low quality. (d) Financial data is often heavy tailed and not stationary, making it difficult to detect and exclude erroneous data. Typical methods (such as Winsorizing) have the potential to introduce significant bias, particularly when considering extreme events. There are many process improvements that can be implemented to decrease data risks, for example, performing data quality monitoring, documenting and reviewing the manipulation of input data, and educating and training individuals involved in data manipulation tasks. Another key approach, to fix biased and limited data, is to generate synthetic or simulated data which is free from (or even corrects for) these issues. We will outline two key approaches -the top-down approach of synthetic data generation, and the bottom-up approach of agent based modelling. Synthetic data generation (SDG) is a top-down data generation solution. It can help to address some of the data biases and errors listed above. It does so by augmenting the quantity and quality of historical data, but it does not attempt to provide a simulator which can model feedback effects for an agent's interventions in a market. At a high level, a synthetic data generator attempts to build a probabilistic model which would generate observations similar to historical data. Generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have demonstrated great success in seemingly high dimensional setups (Wan et al., 2017; Lin et al., 2020) . If used correctly, SDGs could allow for a more comprehensive approach to future-proofing and validating machine learning pipelines; ameliorating some structural deficiencies in data and amending distributional biases (Louizos et al., 2015) . In a financial context, Takahashi et al. (2019) have shown that (GANs) can be used to generate synthetic data that matches most known stylized features of returns; Ni et al. (2020) have shown how mathematically principled feature extraction methods such as signature models can be used to efficiently implement conditional GANs for generic time series data. Related ideas, but combined with VAEs, are presented in Buehler et al. (2020) . Henry-Labordere (2019) developed efficient algorithms building upon optimal transport theory and highlighted an interesting application of data generators for detecting anomalies. Algorithms based on restricted Boltzmann machines have been developed in Kondratyev and Schwarz (2019) , who coined the term 'market generators'. An alternative approach is to learn the underlying dynamics of the system, allowing a path to evolve through time -this is the approach taken by neural-SDE models (Gierjatowicz et al., 2020) . Fu et al. (2019) have shown how conditional GANs can be used to produce synthetic data for different market scenarios. Koshiyama et al. (2020) have shown how these methods can be used to validate trading strategies. SDGs still pose a form of modelling risk, the generators are only as good as the data from which it constructs its generating function; building an SDG involves choosing a metric, a loss function, and a training algorithm for parameter selection. As such, SDGs introduce model risks within the data used to train downstream models, and these risks may be difficult to identify depending on the use-case. At the present time, research in this area lacks standardised benchmarks and theoretical guarantees. Most off-the-shelf methods are not built with financial applications in mind, and are therefore likely to generate simulated data which exhibits arbitrage or other economically unrealistic phenomena. Moreover, many of these models remain black-box and are not easily interpretable. The key benefit, however, is that these methods are expressive and work in high dimensions. This is the main difference when comparing with traditional methods using handcrafted features. Synthetic data can also be used to generate data according to expert opinions and known facts, e.g., can be conditioned to form the observed volatility smiles. And SDGs generally offer a more accurate and robust oversampling method than traditional methods like SMOTE (Synthetic Minority Oversampling Technique) that simply repeat existing records (Chawla et al., 2002) . They also provide a convenient solution for missing data imputation and outlier treatment (Xu and Veeramachaneni, 2018) . Synthetic data generation tools can be used as part of a larger solution to address some of the most common upstream data errors. They can be used side-by-side with federated learning techniques to improve the quality of single standing resources, by pooling data across, departments, subsidiaries, companies, or data-providers (Goetz and Tewari, 2020) . Deep generative models for synthetic data generation remains a new field, and although they have potential to alleviate some of the known issues of neural network models, it is clear that they have the potential to introduce further risks. Overall, as with other methods, they can be seen as shifting risks away from the quantity and quality of data, by including probabilistic modelling (with its associated risks) at a very early stage in the analysis pipeline. Agent-based model (ABM) simulators, unlike SDGs, are a bottom up data solution and date back to the 1990s. Notable early models include those by Levy et al. (1994) and the Santa Fe Artificial Stock Market 1994; 1996. ABMs model markets as evolving systems of competing, autonomous interacting agents (LeBaron, 2000) . The development of ABMs has seen multiple waves of interest. The first wave of market simulators in the 1990s was a deliberate move away from classical economic theories to advance financial market knowledge, the second wave was a reaction to the failure of economic models in foreseeing the financial crises of 2008, the third wave was a call to understand high frequency trading and the flash-crashes in 2010 and 2013, and the fourth and current wave combines the concerns with the past, but emphasises the use of simulators to train machine learning agents. A key advantage of ABMs is that, as bottom-up models, they attempt to learn the feedback effects of agents acting within the market. This has the advantage that these effects can be modelled, but makes training much more difficult -usually involving explicit modelling decisions, and requiring more data to train. Again we see that the issue of historical data not containing counterfactual histories, or being too limited for our purposes, are being addressed, but doing so introduces increases our reliance on statistical models, rather than on observed data. Modeling feedback is important for training environments to be realistic. For example when hedging or trading strategies are trained and tested on historical data, the success of the model still cannot be reliably demonstrated, even when using holdout sets for validation. Training environments with appropriately modelled feedback can, at least partially, mitigate this issue. Such training environments are also critical for deploying on-line reinforcement learning solutions as they allow pre-training of these systems before implementation in the real market. This is critical in applications where the costs and risks of exploration are significant. Agent-based modelling has, in recent years, allowed for the design of high-fidelity simulated markets (Belcak et al., 2020; Byrd et al., 2019) . These artificial markets can run millions of in-silico trials to test counterfactual theories, research emergent phenomena, and train and test algorithms. A current trend is that quantitative funds are looking to establish risk management systems that develop scenarios with no historical precedent 4 . With a simulator, one can perform training and backtesting for trading, execution, and placement algorithms under various conditions. Causal assessments can be performed for market impact and market slippage. Lastly, simulators can also be used as a means of generating synthetic data, given that financial data of sufficient granularity is often highly proprietary and/or expensive to access. Issues surrounding data quality often are specific to the particular use-case. The increasing use of varied data sources, often with little standardization, will inevitably result in the preprocessing of financial data becoming more important. Some approaches, for example the no-arbitrage constraints for option books in Cohen et al. (2020) , rely on preprocessing data to conform with prescribed characteristics. In this case, given the no-arbitrage constraints restrict the range of possible option prices significantly, imposing these requirements has the potential to address errors coming from a variety of sources. These methods can also be run on data coming from a SDG or ABM, in order to ensure that the simulated data is economically reasonable. An approach which can serve to highlight potential concerns, is to look at the sampling frequency and periods of data. By comparing the results of using different but comparable datasets, it is possible to gauge the stability of calibration and models, and hence to identify causes for concern. More generally, learning from other areas of machine learning, the development of common examples, codebases and resources, in an open-source manner, has the potential to improve the identification and processing of data errors. A significant risk is that inappropriate methods for dealing with data errors will be separately developed, implemented and used, without sufficient oversight or criticism. The use of well-developed, understood, and standardized tools is a key part of modern machine learning, and the development of preprocessing tools appropriate to finance should be seen in this light. As part of this, the development of publicly discussed use-cases, with realistic data, would allow for new methods to be evaluated in a consistent manner, and for best-practice to be developed. While this is the case in other areas of machine learning, there is still much scope for improvement when it comes to financial data and problems. As we have discussed, the use of machine learning changes, but does not eliminate, the use of classical mathematical modelling. Classical models may appear explicitly in machine learning methods (for example, in a hybrid pricing model), or may be subtly incorporated in the simulations used to support more explicitly data-driven approaches. Typically, however, machine learning methods aim to construct models from flexible ('non-parametric') families, combining the classical tasks of model selection and calibration into a single step. In this section, we will discuss the risks which arise from these modelling decisions, in a machine learning context. It is worth noting that the presence of model risk depends strongly on how machine learning methods are used. Using machine learning tools for numerical procedures typically introduces little additional model risk, as one can often verify the solutions using other techniques. For example, when using a neural network to estimate option prices, for the sake of quickly calibrating the parameters of a classical model, it is straightforward to verify (using traditional PDE or Monte Carlo methods) that the calibrated model gives the correct prices of those options -the neural network is only serving as a numerical tool. Conversely, end-to-end deep reinforcement learning, for example of a hedging strategy, exposes users to risks in multiple forms: models with too many parameters risk overfitting to available data, leading to both poor performance and a misunderstanding of a model's inaccuracy; the common use of synthetic and simulated data hides an additional layer of model risk in the training environment; complex models are more exposed to reward hacking, poisoning attacks, and other adversarial concerns and are typically less interpretable than simpler models. The problem of calculating a price for a financial derivative which is consistent with the market can be seen as equivalent to finding a map that takes market data (e.g. prices of underlying assets, interest rates, prices of liquid options) and returns the no-arbitrage price of the derivative. One way to do this is to select a martingale model (to prevent arbitrage) that can be calibrated to market data, by which we mean that the model matches the observed prices of liquid assets. While this is a dominating approach in the industry, the introduction of a model necessarily introduces model risk, and there are infinitely many models that can fit market data. In the robust finance paradigm, see (Hobson, 1998; Cox and Obłój, 2011) , one takes a conservative approach and instead of computing a single price one constructs pricing intervals that are consistent with market data. Without imposing further constrains, the class of all calibrated models might be too large, and consequently, the corresponding pricing intervals too wide to be of practical use (Eckstein et al., 2019) . It is therefore natural to consider a smaller search space of models (e.g. SDEs with continuous coefficients) and use data and machine learning to select an appropriate model (i.e. the coefficients of the SDE). This approach has been recently applied in (Gierjatowicz et al., 2020) . The key idea is to use SDEs to describe the model dynamics but, instead of fixing its coefficients, to allow the drift and diffusion to be given by an overparametrized neural network. These 'neural SDE' models provide a systematic framework for model selection, but can also produce robust estimates on the derivative prices. A concern for model risk is not new in finance, but the use of machine learning methods can be seen as typically emphasising some risks over others. In Table 1 , we present an overview of the typical distinctions between handcrafted and machine learning perspectives on model risk. Lower dimensional models which are easy to calibrate, but fail to capture all aspects of the market's behaviour. Generally a higher bias than variance and more prone to underfitting. High dimensional models which require large amounts of data to calibrate, but can capture fine detail when fitted well. Can often incorporate new sources of information in a convenient manner. Generally a higher variance than bias and more prone to overfitting. Model Sensitivity Few parameters and model inputs. Model outputs vary smoothly with calibration and input. Well understood sensitivities to erroneous inputs. High-dimensional parameters and data inputs. Model outputs can vary sharply with inputs. Sensitivities to erroneous inputs can vary significantly. Reasonably robust calibration and not susceptible to data poisoning attacks. Calibration can be easily monitored by users. Adversarial defenses not a key part of most models. Susceptible to attacks, require robust training and adversarial defences, but these can be incorporated as a key part of the model. Not easily monitored by users. Models naturally incorporate economic intuition and underpinnings. Few parameters to update online, but do not often incorporate updating as a core part of the model. Model based on data patterns which may change over time. Many parameters need to be updated dynamically, which can lead to unstable behaviour. Model updating can be included as a core part of the approach. Within a machine-learning paradigm, one usually combines the stages of model selection and calibration. Given data on a supposed relationship or phenomenon, one aims to directly fit a model to this data with which to predict, simulate and build understanding. For our example of pricing and hedging of options, we can focus on the task of pricing an option given historical market data. Our data consist of historical observations of market data, and we aim to build a function which can take new observations and provide us with prices in the future. To do this, some basic modelling assumptions are unavoidable: • Does the price of an option depend only on the current market state, on the recent past, or on a long history of market observations. Equivalently, what are the inputs to the pricing function that I wish to find? • Do I wish to make conditional predictions (say of an option price given a stock price) or do I wish to give simulations of both simultaneously? • Does the relationship between market observations and prices remain stable through time? If not, how do I choose training periods which are representative of the situations where I will apply my function in the future? • If the observed prices are not perfectly predicted by market data, so I have noisy observations, are the noises independent, or are they correlated between times and assets? In each case, the answer given to these questions will be incorporated in our machine learning model, and introduces model risk at a structural level. These general concerns are common to both classical and machine learning methods, however the increased flexibility of machine learning methods may suggest that (as one can include more observations in a model), that they would be less present in a machine learning approach. Even after these general concerns are addressed, machine learning methods introduce risks similar to the 'model risk' of classical mathematical finance. Within the paradigm of machine learning, models are not chosen explicitly but implicitly, through the choice of training data, training algorithm and the often ad hoc choice of a large parametric model (e.g. a neural network and its architecture). Unlike handcrafted methods that are explicitly specified, or hybrid approaches relying on feature engineering, neural networks construct an internal representation of features to capture and approximate functions. With neural networks, model specification is not in the direct control of the modeller. Due to this flexibility in feature specification, a larger space of plausible models are explored than in traditional or many other machine learning approaches. The cost of this flexibility is that the model selected may not be the 'best' available. Since the fitting of traditional models typically involve solving some convex optimisation problem, a best model can be identified due to the existence of a unique minimum. Neural networks fitting techniques are typically non-convex and many good solutions can be found. Adding fuel to the fire, neural networks are known to be sensitive to initialisation conditions (McMormack and Doherty, 1993) . Moreover, many sources of randomness are often injected into the training phase of neural networks, this includes the use of dropout (where some neurons are randomly set to zero for network regularisation), early stopping (where the process of gradient descent stops when the performance on a validation set stops improving), and stochastic gradient descent (where random selections of observations are used to fit the network). These additional factors introduce uncertainty in the output of neural network models. The injection of noise during training is critical to the performance of these methods, and it leads to, so called, implicit regularisation (Neyshabur, 2017) . That means that stochastic gradient descent methods select regularised solutions, even though regularisation is not explicitly incorporated at the training stage (Heiss et al., 2019) . In this sense, the model selection step of classical approaches is replaced by the choice of training algorithm, which has a less easily understood connection with model performance. Drawing from interpretability research by Lipton (2018) , any model's transparency can be broken down in simulatibility, decomposability, and algorithmic transparency. With simulatibility, a human should be able to step through each of the operations in a reasonable time; with decomposability, each part of the model has an intuitive explanation that is understood in isolation; with algorithmic transparency, there are theoretical guarantees about the behaviour of the algorithm, for example certainty of convergence. Going down this checklist it is clear that neural networks lack simulatibility and decomposability because the parameters in the hidden layers do not have an intuitive explanation. Moreover, for non-convex problems stochastic gradient descent is not guaranteed to converge. Instead, one can show that the weights of neural networks are represented by Monte Carlo samples from optimal distribution over the parameter space. This perspective allows one to establish convergence guarantees, but does not help with the issue of interpretability (Hu et al., 2019; Jabir et al., 2019) . A key selling point of neural networks is their ability to work with high dimensional inputs. However, this comes with a well documented issue of sensitivity, where the learnt relationships vary wildly with small perturbations to the underlying inputs. Models are known to be fragile when using high dimensional inputs. The reasons are numerous: given the randomness involved in training neural networks, some inputs may spuriously be considered important. This is a particular issue when only limited data is available, or simulated data (from a low dimensional model) is used as training datasimulated data will typically not explore a full range of market conditions (as it is constrained by the model from which it's generated), and so the neural net will not learn to provide good answers when novel conditions are encountered. Secondly, when many inputs are used within a model, there is an increased probability that some variables might not be available when a model is put into practice. Since the model specification of neural networks is implicit, the modeller and end-user of these methods will often no longer understand how the model has been fit, significantly increasing model specification risk. Consequently, it is not clear how we can quantify sensitivity of the model. The field is therefore largely left with developing more interpretable model alternatives (Nakagawa et al., 2019) or using post-hoc explanations to assess and visualise what models have learned (Li et al., 2020) . This however also comes with risk as many post-hoc explanation are not robust and may lead to false sense of security (Anders et al., 2020) . The competitive nature of financial markets often leads to particular concerns for machine learning models. As models are used in increasingly automated ways, they need to be able to respond to the pressures placed on them by competitive forces, who have strong incentives to identify and exploit potential weaknesses of a model. For example, we could consider our challenge of managing an options portfolio, but in a context where market price impact reduces the efficiency of trading. A classic model for order execution with market impact, Almgren and Chriss (2001) , yields deterministic policies for executing a large buy or sell order, which may have the undesirable effect of 'information leakage' (revealing your strategy to other market participants) when used in an illiquid market. In the more complex situation of managing a portfolio, one could consider building a neural network model to perform this task (for optimal execution, a model of this type is given in work by Ning et al. (2018) ). The additional randomness of the neural network model would arguably assist in preventing information leakage, when compared to the traditional model. Nevertheless, it is a priori unclear whether this additional randomisation would be sufficient, or whether further precautions against information leakage would be needed. Adversarial attacks can be grouped into many categories, for example, attacks can either be intentional or unintentional. Behzadan and Munir (2018) splits them into attacks on model confidentiality, integrity (does the model behave as intended?), and availability (can the model be disabled by an external actor?). Attacks could also be split into the components that are susceptible to the attack, for reinforcement learning this includes the environment, the observation channel, the reward channel, the decision making system, and the online training system. We can consider various way in which a financial reinforcement learning agent can be attacked, with a simple description and illustrative example. These classifications have been adapted from the adversarial threat a matrix developed by MITRE in collaboration with Microsoft, IBM, NVIDIA, and Bosch (Kumar et al., 2020) . In Table 3 , we present examples of adversarial attacks against a trading system. We first list those which are internal to the company, many of which can arise inadvertently in building and implementing machine learning methods and then follow with examples of attacks that an adversary can exploit without having direct access to a trader's codebase. The examples are our own, and are purely illustrative. These intentional and unintentional attack examples are hypothetical, and relate to problems seen in other machine learning domains. Nonetheless, these examples have significant implications for financial model risk management. A substantial level of compounded risks could exist where multiple of these susceptibilities overlap. Although there is a need to test and benchmark the robustness and resilience of trading agents with private systems and historical data, these agents ultimately have to move to the real world, where a slight distributional shift could impair performance. In other areas of machine learning, in addition to internal testing, models can be subjected to public audits. However, in finance the competitive risks from revealing private models are significant, leading to a far lower level of transparency. A good model not only fits historical data well, but also captures changes in the environments in which it is deployed. The challenge of updating models exists in both handcrafted and machine learning models, and reflects the basic challenge that finance does not operate according to stable physical laws, but arises from the interactions of many agents. The challenge of changing market behaviour can be significant: the overwhelming belief is that the value of a derivative and its underlying are kept in line due to no-arbitrage. However, during the 2007-08 financial crisis, these relationships were observed to break down, as arbitrage calculations did not account for counterparty creditworthiness. As a result, a theoretical arbitrage opportunity was observable in the market, but was not available in practice (Baba and Packer, 2009) . Handcrafted models, typically, require updates of few parameters to capture the shift of the data distribution. For overparameterised models, this may not be the case, and a Description Example Reward Hacking When training, the stated reward differs from a true reward. A learning agent was trained to create a perfect hedge, however transaction costs were poorly modelled, leading to poor performance. Side Effects A reinforcement learning system disrupts the environment by advancing its goal. A model has learned an order execution strategy for an illiquid asset, but by executing this strategy, changes the dynamics of the order book significantly, leading to increased risk. The system is trained on one environment, but unable to adapt to changes. A pricing model was trained on data during normal times, and is unable to react to the higher correlations between assets during crises. Natural Adversarial Examples. Even without being attacked the system fails from natural errors. A pricing model was trained individually for each strike and maturity, resulting in arbitrageable prices being offered in the market. The system is not able to deal with common corruptions. A pricing model failed due to a halt on trading being placed on a closely related underlying instrument. The system is not tested on the right environment nor over multiple periods. A pricing model is tested only on one exchange, but is deployed in multiple locations with differing market behaviours. attack Contaminate training phase. Contaminated data is introduced into a pricing model, for example when using sentiment analysis based on social media. Model stealing Recover the entire model. A proprietary model is trained and can be queried online by counterparties. By repeated queries it is possible that the inputs can be matched with the outputs, to reverse engineer the original model. Model inversion Recover hidden features. A pricing model is trained using proprietary trading data on market impact. The fitted model is then made public, without the underlying data. By repeated queries, it may be possible to extract the training data used (Fredrikson et al., 2015) . Reprogramming system Repurpose system for other use. An online pricing model is used to identify expected future market volatility. Adversarial example in physical domain Fool a system by changing some interface component. An adversary determines that a pricing model has sensitivity to the volumes deep in the order book -by posting to this part of the book, they influence the model's behaviour. Exploit software dependencies The use of traditional software exploits. The model relies on code dependencies, these dependencies are exploited by modifying the code to introduce nonsensical values, leading to a trading halt. (The 2016 NPM/left-pad debacle illustrates this external dependency risk, where a disgruntled developer deleted a tiny piece of code that 'broke' the internet (Collins, 2016) .) small change in the data may require a significant change in the model. For example: fraud detection models lose their discriminatory power against maliciously evolving strategies, hedging strategies have to evolve as market conditions are changing. Off-line machine learning suffers from a lack of robustness to distribution shifts, and hence a lack of on-line monitoring can significantly impair its performance (Sugiyama and Kawanabe, 2012) . This has become particularly clear in recent years in other applications of machine learning. For example, in the airline industry it was quickly realised that the standard machine learning pricing models that study flight patterns, fuel costs, and user behaviour became useless during the covid-19 pandemic, with data scientists choosing to fall back on traditional macroeconomic modelling (McCartney, 2020) . On the other hand, online learning approaches have the promise of being able to dynamically and naturally adapt to new situations (Zeng and Klabjan, 2019; Soleymani and Paquet, 2020) . This comes with significant issues, however, as these methods require training at a meta-level: the rate at which they adapt to new information needs to be tuned and adjusted, with rapid adjustment speeds typically associated with increased volatility in performance. Any given model provides only a crude approximation to reality; the risk of using an inadequate model is often hard to detect and quantify. While modern data science techniques are opening the door to more data-driven model selection mechanisms, this comes with new risks, as described previously. In this section, we argue that by combining old and new approaches, it is possible to regain control over newly emerging risks (e.g. lack of interpretability) while improving over classical models currently favoured by industry. We base our presentation on a few hybrid modelling approaches which have recently emerged in the research literature. A natural idea is to incorporate prior knowledge/modelling into deep learning. This can be achieved through incorporating modelling constraints during the training. However, as the number of constraints increases, and hence the search space of possible network parameters decreases, stochastic gradient descent algorithms struggle to find good solutions, so bespoke machine learning methods need to be developed. As mentioned above, using machine learning as a numerical tool introduces only modest model risks, while potentially providing significant speed and accuracy benefits. In Sabate-Vidales et al. (2018 , the authors developed deep learning algorithms for solving parametric families of (path-dependent) partial differential equations (P)PDEs that arise in pricing and hedging. The key idea in these works is to use a probabilistic representation of the (P)PDE, and learn both the solution and its gradient simultaneously. An advantage of this approach is that the gradient of the solution to the (P)PDE provides access to the hedging strategies. While this method is of interest in its own right, it can also be used as a control variate for unbiased Monte Carlo pricing. In other words, by combining deep learning with standard Monte Carlo pricing, one can remove the bias due to approximation with neural nets and easily compute confidence intervals (which are, in general, hard to obtain for large networks). This approach has been tested on several models and (path dependent) payoffs. We stress that while the literature on deep learning for PDEs is growing rapidly, for finance applications it is critical to approximate parametric families of PDEs, where parameters correspond to the possible values of calibrated coefficients of the model. A similar observation has been made in . Another interesting approach, that combines ideas emerging from ML and classical modelling has been put forward in (Lyons et al., 2019) . The key idea here is to lift both modelling and pricing into the signature space. Intuitively, signatures provide efficient basis functions for representing functionals defined on the path space (e.g exotic derivatives or non-Markovian models) and play a similar role to polynomials on Euclidean space. In particular, the signature expansion of a path represents the values of integrals against that path, and so can capture the effect of dynamic trading and hedging. The classical idea of replicating an option via trading in the market then reduces to regressing the option payoff on the signature of the underlying and other vanilla securities. It has been shown that one can effectively represent many exotic derivatives using this signature expansion, and consequently obtain the prices of derivatives in terms of the expectation (under the pricing measure) of the signature expansion terms. Consequently, one only needs to calibrate expected signatures to market data, which in some settings can be done efficiently. The advantage of using signatures when compared with recursive neural networks is that the computational cost does not increase with the number of time points in a time-series. The idea of model selection using signatures has been proposed in . Here, one still works with the familiar SDEs type model but aims to learn (possibly non-Markovian) coefficients from data. A viable approach to controlling the risk of non-transparent model specifications is to develop algorithms and training methods that embed expert knowledge into the architecture or training stage of machine learning. A handful of papers have attempted to embed financial domain knowledge into their models. These methods can offer regularisation, efficiency, consistency, and stability benefits. Drawing from the review by Ruf and Wang (2020b) , methods that adjust the architectural design of neural networks include models that incorporate a homogeneity hint by training a neural network in two parts, the first part controls for moneyness, and the other for time-tomaturity (Garcia and Gençay, 1998) . Other methods restrict the shape of outputs (Dugas et al., 2001) or enforce no-arbitrage conditions such as the convexity of a neural network pricing function and monotonicity (Zheng et al., 2019) . Approaches that impart expert knowledge at the training stage include data augmentation, which involves the generation of synthetic data to help with neural network training (Yang et al., 2017) , adjustments in the penalty terms of the loss function to promote noarbitrage (Itkin, 2019; Ackerer et al., 2019) , as well as the development of bespoke training algorithms for neural networks for options hedging, including the use of the extended Kalman filter, sequential Monte Carlo, and evolutionary algorithms (Niranjan, 1996; de Freitas et al., 2000; Palmer, 2019) . A safe and efficient transition toward using machine learning in finance is only possible when models and methods are well understood and tested on reliable data sets. In other areas of machine learning, standard benchmarks and data sets are a common way to prooftest new methodologies. For example, recent advances in computer vision or reinforcement learning were significantly accelerated due to the emergence of challenging benchmarks, such as ImageNET (Deng et al., 2009) or ALE (Bellemare et al., 2013) . These benchmarks have enabled open, systematic cross-validation of various AI solutions. In machine learning, the term 'benchmarking' has been used to refer to the evaluation and comparison of machine learning models, particularly regarding their ability to learn patterns from benchmark datasets (Olson et al., 2017) . This process can be thought of as a check to validate the improvement of a new method, but also more broadly to identify the respective advantages and disadvantages of each method. Comparisons can be made across a wide range of metrics, for example accuracy in detecting signals, interpretability, and computational complexity. Currently, in finance, various algorithms and machine learning methods are tested on disparate data sets, which are often only accessible to a small community or at high cost. A consequence of this is that very little comparison of methods is done, and we have little understanding of the appropriateness or optimality of these methods. In addition, evaluating new AI techniques on real-world applications often requires expert domain knowledge and consideration of scalability and the cost of development. A key difficulty, in financial applications, is that a more open approach to benchmarking will often involve revealing details of each participant's methodologies. While this is reasonable within the academic community, within industry it is clear that confidentiality is needed, both regarding algorithms and, in some cases, their performance. For this reason, it is important to build our understanding of which problems can be discussed and benchmarked in a public way, and which related data science problems provide insight for those cases where confidentiality is needed. The typical datasets which the benchmarking literature has well studied come from real-world data and simulated data with known underlying patterns. As alluded to before, in finance there are relatively few datasets that have been made publicly available, and often these contain only a small sample of the data that would be needed in practice. There is therefore a clear opportunity for Finance to benefit from synthetic data generators. Synthetic data has been used in other fields 5 but has not yet flourished in the financial literature. Benchmarking has its own problems, many of which are not new to machine learning. There has been an increasing concern that published research findings are misleading due to the number of studies addressing the same question and datasets (Ioannidis, 2005) . Benchmarking has a similar problem, in that a lot of models are prodding the same unchanging datasets leading to a lack of generalisation. Studies reveal that the accuracy of state-of-theart deep learning models can drop from 4%-10% when moving to a new test-set, highlighting the risk of overfitting (Recht et al., 2018) . For this reason, the regular evaluation and updating of benchmarks remains important for future development. In order to be reliably implemented, algorithms must be robust with respect to a variety of objectives (e.g. safety, accuracy). Summarizing the range of adversarial challenges outlined above, we see that machine learning pipelines should come with robustness guarantees against: (i) shifts in data distribution (distributional robustness), (ii) intentional input manipulations (adversarial robustness) and (iii) intentional feature manipulation to 'game' the system (strategic robustness). Recent work (Huang et al., 2017) has begun to address these issues for neural network based models. Drawing on adversarial machine learning and distributionally robust optimisation (Rahimian and Mehrotra, 2019; Cohen et al., 2019a; Wicker et al., 2020) , it is possible to certifiably train models to provably ensure robustness, by providing guaranteed bounds on the probability of the model output (decision) satisfying a combination of objectives. Data driven models cannot automatically guarantee model robustness (Kwiatkowska, 2019) . An adversarial defense is anything that decreases the efficacy of adversarial attacks. There are a range of techniques that can be used to provide adversarial defenses; they can generally be classified into adversarial training methods, randomisation-based schemes, denoising methods, and provable defenses. • Adversarial learning techniques simply train a neural network using adversarial samples. It is one of the most effective defenses against attacks as revealed in benchmark studies (Madry et al., 2017) . These can be thought of as a preprocessing technique. • Randomisation schemes can also protect against perturbation in inputs. These generally involve some transformation, such as random resizing, or can also be achieved by adding a noise layer to the neural network (Liu et al., 2018) . • Denoising inputs in the prediction phase can help to rectify or remove adversarial perturbations. This denoising can be done with generative adversarial networks or autoencoders and can be thought of as a postprocessing technique (Xie et al., 2019) . • Provable defenses are unlike the above approaches, in that they are theoretically proven, rather than purely being experimentally validated. These methods can certify a level of robustness before the prediction stage (Balunovic and Vechev, 2019) . The defenses listed here can only verify and protect a system against a limited number of attacks. Security vulnerability attacks will have to be dealt with using domain expertise, rather than relying on generalist defense mechanisms. Adversarial defenses will not protect against a badly developed model, and appropriate fail-safe mechanisms and human oversight remain a critical part of implementation. Explainability allows for human oversight of machine learning to be carried out effectively, ensuring that model risk is understood and controlled. Understanding the causes behind performance is a common part of risk management -for example, the 'Profit and Loss attribution test', which forms part of the Fundamental Review of the Trading Book (BIS, 2019), requires a bank's hypothetical profits using front-office pricing models to be explained against their back-office risk models and factors, as part of the validation of those risk models. The understanding of models and their risks is a significant challenge in finance. The 2007-2008 financial crises demonstrated that copulas, especially those proposed by Li (2000) , were underpowered to model the risks of CDOs, but yet still were too large and complex to be understood and critiqued by users. In contrast, machine learning models are overpowered, have shiny user-interfaces, but are even more obscure. Machine learning has been promoted in well-cited papers as a method for systemic risk analysis, without only limited discussion of the risks of using machine learning and its lack of interpretability (Kou et al., 2019; Aziz and Dowling, 2019) . Neural networks are not inherently explainable, as input features become entangled and compressed into a single value via repeated non-linear transformations of a weighted sums (Xie et al., 2020) . Explainability can be improved by prima facie selecting a more interpretable 'white box' model, that is, adopting models which intrinsically are easier to query and understand. Neural network models can be designed to be more interpretable through joint training (Hendricks et al., 2016; Iyer et al., 2018) or including attention mechanisms (Bahdanau et al., 2016; Devlin et al., 2018; Anderson et al., 2018) . Although these solutions apply for neural networks in general, they do not necessarily apply in a reinforcement learning framework. In this setting, rule-based (Verma et al., 2018; Hein et al., 2017) , or hierarchical (Shu et al., 2017) methods are available. The purpose of rule-based methods is to present the policies in high level human-readable language, e.g. IF-THEN sequences. Hierarchical methods divide policies into simpler sub-tasks, each of which are separately more interpretable than a flat policy, and are therefore useful to explain individual decisions, i.e. they provide 'local' interpretability. The above interpretable models generally forgo some performance for enhanced comprehensibility. As a result, as performance is often the primary concern, explainability techniques which can be applied to a black-box model need to be identified. These techniques can be grouped under the name 'post-hoc' explainability. The types of post-hoc explanation methods are broad and include perturbation analysis, gradient analysis, example based explanations, and surrogate-modelling for local and global explanations (Adadi and Berrada, 2018) . Different applications and tasks require a different balance between explainability and performance. 'Deep' reinforcement learning is based on neural network models, and adds an additional layer of incomprehensibility to the modelling process (Mnih et al., 2013) . Reinforcement learning models are complex, but it is often possible to use interpretable surrogate models as a means of simplifying and representing their actions; this is often easier than developing inherently interpretable models (Puiutta and Veith, 2020) . A range of surrogates are available for this purpose, and include genetic programming techniques (Hein et al., 2018) , causal DAGs (Madumal et al., 2019) and the use of tree-based models to approximate predictions (Coppens et al., 2019) . However, when using surrogate models for explainability, it is wise to keep the underlying model as simple as possible, in order to make it easier for a surrogate model to reproduce its outputs. Models not only have to be validated on historical data, i.e. benchmarked, they also have to be monitored and controlled when running 'live'. In machine learning, this is related to 'concept drift', which refers to data distributions changing over time, leading to faulty predictions Žliobaitė et al. (2016) . The hope is that, with online learning incorporated in the approach, models can self-diagnose and self-correct when this occurs, but this is not always the case. Continuous recalibration may not be possible in all settings, due to regulatory requirements and the cost of recalibration (Cohen et al., 2019b) . A good survey of concept and data drift and how to deal with it can be found in Gama et al. (2014) . The importance of monitoring, recalibrating and updating systems, and ensuring sufficient human control, is a key part of the implementation of most automated systems in practice. In a financial setting, we might also want to base the criteria for drift on the execution of other methods (for example handcrafted strategies) that are run in parallel as 'controls' for performance. This allows one to study those occasions in which the performance of controls differed significantly from the model, highlighting points of concern. Machine learning models need more extensive monitoring procedures than handcrafted approaches due to the various risks they come with. Nevertheless, the promise of improved performance, the flexibility of modelling, and the speed advantages associated with embracing these new technologies means that there is no doubt about their broad incorporation into many parts of the finance industry. Deep smoothing of the implied volatility surface Peeking inside the black-box: a survey on explainable artificial intelligence (XAI) Optimal execution of portfolio transactions Fairwashing explanations with off-manifold detergent Bottom-up and top-down attention for image captioning and visual question answering Generalized parameter functions for option pricing Sig-SDEs model for quantitative finance Asset pricing under endogenous expectations in an artificial stock market. The economy as an evolving complex system II Machine learning and AI for risk management Interpreting deviations from covered interest parity during the financial market turmoil of 2007-08 Neural machine translation by jointly learning to align and translate Adversarial training and provable defenses: Bridging the gap Neural networks for contingent claim pricing via the Galerkin method On deep calibration of (rough) stochastic volatility models Deep splitting method for parabolic pdes The faults in our pi stars: Security issues and open challenges in deep reinforcement learning Fast agent-based simulation framework of limit order books with applications to pro-rata markets and the study of latency effects The arcade learning environment: An evaluation platform for general agents Deep learning MAR32 -Internal models approach: backtesting and P&L attribution test requirements Deep hedging: hedging derivatives under generic market frictions using reinforcement learning A data-driven market simulator for small data environments Abides: Towards high-fidelity market simulation for ai research The shadow price of latency: Improving intraday fill ratios in foreign exchange markets Smote: synthetic minority over-sampling technique Certified adversarial robustness via randomized smoothing Switching cost models as hypothesis tests Detecting and repairing arbitrage in traded option prices How one programmer broke the internet by deleting a tiny piece of code. Quartz magazine Distilling deep reinforcement learning policies in soft decision trees Robust pricing and hedging of double no-touch options European option pricing with transaction costs Hierarchical bayesian models for regularization in sequential learning Imagenet: A large-scale hierarchical image database BERT: Pre-training of deep bidirectional transformers for language understanding Machine Learning in Finance Incorporating secondorder functional knowledge for better option pricing Incorporating functional knowledge in neural networks Robust pricing and hedging of options on multiple assets and its numerics Model inversion attacks that exploit confidence information and basic countermeasures Time series simulation by conditional generative adversarial net A survey on concept drift adaptation Option pricing with neural networks and a homogeneity hint Pricing and hedging derivative securities with neural networks and a homogeneity hint The Volatility Surface: A Practitioner's Guide Neural networks approach to pricing, options Robust pricing and hedging via neural sdes Federated learning via synthetic data Empirical asset pricing via machine learning Managing smile risk Particle swarm optimization for generating interpretable fuzzy reinforcement learning policies Interpretable policies for reinforcement learning by genetic programming How implicit regularization of neural networks affects the learned function-part i Generating visual explanations Generative models for financial data. Available at SSRN 3408007 Robust hedging of the lookback option Deep learning volatility: a deep neural network perspective on pricing and calibration in (rough) volatility models Mean-field Langevin dynamics and energy landscape of neural networks Open graph benchmark: Datasets for machine learning on graphs Safety verification of deep neural networks A nonparametric approach to pricing and hedging derivative securities via learning networks Why most published research findings are false Deep learning calibration of option pricing models: some pitfalls and solutions Transparency and explanation in deep reinforcement learning neural networks AAAI/ACM Conference on AI, Ethics, and Society Mean-field neural odes via relaxed optimal control Deep ppdes for rough local stochastic volatility. Available at SSRN 3400035 Dynamic replication and hedging: A reinforcement learning approach The market generator. Available at SSRN 3384948 Generative adversarial networks for financial trading strategies fine-tuning and combination Machine learning methods for systemic risk analysis in financial sectors. Technological and Economic Development of Economy Adversarial machine learning-industry perspectives Safety verification for deep neural networks with provable guarantees Improved option pricing using artificial neural networks and bootstrap methods Agent-based computational finance: Suggested readings and early research A microscopic model of the stock market: cycles, booms, and crashes On default correlation: A copula function approach Beyond the black box: an intuitive approach to investment prediction with machine learning Using GANs for sharing networked time series data: Challenges, initial promise, and open questions The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery Towards robust neural networks via random self-ensemble Adaptive markets: Financial evolution at the speed of thought Valuing american options by simulation: A simple least-squares approach The variational fair autoencoder Nonparametric pricing and hedging of exotic derivatives Material signals: A historical sociology of high-frequency trading Towards deep learning models resistant to adversarial attacks Explainable reinforcement learning through a causal lens Pagan: Portfolio analysis with generative adversarial networks Coronavirus Has Upended Everything Airlines Know About Pricing An artificial neural network representation of the SABR stochastic volatility model. Available at SSRN 3288882 Neural network super architectures Navigating Uncharted Waters: A Roadmap to Responsible Innovation with AI in Financial Services : Part of the Future of Optimal hedging with continuous action reinforcement learning. Industrial Engineering and Management Playing atari with deep reinforcement learning Extending the feature set of a data-driven artificial neural network model of pricing financial options Deep recurrent factor model: interpretable non-linear and time-varying multi-factor model Implicit regularization in deep learning Conditional sig-Wasserstein GANs for time series generation How to Build an Exchange :: Jane Street Double deep q-learning for optimal execution Sequential tracking in pricing financial options using model based and neural network approaches Fine-tune your smile: Correction to Hagan et al. Wilmott Magazine PMLB: a large benchmark suite for machine learning evaluation and comparison Artificial economic life: a simple model of a stockmarket Evolutionary algorithms and computational methods for derivatives pricing Machine learning from imbalanced data sets 101 Explainable reinforcement learning: A survey Distributionally robust optimization: A review Do cifar-10 classifiers generalize to cifar-10? Hedging with neural networks Neural networks for option pricing and hedging: a literature review Unbiased deep solvers for parametric pdes Solving path dependent pdes with lstm networks and path signatures Hierarchical and interpretable skill acquisition in multi-task reinforcement learning Machine learning in asset management-part 2: Portfolio construction-weight optimization Financial portfolio optimization with online deep reinforcement learning and restricted stacked autoencoder-deepbreath Machine learning in non-stationary environments: Introduction to covariate shift adaptation Modeling financial time-series with generative adversarial networks Programmatically interpretable reinforcement learning Variational autoencoder based synthetic data generation for imbalanced learning Optimal hedging of options with small but arbitrary transaction cost structure Probabilistic safety for Bayesian neural networks Feature denoising for improving adversarial robustness Explainable deep learning: A field guide for the uninitiated Synthesizing tabular data using generative adversarial networks Gated neural networks for option pricing: Rationality by design Online adaptive machine learning based algorithm for implied volatility surface modeling. Knowledge-Based Systems Gated deep neural networks for implied volatility surfaces An overview of concept drift applications. Big data analysis: new algorithms for a new society