key: cord-146214-lp78l776
authors: Leal, Laura; Lauriere, Mathieu; Lehalle, Charles-Albert
title: Learning a functional control for high-frequency finance
date: 2020-06-17
journal: nan
DOI: nan
sha: 
doc_id: 146214
cord_uid: lp78l776

We use a deep neural network to generate controllers for optimal trading on high frequency data. For the first time, a neural network learns the mapping between the preferences of the trader, i.e. risk aversion parameters, and the optimal controls. An important challenge in learning this mapping is that in intraday trading, trader's actions influence price dynamics in closed loop via the market impact. The exploration--exploitation tradeoff generated by the efficient execution is addressed by tuning the trader's preferences to ensure long enough trajectories are produced during the learning phase. The issue of scarcity of financial data is solved by transfer learning: the neural network is first trained on trajectories generated thanks to a Monte-Carlo scheme, leading to a good initialization before training on historical trajectories. Moreover, to answer to genuine requests of financial regulators on the explainability of machine learning generated controls, we project the obtained"blackbox controls"on the space usually spanned by the closed-form solution of the stylized optimal trading problem, leading to a transparent structure. For more realistic loss functions that have no closed-form solution, we show that the average distance between the generated controls and their explainable version remains small. This opens the door to the acceptance of ML-generated controls by financial regulators.

Financial mathematics are using stochastic control to ensure that market participants are operating as intermediaries and not as unilateral risk takers: investment banks have to design risk replicating strategies, systemic banks have to ensure they have plans in case of fire sales triggered by economic uncertainty, asset managers have to balance risk and returns in their portfolios and brokers have to be sure investors' large buy or sale orders are executed without distorting the prices. The latter took a primary role in post-2008 markets since participants understood preservation of liquidity is of primary importance.

Our paper addresses this last case, belonging to the academic field of optimal trading, initially introduced by [1] and [3] , and then extended in a lot of ways, from sophisticated stochastic control [5] to Gaussianquadratic approximations allowing to obtain closed-form solutions like [13] or [14] , or under a self-financing equation context in [12] . Learning was introduced in this field either via on-line stochastic approximation [29] or in the context of games with partial information [9] . More recently, Reinforcement Learning (RL) approaches have been proposed, like in [20] to face the high dimensionality of optimal trading of portfolios, or in [33] to adapt the controls in an online way; see also [15] for an overview.

For optimal control problems in continuous time, the traditional approach starts with the cost function and derives through dynamic programming an Hamilton-Jacobi-Bellman (HJB) equation involving the optimal control and the value function (that is the optimum of the Q-function in RL). This Partial Differential Equation (PDE) can be explicitly written only once the dynamics to be controlled are stylized enough. When it does not have a closed form solution, this PDE can be solved either by a deterministic method (like a finite difference scheme) that is limited by the dimension of the problem, or by RL approaches, like in [21] for optimal trading or in [7] for deep hedging. More generally, deep learning techniques for PDEs and Backward Stochastic Differential Equations (BSDEs) have recently attracted a lot of interest [36, 24, 34, 17, 23] and found numerous applications such as price formation [35] , option pricing [31] or financial systems with a continuum of agents [18, 10, 11] .

In this paper, we propose to skip most of the previous steps: we directly learn the optimal control on the discretized version of the dynamics and of the cost function that are usually giving birth to the HJB, but we no longer have to derive. This approach, also used e.g. in [19, 22] for optimal control, gives us the freedom to address dynamics for which deriving explicitly a PDE is not possible, and to learn directly on trajectories coming from real data. In optimal trading, the control has a feedback loop with the dynamics via price impact: the faster you trade, the more you move the price in a detrimental way [2] . In our setup the controller sets the trading speed of an algorithm in charge of buying or selling a large amount of stock shares, solving the trade-off between trading fast to get rid of the exposure to uncertainty of future prices and trading slow to not pay too much in market impact.

For numerical applications we use high frequency (HF) data from the Toronto Exchange (TSX) on 19 stocks over two years, each of them generating around 6,500 trades (i.e. data points) per day. We compare different controls: controls generated by a widely known stylized version of the problem that has a closed form formula [14] , controls learned on simulated dynamics whose parameters are estimated on the data in different ways (with or without stationary assumptions), and controls learned on real data. For the latter case, we transfer the learning from simulated data to real data, to make sure we start with a good initialization of the controller.

In our setup, a controller maps the state space describing price dynamics and the trader's inventory to a trading speed that leads to an updated version of the state, and so on until the end of the trading day. The same controller is used at every decision step, which corresponds to one decision every 5 minutes. When the controller is a neural net, if is used multiple times before the loss function can be computed and improved via back propagation, in the spirit of [30] .

Our first main contribution is that we not only train a neural net for given values of the end-user's preferences, but we also train a neural net having two more inputs that are the trader's preferences (i.e. risk aversion parameters), so that this neural net is learning the mapping between the preferences (i.e. hyper-parameters of the control setup) and the optimal controls. To the best of our knowledge, it is the first time that a neural net is performing this kind of "functional learning" for optimal trading: the neural net learns the optimal control for a range of cost functions that are parametrized by the risk preferences. In the paper we call it a "multi-preferences neural net".

The second major contribution of this paper is the way we compare the generated controls in a meaningful way, paving the way to methods which satisfy the request of financial regulators about the explainability of learned controls. We start by the functional space of controls spanned by the closed-form solution of the stylized problem: they are non-linear in the remaining time to trade and affine in the remaining quantity to trade (see [9] for a description of the relationship between the optimal controls and the space generated by the h 1 (t) and h 2 (t) defined later in the paper). Hence we project the learned controls on an affine basis of functions, for each slice of remaining time to trade T − t. Doing so, we provide evidence of the distance between the black-box controls learned by a neural network, and the ones with which the end-users and regulators are familiar. We can compare the controls in this functional space, and we know, thanks to the R 2 of the linear regressions, the average distance between this very transparent representation of the controls and the ones generated by the neural controllers. In practice, we show that when the loss function is mainly quadratic, the learned controls, even when they are trained on real data, are spanning roughly the same functional space as the closed-form ones. End-users and regulators can consider them with the same monitoring tools and analytics than more standard (but sub-optimal) controls. When the loss function is more realistic but not quadratic anymore, taking into account the mean reverting nature of intraday price dynamics [27] , the generated controls may or may not belong to the same functional space.

The structure of the paper is as follows: Section 2 presents the optimal execution problem setup, focusing on the loss function and the closed-form formula associated with a stylized version of the problem that we will use as a benchmark. It also introduces the architecture of the neural networks we use and our learning strategies. Section 3 describes the dataset, stylized facts of intraday price dynamics that should be taken into account, and motivates learning in a data-driven environment. Section 4 presents the numerical results and our way to tackle explainability of the generated controls. Conclusions and perspectives are provided in Section 5.

2 The optimal execution model

Optimal trading is dealing with an agent who would like to execute a large buy or sell order in the market before a time horizon T . Here, their control is a trading speed ν t ∈ R. The center of the problem is to find the balance between trading too fast (and hence move the price a detrimental way and pay trading costs, as modelled here in (1) and (3)) and trading too slow (and being exposed to not finishing the order and be exposed to uncertainty of future prices, reflected in (2) and (4)). They maximize their wealth while taking into account a running cost and a final cost of holding inventory, and they are constrained by the dynamics of the system, which is described by the evolution of the price S t , their inventory Q t , and their wealth X t . In the market, the price process for the asset they want to trade evolves according to

where α t > 0 is a parameter that accounts for the drift, ∆t is the size of a time step, σ > 0 is a constant volatility term and ∼ N (0, 1) is a noise. The state of the investor at time t is described by the tuple (T − t, Q t , X t ), with Q t being their inventory, and X t being their wealth at time t. To isolate the inventory execution problem from other portfolio considerations, the wealth of the agent at time 0 is considered to be 0, i.e. X 0 = 0. Suppose the agent is buying inventory (the selling problem is symmetric). Then, Q 0 < 0 and the trading speed ν will be mostly positive throughout the trajectory. The evolution of the inventory can be written as:

The wealth of the investor evolves according to

where κ > 0 is a constant, and the term κ · ν t represents the temporary market impact from trading at time t.

As we can see, this is a linear function of ν t that acts as an increment to the mid-price S t . It can also be seen as the cost of "crossing the spread", or a transaction cost. The cost function of the investor fits the framework considered e.g. in [14] . The agent seeks to maximize over the trading strategy ν the reward function given by:

with γ > 0. We mostly focus on the standard case γ = 2, but we will also consider the case where γ = 3/2, which better takes into account the sub-diffusive nature of intraday price dynamics, see [27] . X T , S T , and Q T are the random variables parameterizing the terminal value of the stochastic processes described in the dynamics (1)- (3) . A > 0 and φ > 0 are constants representing the risk aversion of the agent. A penalizes holding inventory at the end of the time period, and φ penalizes holding inventory throughout the trading day. Together, they parametrize the control of the optimal execution model and stand for the agent's preferences. The trading agent wants to find the optimal control ν for the cost (4) subject to the dynamics (1)-(3). In the sequel, we solve this problem using a neural network approximation for the optimal control. As a benchmark for comparison, we will use closed-form solution for the corresponding continuous time problem, which is derived through a PDE approach. The closed-form solution of the PDE is well-defined only when the data satisfies suitable assumptions. The neural net learns the control ν t directly from the data, while finding a closed-form solution for the PDE is not always possible.

It can be shown, see [13] , that the optimal control ν * is obtained as a linear function of the inventory, which can be written explicitly as:

where h 2 and h 1 are the solutions of a system of Ordinary Differential Equations (ODEs).

The neural network setup.

The neural network implementation we propose precludes all the usual derivations in the [13] framework. We no longer need to find the PDE corresponding to the stochastic optimal control problem, we no longer need to break it down into ODEs and solve the ODE system, and we can try to directly approximate the optimal control. The deep neural network approximation looks for a control minimizing the agent's objective function (4) while being constrained by the dynamics (1)-(3), without any other derivations.

We define one single neural network f θ (·) to be trained for all the time steps. Each iteration of the stochastic gradient descent (SGD) proceeds as follows. Starting from an initial point (S 0 , X 0 , Q 0 ), we simulate a trajectory using the control ν t = f θ (t, Q t ). Based on this trajectory, we compute the gradient of associated cost with respect to the neural network's parameters θ. Finally, the network's parameters are updated based on this gradient. In our implementation, the learning rate is updated using Adaptive Moment Estimation (Adam) [26] , which is well suited for situations with a large amount of data, and also for non-stationary, noisy problems like the one under consideration.

In the Monte-Carlo simulation mode, we generate random increments of the Brownian motion through the term σ √ ∆t , where ∼ N (0, 1) comes from the standard Gaussian distribution. Using the Gaussian increments, along with the ν t obtained from the neural network, and the three process updates (∆S t , ∆X t , ∆Q t ), we can compute the new state variables in discrete time:

The new state variable Q t+1 , along with the time variable, is then going to serve as input to help us learn the control at the next time step. This cycle continues until we have reached the time horizon T . Since we are using the same neural network f θ for all the time steps, we expect it to learn according to both time and inventory.

Neural network architecture and parameters. The architecture consists of three dense hidden layers containing five nodes each, using the hyperbolic tangent as activation function. Each hidden layer has a dropout rate of 0.2. We add one last layer, without activation function, that returns the output. The learning rate is η = 5e −4 ; the mini-batch size is 64; the tile 1 size is 3; and the number of SDG iterations is 100,000. Every 100 SGD iterations, we perform a validation step in order to check the generalization error, instead of evaluating just the mini-batch error. In our implementation, we used Tensorflow.

Although the neural network's basic architecture is not very deep, each SGD step involves T loops over the same neural network. This is because of the closed loop which the problem entails. The number of layers is thus artificially multiplied by the number of time steps considered.

For the inputs, we have two setups. In the basic setup (as discussed above), the input is the pair (t, Q t ). In the second case, which we call "multi-preferences neural network ", the input is the tuple (t, Q t , A, φ) and the neural netowrk learns to minimize the J A,φ (·) cost functions for all (A, φ). Each time we need to solve a given system, we set A and φ to the desired value in the multi-preferences network and we do not need to relearn anything to obtain the optimal controls. For both neural nets, the output is the speed of trade ν t for each time step t, intercalating with the variable updates, thus learning a controller for the full length of the trading day.

In order to learn from historical data, we first train the network on data simulated by Monte Carlo and then perform transfer learning on real data.

We would like to compare the control obtained using the PDE solution with the control obtained using the neural network approximation. As stated in equation (5), the optimal control can be written as an affine function of the inventory q t at any point in time, for t ∈ [0, T ]. The shape of this closed-form optimal control belongs to the manifold of functions of t and q spanned by [0, T ]×R (t, q) → h 1 (t)/(2κ)+(α+h 2 (t))/(2κ)·q ∈ R, where the h i (t) are non-linear smooth functions of t. To provide explainability of our neural controller we project its effective controls on this manifold. We naturally obtain two non-linear functionsh 1 (t) anh 2 (t) and a R 2 (t) measuring the distance between the effective control and the projected one at each time step t.

Procedure of projection on the "closed-form manifold". For each t, we form a database of all the learned controls ν t mapped by the neural net to the remaining quantity q t . It enables us to project this ν t on q t using an Ordinary Least Squares (OLS) regression. The coefficients β 1 (t) and β 2 (t) of this regression ν t = β 1 (t) + β 2 (t)q t + ε t can be easily inverted to givẽ

for each t ∈ [0, T ]. The R 2 (t) of each projection associated to t quantifies the distance between the effective neural "black box" control and the explained oneν t :=h 1 (t)/(2κ) + (α +h 2 (t))/(2κ) · q t . The curve of R 2 (t) represents how much of the non-linear functional can be projected onto the manifold of closed-form controls at each t. In practice we perform T = 77 OLS projections to obtain the curves of h 1 , h 2 and R 2 (see Figure  2 ) which provide explainability of our learned controls.

Users' preferences and exploration-exploitation issues The rate at which the agent executes the whole inventory strongly depends on the agent's preferences A and φ. When they are both large, the optimal control tends to finish to trade earlier than T , the end of the trading day. Because of that, in such a regime 1 The tile size (ts) stands for how many samples of the inventory we combine with each sampled pair. For ts = 3, each pair

where j is the tile index. This is useful when using real data, because we can generate more scenarios than we would ordinarily be able to.

it will be very difficult for the neural net to learn anything around the end of the time interval since it will no more explore its interactions with price dynamics to be able to emulate a better control. For the case of the multi-preferences controller with 4 inputs, including A and φ, taking (A, φ) in a certain range ensures that the neural net will witness long enough trajectories to exploit them. To this end, we took profit of the closed-form solution of the stylized dynamics and scanned the duration of trajectories for each pair (A, φ). In order to avoid the exploration -exploitation trade-off, we restricted the domain to values of φ smaller or equal to 0.007, and values of A smaller or equal to 0.01. This enabled the regression to learn the functional produced by the neural network. There will be some inventory left to execute at the end of the trading day for these parameters. For the pair (A, φ) = (0.01, 0.007), we have less than 1% of the initial inventory left to execute, yet it is enough for us to estimate the regression accurately.

3 Optimal execution with data

We are using ticker data for trades and quotes for equities traded on the Toronto Stock Exchange for the period that ranges from Jan/2008 until Dec/2009, which results in 503 trading days. Both trades and quotes are available for Level I (no Level II available) for 19 stocks (see Table 2 ). They have a diverse range of industries, and daily traded volume. Our method can be directly used for all the stocks, either individually or simultaneously. 2 Table 1 We partition both trade and quote files into smaller pickled files organized by date. We merge them day by day to calculate other interesting variables, such as: trade sign, volume weighted average price (VWAP), and the bid-ask spread at time of trade. We drop odd-lots from the dataset, as they in fact belong to a separate order book. Moreover, we only keep the quotes that precede a trade, or a sequence of trades. It saves us memory usage and reduces processing times.

The trade data, including the trade sign variable, is further processed into five minute bins. TAQ data is extremely asynchronous and a situation might happen when there is not enough data in a given time interval.

To avoid this problem, we aggregate the data into bins and focus our analysis on liquid stocks. This bin size is big enough for us to have data at each bin, and small enough to allow the agent to adjust the trading speed several times throughout the day. Since the market is open from 9:30 until 16:00, five-minute bin intervals, we have 78 time steps, which gives us 77 bins for which to estimate the control on each trading day. We have 91.5% of bins containing data (see Table 1 ).

The stylized dynamics allowing a PDE formulation of the problem and a closed-form solution of the control make a lot of assumptions of the randomness of price increments, market impact and trading costs. Typically they assume independent and normally distributed, stationary, non-correlated returns, with no seasonality. However, it is not the case for financial time series. See [4, 6, 8] for some interesting discussions. Figure 1 shows the specific characteristics of financial data for the stock MRU. The left plot is a qq-plot, which tells us that the stock returns have a heavy tailed distribution. The same is confirmed for all stocks in Table 2 : we observe high kurtosis for all the distributions of stock returns(which is a common metric for saying that a distribution has heavy tails with respect to a Gaussian distribution). The middle plot presents the five-minute lag auto-correlation profile. For this stock, we can see a mean-reversion tendency for five minutes and one hour. The first lag of the five-minute auto-correlation is shown for all stocks in Table 2 . The right plot shows intra-day seasonality profiles: the bid-ask spread (blue, dashed curve), and the intra-day volume curve (full, gray curve). 4 Learning the mapping to the optimal control in a closed loop

As mentioned in section 2.3, the first step is to compare the output of our neural network model to the PDE solution, both using Monte Carlo simulated data. The second step is to understand how seasonality in the data might affect the training and the output. Finally, we continue our learning process on real data, in order to learn its peculiarities in terms of heavy tails, seasonality and auto-correlation.

The results and our benchmark are summarized in Figure 2 . As stated in equation (5), the optimal control for the stylized dynamics and the PDE is linear in the inventory, hence the associated R 2 is obviously 1 for any t. For the different neural nets, we perform the projections on the (h 1 , h 2 ) manifold and keep track of the R 2 curve. During the learning, it is important that the neural network can observe full trajectories. When the risk aversion parameters are very restrictive, the closed-form solution and neural net trades so fast that the order is fully executed (i.e. the control stops) far before t = T . Because of that it is impossible to learn after this stopping time corresponding to q t = 0. Our workaround for this exploration -exploitation issue has been to use the closed-form solution to select a range of (A, φ) allowing the neural net to generate enough trajectories to observe the dynamics up to t = T . In Figure 2 In this Figure 2 , we mix closed-form controls on the stylized dynamics, neural controls on the same dynamics (using Monte Carlo simulations), neural controls in more realistic simulations (with an intraday seasonality) and ultimately on real data. We also superpose results for neural networks trained only on this regime of preferences and the controls of the multi-preference neural network, that is trained for once and then generates controls for any pair (A, φ) of preferences. The top panels represent the two components of the projection of the control on the "closed-form manifold", to enable comparison. The R 2 (t) curves (bottom-left panel) are flat at 1 for the closed form formula (since 100% of the control belongs to this manifold), whereas it can vary for the other controls (more cases are available in the Appendix). Thanks to this projection, it is straightforward to compare the behaviour of all the controls. The R 2 (t) curves provide evidence that in this regime the neural controls are almost 100% explained by the closed-form manifold. It does not say that they are similar to the closed-form solutions of the stylized problem, but that they are linear in q t (but not in t) for each time steps t. For non quadratic costs (we took γ = 3/2 since it is a realistic case [27] ) we take (A, φ) = (0.5, 0.1) to be in a comparable regime as the other controls. The R 2 (t) curve shows that for these parameters, the learned control remains extremely close to the closed-form manifold. This kind of evidence can allow risk departments of banks and regulators to get more comfortable with this learned control.

Main differences between the learned controls. The different learning contexts are primarily influencing h 1 (t), that is a shift in the trading speed. This is an important result by itself in optimal trading: going from stylized dynamics, to simulations with intraday seasonality, and then to real price dynamics does not change that much the multiplicative term of q t , but shifts the trading speed. This has been already observed in the context of game theoretical frameworks [9] . When the seasonality is added in the simulation (dotted green curve) the deformation of the control exhibits oscillations that most probably stem from the seasonalities of the right panel of Figure 1 . The shift learned on real price dynamics (dashed orange) amplifies some oscillations that are now less smooth; this is probably due to autocorrelations of price innovations. Moreover Figure 2 shows that the "functional learning" worked: the mono-preference neural network trained only on this (A, φ) and the multi-preferences one have similar curves.

In this paper, we succeed in using a neural network to learn the mapping between end-user preferences and the optimal trading speed, allowing to buy or sell a large number of shares or contracts on financial markets. Prior to this work, various proposals have been made up to now but the learned controls have always been specialized for a given set of preferences. Here, our multi-preferences neural network learns the solution of a class of dynamical systems.

Note that optimal execution dynamics are reacting to the applied control, via the price impact of the buying or selling pressure. The neural network hence has to learn this feedback. Our approach uses a deep neural network, whose inputs are user preferences and the state of the optimization problem that change after each decision. The loss function can only be computed at the end of the trading day, once the neural controller has been used 77 times. The backpropagation hence takes place across these 77 steps. We faced some exploration -exploitation issues and solved it by choosing suitable the range of users' preferences to ensure that long enough trajectories are observed during the learning.

Our setup leverages on transfer learning, starting on simulated data before switching to historical data. Since we want to understand how the learned controls are different from the closed-form solution on a stylized model that are largely used by practitioners, we learn on different versions of simulated data: from a model corresponding to the stylized one to one incorporating non-stationarities, and then to real data.

To ensure the explainability of our learned controls, we introduce a projection method on the functional space spanned by the closed-form formula. It allows us to show that most of the learned controls belong to this manifold, and that the source of adaptation to realistic dynamics focuses on a "shift term" h 1 in the control. The most noticeable adaptation of h 1 exhibits slow oscillations that are most probably reflecting the seasonalities of intraday price dynamics.

We then introduce a version of the loss function that reflects more the reality of intraday price dynamics (that are sub-diffusive). Despite the fact that the associated HJB has no closed form solution, we manage to learn its associated optimal control and show, using our projection technique, that it almost belongs to the same manifold. This approach delivers explainability of learned controls and can probably be extended to contexts other than optimal trading. It should help regulators to have a high level of trust in the learned controls, that are often considered as "black boxes": our proposal exposes the fraction of the controls that belongs to the manifold practitioners and regulators are familiar with, allowing them to perform the usual "stress tests" on it, and quantifies this fraction as a R 2 curve that is easy to interpret.

Explainability of learned control for financial markets. Attempts of using machine learning for financial markets are booming the last 5 years. It spans a wide spectrum of applications: from client profiling to nowcasting using databases of texts and satellite images. One area where the acceptance of ML goes slowly is the automation of hedging strategies. These strategies are part of the functioning of financial markets in the post 2008-crisis environment, because genuine regulations demand to market participants to compute and compensate their exposure to as much risk scenarios as possible. Moreover and for obvious reasons, regulators ask for a better understanding of algorithms that look like "black boxes" before allowing the use of ML in production to improve the efficiency of hedging. It is related to the very well known topic of "explainability of AI ".

Optimal trading is addressing part of these hedging strategies, since a framework like the one used in this paper corresponds to an "large investor", typically a pension fund, who decided to rebalance its portfolio for good reasons (think about the market moves following the spread of COVID-19: the optimal response of pension funds is clearly to rebalance their portfolio to get less exposure to factors like airplane companies, to prevent their pensioners to suffer from losses in the coming year), but they have to do it at a slow enough pace to not push the price too much at their own detriment [28] . The reason for the big COVID-19 drop of markets followed by a rebound 10 days later is due to non optimal trading strategies: the institutions sold too fast, paying too much for their deleveraging and perturbing the prices with the downward pressure of their too fast sells. Being able to adapt the trading strategies to realistic features of intraday price dynamics (like seasonality and auto-correlations) decreases both the trading costs for large institutions like pension funds (and consequently goes back to their pensioners) and is profitable to the public price formation, that is otherwise more blurred and does not reflects the "fair value" of traded instruments.

In the introduction, we cite papers on improving optimal trading, via reinforcement learning or directly using deep learning. They do not attempt to provide explainable version of the obtained controls. In this paper we do it in a very practical way that is compliant with the practices of the financial industry: we project the learned controls on the manifold spanning the controls currently in use by brokers and asset managers, and measure the distance between the obtained projection and the learned controls (via a R 2 curve that is straightforward to understand). Doing this: we provide evidences that the learned controls that we generated are largely contained in this regular manifold, and we allow practitioners and regulators to apply their usual stress testing and certification practices to the projected controls. It is the first proposal in this direction, and we hope that it will propagate in other areas of finance, like deep hedging. It may also be used in other industries when the regulatory demand in explainability is high.

"Functional learning" for control. The second "broader impact" of this paper is what we called "functional learning" in the scope optimal trading with the setup we call multi-preferences neural network. In other applications of optimal control, the loss function is fixed because it correspond to one universal use of the controlled object. On financial markets the loss function is parametrized by hyperparameters describing the risk aversion of the end-user. In game theory versions of the optimal trading framework, they are usually called "agent preferences" [9] . It is like allowing the driver of an autonomous car to choose the "driving style" she or he wants to use: far below the speed limit (may be for low carbon emission reasons) or just below, or a "sport mode", etc. On financial markets, mainly because on the one hand models of price impact have confidence intervals and on the other hand asset owners can have a lot of reasons to rebalance their portfolio (think about an Australian pension fund that has to sell fast enough to face the exceptional pensioners' redemption authorized by the government following the COVID-19 crisis; 3 it is clear that the selling speed has to be fast enough to give the money to pensioners), these parameters are chosen at the start of trading (t = 0 in our framework) by the end-user. Today a stylized version of the control problem is numerically solved on the fly (either in closed-form, either by a numerical scheme) for the chosen preferences, but the dynamics have to be stylized, hence simplified, enough to allow to solve this very quickly. Learning the optimal control each time an end-user choose new preferences is not fast enough.

In this paper the neural net is learning to solve the optimal control problem for any preference; in fact it learns the mapping between the preferences and the optimal solution. It effectively learns to solve a whole parametrized family of Hamilton-Jacobi-Bellman equations. Hence it can provide the optimal trading strategy to a given choice of preferences in a flash. Keep in mind this optimal strategy is not one number but it is itself a mapping between the state space of the control problem and the optimal trading speed to be applied.

It had chances to work because it is straightforward to check (in the continuous setting) that the infinitedimensional optimal control strategy is a smooth function of the two-dimensional vector of preferences (A, φ), hence we ask to the neural network to interpolate high dimensional functions in this 2-dimensional grid.

Nevertheless it is far from easy (that's probably why other attempts did not succeed by now). In the paper, we refer to what we suspect to be the main difficulty under the name of "exploration -exploitation issue". For some pairs of user preferences, the optimal trading speed is so fast that the full order is bought or sold far before T , and hence the neural network cannot learn from the feedback of its control on the dynamics, simply because it stopped to interact with them. To counter this we used the closed-form formula of the stylized problem to define a domain of user preferences that is compatible with our T , it helped the convergence of the "functional learning".

For the sake of completeness, we recall how the benchmark solution is obtained; see [13] for more details. The continuous form of the problem defined in equations (1) to (4), when γ = 2, can be characterized by the value (we drop the subscripts A and φ to alleviate the notations)

where V is the value function, defined as:

From dynamic programming, we obtain that the value function V satisfies the following Hamilton-Jacobi-Bellman (HJB): for t ∈ [0, T ), x, s, q ∈ R,

with terminal condition V (T, x, s, q) = x + q(s − Aq).

If we use the ansatz V (t, x, s, q) = x + qs + u(t, q), with u of the form u(t, q) = h 0 (t) + h 1 (t)q + h 2 (t) q 2 2 , the optimal control resulting from solving this problem can be written as:

Hence, h 1 (t) and h 2 (t) act over the control by influencing either the intercept of an affine function of the inventory, in the case of h 1 (t), or its slope, in the case of h 2 (t).

From the HJB equation, these coefficients are characterized by the following system of ordinary differential equations (ODEs):

with terminal conditions:

In this section, we provide more details on the implementation of the method based on neural network approximation. For the neural network, we used a fully connected architecture, with three hidden layers, five nodes each. One forward step in this setup is described by Figure 3 , while one round of the SGD is represented by Figure 4 . The same neural network learns from the state variables obtained in the previous step. It thus learns the influence of its own output through many time steps. We have experimented with using one neural network per time step. However, this method implies more memory usage, and did not provide any clear advantage with respect to the one presented here.

From Figures 3 and 4 , we can clearly see the idea that the control influences the dynamics. We are, in fact, optimizing in a closed-loop learning environment, where the trader's actions have both permanent and temporary market impact on the price.

The mini-batch size we used, namely 64, is relatively small. While papers like [16] defend the use of larger mini-batch sizes to take advantage of parallelism, smaller mini-batch sizes such as the ones we are using are shown to improve accuracy in [25] , [32] and [37] . Figure 3 : Structure of the state variables' simulation -One Step 

Permanent market impact Let S be the space of stocks; and D be the space of trading days. If we take five-minute intervals (indicated by the superscript notation), we can write equation (1) for each stock s ∈ S, for each day d ∈ D, and for each five-minute bin indexed by t as:

where the subscripts s, d, t respectively indicate the stock, the date, and the five-minute interval to which the variables refer, and ∆t = 5 min, by construction. We have α s,t independent of d, which assumes that for any given day the permanent market impact multiplier the agent may have on the price of a particular stock s, for a given time bin t is the same. And we have σ s independent of the day d and time bin t, which means the volatility related to the noise term is constant for a given stock.

As an empirical proxy for the theoretical value ν 5min s,d,t representing the net order flow, we use the order flow imbalance observed in the data: Imb 5min s,d,t . We define this quantity as:

where v buy s,d,t is the volume of a buy trade, and v sell s,d,t is the volume of a sell trade, and we aggregate their net amounts over five minute bins.

However, since we are estimating the permanent market impact parameter α using data from different stocks, and we would like to perform a single regression for all of them, we re-scale the data used in the estimation. In order to make the data from different stocks comparable, we do the following:

1. Divide the trade price difference (∆S 5min s,d,t ) by the average bin spread over all days for the given stock, also calculated in its respective five minute bin:

where ψ 5min s,d,t is the bid-ask spread for a given ( .

where Total Volume 5min s,d,t stands for the total traded volume for both buys and sells for a given (stock, date, bin) tuple.

This ensures that both the volume and the price are normalized to quantities that can be compared. Each (s, d, t) now gives rise to one data point, and we can thus run the following regression instead, using comparable variables, to findᾱ:

whereᾱ is the new slope parameter we would like to estimate, and¯ 5min s,d,t is the normalized version of the residual we had in equation (15) .

In order to useᾱ in a realistic way, we de-normalize the regression equation for each stock by doing: 

whereV s,t is the average bin volume for stock s at bin the time bin t,ψ s,t is the average bin bid-ask spread for the same pair (s, t).

The derivation for the temporary market impact κ uses equation (2), and follows similar steps.

In section 2.3, we discussed the exploration -exploitation trade-off we encountered when projecting the neural network controller onto the manifold of closed-form solutions when γ = 2.

In Figure 5 , we compare the number of time steps it takes the controller to trade at least 90% of the initial inventory in the market. Rows stand for values of the risk parameter A ∈ {0.01, 0.001, 0.0001}, while columns represent the risk parameter φ ∈ {0.07, 0.007, 0.004, 0.001, 7e −4 , 7e −5 }.

When the trader's preferences over risk aversion gives them incentive to trade faster, the inventory is executed sooner. The controller is 'exploiting' the available trading choices, and the projection on the closed-form manifold becomes harder to estimate. On the other hand, if the pair (A, φ) defines less restrictive preferences towards execution, there will be some (although not a lot of) inventory left at the end of the trading day that allows us to learn the projection accurately. When γ = 3/2, the projection on the "shift term" h 1 is larger than usual, and the absolute value of "proportional term" h 2 is very close to zero when using the same scale as the h 2 when γ = 2 (it is increasing with time, that is the reverse of usual when in its own scale). Nevertheless, it is compensated by the part of the control that is outside of this space, as indicated by the R 2 : at the start of the trading process 40% of the control is explained by the "closed-form manifold"; it then decreases to 0%, meaning that the learned control takes part fully out of this manifold; but it then increases back to 80% at the end of the process. It seems that the best way to finish the trading is already well captured by the closed-form manifold. As we observe from the control plot in the bottom-right of Figure 6 , the parameter pair (A, φ) = (0.01, 0.007) does not enforce trading speeds as fast as the pair (A, φ) = (0.5, 0.1) when the loss function is sub-diffusive. Keeping the same preferences as before is thus making the trader less aggressive in a sub-diffusive environment. In order to have the same behavior in terms of control, they would need to behave in a more risk averse manner. Average Control Process (nu) Figure 6 : Explainable parts of the control: h 1 (t), h 2 (t); fraction of the explained control R 2 (t), and average control given it is positive E(ν(t)|ν(t) > 0) for different controllers, for the case γ = 3/2.

Optimal execution of portfolio transactions

Market impacts and the life cycle of investors orders

Optimal control of execution costs

Tails, fears, and risk premia

Optimal control of trading algorithms: a general impulse control approach

Trades, quotes and prices: financial markets under the microscope

Deep hedging

Trading volume and serial correlation in stock returns

Mean field game of controls and an application to trade crowding

Convergence Analysis of Machine Learning Algorithms for the Numerical Solution of Mean Field Control and Games: I-The Ergodic Case

Convergence Analysis of Machine Learning Algorithms for the Numerical Solution of Mean Field Control and Games: II-The Finite Horizon Case

The self-financing equation in limit order book markets

Incorporating order-flow into optimal execution

Algorithmic and high-frequency trading

Reinforcement learning in economics and finance

Large scale distributed deep networks

Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations

Deep learning methods for mean field control problems with delay

Sensitivity analysis using Itô-Malliavin calculus and martingales, and application to stochastic optimal control

Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality

Accelerated share repurchase and other buyback programs: what neural networks can bring

Deep learning approximation for stochastic control problems

Solving high-dimensional partial differential equations using deep learning

Some machine learning schemes for high-dimensional nonlinear PDEs

On large-batch training for deep learning: Generalization gap and sharp minima

Adam: A method for stochastic optimization

Optimal starting times, stopping times and risk measures for algorithmic trading

Market microstructure in practice

Optimal posting price of limit orders: learning by trading

Piecewise affine neural networks and nonlinear control

Pricing options and computing implied volatilities using neural networks. Risks

Revisiting small batch training for deep neural networks

Improving reinforcement learning algorithms: towards optimal learning rate policies

Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations

Universal features of price formation in financial markets: perspectives from deep learning

DGM: a deep learning algorithm for solving partial differential equations

The general inefficiency of batch training for gradient descent learning

The authors are extremely grateful for the fruitful discussions with professor René Carmona, who was also very kind to provide the data we are using. The work of L. Leal and M. Laurière was supported by NSF DMS-1716673 and ARO W911NF-17-1-0578.