key: cord-0505878-yp9zn75s authors: Benhamou, Eric; Saltiel, David; Ungari, Sandrine; Mukhopadhyay, Abhishek title: Bridging the gap between Markowitz planning and deep reinforcement learning date: 2020-09-30 journal: nan DOI: nan sha: 7be5369d668fa295985730307c85d36c5c969e3c doc_id: 505878 cord_uid: yp9zn75s While researchers in the asset management industry have mostly focused on techniques based on financial and risk planning techniques like Markowitz efficient frontier, minimum variance, maximum diversification or equal risk parity, in parallel, another community in machine learning has started working on reinforcement learning and more particularly deep reinforcement learning to solve other decision making problems for challenging task like autonomous driving, robot learning, and on a more conceptual side games solving like Go. This paper aims to bridge the gap between these two approaches by showing Deep Reinforcement Learning (DRL) techniques can shed new lights on portfolio allocation thanks to a more general optimization setting that casts portfolio allocation as an optimal control problem that is not just a one-step optimization, but rather a continuous control optimization with a delayed reward. The advantages are numerous: (i) DRL maps directly market conditions to actions by design and hence should adapt to changing environment, (ii) DRL does not rely on any traditional financial risk assumptions like that risk is represented by variance, (iii) DRL can incorporate additional data and be a multi inputs method as opposed to more traditional optimization methods. We present on an experiment some encouraging results using convolution networks. In asset management, there is a gap between mainstream used methods and new machine learning techniques around reinforcement learning and in particular deep reinforcement learning. The former methods rely on financial risk optimization and solve the planning problem of the optimal portfolio as a single step optimization question. The latter do not make any assumptions about risk, do a more involving multi-steps optimization and solve complex and challenging tasks like autonomous driving (Wang, Jia, and Weng 2018) , learning advanced locomotion and manipulation skills from raw sensory inputs Schulman et al. 2015a; Schulman et al. 2017; Lillicrap et al. 2015) or on a more conceptual side for reaching supra human level in popular games like Atari (Mnih et al. 2013) , Go Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Silver et al. 2017) , StarCraft II (Vinyals et al. 2019), etc ... One of the reasons often put forward for this situation is that asset management researchers have mostly been trained with an econometric and financial mathematics background, while the deep reinforcement learning community has been mostly trained in computer science and robotics, leading to two distinctive research communities that do not interact much between each other. In this paper, we aim to present the various approaches to show similarities and differences to bridge the gap between these two approaches. Both methods can help solving the decision making problem of finding the optimal portfolio allocation weights. As this paper aims at bridging the gap between traditional asset management portfolio selection methods and deep reinforcement learning, there are too many works to be cited. On the traditional methods side, the seminal work is (Markowitz 1952 ) that has led to various extensions like minimum variance (Chopra and Ziemba 1993; Haugen and Baker 1991) , (Kritzman 2014), maximum diversification (Choueifaty and Coignard 2008; Choueifaty, Froidure, and Reynier 2012) , maximum decorrelation (Christoffersen et al. 2010) , risk parity (Maillard, Roncalli, and Teïletche 2010; Roncalli and Weisang 2016) . We will review these works in the section entitled Traditional methods. On the reinforcement learning side, the seminal book is (Sutton and Barto 2018) . The field of deep reinforcement learning is growing every day at an unprecedented pace, making the citation exercise complicated. But in terms of breakthroughs of deep reinforcement learning, one can cite the work around Atari games from raw pixel inputs (Mnih et al. 2013; Mnih et al. 2015) , Go (Silver et al. 2016; Silver et al. 2017 ), StarCraft II (Vinyals et al. 2019 , learning advanced locomotion and manipulation skills from raw sensory inputs ) (Schulman et al. 2015a; Schulman et al. 2015b; Schulman et al. 2017; Lillicrap et al. 2015) , autonomous driving (Wang, Jia, and Weng 2018) and robot learning (Gu et al. 2017) . On the application of deep reinforcement learning methods to portfolio allocations, there is already a growing in-terest as recent breakthroughs has put growing emphasis on this method. Hence, the field is growing very rapidly and survey like (Fischer 2018) are already out dated. Driven initially mostly by applications to crypto currencies and Chinese financial markets (Jiang and Liang 2016; Zhengyao et al. 2017; Yu et al. 2019; Wang and Zhou 2019) , the field is progressively taking off on other assets (Kolm and Ritter 2019; Liu et al. 2020; Ye et al. 2020; Li et al. 2019; Xiong et al. 2019 ). More generally, DRL has recently been applied to other problems than portfolio allocation. For instance, (Deng et al. 2016; Zhang, Zohren, and Roberts 2019; Huang 2018; Thate and Ernst 2020; Chakraborty 2019; Nan, Perumal, and Zaiane 2020; Wu et al. 2020 ) tackle the problem of direct trading strategies (Bao and yang Liu 2019) handles the one of multi agent trading while (Ning, Lin, and Jaimungal 2018) examine optimal execution. We are interested in finding an optimal portfolio which makes the planning problem quite different from standard planning problem where the aim is to plan a succession of tasks. Typical planning algorithms are variations around STRIPS (Fikes and Nilsson 1971) , that starts by analysis ending goals and means, builds the corresponding graph and finds the optimal graph. Indeed we start from the goals to achieve and try to find means that can lead to them. New work like Graphplan as presented in (Blum and Furst 1995) uses a novel planning graph, to reduce the amount of search needed, while hierarchical task network (HTN) planning leverages the classification to structure networks and hence reduce the number of graph searches. Other algorithms like search algorithm as A * , B * , weighted A * or for full graph search, branch and bound and its extensions, as well as evolutionary algorithms like particle swarm, CMA-ES are also used widely in AI planning etc.. However, when it comes to portfolio allocation, standard methods used by practitioners rely on more traditional financial risk reward optimization problems and follows rather the Markowitz approach as presented below. The intuition of Markowitz portfolio is to be able to compare various assets and assemble them taking into account both return and risk. Comparing just returns of some financial assets would be too naive. One has to take into account in her/his investment decision returns with associated risk. Risk is not an easy concept. in Modern Portfolio Theory (MPT), risk is represented by the variance of the asset returns. If we take various financial assets and display their returns and risk as in figure 1, we can find an efficient frontier. Indeed there exists an efficient frontier represented by the red dot line. Mathematically, if we denote by w = (w 1 , ..., w l ) the allocation weights with 1 ≥ w i ≥ 0 for i = 0...l, summarized by 1 ≥ w ≥ 0, with the additional constraints that these weights sum to 1: l i=1 w i = 1, we can see this portfolio allocation question as an optimization. Let µ = (µ 1 , ..., µ l ) T be the expected returns for our l strategies and Σ the matrix of variance covariances of the l strategies' returns. Let r min be the minimum expected return. The Markowitz optimization problem to solve is to minimize the risk given a target of minimum expected return as follows: It is solved by standard quadratic programming. Thanks to duality, there is an equivalent maximization with a given maximum risk σ max for wich the problem writes as follows: This seminal model has led to numerous extensions where the overall idea is to use a different optimization objective. As presented in (Chopra and Ziemba 1993; Haugen and Baker 1991), (Kritzman 2014), we can for instance be interested in just minimizing risk (as we are not so much interested in expected returns), which leads to the minimum variance portfolio given by the following optimization program: Maximum diversification portfolio Denoting by σ the volatilities of our l strategies, whose values are the diagonal elements of the covariance matrix Σ: σ = (Σ i,i ) i=1..l , we can shoot for maximum diversification with the diversification of a portfolio defined as follows: . We then solve the following optimization program as presented in (Choueifaty and Coignard 2008; Choueifaty, Froidure, and Reynier 2012) Maximize The concept of diversification is simply the ratio of the weighted average of volatilities divided by the portfolio volatility. Maximum decorrelation portfolio Following (Christoffersen et al. 2010) and denoting by C the correlation matrix of the portfolio strategies, the maximum decorrelation portfolio is obtained by finding the weights that provide the maximum decorrelation or equivalently the minimum correlation as follows: Risk parity portfolio Another approach following risk parity (Maillard, Roncalli, and Teïletche 2010; Roncalli and Weisang 2016) is to aim for more parity in risk and solve the following optimization program All these optimization techniques are the usual way to solve the planning question of getting the best portfolio allocation. We will see in the following section that there are many alternatives leveraging machine learning that remove cognitive bias of risk and are somehow more able to adapt to changing environment. Previous financial methods treat the portfolio allocation planning question as a one-step optimization problem, with convex objective functions. There are multiple limitations to this approach: • they do not relate market conditions to portfolio allocation dynamically. • they do not take into account that the result of the portfolio allocation may potentially be evaluated much later. • they make a strong assumptions about risk. What if we could cast this portfolio allocation planning question as a dynamic control problem where we have some market information and needs to decide at each time step the optimal portfolio allocation problem and evaluate the result with delayed reward? What if we could move from static portfolio allocation to optimal control territory where we can change our portfolio allocation dynamically when market conditions changes. Because the community of portfolio allocation is quite different from the one of reinforcement learning, this approach has been ignored for quite some time even though there is a growing interest for the use of reinforcement learning and deep reinforcement learning over the last few years. We will present here in greater details what deep reinforcement is in order to suggest more discussions and exchanges between these two communities. Contrary to supervised learning, reinforcement learning do not try to predict future returns. It does not either try to learn the structure of the market implicitly. Reinforcement learning does more: it directly learns the optimal policy for the portfolio allocation in connection with the dynamically changing market conditions. As it name stands for, Deep Reinforcement Learning (DRL) is the combination of Reinforcement Learning (RL) and Deep (D). The usage of deep learning is to represent the policy function in RL. In a nutshell, the setting for applying RL to portfolio management can be summarized as follows: • current knowledge of the financial markets is formalized via a state variable denoted by s t . • Our planning task which is to find an optimal portfolio allocation can be thought as taking an action a t on this market. This action is precisely the decision of the current portfolio allocation (also called portfolio weights). • once we have decided the portfolio allocation, we observe the next state s t+1 . • we use a reward to evaluate the performance of our actions. In our particular setting, we can compute this reward only at the the final time of our episode, making it quite special compared to standard reinforcement learning problem. We denote this reward by R T where T is the final time of our episode. This reward R T is in a sense similar to our objective function in traditional methods. A typical reward is the final portfolio net performance. It could be obviously other financial performance evaluation criteria like Sharpe, Sortino ratio, etc.. Following standard RL, we model our problem to solve with a Markov Decision Process (MDP) as in (Sutton and Barto 2018) . MDP assumes that the agent knows all the states of the environment and has all the information to make the optimal decision in every state. The Markov property implies in addition that knowing the current state is sufficient. MDP assumes a 4-tuple (S, A, P, R) where S is the set of states, A is the set of actions, P is the state action to next state transition probability function P : S × A × S → [0, 1], and R is the immediate reward. The goal of the agent is to learn a policy that maps states to the optimal action π : S → A and that maximizes the expected discounted The concept of using deep network is to represent the function that relates dynamically the states to the action called in RL the policy and denoted by a t = π(s t ). This function is represented by deep network because of the universal approximation theorem that states that any function can be represented by a deep network provided we have enough layers and nodes. Compared to traditional methods that only solve a one step optimization, we are solving the following dynamic control optimization program: subject to a t = π(s t ) Note that we maximize the expected value of the cumulated reward E[R T ] because we are operating in a stochastic environment. To make things simpler, let us assume that the cumulated reward is the final portfolio net performance. Let us write P t the price at time t of our portfolio, and its return at time t: r P t and the portfolio assets return vector at time t: r t . The final net performance writes as P T /P 0 − 1 = T t=1 (1 + r P t ) − 1. The returns r P t is a function of our planning action a t as follows: (1+r P t ) = 1+ a t , r t where ·, · is the standard inner product of two vectors. In addition if we recall that the policy is parametrized by some deep network parameters, θ: a t = π θ (s t ), we can make our optimization problem slightly more detailed as follows: subject to a t = π θ (s t ). It is worth noticing that compared to previous traditional planning methods (optimization 1, 3, 4, 5 or 5), the underlying optimization problem in RL 7 and its rewritting in terms of deep network parameters θ as presented in 8 have many differences: • First, we are trying to optimize a function π and not simple weights w i . Although this function at the end is represented by a deep neural network that has admittely also weights, this is conceptually very different as we are optimizing in the space of functions π : S → A , that is a much bigger space than simply R l . • Second, it is a multi time step optimization at it involves results from time t = 1 to t = T , making it also more involving. If there is in addition some noise in our data and we are not able to observe the full state, it is better to use Partially Observable Markov Decision Process (POMDP) as presented initially in (Astrom 1969) . In POMDP, only a subset of the information of a given state is available. The partiallyinformed agent cannot behave optimally. He uses a window of past observations to replace states as in a traditional MDP. Mathematically, POMDP is a generalization of MDP. POMPD adds two more variables in the tuple, O and Z where O is the set of observations and Z is the observation transition function Z : S × A × O → [0, 1]. At each time, the agent is asked to take an action a t ∈ A in a particular environment state s t ∈ S, that is followed by the next state s t+1 with P(s t+1 |s t , a t ). The next state s t+1 is not observed by the agent. It rather receives an observation o t+1 ∈ O on the state s t+1 with probability Z (o t+1 |s t+1 , a t ) . From a practical standpoint, the general RL setting is modified by taking a pseudo state formed with a set of past observations (o t−n , o t−n−1 , . . . , o t−1 , o t ). In practice to avoid large dimension and the curse of dimension, it is useful to reduce this set and take only a subset of these past observations with j < n past observations, such that 0 < i 1 < . . . < i j and i k ∈ N is an integer. The set δ 1 = (0, i 1 , . . . , i j ) is called the observation lags. In our experiment we typically use lag periods like (0, 1, 2, 3, 4, 20, 60) for daily data, where (0, 1, 2, 3, 4) provides last week observation, 20 is for the one-month ago observation (as there is approximately 20 business days in a month) and 60 the three-month ago observation. Regular observations There are two types of observations: regular and contextual information. Regular observations are data directly linked to the problem to solve. In the case of an asset management framework, regular observations are past prices observed over a lag period δ = (0 < i 1 < . . . < i j ). To normalize data, we rather use past returns computed as r k t = p k t p k t−1 − 1 where p k t is the price at time t of the asset k. To give information about regime changes, our trading agent receives also empirical standard deviation computed over a sliding estimation window denoted by d as follows σ k t = 1 d t u=t−d+1 (r u − µ) 2 , where the empirical mean µ is computed as µ = 1 d t u=t−d+1 r u . Hence our regular observations is a three dimensional ten- Context observation Contextual observations are additional information that provide intuition about current con-text. For our asset manager, they are other financial data not directly linked to its portfolio assumed to have some predictive power for portfolio assets. Contextual observations are stored in a 2D matrix denoted by C t with stacked past p individual contextual observations. Among these observations, we have the maximum and minimum portfolio strategies return and the maximum portfolio strategies volatility. The latter information is like for regular observations motivated by the stylized fact that standard deviations are useful features to detect crisis. The contextual state writes as The matrix nature of contextual states C t implies in particular that we will use 1D convolutions should we use convolutional layers. All in all, observations that are augmented observations, write as that will feed the two sub-networks of our global network. In our deep reinforcement learning the augmented asset manager agent needs to decide at each period in which hedging strategy it invests. The augmented asset manager can invest in l strategies that can be simple strategies or strategies that are also done by asset management agent. To cope with reality, the agent will only be able to act after one period. This is because asset managers have a one day turn around to change their position. We will see on experiments that this one day turnaround lag makes a big difference in results. As it has access to l potential hedging strategies, the output is a l dimension vector that provides how much it invest in each hedging strategy. For our deep network, this means that the last layer is a softmax layer to ensure that portfolio weights are between 0 and 100% and sum to 1, denoted by (p 1 t , ..., p l t ). In addition, to include leverage, our deep network has a second output which is the overall leverage that is between 0 and a maximum leverage value (in our experiment 3), denoted by lvg t . Hence the final allocation is given by lvg t × (p 1 t , ..., p l t ). In terms of reward, we are considering the net performance of our portfolio from t 0 to the last train date t T computed as follows: We display in figure 2 the architecture of our network. Because we feed our network with both data from the strategies to select but also contextual information, our network is a multiple inputs network. Additionally, as we want from these inputs to provide not only percentage in the different hedging strategies (with a softmax activation of a dense layer) but also the overall leverage (with a dense layer with one single output neurons), we also have a multi outputs network. Additional hyperparameters that are used in the network as L2 regularization with a coefficient of 1e-8. Figure 2 : network architecture obtained via tensorflow plotmodel function. Our network is very different from standard DRL networks that have single inputs and outputs. Contextual information introduces a second input while the leverage adds a second output Because we want to extract some features implicitly with a limited set of parameters, and following (Liang et al. 2018), we use convolution network that perform better than simple full connected layers. For our so called asset states named like that because there are the part of the states that relates to the asset, we use two layers of convolutional network with 5 and 10 convolutions. These parameters are found to be efficient on our validation set. In contrast, for the contextual states part, we only use one layer of convolution networks with 3 convolutions. We flatten our two sub network in order to concatenate them into a single network. To learn the parameters of our network depicted in 2, we use a modified policy gradient algorithm called adversarial as we introduce noise in the data as suggested in (Liang et al. 2018) .. The idea of introducing noise in the data is to have some randomness in each training to make it more robust. This is somehow similar to drop out in deep networks where we randomly perturb the network by randomly removing some neurons to make it more robust and less prone to overfitting. Here, we are perturbing directly the data to create this stochasticity to make the network more robust. A policy is a mapping from the observation space to the action space, π : O → A. To achieve this, a policy is specified by a deep network with a set of parameters θ. The action is a vector function of the observation given the parameters: a t = π θ (o t ). The performance metric of π θ for time interval [0, t] is defined as the corresponding total reward function of the interval J [0,t] (π θ ) = R o 1 , π θ (o 1 ), · · · , o t , π θ (o t ), o t+1 . After random initialization, the parameters are continuously updated along the gradient direction with a learning rate λ: θ −→ θ + λ∇ θ J [0,t] (π θ ). The gradient ascent optimization is done with standard Adam (short for Adaptive Moment Estimation) optimizer to have the benefit of adaptive gradient descent with root mean square propagation (Kingma and Ba 2014). The whole process is summarized in algorithm 1. end while 17: until convergence In our gradient ascent, we use a learning rate of 0.01, an adversarial Gaussian noise with a standard deviation of 0.002. We do up to 500 maximum iterations with an early stop condition if on the train set, there is no improvement over the last 50 iterations. We are interested in planing a hedging strategy for a risky asset. The experiment is using daily data from 01/05/2000 to 19/06/2020 for the MSCI and 4 SG-CIB proprietary systematic strategies. The risky asset is the MSCI world index whose daily data can be found on Bloomberg. We choose this index because it is a good proxy for a wide range of asset manager portfolios. The hedging strategies are 4 SG-CIB proprietary systematic strategies further described below. Training and testing are done following extending walk forward analysis as presented in (Benhamou et al. 2020a) and (Benhamou et al. 2020b ) with initial training from 2000 to end of 2006 and testing in a rolling 1 year period. Hence, there are 14 training and testing periods, with the different testing period corresponding to all the years from 2007 to 2020 and training done for period starting in 2000 and ending one day before the start of the testing period. Systematic strategies are similar to asset managers that invest in financial markets according to an adaptive, predefined trading rule. Here, we use 4 SG CIB proprietary 'hedging strategies', that tend to perform when stock markets are down: • Directional hedges -react to small negative return in equities, • Gap risk hedges -perform well in sudden market crashes, • Proxy hedges -tend to perform in some market configurations, like for example when highly indebted stocks under-perform other stocks, • Duration hedges -invest in bond market, a classical diversifier to equity risk in finance. The underlying financial instruments vary from put options, listed futures, single stocks, to government bonds. Some of those strategies are akin to an insurance contract and bear a negative cost over the long run. The challenge consists in balancing cost versus benefits. In practice, asset managers have to decide how much of these hedging strategies are needed on top of an existing portfolio to achieve a better risk reward. The decision making process is often based on contextual information, such as the economic and geopolitical environment, the level of risk aversion among investors and other correlation regimes. The contextual information is modeled by a large range of features : • the level of risk aversion in financial markets, or market sentiment, measured as an indicator varying between 0 for maximum risk aversion and 1 for maximum risk appetite, • the bond equity historical correlation, a classical expost measure of the diversification benefits of a duration hedge, measured on a 1 month, 3 month and 1 year rolling window, • The credit spreads of global corporate -investment grade, high yield, in Europe and in the US -known to be an early indicator of potential economic tensions, • The equity implied volatility, a measure if the 'fear factor' in financial market, • The spread between the yield of Italian government bonds and the German government bond, a measure of potential tensions in the European Union, • The US Treasury slope, a classical early indicator for US recession, • And some more financial variables, often used as a gauge for global trade and activity: the dollar, the level of rates in the US, the estimated earnings per shares (EPS). A cross validation step selects the most relevant features. In the present case, the first three features are selected. The rebalancing of strategies in the portfolio comes with transaction costs, that can be quite high since hedges use options. Transactions costs are like frictions in physical systems. They are taken into account dynamically to penalise solutions with a high turnover rate. Asset managers use a wide range of metrics to evaluate the success of their investment decision. For a thorough review of those metrics, see for example (Cogneau and Hbner 2009 ). The metrics we are interested in for our hedging problem are listed below: • annualized return defined as the average annualized compounded return, • annualized daily based Sharpe ratio defined as the ratio of the annualized return over the annualized daily based volatility µ/σ, • Sortino ratio computed as the ratio of the annualized return overt the downside standard deviation, • maximum drawdown (max DD) computed as the maximum of all daily drawdowns. The daily drawdown is computed as the ratio of the difference between the running maximum of the portfolio value defined as RM T = max t=0..T (P t ) and the portfolio value over the running maximum of the portfolio value. Hence the drawdon at time T is given by It is the maximum loss in return that an investor will incur if she/he invested at the worst time (at peak). Overall, the DRL approach achieves much better results than traditional methods as shown in table 1, except for the maximum drawdown (max DD). Because time horizon is important in the comparison we provide risk measures for the last 2 and 5 years to emphasize that the DRL approach seems more robust than traditional portfolio allocation methods. When plotting performance results from 2007 to 2020 as shown in figure 3 , we see that DRL model is able to deviate upward from the risky asset continuously, indicating a steady performance. In contrast, other financial models are not able to keep their marginal over-performance over time with respect to the risky asset and end slightly below the risky asset. The reason of the stronger performance of DRL comes from the way it chooses its allocation. Contrarily to standard financial methods that play the diversification as shown in figure 4, DRL aims at choosing a single hedging strategy most of the time and at changing it dynamically, should the financial market conditions change. In a sense, DRL is doing some cherry picking by selecting what it thinks is the best hedging strategy. In contrast, traditional models like Markowitz, minimum variance, maximum diversification, maximum decorrelation and risk parity provides non null weights for all our hedging strategies and do not do cherry picking at all. They are neither able to change the leverage used in the portfolio as opposed to DRL model. The DRL model can change its portfolio allocation should the market conditions change. This is the case from 2018 onwards, with a short deleveraging window emphasized by the small blank disruption during the Covid crisis as shown in figure 5. We observe in this figure where we have zoomed over year 2020, that the DRL model is able to reduce leverage from 300 % to 200 % during the Covid crisis (end of February 2020 to start of April 2020). This is a unique feature of our DRL model compared to traditional financial planning models that do not take leverage into account and keeps a leverage of 300 % regardless of market conditions. As illustrated by the experiment, the advantages of DRL are numerous: (i) DRL maps directly market conditions to actions by design and hence should adapt to changing environment, (ii) DRL does not rely on any traditional financial risk assumptions, (iii) DRL can incorporate additional data and be a multi inputs method as opposed to more traditional optimization methods. Figure 5 : disallocation of DRL model As nice as this work is, there is room for improvement as we have only tested a few scenarios and only a limited set of hyper-parameters for our convolutional networks. We should do more intensive testing to confirm that DRL is able to better adapt to changing financial environment. We should also investigate the impact of more layers and other design choice in our network. In this paper, we discuss how a traditional portfolio allocation problem can be reformulated as a DRL problem, trying to bridge the gaps between the two approaches. We see that the DRL approach enables us to select fewer strategies, improving the overall results as opposed to traditional methods that are built on the concept of diversification. We also stress that DRL can better adapt to changing market conditions and is able to incorporate more information to make decision. Optimal control of markov processes with incomplete state-information ii. the convexity of the lossfunction Aamdrl: Augmented asset management with deep reinforcement learning. arXiv The effect of errors in means, variances, and covariances on optimal portfolio choice Is the potential for international diversification disappearing? Working Paper Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates The efficient market inefficiency of capitalization-weighted stock portfolios End-to-end training of deep visuomotor policies Optimistic bull or pessimistic bear: Adaptive deep reinforcement learning for stock portfolio allocation Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop Double deep q-learning for optimal execution Trust region policy optimization High-dimensional continuous control using generalized advantage estimation. ICLR Reinforcement Learning: An Introduction Application of deep reinforcement learning in stock trading strategies and stock forecasting Adaptive stock trading strategies with deep reinforcement learning methods Practical deep reinforcement learning approach for stock trading Reinforcement-learning based portfolio management with augmented asset movement prediction states Model-based deep reinforcement learning for financial portfolio optimization Deep reinforcement learning for trading We would like to thank Beatrice Guez and Marc Pantic for meaningful remarks. The views contained in this document are those of the authors and do not necessarily reflect the ones of SG CIB.