key: cord-0062092-0gge35rs authors: Pham, Uyen; Luu, Quoc; Tran, Hien title: Multi-agent reinforcement learning approach for hedging portfolio problem date: 2021-04-19 journal: Soft comput DOI: 10.1007/s00500-021-05801-6 sha: 44b7e9840b4072298d60e340bed8e1c6d7170c37 doc_id: 62092 cord_uid: 0gge35rs Developing a hedging strategy to reduce risk of losses for a given set of stocks in a portfolio is a difficult task due to cost of the hedge. In Vietnam stock market, cross-hedge is involved hedging a long position of a stock because there is no put option for the stock. In addition, only VN30 stock index futures contracts are traded on Hanoi Stock Exchange. Inspired by recently achievement of deep reinforcement learning, we explore feasibility to construct a hedging strategy automatically by leveraging cooperative multi-agent in reinforcement learning techniques without advanced domain knowledge. In this work, we use 10 popular stocks on Ho Chi Minh Stock Exchange, and VN30F1M (VN30 Index Futures contracts within one month settlement) to develop a stock market simulator (including transaction fee, tax, and settlement date of transactions) for reinforcement learning agent training. We use daily return as input data for training process. Results suggest that the agent can learn trading and hedging policy to make profit and reduce losses. Furthermore, we also find that our agent can protect portfolios and make positive profit in case market collapses systematically. In practice, this work can help Vietnam’s stock market investors to improve performance and reduce losses in trading, especially when the volatility cannot be controlled. Hedging a position in stock is an attractive topic for both academics and practitioners. The objective of hedging is to minimize market risk due to price fluctuation, maximize profit by speculation on the basis, and construct a portfolio with reduced risk Floros and Vougas (2004) . Portfolio managers have used stock index futures as a means to adjust desired return of a portfolio and potential loss since the 1980s. Main advantage of index futures as a major hedging tool is liquidity and lower transaction costs Ghosh (1993) . However, hedge strategies are not always effective as expected because relationship of cash price and price of a future contract is usually not perfect, or hedged position in stock is different from the underlying portfolio for the index contract Figlewski (1984) , Floros and Vougas (2004) . It is possible to increase risk of potential loss that leads to negative return. Hence, hedgers have to determine the optimal hedge ratio to control the risk of the portfolio. In contrast to supervised learning and unsupervised learning, reinforcement learning mainly relies on experience of repeated interaction to learn optimal policy in order to make sequential decisions to maximize rewards in a given environment Sutton and Barto (2018) . In a complex and dynamic environment, it may require huge amounts of computational power over a long period of time to train. With revolution of deep learning techniques and computer hardware, reinforcement learning has become more feasible by using deep neural network as a functional estimator. From long time horizons with high-dimensional observation and action spaces in realtime strategy games, to self-driving vehicles, data center cooling systems, deep reinforcement learning has been more and more applied to solve many complex real-world challenges Berner et al. (2019) , Evans and Gao (2016) , O'Kelly et al. (2018) . In finance, deep reinforcement learning is also widely adopted. Zhang et al. Zhang et al. (2020) uses various RL algorithms including deep Q-learning, policy gradients, and Advantage Actor-Critic (A2C) to design trading strategies for continuous futures contracts. They use technical indicators such as moving average convergence/divergence (MACD) and relative strength index (RSI) as a part of input features. The agent shows that it can deliver profits even under heavy transaction costs. Ganesh et al. Ganesh et al. (2019) develop a multi-agent dealer market for market marking with different competitive scenarios and market price conditions. The research suggests that trained agent can learn to manage inventory and its competitor's policy for pricing. However, investigating feasibility of stock and futures trading to hedge portfolio at the same time using deep reinforcement learning is still a topical and interesting problem. In this study, we selected 10 popular stocks on HSX, and one stock index futures contract on HNX to build a simulation of stock market environment with real market data to study learning performance of our agent. Our objective in this work is to investigate whether cooperation of multi-agent to determine optimal hedging strategy to protect stock portfolio is achievable. Hedging is a finance strategy to reduce risk in investments by taking an opposite position in a related asset to offset losses. Basically, before making any investments, investors have to balance between profit and risk, for example, expected returns and variance of returns. It is the fact that a dollar of loss can cost the investor or the company more than a dollar of high profit. Hence, the reduction in risk provided by hedging also typically results in reduction in potential profits. The trade-off between profits and risks is the basic problem in finance. Hedging generally involves the use of financial instruments known as derivatives. The two most common derivatives are options (such as call option, namely the right to buy an asset at the fixed strike price by the predetermined time t in the future, or the put option, i.e., the right to sell) and futures or future contracts. (The buyer must purchase or the seller must sell the underlying asset at the set price at the expiration date.) How much should an open or spot position be hedged? Fixed or "obvious" hedge ration may increase rather than decrease risk ( McDonald (2006) ). It depends on which kinds of risk investors consider, and then, the optimal hedge ratio would be obtained accordingly. For example, if the variance of returns is used as a risk measure of a portfolio, the optimal hedge ration would be the minimum variance hedge ratio (MVHR). Motivated by risk reduction, hedging a stock portfolio with index futures has been an active research topic since it was introduced Figlewski (1984) . A hedger supposes that return of a hedged position (e.g., stock portfolio) can be closed to risk-free interest rate. In terms of optimal hedge ratio hr, there are many methods used to estimate the ratio. For instance, one-to-one hedge, the beta hedge, and the MVHR are some of these methods Brown (1985) , Ederington (1979). Butterworth et. al. Butterworth and Holmes (2001) evaluated hedging effectiveness of stock index futures with four different strategies (i.e., the traditional hedge, MVHR, least trimmed squares (LTS), and beta ratio of cross-hedge) with two daily and weekly hedge durations in the UK market. The results suggest that MVHR and LTS methods are robust to estimate the ratio. With cash prices and futures moving closely together assumption, one-to-one hedge strategy suggests hr = −1. Beta hedge strategy uses negative of the beta cash portfolio as hr. The hedger expects the overall beta of the portfolio is zero. However, in practice, change of prices of spot and futures is imperfectly correlated. Particularly in case of cross-hedge (namely the use of a derivative on one asset to hedge another asset), one-to-one and beta hedge may not reduce risk. In contrast, futures hedging can lead to unexpected loss. The MVHR was introduced to work around for the problem by taking the imperfect relationship of prices into account and determine the optimal ratio hr. Let R s , R f , and R h are returns of spot position (e.g., open portfolio), futures positions (e.g., index futures for hedging), and the hedged portfolio with futures, respectively, then we get The optimal ratio h (or hr) to minimize the V ar(R h ) is: Furthermore, by using ordinary least squares (OLS) regression to estimate minimum risk hedge, Figlewski Figlewski (1984) found that hedging effectiveness of a large capitalization portfolio can yield "fairly good" for a one-week holding period (p. 663). However, with diversified portfolio of small stocks, the effectiveness is reduced significantly. Basis risk is also not negligible even if the spot is hedged with index futures itself. When basis risk arises, it can generate profit or loss. It is suggested that one-day holding hedge positions strategy can potentially increase basis risk and reduce risk effectiveness than one-week hedge. Stating that traditional methods to estimate optimal hedge ratio are misspecified, error correction model (ECM) was proposed to estimate optimal hedge ratio and forecast out of sample for evaluation as in Ghosh (1993) . Firstly, it carries out cointegration test. Secondly, it use OLS regression to estimate error correction model. The model incorporates relationship of the long-run equilibrium as well as the shortrun dynamics. The result shows that optimal hedge ratio is significantly improved with adjusted R 2 from ECM which is higher than traditional methods. Also, by comparing rootmean-squared error (RMSE), out-of-sample forecasts from the ECM are found to be better than other methods. Beyond variance and standard deviation, value at risk (VaR) and conditional value at risk (CVaR) are extensively applied to measure market risk for hedging strategies of portfolio Cao et al. (2010) , Huggenberger et al. (2016) . VaR was introduced by J.P. Morgan in the 1990s and widely adopted to summarize risk of an entire portfolio at the end of each day Miller (2018). However, VaR is not a coherent risk measure. To be coherent, it must be monotonicity, positive homogeneity, translation invariance, and subadditivity Artzner (1999) , Artzner et al. (1997) . CVaR was constructed with these properties as a new valid practical alternative to VaR Acerbi and Tasche (2002) . Espeholt et al. Alexander et al. (2003) show that CVaR is applicable to a wide range of derivatives portfolio including American options and exotic options. In addition, it is found that CVaR risk metric is suitable for asymmetric return distributions and expected loss of portfolio can be minimized in many circumstances Topaloglou et al. (2002) . Reinforcement learning was proposed to train trading systems to make profit and to adjust risk Moody et al. (1998) , Moody et al. (1998) . Recurrent learning and Q-learning with neural networks were used to optimize financial performance functions including risk-adjusted return and immediate utility for online learning Moody et al. (1998) . Furthermore, portfolios with continuous quantities of multiple assets were considered. The result shows that reinforcement learning can avoid large losses when market crashed. Basis risk hedging strategy was developed using reinforcement learning as in Watts (2015). Without assets modeling requirements, state-action-reward-state-action (SARSA)based algorithm was applied to find an optimal trading policy to hedge a non-traded asset. Q-learning is proposed to extend Black-Scholes-Merton (BSM) model for option pricing and hedging in Halperin (2017) . In an attempt to escape Greeks and complete market assumptions in risk management, by leveraging deep reinforcement learning, a Greek-free approach is proposed to focus on realistic market dynamics and out-sample testing performance for optimiz-ing hedging of a portfolio of derivatives . Deep reinforcement learning is further investigated for hedging a portfolio of over-the-counter derivatives under generic market frictions as in . Trading costs and liquidity constraints are considered in the approach. For a given stochastic environment ε, an agent interacts with the environment by choosing to take a legal action a t from many actions at time step t, a t ∈ A ≡ {1, . . . , L}. Action space can be discrete or continuous. When the selected action is passed to the environment ε, internal state s t is switched to another state in many states S. In other words, the process of sequential interactions between the agent and the environment is result of mapping from perceived states s t to actions a t by policy π . For instance, in Dota 2 game, internal state can be all the available information for human player including positions, health, map Berner et al. (2019) . In this research, internal state is asset return in percentage, position of each asset. In return, the agent receives reward r t of the passed action as feedback, and new internal state s t+1 for each time step until reaching terminate state. The ultimate goal of deep reinforcement learning is to find an policy π that can select optimal action to maximize reward signal for each state s t . Value of a state measures total expected return by predicting future reward with discount rate γ ∈ [0, 1]. The total accumulated discounted return G t from time t with k time steps in the future is defined as: The state value V π (s) is defined as in Sutton and Barto (2018) : Similarly, action value Q π (s, a) is the expected return for state s from selecting action a following policy π . In value-based reinforcement learning, off-policy Qlearning was introduced to estimate the action value function Q π (s, a), defined as Watkins (1989) . The algorithm directly approximates the optimal action value function Q * (s, a). By extending neural network as a function approximator, the function can be estimated as Q * (s, a) ≈ Q (s, a, θ) . The approach is referred as Q-network with weights θ Mnih et al. (2013) . In contrast to value-based methods, policy-based can select actions directly by parameterizing the policy π(a|s, θ) and using gradient ascent to optimize E[R t ] to find the best θ that can produce the highest reward. In terms of probability, we can express the policy as π(a|s, θ) = Pr{A t = a|S t = s, θ t = θ } for the probability of a given environment ε in state s at time t with parameter θ to take action a. Actorcritic algorithms use both value and policy functions to learn approximations Konda and Tsitsiklis (2000) . To improve performance, the critic learns a value function (e.g., state value) and is used to update policy parameters of actor. Extent from single agent, multi-agent learning is considered n agents interacting with the environment ε. At state s t of time step t, each agent selects action a i t to react to the state and receive reward r i t , where i ∈ {1, . . . n}. Hence, for any given joint policy π(a|s) . = n i π i (a i |s) with state s ∈ S, state value function can be defined as in Zhang et al. (2019) : where −i indicates that all n agents except the ith agent. Importance weighted actor-learner architecture (IMPALA) is a decoupled actor-critic style learner with introduction of V-trace off-policy to learn a policy π and a baseline function V π that achieves stability, high data throughput, and efficiency for agent training Espeholt et al. (2018) . Moreover, deep neural networks can be trained efficiently with IMPALA as suggested in Fig. 2 . Suppose at time t, a given local actor policy μ generates trajectory (s t , a t , r t ) t=k+n t=k . The n-step V-trace target for value approximation V (s k ) at state s k is defined as: where δ t V is temporal difference for V , and ρ t and c i are truncated importance sampling. It is worth to note that the truncation levels are assumedc ≤ρ. Furthermore, value function V θ and policy π ω with θ and ω parameters, respectively, can be updated in the direction of: Entropy H (ω) is added to avoid immature convergence and encourage exploration in agent training process. IMPALA algorithm can be used to concurrently train for multiple tasks with one set of weights due to efficiency of the architecture. We collect daily historical stock prices and volumes data from Ho Chi Minh Stock Exchange (HSX) for equity and Ha Noi Stock Exchange (HNX) for derivatives. We use data of the stock markets from September 25, 2017, to May 21, 2020, for Training data from raw inputs rather than handcrafted features are often recommended for feeding data to deep neural networks to achieve higher performance Krizhevsky et al. (2012) . Likewise, researches show that reinforcement learn- ing can exceed human capabilities without human expert data or domain knowledge Mnih et al. (2013) , Silver et al. (2017) . As a result, in this study, instead of applying advanced quantitative finance theories to develop trading and hedging strategy, daily return data of each asset collected from HSX and HNX exchanges are used as main components of environment observation. Specifically, we conducted a set of 12 different periods for out-sample evaluation. We use data from September 25, 2017, to May 16, 2019, for training, and from May 17, 2019, to May 21, 2020, for evaluation (see Table. 1). We also provide position and unrealized profit of each asset (P&L) in portfolio to our neural network. Actor-critic-based algorithms use policy and value networks. We can use policy network and value network separately or combine these two networks. We choose the combined architecture due to improvement of computational efficiency. Furthermore, we use LSTM Hochreiter and Schmidhuber (1997) beside dense layer in the shared network. As suggested in Fig. 3 , we use a shallow network architecture for this study. In detail, in terms of shared network, a trajectory is feed to the first fully connected hidden layer with 256 units and applies tanh activation function. The next hidden layer is stateful LSTM with 256 unit. The tanh function is also applied to the LSTM layer. Finally, policy and value heads are fully connected linear layer for single output of each action and state value, respectively. We use same network architecture and hyper-parameters all trading agent without tweaking. The agent learns to decide to long buy or short sell assets on its own. For instance, agent can cut loss or hold positions overnight without any constraint. Our experiment uses discrete actions. For equity, trading agent can hold, buy, or sell stocks without considering amount of volume. Stocks are only sold after T+2 settlement. Likewise, derivatives trading agent can hold, long, or short futures contracts. However, the agent can trade continuously Trade return r p ← Short sell at m t ; end r p -= F; PV += r p ; end end end Algorithm 1: Simulation of stock market environment as it is T+0 settlement market. We include transaction fees for every trade return (see Algorithm. 1). For every agent, reward can be 1 if overall profit of an episode is positive and -1 in case of negative profit. In addition, we discount reward for every long position of agent in future market to encourage hedging. Finally, we use RLLib Liang et al. (2017) with 32 workers to train the agent. RMSProp algorithm is used as optimizer for training. We compare our proposed trading strategy result with performance of buy and hold strategy to determine effectiveness of the approach. During the evaluation periods, the stock market is highly volatile due to impact of COVID-19 pandemic. Buy and hold strategy may lead to negative return (see Table. 2). Learned deep RL agent was deployed to trade out-of-sample market data from May 17, 2019, to May 21, 2020. The result shows that the learned agent can protect portfolio by short selling in futures market. In addition, in some cases, our agent can cut loss and achieve higher performance than market return in equity market even market plunged as traders had panic-sold out of COVID-19 pandemic fear (see Table. 3). Specifically, VN30 Index was lost about 300 points (33%) during the first three months (Period 9, Period 10, Period 11) of 2020 (see Fig. 1 ). In terms of equity trading, every stock in portfolio had negative market return in the periods. In equity market, after transaction and commission fees, the RL agent cannot maintain positive return. In contrast, our agent executed many orders for opening and closing position to hedge equity assets dynamically in futures market. It leads to positive return of the portfolio (see Fig. 4 ). The portfolio profit did not decrease when the market rebounded due to dynamic hedge strategy. However, in some cases, the agent cannot achieve higher performance than return of market. As a result, we show that our method can reduce losses and achieve positive profit in trading. Furthermore, the trading data also suggest that dynamic hedging strategy for equity in portfolio is feasible in cross-hedging case. The futures trading agent generated far profit than losses. Overall, during evaluation periods, our deep RL agent earned about 30% profit of portfolio value and maintains positive return in case market collapsed systematically. This study proposed a feasible approach for cross-hedging in trading without domain knowledge by applying deep reinforcement learning. Our result also suggests that the approach can cut loss efficiently when market is in selling panic as happening in COVID-19 event. Overall, the proposed method can generate positive profit with dynamic hedge strategy. The result is desirable as our approach earns higher performance than the risk-free rate Hancock and Weise (1994) . It is important to develop a deterministic behavior of agent to maintain reliable outcome. In future work, we should further study stability and safety in reinforcement learning for trading. Expected shortfall: a natural coherent alternative to value at risk Derivative portfolio hedging based on cvar Coherent measures of risk Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint A reformulation of the portfolio model of hedging Deep hedging Available at SSRN Butterworth D, Holmes P (2001) The hedging effectiveness of stock index futures: evidence for the FTSE-100 and FTSE-Mid250 indexes traded in the UK Hedging and value at risk: a semiparametric approach The hedging performance of the new futures markets Importance weighted actor-learner architectures: Scalable distributed deeprl with importance weighted actor-learner architectures Deepmind AI reduces google data centre cooling bill By 40% Hedging performance and basis risk in stock index futures Hedge ratios in greek stock index futures market Reinforcement learning for market making in a multi-agent dealer market Hedging with stock index futures: estimation and forecasting with error correction model Qlbs: Q-learner in the black-scholes (-merton) worlds. Available at SSRN 3087076 Competing derivative equity instruments: Empirical evidence on hedged portfolio performance Long short-term memory Tail risk hedging and regime switching Actor-critic algorithms Imagenet classification with deep convolutional neural networks Rllib: Abstractions for distributed reinforcement learning. arXiv preprint Quantitative financial risk management Playing atari with deep reinforcement learning Reinforcement learning for trading systems and portfolios: Immediate vs future rewards Scalable end-to-end autonomous vehicle testing via rare-event simulation Mastering the game of go without human knowledge Reinforcement learning: an introduction, 2nd edn Cvar models with selective hedging for international asset allocation Learning from delayed rewards Watts S (2015) Hedging basis risk using reinforcement learning Multi-agent reinforcement learning: a selective overview of theories and algorithms Deep reinforcement learning for trading Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Conflict of Interest Uyen Pham declares that she has no conflict of interest. Quoc Luu declares that he has no conflict of interest. Hien Tran declares that he has no conflict of interest.Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.