key: cord-0446845-mz7vtzw7 authors: Kim, Tae Wan; Khushi, Matloob title: Portfolio Optimization with 2D Relative-Attentional Gated Transformer date: 2020-12-27 journal: nan DOI: nan sha: 95b0fc5bde87f7c4f513cf8532ca873823971d2f doc_id: 446845 cord_uid: mz7vtzw7 Portfolio optimization is one of the most attentive fields that have been researched with machine learning approaches. Many researchers attempted to solve this problem using deep reinforcement learning due to its efficient inherence that can handle the property of financial markets. However, most of them can hardly be applicable to real-world trading since they ignore or extremely simplify the realistic constraints of transaction costs. These constraints have a significantly negative impact on portfolio profitability. In our research, a conservative level of transaction fees and slippage are considered for the realistic experiment. To enhance the performance under those constraints, we propose a novel Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. Applying learnable relative positional embeddings for the time and assets axes, the model better understands the peculiar structure of the financial data in the portfolio optimization domain. Also, gating layers and layer reordering are employed for stable convergence of Transformers in reinforcement learning. In our experiment using U.S. stock market data of 20 years, our model outperformed baseline models and demonstrated its effectiveness. Portfolio optimization aims to allocate resources optimally into various financial assets to maximize the return while reducing the risks. Since it was theoretically pioneered by [1] , many researchers have attempted to solve this problem using various machine learning approaches. Particularly, reinforcement learning is a type of machine learning suitable for sequential decision making such as online portfolio rebalancing. In reinforcement learning, the agent improves its policy to decide an action by repeatedly trying various actions for the environment and maximizing the expected cumulative reward from the environment. This can be implemented by two elements of an agent, the actor that decides its action and the critic that assesses the value of the action with the estimate of the expected cumulative reward. Deep reinforcement learning, a reinforcement learning that utilizes deep neural networks in its actor and critic, is known to be efficient in handling financial problems. However, most research that adopted it in portfolio optimization showed a lack of consideration of realistic constraints, which affects the performance of the models. Moreover, the data in the portfolio optimization domain has intractable characteristicscontinuous action space, partial observability, and high dimensionality. This research proposes the Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model to tackle these issues. In the ablation study, the model illustrated profitability as well as stability and outperformed baseline models. II. RELATED WORK Various machine learning approaches have been examined for portfolio optimization. The majority of them, including recent research [2] [3] [4] , focused on predicting future prices and built portfolios based on the prediction. However, this two-stage approach can be suboptimal in that minimizing the prediction error could be different from the objective of optimizing portfolios and relevant data could be lost by using the predicted price alone [5] . Moreover, the performance of the approach is highly dependent on the prediction accuracy, of which a high degree can be hardly achieved with financial data. For these reasons, researchers employed reinforcement learning as an alternative approach without predicting future prices. Several researchers [6, 7] used reinforcement learning with discrete action space, simplifying trading actions to include buying, selling, or holding a single asset. Using a limited number of positions in trading was also used in [8] , but it is difficult to generalize this approach to apply to large-scale portfolios since expanding the number of assets results in exponential growth in the action space. To address the continuous action space problem, [9] used a policybased reinforcement learning framework using deep neural networks as its approximation functions and returning the deterministic continuous action values directly from the policy network. General reinforcement learning approaches are based on Markov Decision Process, which assumes that the current state depends on the previous state only. However, financial markets are only partially observable [10] from the price and volume at a specific point in time. To address the partial observability of the state, reinforcement learning with recurrent neural networks was proposed in [8, 11, 12] using a time series of observations instead of a single observation to represent a state. However, RNN including LSTM still suffers from long-term dependency problems and shows lower performance with longer data [13] . The attention mechanism [14] is proposed to tackle this problem. Particularly, the advent of the Transformer that uses multihead attention [15] produced state-of-the-art performance in natural language processing and computer vision, but the financial field has not yet benefited from it. Moreover, in [16, 17] Transformer failed to solve a simple Markov Decision Process problem or was comparable to a random model, which suggests that it is extremely difficult to optimize Transformer in a reinforcement learning setting. There have been many studies on portfolio optimization but most of them are based on unrealistic assumptions about transaction fees and slippage, which have a significant impact on portfolio profitability [18] . In light of this, ignoring or extremely simplifying these constraints makes it difficult to apply the algorithms to real-world asset trading. In this regard, this research proposes a policy-based deep reinforcement learning using a variation of Transformer to address realistic constraints as well as the profitability. Since a financial state is only partially observable, this study employs an observation set of historical prices and trading volumes up to time to represent a state at time . A single observation set is a three-dimensional tensor that consists of five features, Opening, High, Low, Closing prices (OHLC), and trading volumes of assets at time . Each feature is by + 1 matrix, where the rows represent the time axis and the columns represent the assets axis that consists of cash and m assets. The opening prices , high prices , low prices , closing prices , and trading volumes at time t are as follows: where the subscript of each element stands for the time and the superscript stands for the assets. The elements with the superscript 0 in the first columns stand for cash data and are uniformly set at one. The action defined in this research is the proportion of asset investment to be rebalanced at time , since the actions represented in continuous values can be applied to large scale portfolios much better than discrete actions. Action is a portfolio vector at time where its elements are weights of resource allocation in cash and assets, and the sum of the weights totals one. The reward is the risk-adjusted return for the action, with transaction costs applied. The previous portfolio value −1 is a scalar value calculated by the inner product of the current closing prices of the assets and the previous shares of assets held. where is a vector of the current closing prices from 0 to the first row of the closing price featureand −1 is a vector of the previous shares from −1 0 to −1 . The weighted portfolio values −1 * are modulated to the integer numbers of rebalanced shares of the assets according to the current closing prices as follows: where // stands for the element-wise floor division operator that returns integer quotients of element-wise division. The transaction fee rate is assumed conservatively at 20 basis points (0.2%), and the slippage rate is set at half of the proportional bid-ask spread. Since the bid-ask spread data is difficult to acquire, the estimate of the proportional bid-ask spread [19] is used for a single asset at time : where ( ) is daily closing log-price at time and is the average of daily high and low log-prices at time . The rebalancing cost for a single asset is the transaction fee and slippage proportional to the current closing price and change in shares of the asset. Thus, the total rebalancing cost is calculated as follows: The rebalanced share for cash 0 is the remainder of the previous portfolio value after deduction of the assets' rebalanced portfolio value and rebalancing cost. Now, the total rebalanced portfolio value can be calculated as follows: The return is a log return of the portfolio value. Since the return itself does not reflect risks, the reward function used here is the Sortino ratio [20] , which is a variation of the Sharpe ratio [21] . Sharpe ratio is defined as the average of historical returns from 1 to divided by standard deviation of all the returns, whereas Sortino ratio is the expected return divided by the standard deviation of negative returns. The denominators used in both ratios represent the risks of the portfolios. In this section, we propose the Deterministic Policy Gradient with 2D Relative-attentional Gated Transformer (DPGRGT) model. The overall architecture is shown in Fig. 1 and is designed in consideration of the characteristics of the portfolio optimization domain datacontinuous action space, partial observability, and high dimensionality. The agent basically follows the structure of Deep Deterministic Policy Gradient [9] for continuous action space, utilizing Transformer encoders whose structure is robust to longterm dependencies of partial observability. Specifically, a variation of Transformer called 2D Relative-attentional Gated Transformer (RG-Transformer) is used as a core part of its actor, target actor, critic, and target critic networks to deal with high dimensional portfolio data. The agent uses a deep neural network as a policy approximator that returns actions with continuous values in a deterministic way [9] . In addition to the actor and critic with weights and , respectively, a separate pair of a target actor ' and a target critic ' with respective weights ' and ' is introduced to ensure stable learning. The target return for the -th sample from replay buffer is as follows: where , , , ' and present the state, action, reward, next state, and discount factor, respectively. The critic weights are updated by minimizing the loss from temporal difference error between and ( , | ): Also, the policy gradient to update the actor weights is calculated using the chain rule as follows: Finally, with a predefined update rate , the target actor weights ' and target critic weights ' are updated to + (1 − ) ' and + (1 − ) ' , respectively. During training, the actor includes action space noise of [22] for temporally correlated exploration to avoid local optima as follows: where is the noise at time with fixed parameters , , and for the noise generation. To accelerate the training procedure, an asynchronous episodic training method [23] was adopted. The episodic training is processed parallelly by multi-simulators that accumulate each group of experience data , , , and in the experience replay buffer. To encourage the agent to learn highly-rewarded Fig. 1 . The overall architecture of DPGRGT model trajectories more often, HMemory saves an episode only when it renews the highest episodic reward, and those saved data are sampled with probability . The input for portfolio optimization is historical data of partially observable states and is a tensor of three dimensionsfinancial features, historical time, and assets. To address the high dimensionality, the agent incorporates a variation of Transformer capable of identifying positions of both time and assets into the main part of its actors and critics. Fig. 2 shows the structure of the actor network and its 2D relative positional multi-head attention (2D relative attention), where the length dimension stands for periods while the height dimension represents assets. As seen in Fig. 3 , the critic network is similar to the actor network, except that it includes action data as its additional input. Since the Transformer does not have any recurrent or convolutional structures, it requires additional position information. In addition to the sinusoidal encoding used in the original Transformer, [24, 25] showed the effectiveness of incorporating relative positional information in the selfattention for machine translation and music scoring. For × dimensional data , the relative attention ℎ for each head is as follows: where ℎ is divided by the number of heads ℎ. X ℎ , X ℎ , and X ℎ are an evenly split × ℎ matrix for each head and are the query, key, and value in the attention, respectively. ℎ is a matrix that represents the relative positions between every pair of elements and is gathered from × ℎ dimensional relative position embedding ℎ learned separately for each head. After multiplying two expanded tensors X of shape (ℎ, , ℎ ) and ( ) of shape (ℎ, ℎ , ), "skewing" [24] the result gives rise to the direct calculation of X ℎ ( ℎ ) for each head with efficient use of memory. To implement 2D relative attention in our model, two relative position embeddings are used for × × dimensional financial data ′ where the additional dimension stands for the height. Embedding and ℎ learn the relative positional representation for each pair of data elements in (time) dimension and (assets) dimension, respectively. While ′ , ′ , and ′ are flattened into (ℎ, ⋅ , ℎ ) tensors for matrix multiplication except for the second term in (19) , ′ with the original shape of (ℎ, , , ℎ ) is multiplied by the height embedding ( ℎ ) of shape (ℎ, ℎ , ) and ′ ( ℎ ) is flattened into a (ℎ ⋅ , , ) tensor for skewing. Similarly, after multiplying permuted ′ of shape (ℎ, , , ℎ ) and ( ) of shape (ℎ, ℎ , ) , the result is flattened to a (ℎ ⋅ , , ) tensor for skewing. After skewing, both of the calculation outputs result in a tensor of shape (ℎ, ⋅ , ⋅ ) and are added to the first term instead of the original second term in (19) , representing relative positions of both height and length dimensions of the data. Lastly, the layer normalization is placed before the multi-head attention, and the gating layer is used to replace the residual connection to enhance the stability of Transformers in reinforcement learning [17] . The gating layer utilizes the structure of GRU [13] cell as a gating function as follows: where is the input from the previous layer, is the residual value, and ⊙ is element-wise multiplication. The model is trained and evaluated with assets of nine Dow Jones companies representing each sector -industrials (MMM), financials (JPM), consumer services (PG), technology (AAPL), health care (UNH), consumer goods (WMT), oil & gas (XOM), basic materials (DD) and telecommunications (VZ). The OHLC prices and trading volume data of the assets are collected from Yahoo Finance. The data for 18 years from 2000 to 2017 are used for training, and the data for the period from 2018 to April 2020 is used for evaluation. Each daily observation set consists of historical data for the recent 50 days, and all the data is log differenced for the stationarity of the time series. The maximum length of a single episode is set at 50 days, and the initial investment at 100,000 USD. Two separate Adam optimizers are used with mini-batch size 32 to train the actor and critic. The learning rate of the actor and critic is 1e-4, and the update rate for the target actor and target critic is 0.15. Discount factor is 0.9, HMemory sampling rate is 0.2, and the number of threads used for parallel training is 5. The action space noise parameters , , and are 0.13, 0, and 0.2, respectively. For the Transformer, three encoder layers, eight heads, 128dimensional vectors for attention hidden layers, and 512dimensional vectors for feed-forward network layers are used. All programs were implemented with Python 3 and TensorFlow 2 using Google Colab. As baseline models, two traditional portfolio models that implement MPT (Markowitz's Modern Portfolio Theory) and Uniform Constant Rebalanced Portfolio (UCRP) strategy [26] are employed. For the ablation study, simple Deep Deterministic Policy Gradient (DDPG), DDPG with Transformer (DDPG_TF), DDPG with 2D Relativeattentional Transformer (DDPG_RP_TF), and DDPG with Gated Transformer (DDPG_GL_TF) are tested under the same conditions. Both the cumulative return and the annualized Sharpe ratio are evaluated to verify the robustness of the models to risks as well as their performances. Fig. 4 shows the changes in portfolio values for all the models tested with the asset data for the 28 months prior to the experiment. The steep drop seen around March 2020 comes from the outbreak of COVID-19, which has made a huge negative impact on the portfolio values. The portfolio value of the DPGRGT model keeps increasing until it meets the outbreak, and despite the drop, the model outperformed all the other models, proving its resilience and high profitability. DDPG and DDPG with Transformer are slightly better than MPT but show poor performance. DDPG with 2D Relative-attentional Transformer is relatively better than DDPG and DDPG with Transformer, but still worse than the UCRP baseline model. With the exception of DPGRGT, DDPG with Gated Transformer is the only model that is more profitable and better performing than UCRP, which points to the effectiveness of the gating layer of Transformer in reinforcement learning. Table I shows the cumulative returns and the annualized Sharpe ratios of models. While DPGRGT and DDPG with Gated Transformer are profitable, three other models have lost over 40% of their initial value. The MPT model that attempts to maximize its Sharp ratio for every rebalancing chance is the worst, whereas UCRP that rebalances its shares of assets for the constant investment proportion is the second best. This result suggests that transaction fees and slippage have a significant influence on rebalancing performance. With these constraints, the effects of using either 2D Relative-attentional Transformer or Gated Transformer are also limited. Although the pandemic outbreak has undermined the overall performance of the models, DPGRGT that uses both 2D Relative attention and Gated Transformer has ultimately demonstrated stability and strong performance. V. CONCLUSION AND FUTURE WORK This paper proposes a portfolio optimization algorithm based on reinforcement learning using 2D Relativeattentional Gated Transformers. To our best knowledge, this is the first research that applies Transformers to reinforcement learning for portfolio optimization. The experiment shows that the 2D relative attentions and Gating layers improve the performance of Transformers, and combining them creates synergy effects and produces the best results for portfolio optimization. Since even highly profitable models cannot be applied to real trades without stability and realistic constraints taken into consideration, the risk is considered by incorporating the Sortino ratio into the reward function, and the transaction costs are set at a conservative level to ensure the more practical experiment. In a further study, the relations between periods and assets can be analyzed with the multi-head attention weights used in the Transformer to make the model more interpretable and trustable. Furthermore, experiments utilizing a multi- Portfolio Selection Adaptive Portfolio Asset Allocation Optimization with Deep Learning Forecasting Portfolio Optimization using Artificial Neural Network and Genetic Algorithm Portfolio management via two-stage deep learning with a joint cost Performance functions and reinforcement learning for trading systems and portfolios Deep Direct Reinforcement Learning for Financial Signal Representation and Trading Learning to trade via direct reinforcement Deep Robust Reinforcement Learning for Practical Algorithmic Trading Continuous control with deep reinforcement learning Planning and acting in partially observable stochastic domains Continuous control with Stacked Deep Dynamic Recurrent Reinforcement Learning for portfolio optimization Learning Continuous Control Policies by Stochastic Value Gradients Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation Neural Machine Translation by Jointly Learning to Align and Translate Attention is All you Need A Simple Neural Attentive Meta-Learner Stabilizing Transformers for Reinforcement Learning Reinforcement Learning in Financial Markets A Simple Estimation of Bid-Ask Spreads from Daily Close, High, and Low Prices Performance Measurement in a Downside Risk Framework Mutual Fund Performance On the Theory of the Brownian Motion Asynchronous Episodic Deep Deterministic Policy Gradient: Towards Continuous Control in Computationally Complex Environments Music Transformer: Generating Music with Long-Term Structure Self-Attention with Relative Position Representations Universal data compression and portfolio selection GPU environment for a wide range of types of assets can be considered for the generalization of the model.