key: cord-0536444-ntg31ipn
authors: Marzban, Saeed; Delage, Erick; Li, Jonathan Yumeng; Desgagne-Bouchard, Jeremie; Dussault, Carl
title: WaveCorr: Correlation-savvy Deep Reinforcement Learning for Portfolio Management
date: 2021-09-14
journal: nan
DOI: nan
sha: bb5b1c069d46b3dfe1456f66e77504e2a251490d
doc_id: 536444
cord_uid: ntg31ipn

The problem of portfolio management represents an important and challenging class of dynamic decision making problems, where rebalancing decisions need to be made over time with the consideration of many factors such as investors preferences, trading environments, and market conditions. In this paper, we present a new portfolio policy network architecture for deep reinforcement learning (DRL)that can exploit more effectively cross-asset dependency information and achieve better performance than state-of-the-art architectures. In particular, we introduce a new property, referred to as textit{asset permutation invariance}, for portfolio policy networks that exploit multi-asset time series data, and design the first portfolio policy network, named WaveCorr, that preserves this invariance property when treating asset correlation information. At the core of our design is an innovative permutation invariant correlation processing layer. An extensive set of experiments are conducted using data from both Canadian (TSX) and American stock markets (S&P 500), and WaveCorr consistently outperforms other architectures with an impressive 3%-25% absolute improvement in terms of average annual return, and up to more than 200% relative improvement in average Sharpe ratio. We also measured an improvement of a factor of up to 5 in the stability of performance under random choices of initial asset ordering and weights. The stability of the network has been found as particularly valuable by our industrial partner.

In recent years, there has been a growing interest in applying Deep Reinforcement Learning (DRL) to solve dynamic decision problems that are complex in nature. One representative class of problems is portfolio management, whose formulation typically requires a large amount of continuous state/action variables and a sophisticated form of risk function for capturing the intrinsic complexity of financial markets, trading environments, and investors' preferences.

In this paper, we propose a new architecture of DRL for solving portfolio management problems that optimize a Sharpe ratio criterion. While there are several works in the literature that apply DRL for portfolio management problems such as Moody et al. [1998] , He et al. [2016] , Liang et al. [2018] among others, little has been done to investigate how to improve the design of a Neural Network (NN) in DRL so that it can capture more effectively the nature of dependency exhibited in financial data. In particular, it is known that extracting and exploiting cross-asset dependencies over time is crucial to the performance of portfolio management. The neural network architectures adopted in most existing works, such as Long-Short-Term-Memory (LSTM) or Convolutional Neutral Network (CNN), however, only process input data on an asset-by-asset basis and thus lack a mechanism to capture cross-asset dependency information. The architecture presented in this paper, named as WaveCorr, offers a mechanism to extract the information of both time-series dependency and cross-asset dependency. It is built upon the WaveNet structure [Oord et al., 2016] , which uses dilated causal convolutions at its core, and a new design of correlation block that can process and extract cross-asset information.

In particular, throughout our development, we identify and define a property that can be used to guide the design of a network architecture that takes multi-asset data as input. This property, referred to as asset permutation invariance, is motivated by the observation that the dependency across assets has a very different nature from the dependency across time. Namely, while the dependency across time is sensitive to the sequential relationship of data, the dependency across assets is not. To put it another way, given a multivariate time series data, the data would not be considered the same if the time indices are permuted, but the data should remain the same if the asset indices are permuted. While this property may appear more than reasonable, as discussed in section 3, a naive extension of CNN that accounts for both time and asset dependencies can easily fail to satisfy this property. To the best of our knowledge, the only other works that have also considered extracting cross-asset dependency information in DRL for portfolio management are the recent works of and Xu et al. [2020] . While Zhang et al.' s work is closer to ours in that it is also built upon the idea of adding a correlation layer to a CNN-like module, its overall architecture is different from ours and, most noticeably, their design does not follow the property of asset permutation invariance and thus its performance can vary significantly when the ordering of assets changes. As further shown in the numerical section, our architecture, which has a simpler yet permutation invariant structure, outperforms in many aspects Zhang et al.'s architecture. The work of Xu et al. [2020] takes a very different direction from ours, which follows a so-called attention mechanism and an encoder-decoder structure. A more detailed discussion is beyond the scope of this paper.

Overall, the contribution of this paper is three fold. First, we introduce a new property, referred to as asset permutation invariance, for portfolio policy networks that exploit multi-asset time series data. Second, we design the first portfolio policy network, named WaveCorr, that accounts for asset dependencies in a way that preserves this invariance. This achievement relies on the design of an innovative permutation invariant correlation processing layer. Third, and most importantly, we present evidence that WaveCorr significantly outperforms state-of-the-art policy network architectures using data from both Canadian (TSX) and American (S&P 500) stock markets. Specifically, our new architecture leads to an impressive 5%-25% absolute improvement in terms of average annual return, up to more than 200% relative improvement in average Sharpe ratio, and reduces, during the period of 2019-2020 (i.e. the Covid-19 pandemic), by 16% the maximum daily portfolio loss compared to the best competing method. Using the same set of hyper-parameters, we also measured an improvement of up to a factor of 5 in the stability of performance under random choices of initial asset ordering and weights, and observe that WaveCorr consistently outperforms our benchmarks under a number of variations of the model: including the number of available assets, the size of transaction costs, etc. Overall, we interpret this empirical evidence as a strong support regarding the potential impact of the WaveCorr architecture on automated portfolio management practices, and, more generally, regarding the claim that asset permutation invariance is an important NN property for this class of problems.

The rest of the paper unfolds as follows. Section 2 presents the portfolio management problem and risk averse reinforcement learning formulation. Section 3 introduces the new property of "asset permutation invariance" for portfolio policy network and presents a new network architecture based on convolution networks that satisfies this property. Finally, Section 4 presents the findings from our numerical experiments. We finally conclude in Section 5.

The portfolio management problem consists of optimizing the reallocation of wealth among many available financial assets including stocks, commodities, equities, currencies, etc. at discrete points in time. In this paper, we assume that there are m risky assets in the market, hence the portfolio is controlled based on a set of weights w t ∈ W := {w ∈ R m + | m i=1 w i = 1}, which describes the proportion of wealth invested in each asset. Portfolios are rebalanced at the beginning of each period t = 0, 1, ..., T − 1, which will incur proportional transaction costs for the investor, i.e. commission rates are of c s and c p , respectively. We follow Jiang et al. [2017] to model the evolution of the portfolio value and weights (see Figure 1 ). Specifically, during period t the portfolio value and weights start at p t−1 and w t−1 , and the changes in stock prices, captured by a random vector of asset returns ξ t ∈ R m , affect the end of period portfolio value p t := p t−1 ξ t w t−1 , and weight vector w t := (p t−1 /p t )ξ t w t−1 , where is a term-wise product. The investor then decides on a new distribution of his wealth w t , which triggers the following transaction cost:

Denoting the net effect of transaction costs on portfolio value with ν t := p t /p t , as reported in one finds that ν t is the solution of the following equations:

This, in turn, allows us to express the portfolio's log return during the t + 1-th period as:

where we make explicit the influence of w t and w t on ν t .

We note that in Jiang et al. [2017] , the authors suggest to approximate ν t using an iterative procedure. However, we actually show in Appendix A.1 that ν t can easily be identified with high precision using the bisection method.

In this section, we formulate the portfolio management problem as a Markov Decision Process (MDP) denoted by (S, A, r, P ). In this regard, the agent (i.e. an investor) interacts with a stochastic environment by taking an action a t ≡ w t ∈ W after observing the state s t ∈ S composed of a window of historical market observations, which include the latest stock returns ξ t , along with the final portfolio composition of the previous period w t . This action results in the immediate stochastic reward that takes the shape of an approximation of the realized log return, i.e. r t (s t , a t , s t+1 ) := ln(f (1, w t , w t )) + ln(ξ t+1 w t ) ≈ ln(ν(w t , w t )) + ln(ξ t+1 w t ), for which a derivative is easily obtained. Finally, P captures the assumed Markovian transition dynamics of the stock market and its effect on portfolio weights: P (s t+1 |s 0 , a 0 , s 1 , a 1 , ..., s t , a t ) = P (s t+1 |s t , a t ).

Following the works of Moody et al. [1998] and Almahdi and Yang [2017] on risk averse DRL, our objective is to identify a deterministic trading policy µ θ (parameterized by θ) that maximizes the expected value of the Sharpe ratio measured on T -periods log return trajectories generated by µ θ . Namely:

where F is some fixed distribution and SR(r 0:T −1 ) :=

(1/T )

The choice of using the Sharpe ratio of log returns is motivated by modern portfolio theory (see Markowitz [1952] ), which advocates a balance between expected returns and exposure to risks, and where it plays the role of a canonical way of exercising this trade-off [Sharpe, 1966] . While it is inapt of characterizing downside risk, it is still considered a "gold standard of performance evaluation" by the financial community [Bailey and Lopez de Prado, 2012] . In Moody et al. [1998] , the trajectory-wise Sharpe ratio is used as an estimator of the instantaneous one in order to facilitate its use in RL. A side-benefit of this estimator is to offer some control on the variations in the evolution of the portfolio value which can be reassuring for the investor.

In the context of our portfolio management problem, since s t is composed of an exogeneous component s exo t which includes ξ t and an endogenous state w t that becomes deterministic when a t and s exo t+1 are known, we have that:

where β(s t ) is an arbitrary policy, and where the effect of µ θ on the trajectory can be calculated usinḡ

, for t ≥ 1, whiles θ 0 := s 0 . Hence,

where ∇ θ SR can be obtained by backpropagation using the chain rule. This leads to the following stochastic gradient step:

, with α > 0 as the step size.

There are several considerations that go into the design of the network for the portfolio policy network µ θ . First, the network should have the capacity to handle long historical time series data, which allows for extracting long-term dependencies across time. Second, the network should be flexible in its design for capturing dependencies across a large number of available assets. Third, the network should be parsimoniously parameterized to achieve these objectives without being prone to overfitting. To this end, the WaveNet structure [Oord et al., 2016] offers a good basis for developing our architecture and was employed in . Unfortunately, a direct application of WaveNet in portfolio management struggles at processing the cross-asset correlation information. This is because the convolutions embedded in the WaveNet model are 1D and extending to 2D convolutions increases the number of parameters in the model, which makes it more prone to the issue of over-fitting, a notorious issue particuarly in RL. Most importantly, naive attempts at adapting WaveNet to account for such dependencies (as done in ) can make the network become sensitive to the ordering of the assets in the input data, an issue that we will revisit below.

We first present the general architecture of WaveCorr in Figure 2 . Here, the network takes as input a tensor of dimension m × h × d, where m : the number of assets, h : the size of look-back time window, d : the number of channels (number of features for each asset), and generates as output an m-dimensional wealth allocation vector. The WaveCorr blocks, which play the key role for extracting cross time/asset dependencies, form the body of the architecture. In order to provide more flexibility for the choice of h, we define a causal convolution after the sequence of WaveCorr blocks to adjust the receptive field so that it includes the whole length of the input time series. Also, similar to the WaveNet structure, we use skip connections in our architecture.

The design of the WaveCorr residual block in WaveCorr extends a simplified variation [Bai et al., 2018] of the residual block in WaveNet by adding our new correlation layers (and Relu, concatenation operations following right after). As shown in Figure 3 , the block includes two layers of dilated causal convolutions followed by Relu activation functions and dropout layers. Having an input of dimensions m × h × d, the convolutions output tensors of dimension m × h × d where each slice of the output tensor, i.e. an m × 1 × d matrix, contains the dependency information of each asset over time. By applying different dilation rates in each WaveCorr block, the model is able of extracting the dependency information for a longer time horizon. A dropout layer with a rate of 50% is considered to prevent over-fitting, whereas for the gradient explosion/vanishing prevention mechanism of residual connection we use a 1 × 1 convolution (presented on the top of Figure 3 ), which inherently ensures that the summation operation is over tensors of the same shape. The Corr layer generates an output tensor of dimensions m × h × 1 from an m × h × d input, where each slice of the output tensor, i.e. an m × 1 × 1 matrix, is meant to contain cross-asset dependency information. The concatenation operator combines the cross-asset dependency information obtained from the Corr layer with the cross-time dependency information obtained from the causal convolutions.

Before defining the Corr layer , we take a pause to introduce a property that will be used to further guide its design, namely the property of asset permutation invariance. This property is motivated by the idea that the set of possible investment policies that can be modeled by the portfolio policy network should not be affected by the way the assets are indexed in the problem. On a block per block level, we will therefore impose that, when the asset indexing of the input tensor is reordered, the set of possible mappings obtained should also only differ in its asset indexing. More specifically, we let σ : R m×h×d → R m×h×d denote a permutation operator over a tensor T such that σ(T )[i, :, : ] = T [π(i), :, :], where π : {1, ..., m} → {1, ..., m} is a bijective function. Furthermore, we consider

One can verify, for instances, that all the blocks described so far in WaveCorr are permutation invariant and that asset permutation invariance is preserved under composition (see Appendix A.2.2).

With this property in mind, we can now detail the design of a permutation invariant Corr layer via Procedure1, where we denote as CC : R (m+1)×h×d → R 1×h×1 the operator that applies an (m + 1) × 1 convolution, and as Concat 1 the operator that concatenates two tensors along the first dimension. In Procedure 1, the kernel is applied to a tensor O mdl ∈ R (m+1)×h×d constructed from adding the i-th row of the input tensor on the top of the input tensor. Concatenating the output tensors from each run gives the final output tensor. Figure 4 gives an example for the case with m = 5, and h = d = 1. Effectively, one can show that Corr layer satisfies asset permutation invariance (proof in Appendix).

Proposition 3.1. The Corr layer block satisfies asset permutation invariance. Table 1 summarizes the details of each layer involved in the WaveCorr architecture: including kernel sizes, internal numbers of channels, dilation rates, and types of activation functions. Overall, the following proposition confirms that this WaveCorr portfolio policy network satisfies asset permutation invariance (see Appendix for proof). Proposition 3.2. The WaveCorr portfolio policy network architecture satisfies asset permutation invariance. Finally, it is necessary to discuss some connections with the recent work of , where the authors propose an architecture that also takes both sequential and cross-asset dependency into consideration. Their proposed architecture, from a high level perspective, is more complex than ours in that theirs involves two sub-networks, one LSTM and one CNN, whereas ours is built solely on CNN. Our architecture is thus simpler to implement, less susceptible to overfitting, and allows for more efficient computation. The most noticeable difference between their design and ours is at the level of the Corr layer block, where they use a convolution with a m × 1 kernel to extract dependency 

across assets and apply a standard padding trick to keep the output tensor invariant in size. Their approach suffers from two issues (see Appendix A.3 for details): first, the kernel in their design may capture only partial dependency information, and second, most problematically, their design is not asset permutation invariant and thus the performance of their network can be highly sensitive to the ordering of assets. This second issue is further confirmed empirically in section 4.3. 

In this section, we present the results of a series of experiments evaluating the empirical performance of our WaveCorr DRL approach. We start by presenting the experimental set-up. We follow with our main study that evaluates WaveCorr against a number of popular benchmarks. We finally shed light on the superior performance of WaveCorr with comparative studies that evaluate the sensitivity of its performance to permutation of the assets, number of assets, size of transaction costs, and (in Appendix A.6.3) maximum holding constraints. All code is available at https://github.com/saeedmarzban/waveCorr. (2012) (2013) (2014) (2015) (2016) (2017) (2018) and testing (2019-2020) periods given that hyper-parameters were reused from the previous two studies. We assume with all datasets a constant comission rate of c s = c p = 0.05% in the comparative study, while the sensitivity analysis considers no transaction costs unless specified otherwise.

Benchmarks: In our main study, we compare the performance of WaveCorr to CS-PPN , EIIE [Jiang et al., 2017] , and the equal weighted portfolio (EW). Note that both CS-PPN and EIIE were adapted to optimize the Sharpe-ratio objective described in section 2.2 that exactly accounts for transaction costs.

Hyper-parameter selection: Based on a preliminary unreported investigation, where we explored the influence of different optimizers (namely ADAM, SGD, RMSProp, and SGD with momentum), we concluded that ADAM had the fastest convergence. We also narrowed down a list of reasonable values (see Table A .6) for the following common hyper-parameters: initial learning rate, decay rate, minimum rate, look-back window size h, planning horizon T . For each method, the final choice of hyper-parameter settings was done based on the average annual return achieved on both a 4-fold cross-validation study using Can-data and a 3-fold study with the US-data. The final selection (see Table A .6) favored, for each method, a candidate that appeared in the top 5 best performing settings of both data-sets in order to encourage generalization power among similarly performing candidates. Note that in order to decide on the number of epochs, an early stopping criteria was systematically employed.

Metrics: We evaluate all approaches using out-of-sample data ("test data"). "Annual return" denotes the annualized rate of return for the accumulated portfolio value. "Annual vol" denotes the prorated standard deviation of daily returns. Trajectory-wise Sharpe ratio (SR) of the log returns, Maximum drawdown (MDD), i.e. biggest loss from a peak, and average Turnover, i.e. average of the trading volume, are also reported (see for formal definitions). Finaly, we report on the average "daily hit rate" which captures the proportion of days during which the log returns out-performed EW.

Important implementation details: Exploiting the fact that our SGD step involves exercising the portfolio policy network for T consecutive steps (see equation (3)), a clever implementation was able to reduce WavCorr's training time per episode by a factor of 4. This was done by replacing the T copies of the portfolio policy network producing a 0 , a 2 , . . . , a T −1 , with an equivalent single augmented multi-period portfolio policy network producing all of these actions simultaneously, while making sure that all intermediate calculations are reused as much as possible (see Appendix A.4 for details). We also implement our stochastic gradient descent approach by updating, after each episode k, the initial state distribution F to reflect the latest policy µ θ k . This is done in order for the final policy to be better adapted to the conditions encountered when the portfolio policy network is applied on a longer horizon than T .

In this set of experiments the performances of WaveCorr, CS-PPN, EIIE, and EW are compared for a set of 10 experiments (with random reinitialization of NN parameters) on the three datasets. The average and standard deviations of each performance metric are presented in Table 2 while Figure  A .8 (in the Appendix) presents the average out-of-sample portfolio value trajectories. The main takeaway from the table is that WaveCorr significantly outperforms the three benchmarks on all data sets, achieving an absolute improvement in average yearly returns of 3% to 25% compared to the best alternative. It also dominates CS-PPN and EIIE in terms of Sharpe ratio, maximum drawdown, daily hit rate, and turnover. EW does appear to be causing less volatility in the US-data, which leads to a slightly improved SR. Another important observation consists in the variance of these metrics over the 10 experiments. Once again WaveCorr comes out as being generally more reliable than the two other DRL benchmarks in the Can-data, while EIIE appears to be more reliable in the US-data sacrificing average performance. Overall, the impressive performance of WaveCorr seems to support our claim that our new architecture allows for a better identification of the cross-asset dependencies.

In conditions of market crisis (i.e. the Covid-data), we finally observe that WaveCorr exposes the investors to much lower short term losses, with an MDD of only 31% compared to more than twice as much for CS-PPN and EIIE, which reflects of a more effective hedging strategy.

Sensitivity to permutation of the assets: In this set of experiment, we are interested in measuring the effect of asset permutation on the performance of WaveCorr and CS-PPN. Specifically, each Table 3 and illustrated in Figure 5 . We observe that the learning curves and performance of CS-PPN are significantly affected by asset permutation compared to WaveCorr. In particular, one sees that the standard deviation of annual return is reduced by more than a factor of about 5 with WaveCorr. We believe this is entirely attributable to the new structure of the Corr layer in the portfolio policy network. Sensitivity to number of assets: In this set of experiments, we measure the effect of varying the number of assets on the performance of WaveCorr and CS-PPN. We therefore run 10 experiments (randomly resampling initial NN parameters) with growing subsets of 30, 40, and 50 assets from Can-data. Results are summarized in Table 4 and illustrated in Figure A .9 (in Appendix). While having access to more assets should in theory be beneficial for the portfolio performance, we observe that it is not necessarily the case for CS-PPN. On the other hand, as the number of assets increase, a significant improvement, with respect to all metrics, is achieved by WaveCorr. This evidence points to a better use of the correlation information in the data by WaveCorr. 

This paper presented a new architecture for portfolio management that is built upon WaveNet [Oord et al., 2016] , which uses dilated causal convolutions at its core, and a new design of correlation block that can process and extract cross-asset information. We showed that, despite being parsimoniously parameterized, WaveCorr can satisfy the property of asset permutation invariance, whereas a naive extension of CNN, such as in the recent works of , does not. The API property is both appealing from a practical, given that it implies that the investor does not need to worry about how he/she indexes the different assets, and empirical point of views, given the empirical evidence that it leads to improved stability of the network's performance. As a side product of our analysis, the results presented in Appendix A.2 lay important foundations for analysing the API property in a larger range of network architectures. In the numerical section, we tested the performance of WaveCorr using data from both Canadian (TSX) and American (S&P 500) stock markets. The experiments demonstrate that WaveCorr consistently outperforms our benchmarks under a number of variations of the model: including the number of available assets, the size of transaction costs, etc.

Regarding the "if" part, we start with the assumption that

Next, we follow with the fact that for all permutation operator σ:

where we assumed for simplicity of exposition that h = h and d = d , and exploited the fact that σ −1 is also a permutation operator.

A.2.1 Proof of Proposition 3.1

We first clarify that the correlation layer is associated with the following set of functions (see Procedure 1): Let σ (associated with the bijection π) be an asset permutation operator. For any correlation layer function B w,b ∈ B, one can construct a new set of parameters w 0 := w 0 , w j := w π(j) , for all j = 1, . . . , m, and b := b such that for all input tensor T , we have that for all i: We can therefore conclude that {σ −1 •B•σ : B ∈ B} ⊇ B. Based on Lemma A.1, we conclude that B is asset permutation invariant.

To prove Proposition 3.2, we demonstrate that all blocks used in the WaveCorr architecture are asset permutation invariant (Steps 1 to 3) . We then show that asset permutation invariance is preserved under composition (Step 4). Finally, we can conclude in Step 5 that WaveCorr is asset permutation invariant.

Step 1 -Dilated convolution, Causal convolution, Sum, and 1 × 1 convolution are asset permutation invariant: The functional class of a dilated convolution, a causal convolution, a sum, and a 1 × 1 convolution block all have the form:

where B g (T )[i, :, :] := g(T (i, :, :)), ∀ i = 1, ..., m, for some set of functions G ⊆ {G : R 1×h×d → R 1×h×d }. In particular, in the case of dilated, causal, and 1 × 1 convolutions, this property follows from the use of 1 × 3, 1 × [h − 28], and 1 × 1 kernels respectively. Hence, for any g ∈ G, we have that:

Step 2 -Relu and dropout are asset permutation invariant: We first clarify that Relu and dropout on a tensor in R m×h×d are singleton sets of functions: j, k] ). In particular, in the case of Relu, we have:

while, for dropout we have:

since a dropout block acts as a feed through operator. Hence, we naturally have that:

Step 3 -Softmax is asset permutation invariant: We first clarify that softmax on a vector in R m×h×1 is a singleton set of functions:

.

Hence, we have that:

.

This allows us to conclude that:

Hence, we conclude that {σ −1 •B•σ : B ∈ B} = B.

Step 4 -Asset permutation invariance is preserved under composition: Given two asset permutation invariant blocks representing the set of functions B 1 and B 2 , one can define the composition block as:

We have that for all B 1 ∈ B 1 and B 2 ∈ B 2 :

where B 1 ∈ B 1 and B 2 ∈ B 2 come from the definition of asset permutation invariance, and where B := B 1 •B 2 ∈ B. We therefore have that {σ −1 •B•σ : B ∈ B} ⊇ B. Finally, Lemma A.1 allows us to conclude that B is asset permutation invariant.

Step 5 -WaveCorr is asset permutation invariant: Combing

Step 1 to 4 with Proposition 3.1, we arrive at the conclusion that the architecture presented in Figure 2 is asset permutation invariant since it is composed of a sequence of asset permutation invariant blocks.

Assuming for simplicity that m is odd, the "correlational convolution layer" proposed in takes the form of the following set of functions:

where T [i , :, :] := 0 for all i ∈ {1, . . . , m} to represent a zero padding. Figure A. 6 presents an example of this layer when m = 5, h = 1, and d = 1. One can already observe in this figure that correlation information is only partially extracted for some of the assets, e.g. the convolution associated to asset one (cf. first row in the figure) disregards the influence of the fifth asset. While this could perhaps be addressed by using a larger kernel, a more important issue arises with this architecture, namely that the block does not satisfy asset permutation invariance. Proof. When m = 5, h = 1, and d = 1, we first clarify that the correlational convolution layer from is associated with the following set of functions:

, where we shortened the notation T [i, 1, 1] to T [i]. Let's consider the asset permutation operator that inverts the order of the first two assets: π(1) = 2, π(2) = 1, and π(i) = i for all i ≥ 3. We will prove our claim by contradiction. Assuming that B is asset permutation invariant, it must be that for any fixed valuesw such thatw 4 =w 1 , there exists an associated pair of values (w , b ) that makes B w ,b ≡ σ −1 •Bw ,0 •σ. In particular, the two functions should return the same values for the following three "tensors": T 0 [i] := 0, T 1 [i] := 1{i = 1}, and at T 2 [i] := 1{i = 2}. The first implies that b = 0 since b = B w ,b (T 0 )[1] = σ −1 (Bw ,0 (σ(T 0 )))[1] = 0 . However, it also implies that:

and that w 2 = B w ,0 (T 2 )[3] = σ −1 (Bw ,0 (σ(T 2 )))[3] = Bw ,0 (T 1 )[3] =w 1 . We therefore have a contradiction sincew 4 = w 2 =w 1 =w 4 is impossible. We must therefore conclude that B was not asset permutation invariant.

We close this section by noting that this important issue cannot simply be fixed by using a different type of padding, or a larger kernel in the convolution. Regarding the former, our demonstration made no use of how padding is done. For the latter, our proof would still hold given that the fixed parameterization (w, 0) that we used would still identify a member of the set of functions obtained with a larger kernel.

We detail in this section how the structure of the portfolio management problem (2) can be exploited for a more efficient implementation of a policy network, both in terms of computation time and hardware memory. This applies not only to the implementation of WaveCorr policy network but also policy networks in Jiang et al. [2017] and . In particular, given a a multiperiod objective as in (2), calculating the gradient ∇ θ SR involves the step of generating a sequence of actions a 0 , a 1 , ..., a T −1 from a sample trajectory of states s 0 , s 1 , ..., s T −1 ∈ R m×h×d over a planning horizon T , where m : the number of assets, h : the size of a lookback window, d : the number of features. The common way of implementing this is to create a tensor T of dimension m × h × d × T from s 0 , ..., s T −1 and apply a policy network µ θ (s) to each state s t in the tensor T so as to generate each action a t . Assuming for simplicity of exposition that the state is entirely exogenous, this procedure is demonstrated in Figure A .7(a), where a standard causal convolution with d = 1 and kernel size of 2 is applied. In this procedure, the memory used to store the tensor T and the computation time taken to generate all actions a 0 , ..., a T −1 grow linearly in T , which become significant for large T . It is possible to apply the policy network µ θ (s) to generate all the actions a 0 , ..., a T −1 more efficiently than the procedure described in Figure A .7(a). Namely, in our implementation, we exploit the sequential and overlapping nature of sample states s 0 , ..., s T −1 used to generate the actions a 0 , ..., a T −1 , which naturally arises in the consideration of a multiperiod objective. Recall firstly that each sample state s t ∈ R m×h×d , t ∈ {0, ..., T − 1}, is obtained from a sample trajectory, denoted by S ∈ R m×(h+T −1)×d , where s t = S[:, t + 1 : t + h, :], t = 0, ..., T − 1. Thus, between any s t and s t+1 , the last h − 1 columns in s t overlap with the first h − 1 columns in s t+1 . The fact that there is a significant overlap between any two consecutive states s t , s t+1 hints already that processing each state s t+1 separately from s t , as shown in Figure A.7(a) , would invoke a large number of identical calculations in the network as those that were already done in processing s t , which is wasteful and inefficient. To avoid such an issue, we take an augmented approach to apply the policy network. The idea is to use a sample trajectory S directly as input to an augmented policy network µ θ : R m×(h+T −1)×d → R m×T , which reduces to exactly the same architecture as the policy network µ θ (s t ) when generating only the t-th action. Figure A.7(b) presents this augmented policy network µ θ (S) for our example, and how it can be applied to a trajectory S to generate all actions a 0 , ..., a T −1 at once. One can observe that the use of an augmented policy network allows the intermediate calculations done for each state s t (for generating an action a t ) to be reused by the calculations needed for the other states (and generating other actions). With the exact same architecture as the policy network µ θ (s), the augmented policy network µ θ (S), which takes a trajectory with width h + T − 1 (thus including T many states), would by design generate T output, each corresponds to an action a t . This not only speeds up the generation of actions a 0 , ..., a T −1 significantly but also requires far less memory to store the input data, i.e. the use of a tensor with dimension (m × (h + T ) × d) instead of m × h × d × T . The only sacrifice that is made with this approach is regarding the type of features that can be integrated. For instance, we cannot include features that are normalized with respect to the most recent history (as done in Jiang et al. [2017] ) given that this breaks the data redundancy between two consecutive time period. Our numerical results however seemed to indicate that such restrictions did not come at a price in terms of performance. 

In practice, it is often required that the portfolio limits the amount of wealth invested in a single asset. This can be integrated to the risk-averse DRL formulation:

where w max is the maximum weight allowed in any asset, and M is a large constant. This new objective function penalizes any allocation that goes beyond w max , which will encourage µ θ to respects the maximum weight allocation condition. The commission rates are considered to be c s = c p = 0.5%, and the experiments here are done over Can-data using the full set of 70 stocks, with a maximum holding of 20%. The results are summarized in Table A .7 and illustrated in Figure A .10. As noted before, we observe that WaveCorr outperforms CS-PPN with respect to all performance metrics.

(a) Out-of-sample cumulative returns 

Average (solid curve) and range (shaded region) of out-of-sample wealth accumulated by WaveCorr, CS-PPN, EIIE, and EW over 10 experiments using Can-data, US-data, and Covid-data

Performance functions and reinforcement learning for trading systems and portfolios

Deep residual learning for image recognition

Adversarial deep reinforcement learning in portfolio management

Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio

Cost-sensitive portfolio selection via deep reinforcement learning

Relation-aware transformer for portfolio policy learning

A deep reinforcement learning framework for the financial portfolio management problem

Transaction cost optimization for online portfolio selection

An adaptive portfolio trading system: A risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown

Portfolio selection

Mutual fund performance

The Sharpe ratio efficient frontier

An empirical evaluation of generic convolutional and recurrent networks for sequence modeling

Deep sets

Efficient reinforcement learning in resource allocation problems through permutation invariant multi-task learning

Permutation invariant policy optimization for mean-field multi-agent reinforcement learning: A principled approach

This appendix is organized as follows. Section A.1 demonstrates a claim made in section 2 regarding the fact that the solution of ν t = f (ν t , w t , w t ) can be obtained using a bisection method. Section A.2 presents proofs to the two propositions in section 3. Section A.3 presents further details on the correlation layer in and its two deficiencies. Section A.4 presents further details on the augmented policy network architecture used to accelerate training. Section A.5 presents our hyper-parameter ranges and final selection. Finally, section A.6 presents a set of additional results.In order to apply the bisection method to solve ν = f (ν), we will make use of the following proposition. Proposition A.1. For any 0 < c s < 1 and 0 < c p < 1, the function g(ν)we first obtain the two bounds at g(0) and g(1) as follows:since c s < 1, andsince min(c s , c p ) > 0. We can further establish the convexity of g(ν), given that it is the sum of convex functions. A careful analysis reveals that g(ν) is supported at 0 by the planewhere 1{A} is the indicator function that returns 1 if A is true, and 0 otherwise. Hence, by convexity of g(ν), the fact that this supporting plane is strictly increasing implies that g(ν) is strictly increasing for all ν ≥ 0.Given Proposition A.1, we can conclude that a bisection method can be used to find the root of g(ν), which effectively solves ν = f (ν).

We start this section with a lemma that will simplify some of our later derivations. Proof. The "only if" follows straightforwardly from the fact that equality between two sets implies that each set is a subset of the other.