key: cord-0520810-9zleu4ri authors: Sangadiev, Aiusha; Rivera-Castro, Rodrigo; Stepanov, Kirill; Poddubny, Andrey; Bubenchikov, Kirill; Bekezin, Nikita; Pilyugina, Polina; Burnaev, Evgeny title: DeepFolio: Convolutional Neural Networks for Portfolios with Limit Order Book Data date: 2020-08-27 journal: nan DOI: nan sha: ac89519ed682f96151cf31357bcb49d2ab6a1a55 doc_id: 520810 cord_uid: 9zleu4ri This work proposes DeepFolio, a new model for deep portfolio management based on data from limit order books (LOB). DeepFolio solves problems found in the state-of-the-art for LOB data to predict price movements. Our evaluation consists of two scenarios using a large dataset of millions of time series. The improvements deliver superior results both in cases of abundant as well as scarce data. The experiments show that DeepFolio outperforms the state-of-the-art on the benchmark FI-2010 LOB. Further, we use DeepFolio for optimal portfolio allocation of crypto-assets with rebalancing. For this purpose, we use two loss-functions - Sharpe ratio loss and minimum volatility risk. We show that DeepFolio outperforms widely used portfolio allocation techniques in the literature. More than half of the financial world uses electronic Limit Order Books (LOBs). LOBS are a store of records of all transactions, [1] , [2] . A limit order is a request to transact with a financial instrument at a price not exceeding a threshold, [3] . Usually, traders set so-called buy limit orders below the current market price. They represent the maximum price that the trader is willing to pay. On the other side, traders set the amount above the current market price. The sell limit orders act as the minimum price to sell. LOBs are also gaining popularity in the relatively new and rapidly developing crypto-asset market. The novelty of LOBs leads to low market liquidity and increased stochastic behavior of crypto-asset prices [4] . It is easy to see the drivers behind the increasing popularity of LOBs. Our example in Figure 1 shows how traders control the price of the transaction and the logic behind a LOB. First, a passive order for one ETH crypto asset at 260 USDT arrives. Similarly, a retail order to sell three crypto-assets at 300 USDT appears. The sell order matches with three passive orders to buy. Second, a trade executes at 300 USDT, and the LOB removes the buy orders. Sections I, II, III, IV were supported by the Ministry of Education and Science of the Russian Federation (Grant no. 14.756.31.0001). Other sections were supported by the Mexican National Council for Science and Technology (CONACYT), 2018-000009-01EXTF-00154. Accordingly, modeling LOBs with mathematical methods is a challenging task. Typically, researchers resort to using the autoregressive integrated moving average model (ARIMA), [5] . Alternatively, the vector autoregressive model (VAR), [6] , is a popular choice. One of the benefits of the VAR is that it can display the direction of transactions. However, LOB data is highly stochastic, and time series are unsteady. The result is additional noise to the data. This setting makes the creation of dedicated models and processing data demanding. Another limitation of those techniques is that they make assumptions on the data. To overcome these limitations, [7] suggests a state-of-the-art model called DeepLOB. In this work, we propose a LOB-based approach to predict price trends of crypto-assets. Levering deep neural networks, we call our approach DeepFolio. Our proposal achieves superior results and addresses some of the problems of DeepLOB. Moreover, we go a step further and use DeepFolio to build investment portfolios. Thus, this work adds a new entry to the "deep portfolio" literature. The remainder of the paper is structured as follows. We introduce relevant literature in section II and our data processing in section III. We then proceed to present our methodology in section IV. section V describes our experiments and details the results of our method. In this section, we compare with a range of baseline algorithms. In section VI, we summarise our findings and discuss possible future work. There have been previous attempts to work with limit order book data using machine learning methods. For example, in [8] , they extract features using principal component analysis (PCA). Furthermore, in a second step, they use linear discriminant analysis (LDA). However, these techniques are suitable only for processing statistical data. Besides, they are not optimized for working with dynamics. Another critical point is that these models make inherent assumptions about the data. As a result, these techniques yield lower efficiency. Besides [7] , there are several works in the literature. Their focus is on the application of deep learning and neural networks. They use them to process limit order book data, and then to classify price trends. Along this line of work, one of the most notable entries is [9] . The authors propose to use a fully convolutional neural network (FCNN). With the FCNN, they extract features and perform trend classification. This approach shows significant improvements over more conventional methods such as support vector machines (SVM). Another example of using deep learning as a classifier for LOB data is [10] . In this work, the authors applies an LSTM to perform trend forecasting based on LOB data. Finally, [7] combines these two approaches to create a mixed CNN and LSTM neural network. With LOB data, the approach delivers state-of-the-art results in the classification of trends. Significantly, it outperforms approaches using pure CNNs or LSTMs. The Markowitz mean-variance model is a classic approach. Portfolio managers use it widely for portfolio building. The central assumption underlying this theory is that the investor has two choices. She will try to maximize profits at a given level of risk or minimize risk at a given level of profit. The Markowitz model offers to build a broad array of possible portfolios to reach these goals. It then chooses one of them through the optimization of the risk-return curve. To build the space of possible portfolios, Markowitz proposes to lever three elements. It requires a class of assets, a vector of the average expected returns, and a covariance matrix, [11] . With this, the Markowitz model constructs an array of portfolios with various profitability-risk ratios, [11] . Since the analysis builds on two criteria, the manager selects the portfolios based on three choices: • She searches for effective or non-improvable solutions. • She chooses the main criterion, i.e., minimum profitability, using other criteria as constraints. • She provides a "super criteria," such as a superposition of the previous two options. In this work, the criteria for choosing the optimal portfolio are the maximum Sharpe Ratio, [12] . It is a standard metric for assessing the "optimality", and the minimum volatility risk. This dataset is the first public marked-up dataset of highfrequency financial markets, [13] . It is ideal for assessing and controlling the forecasting of indicators. With time-series data from five stocks of the NASDAQ Nordic stock market, it consists of normalized representations. It results in a dataset of approximately 40,000,000 time-series samples representing ten consecutive days. The dataset provides three different normalizations: z-score, min-max, and decimal precision normalization. Due to its richness and relevance, it is a good benchmark for LOB-based deep learning models, [7] . Limit order books for crypto-assets are not readily available. Hence, we assemble the datasets using the public API of Binance, [14] . Binance is a relevant market for the trade of crypto-assets. In our dataset, the time length of the collected data is one year. It starts on February 27, 2019, and has an hour resolution. The data consists of orders, defined by bid or ask labels, time steps, volumes, and prices. By asks and bids, we divide the orders. We take the ten best asks, the ten best bids, and their respective volumes within a five-minute interval. As a result, we obtain 40 values for a single time step. Each of them consists of 20 asks and bids, as well as 20 volumes. The percentage of missing values is less than 6%. The dataset has missing values distributed evenly. For data imputation, we consider methods relying on neighboring values. These are prices connected to an order volume, such as simple arithmetic or root mean square average. However, it probably leads to a distortion of data. For this reason, we use the propagation of the last viable value as an additional imputation technique. Moreover, we normalize the data using the dynamic z-normalization, see Equation 1 . We use the mean µ and the standard deviation σ of the previous five days. The objective is to normalize the values of the current day. In the financial time series literature, dynamic normalization is a reasonable choice. The motivation is that financial time series are usually affected by regime shifts, [7] . In particular, we can represent crypto-assets' prices as a sum. For [15] , the sum consists of the primary trend plus some noise or long term and short term volatility. Along these lines, the dynamic normalization enables the data to be within an appropriate range. If we apply z-normalization on the whole dataset, we destroy the underlying data patterns. Finally, for each point in the dataset, we establish a mid-price outlined in Equation 2. It is the average between the best ask and the best bid. Throughout this work, we use mid-prices for further calculations. (2) After that, we generate three labels indicating price movements such as increase, decrease, or uncertainty. The third label is defined whenever an increase or decrease is too small to confirm them. Since financial data is inherently noisy and highly stochastic, we use label smoothing strategies. For this purpose, we calculate m − , see Equation 3 , and m + , see Equation 4 . These values denote the average of the previous and next k mid-prices. We then calculate the "smoothed labels" l t . In Equation 5 and Equation 6 respectively, we outline these labels. These values show relative changes in the asset and its trend, taking into account a k-point smoothing. For the final label distribution, we set a threshold, α, equal to 0.001. Changes of 0.1% are sufficiently large to indicate a price movement. If l t > α, we apply l t to signalize an increase. Otherwise, if l t < −α, the price is decreasing. We consider the [−α, α] interval to be an intermediate value of l t . In this case, there is no increase or decrease in price. The changes are insignificant for this range of values. We present this logic for the crypto-asset BTC. In our example, the green background represents a buy signal. We use red for the sell signal and white for the hold one. The first module of DeepFolio consists of three main blocks. The first block is a fully convolutional neural network (FCNN). One Inception block represents the second block. An LSTM network is our third one. The input to this network has three elements. They are a batch size, a sequence length, and features. Hence, we consider this module to be a "CNN+RNN." The FCNN block has three sub-blocks. On the first block, we have a stridden convolutional layer. It has a kernel size of 1 × 2. Thus, it performs convolutions strictly over LOB levels. Two convolutional layers follow it in the second block. Due to their kernel sizes of 4 × 1, they capture short-term time dependencies. In the last block of the FCNN, the kernel size expands to 1 × 10. Hence, it performs convolutions over the remaining elements in the feature dimension. Similarly, we employ an Inception block, [16] . It enables us to capture dynamic behaviors over multiple time scales. This block is equivalent to performing multiple moving averages over different periods. For financial time series analysis, we can use it to capture the time-series momentum. The last LSTM block captures long-term temporal dependencies in the data. We feed its output into a fully-connected layer with a softmax activation function. It has three outputs to produce probabilities of having one of three possible labels. They are a negative price trend, −1, a neutral trend, 0, and a positive trend, +1. B. Problems with the "CNN+RNN" module a) Extreme sensitivity to initial model weight allocation: Empirical observations show that using "He uniform" is suboptimal. Practitioners use it to initialize weights of the convolutional and recurrent layers, [17] . Nevertheless, both in the case of the weight matrices and the biases, the model "dies." It happens early in the training process and results in a lack of learning. A better option is to use Glorot uniform, [18] . It initializes the weight matrices of the CNN and the input weight matrix of the LSTM. Similarly, for the recurrent weight matrix of the LSTM, we have zero initialization. We do this for all biases, and orthogonal, [19] . In Figure 8 , we show this effect. In the figure, "default" stands for the default initialization. The second label, "initialization," represents our proposed allocation. b) Slow learning process at the beginning of the training: This effect is especially noticeable with the crypto-asset data. Compared to the benchmark dataset, FI-2010, it is a smaller dataset. Figure 9 depicts that it takes more than 30 epochs before proper training starts. c) Worse depth-wise scalability: It stems from the first two problems. Unfortunately, the original model offers worse depth-wise scalability. An increase in depth hampers the training process even further. In [20] , the authors propose using residual connections. The motivation is to improve the learning process of deep convolutional networks. Residual connections allow for better gradient flows through the layers. Inspired by this, we introduce blocks with residual connections into the network. Our objective is to extend the depth of the network. We also want to improve problems associated with gradient flows and vanishing gradients. In Figure 3 , we present a general architecture for DeepFolio. Label LSTM Dense Output Figure 4 depicts the structure of the used residual block. It consists of three stacked 3 × 1 convolutions. A leaky rectified linear unit is the activation function, [21] . The leaky ReLU also serves as a shortcut connection. Our observation is that batch normalization improves the convergence speed dramatically. This aligns with similar results from [22] and [20] , However, at the same time, it hampers the network's ability to learn "deeper" patterns. Other works using deep learning for financial data do not use batch normalization. Examples of this are [7] , [9] , and [23] . We assume that batch normalization might be a "smoother." As a consequence, it might affect deeper patterns in the financial time-series data. In Figure 6 , we use the architecture "inception v2". One can consider it as an alternative to the canonical inception block. [24] proposed it first. The authors replace the 5 × 5 kernel with two consecutive smaller 3 × 3 kernels. This approach improves metrics and computational speeds. For most tasks, the gated recurrent units (GRU) performs on par with the LSTM. We make this conclusion based on empirical observations. Our conclusion arises from a numerical comparison of GRU versus LSTM. However, GRUs offer additional benefits. They have a more straightforward structure. It allows them to generalize better in cases of limited data. Our architectural choices are visible in Figure 7 . We present the full architecture of the ResCNN+GRU module of DeepFolio. A problem of [7] is its initial weight allocation dependency. In Figure 8 , we can see that our ResCNN+GRU module solves it. It is mostly indifferent to the weight allocations. Further, it trains well for both cases. Another problem of [7] is noticeable in the crypto-asset dataset. We run both models for the dataset of the crypto-asset BTC. Our prediction horizon is k = 1 to see the performance of DeepFolio. DeepLOB takes nearly 30+ epochs for the loss to start dropping. On the other side, our model starts training at around epochs 8-9. Visually, we confirm in Figure 9 that the problem disappears. The predicted labels of DeepFolio are convenient for the development of trading strategies. However, we go a step further of price prediction and trading strategies in this work. Our objective is to generate investment portfolios of cryptoassets. For this, we build a crypto-asset portfolio consisting of 4 crypto-assets. We also perform weight rebalancing every 50 minutes. Rather than strictly building portfolios with historical data, we use our predictions. It results in picking the period for rebalancing. The reasoning is that it is less frequent than a predictive horizon of k = 1, i.e., 5 minutes. Nevertheless, we still maintain reliable performance. To allocate portfolio weights, the model essentially has a two-step structure. First, we feed the input data to the LSTM network. Then, we pass the LSTM outputs through the fully connected layer with softmax activation. LSTMs are very efficient tools for modeling time series and especially financial data. Our innovation is that we use price movement labels to perform rebalancing. Traditionally, the literature works with the price and returns history. The following algorithm performs the training scheme. Fist, the input for LSTM layers with 64 units consists of price movement indicators. These are the labels from the ResCNN+GRU module. In our case, the period consists of 50 minutes. Second, we pass the predictions through softmax. With this, we can get portfolio weights and use them to optimize the objective function. Third, we run an Adam optimizer with a learning rate of 0.001. We use this to train our network and set the batch size to 64. Fourth, after we train the network, we use the input predicted labels. The ResCNN+GRU module generates them for intervals of 50 minutes. As a result, we obtain the portfolio weights rebalanced. Fifth, we move ahead to the next 50-minutes interval. Again, we feed the input with predictions from the ResCNN+GRU module and update weights. Finally, we repeat this process for the whole test set. In this work, we evaluate two different loss functions: 1) Maximization of Sharpe ratio, proposed in [7] : where: and r i,t = (p i,t − p i,t−1 )/p i,t−1 is the return of the asset i. std is the standard deviation. Sharpe ratio is essentially a form of risk-adjusted returns. It assesses the "optimality" of the portfolio. Portfolios with a higher Sharpe ratio are considered more optimal. 2) Minimization of portfolio volatility (risk): It corresponds to the minimization of volatility. This is equal to reducing portfolio risks. V. EXPERIMENTS We evaluate our model and compare its performance with the state-of-the-art. Besides, we also consider two more baseline models. They are a CNN, [9] , and an LSTM, [10] . For DeepLOB, we follow the indications in its respective publication strictly. To train the ResCNN+GRU module of DeepFolio, we use an Adam optimizer. We set its learning rate at 0.01, and to 1. To avoid overfitting, we apply early stopping with checkpointing. It saves the model weights each time. Our performance metrics are accuracy for FI-2010 and F1 score for the crypto-asset. On each iteration, we seek to improve them on the validation set. If we do not observe changes after 20 epochs, the training stops. L 2 -normalization helps us tackle overfitting. It is especially relevant for the ResCNN+GRU module of DeepFolio. Sometimes, it can overfit the training data. For example, validation loss starts to grow steadily. We suppose that this is due to the deeper architecture with more parameters. For the FI-2010 dataset, we divide ten days of this dataset into three parts. We use seven days for training and two days for validation. The remaining days serve us as a training metric. We use 40 features from the dataset. They account for the ten levels of ask prices, bid prices, and quantities. The last five features are labels. Respectively, they account for the prediction horizons k = 1, 2, 3, 5, and 10. We use only k = 1, 5, 10 for comparison. These labels represent three different horizons. They are short-term predictions, mid-term predictions, and long-term predictions. We also employ a sliding time window of length T = 100 with a batch size equal to 64. The input to the network has a size (64, 1, 100, 40) . In this case, the second dimension is an auxiliary "channel" dimension. In Table I , we see the benefits of our model. Both DeepLOB and DeepFolio massively outperform both baseline models. The difference grows further as the length of the prediction horizon, k, grows. DeepFolio also outperforms DeepLOB on all metrics. The performance gap between these two models also grows with the length of k. The architecture of DeepFolio captures the long-term relations in the data better. We consider two different cases for the crypto-asset dataset. The first setup is a conventional one. We train a separate network for each crypto-asset. Then, we validate and test only on the respective crypto-assets. The second approach combines three crypto-assets into one dataset. They are BTC, LTC, and ETH. We do the training on this combined dataset. Separately, we perform testing on each crypto-asset. That way, we can assess the models' ability to generalize. Also, we intentionally hold out Ripple (XRP) entirely. We aim to additionally back-test the models. We want to evaluate their generalization ability to do transfer learning. For both approaches, we use a sliding time window of T = 60 and a batch size of 64. For the first case, we employ a 70-15-15 split of the datasets. Respectively, we use 70% for training and 15% for validation and test. An additional characteristic is that the datasets are unbalanced. Hence, we focus on the weighted F1 score to assess the performance of the models. In Table II , we appreciate that both DeepLOB and DeepFolio outperform. The baselines show worse performance by a large margin on all metrics. When we move to longer prediction horizons, it becomes especially evident. Rapidly, the metrics of baseline methods start dropping. DeepLOB and DeepFolio also experience a decrease in metrics. Nevertheless, it is not as severe as the baseline models. While directly comparing Deep-Folio and DeepLOB, we can see that DeepFolio outperforms. It gets superior scores across all metrics. However, the gap between them is narrow. To better investigate the results, we provide the confusion matrices. They are available for the four prediction horizons in Figure 11 for k = 1. In Figure 12 , they have k = 5. We also present two additional matrices for further horizons. In Figure 13 , it is k = 10. k = 20 is in Figure 14 . For the second setup, we split the dataset in the following way. First, we take each crypto-asset from the (BTC, LTC, ETH) trio separately. Then, we perform an 80-10-10 train-validationtest split. After that, we concatenate the train parts of the crypto-assets. With this, we form a single dataset. We repeat the same process for the validation, while we keep the test sets separate. The main goal of this setup is to check whether models can extract general LOB patterns. Our inspiration is the work of [25] . To further test the networks' ability, we perform transfer learning. We select the XRP crypto-asset for this task. We feed it to the entire dataset into models that did not previously see the XRP data. For this setup, we exclude baseline models. Their performance is limited, even when dealing with individual crypto-assets. Thus, we focus on DeepLOB and DeepFolio, primarily. We look at Table III . It seems that both models have strong generalizing abilities. However, DeepFolio outperforms in the majority of the cases. The gaps this time are higher at about 2-3 % on average. Transfer learning results are also robust. It means that neural networks are indeed capable of learning the general LOB patterns. They do not merely adapt to the data. Overall, in both setups, we can see that DeepFolio outperforms. To evaluate our portfolio model performance, we estimate the portfolio value using [26] and define it as where p t−1 is the portfolio value at the beginning of period t. r t corresponds to prices vector at time t. Meanwhile, w t−1 is the portfolio weight vector at the beginning of period t. We rebalance every 50 minutes and do not consider transaction costs. In Figure 10 , we see the various portfolio strategies. It displays the cumulative log-returns. Here, 1/n is the equalweights naive portfolio. Markowitz SR corresponds to the Markowitz model with Sharpe Ration. Similarly, Markowitz MV has a mean-variance. DeepFolio SR uses the Sharpe Ratio as a loss function. However, for DeepFolio MV, we have volatility, instead. DeepFolio with Sharpe Ratio has the best performance on the test dataset. Moreover, the testing period starts around February 2020. In this period, the crisis induced by COVID-19 hits the global markets. Table IV presents a global comparison of results. We want a full understanding of each method's performance. For this, we compare the following parameters. First, we consider the expected and mean returns. Second, we look at the standard deviation of portfolio returns, the Sharpe Ratio. Third, we have the ratio between positive and negative returns for the test period. We can see that all reallocation strategies work well. Nevertheless, DeepFolio with Sharpe Ratio shows the best values for all parameters. The only exception is in the case of the standard deviation. We propose DeepFolio to address problems in the state-ofthe-art. Our model surpasses its performance on the benchmark dataset. We observe similar behavior for the crypto-asset dataset. It is despite the latter being more scarce and favoring smaller models. We also show that DeepFolio is capable of learning general patters in the LOB data. It does not merely adapt to the data at hand. We demonstrate it through transfer learning on a previously unseen crypto-asset. We generate price movement predictions from LOBs. With them, we prove that they as well can be used for short-term portfolio allocation. We bestow these portfolios with rebalancing strategies. Such an approach overcomes the pitfalls of classical methods of portfolio optimization. Also, we test the model with two different loss functions. They are the maximization of Sharpe ratio and volatility. Extensive tests show that DeepFolio with Sharpe Ratio performs the best. It outperforms all other approaches. Portfolio managers can use the results of this work for a myriad of assets. For assets with high liquidity, we expect a better performance. They are less prone to stochastic fluctuations. In conclusion, our approach serves as a building block for an automated portfolio building and optimization framework. Liquidity and information in order driven markets Limit order markets: A survey Technical analysis of the futures markets, new The new electronic trading regime of dark books, mashups and algorithmic trading Stock price prediction using the arima model Vector autoregressive models for multivariate time series Deeplob: Deep convolutional neural networks for limit order books Temporal bag-of-features learning for predicting mid price movements using high frequency limit order book data Forecasting stock prices from the limit order book using convolutional neural networks Using deep learning to detect price change indications in financial markets Portfolio selection Mutual fund performance Benchmark dataset for mid-price forecasting of limit order book data with machine learning methods Official documentation for the binance apis and streams Long-and short-term cryptocurrency volatility components: A garch-midas analysis Going deeper with convolutions Delving deep into rectifiers: Surpassing human-level performance on imagenet classification Understanding the difficulty of training deep feedforward neural networks Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Deep residual learning for image recognition Rectifier nonlinearities improve neural network acoustic models Batch normalization: Accelerating deep network training by reducing internal covariate shift Temporal relational ranking for stock prediction Rethinking the inception architecture for computer vision Universal features of price formation in financial markets: perspectives from deep learning A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem