key: cord-0212049-wujqwuh3 authors: Koenigstein, Nicole title: Dynamic and Context-Dependent Stock Price Prediction Using Attention Modules and News Sentiment date: 2022-03-13 journal: nan DOI: nan sha: 183c51d054f1cac04d83c1f089682ebe47dafc64 doc_id: 212049 cord_uid: wujqwuh3 The growth of machine-readable data in finance, such as alternative data, requires new modeling techniques that can handle non-stationary and non-parametric data. Due to the underlying causal dependence and the size and complexity of the data, we propose a new modeling approach for financial time series data, the $alpha_{t}$-RIM (recurrent independent mechanism). This architecture makes use of key-value attention to integrate top-down and bottom-up information in a context-dependent and dynamic way. To model the data in such a dynamic manner, the $alpha_{t}$-RIM utilizes an exponentially smoothed recurrent neural network, which can model non-stationary times series data, combined with a modular and independent recurrent structure. We apply our approach to the closing prices of three selected stocks of the S&P 500 universe as well as their news sentiment score. The results suggest that the $alpha_{t}$-RIM is capable of reflecting the causal structure between stock prices and news sentiment, as well as the seasonality and trends. Consequently, this modeling approach markedly improves the generalization performance, that is, the prediction of unseen data, and outperforms state-of-the-art networks such as long short-term memory models. Non-stationary time series data is common in finance. In the case of price series, each value depends on a long history of prior levels. However, most machine learning models that predict financial time series expect stationary inputs. Consequently, these models rely on standard stationarity transformations, such as integer differentiation, to produce returns, which have a memory cut-off hence the series loses important signals [1] , [2] . In addition, in the field of causal inference, the concept of independent or autonomous processes has proven to be important because these processes make the model capable of cause and effect inference [3] , [4] . Thus, a complex model may be thought of as a collection of separate processes or "causal" modules. As a result, individual modules may be robust or invariant even when other modules change, as in a distribution shift [5] , [6] . Furthermore, most machine learning modules are monolithic and based on bottom-up signals, i.e. directly observed content, in contrast to a top-down signal, which is based on past experience and short-term memory. Moreover, human cognition has a modular structure with sparse interactions, as described by Carruthers in his book The Architecture of the Mind [7] . Carruthers argues that one of the distinctive characteristics of the human mind is that it is composed of numerous cognitive systems, each of which communicates with a small number of other systems or experts, each of which has little influence over the processes occurring within them. Thus, the human mind is flexible and capable of practical reasoning and thus gains the capacity for scientific thinking. Consequently, if we think of the brain as capable of solving problems by using different systems (or modules), we hypothesize that it could be beneficial to leverage this kind of structure by learning separate processes that can be reused, constructed, and flexibly re-purposed. Humans also seldom utilize all available inputs to complete tasks. For these reasons, using sparse interactions and focusing attention in machine learning models may reduce learning difficulties by minimizing interference. In other words, models that learn in this manner may more accurately reflect the causal structure of the data and hence generalize more effectively [8] , [5] . Accordingly, within this context, we analyze how such a modeling approach can be used to incorporate stock prices and news sentiment to predict stock price movements. 1. A subset of modules is selectively activated based on the determination of the relevance of the modules input. 2. The activated modules independently process the information made available to them. 3. The RIMs communicate with one another sparingly via key-value attention, then the active modules gather contextual information from all the other modules and consolidate this information in their hidden state. The entire model is subdivided into k small subsystems, so-called RIMs, in which each represents a recurrent model to capture the dynamics in observable sequences. As a consequence, each RIM has its own unique functions that are automatically trained from data, resulting in the vector-valued state h t,k at timestep t. Each RIM also contains θ k parameters that are shared between time steps. However, only when the input is relevant are the RIM modules active and updated. For each time step, the attention selects and then activates a subset of the RIMs. Soft attention takes the product of a query represented as a matrix of dimension N r × d, with d being the dimension of each key, and a collection of N 0 objects, each associated with a key as a row in matrix K T (N 0 × d), and after normalizing (using Softmax) produces the following outputs: where Q, K and V are the query, key and value matrices, respectively. The Softmax algorithm is then performed on each row of the argument matrix, resulting in a set of convex weights. As a consequence, a convex combination of the values in the rows of V is obtained. Note that when the attention is focused on one element of a specific row, which means that the softmax is saturated, it selects one of the objects and copies its value to row j of the result. Further, the d dimensions of the keys may be divided into heads, each of which has its own attention matrix and writes independently calculated values. Without attention, neurons in neural networks operate on fixed variables and are thus fed by the previous layer. The key-value attention mechanism enables a dynamic selection of which a variable instance will be used as the input for each of the dynamics of the RIM's arguments. These inputs may originate from an external source or be the output of another RIM. Thus, the model learns to dynamically choose those RIMs that are relevant to the present input. The RIMs provide the queries for this specific application of key-value attention, while the current input provides the keys and values. The following summarizes the input attention paid to a particular RIM: At time t, the input x t is seen as a collection of rows of a matrix, followed by the concatenation of a row of zeros to obtain the following equation: Following this, linear transformations are applied to generate keys (K = XW e , one for each input element and the null element), values (V = XW v , one for each element), and queries (Q = h t W q k , one for each RIM attention head). W v is a simple matrix translating an input element to the weighted attention's associated value vector, and W e is similarly a weight matrix mapping the input to the keys. W q k is a weight matrix for each RIM that relates the RIM's hidden state to its queries. refers to the concatenation operator at the row level. Thus, the attention can be written as follows: The RIMs use multiple heads for the input and communication attention, analogously to "Attention is all you need" [22] . In general, the RIMs operate independently by default and the attention mechanism allows the model to share information among the modules. Furthermore, the activated RIMs are allowed to read from all the other RIMs' inputs. The reason for this is that non-activated RIMs do not need to change their value because they are not related to the current input. Nonetheless, they may retain important contextual information, therefore, there is communication between the RIMs through the usage of residual connections, as described in the paper "Relational recurrent neural networks" [23] . Figure 1 illustrates the described RIM dynamics. The illustrations were adapted from the original paper and partially modified to fit the use case of this work. However, the RIMs were originally used with gated recurrent units [24] and LSTM [25] , which we decided against, because the α-RNN and α t -RNN are being a much simpler architecture, yet perfectly suitable for forecasting stationary and non-stationary time series, respectively. The α-RNNs are a generic family of exponentially smoothed RNNs that excel in modeling non-stationary time series data as seen in financial applications. They characterize the time series' non-linear partial autocorrelation structure and directly capture dynamic influences such as seasonality and trends. The α-RNN is almost identical to a standard RNN except for the addition of a scalar smoothing parameter, which provides the recurrent network with extended memory, that is, autoregressive memory beyond the sequence length. To extend RNNs into dynamic time series models, the combination of the hidden stateĥ t and the exponentially smoothed outputh t , which is time-dependent and convex, is used. This combination means that the model is capable of modeling non-stationary time series data, as in the following equation: Thus, smoothing may be thought of as a type of dynamic forecast error correction. Alternatively, smoothing may be viewed as a weighted summation of the lagged data with either equal or lower weights, α t−s s r=1 (1 − α t−r+1 ) at the s ≥ 1 lagged hidden state,ĥ t−s : where g(α) := Note that beyond the r th lag, for any α t−r+1 = 1, the model will forget the hidden states. While the α t -RNN is free to specify how α is updated (including altering the update frequency) in response to the hidden state and input, using a recurrent layer is a convenient choice. The activated RIMs in the α t -RIM uses the α t -RNN as their per-RIM independent transition dynamics. This choice was made because for industrial forecasting, LSTMs and GRUs are likely over-engineered, and light-weight exponentially smoothed architectures capture the key properties while being superior and more robust than simple RNNs. We evaluated our modeling approach using financial time series data with and without news sentiment, hereafter denoted as univariate and bivariate, respectively. To further study our model's performance, we compared it with two RNN models, namely a simple RNN and an LSTM. TensorFlow v2.4.1 [26] was used to implement all the models, while the α t -RIM was a custom implementation and the SimpleRNN and the LSTM from the TensorFlow-Keras-API were used. Further, time series cross-validation was performed using separate training, validation, and test sets for all the models and stocks. Each set represents a contiguous sample period, with the test set containing the most recent observations in order to maintain the data's temporal structure and prevent look-ahead bias. The hidden layer was activated using Tanh functions. The Glorot and Bengio uniform [27] approach was used to initialize the non-recurrent weight matrices, and an orthogonal matrix was used to initialize the recurrence weights for stability, which ensures that the absolute value of the eigenvalues is initially limited by unity [28] . Further, Adam [29] is used as optimizer. The hyperparameters for the RNN and LSTM were determined using time series cross-validation with five folds. The hidden units were evaluated starting from 5 up to and including 250 in steps of 5. L1 regularization was 0.0001, 0.001, 0.01, 0.10 and the dropout rate was 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. The setup for the α t -RIM was identical, with the exception that the hyperparameters were determined with a three fold randomized grid search due to the 12 hyperparameters of the model. See Appendix Subsection 6.1 for more information on the model's hyperparameters. Within that setup, we used three metrics: mean squared error (MSE), mean absolute error (MAE), and mean percentage error (MAPE). The first two were used to evaluate the models during training, validation, and testing to examine their generalization performance on unseen data. The MAPE was used after training on the re-scaled testing dataset, where re-scaled means that the data transformations were reverted in order to retrieve the original closing prices, and separately for each prediction time step ahead, as our prediction horizon was five days ahead. Two data sources were used for the analyses: end of day pricing data from quandl.com and sentiment data from 2015 onwards, provided by YUKKA Lab, Berlin. The two datasets were then joined and contained 6.5 years of complete data from within the S&P 500 Index. To account for variability in the experiments, trading volume, the total article count, and the number of positive and negative articles of all stocks were analyzed, resulting in the following chosen stocks: 1. Amazon: With high dollar volume and high article count. 2. Brown-Forman: With low dollar volume and low article count. 3. Thermo Fisher: With medium dollar volume and medium-high article count. To account for the noise in the sentiment data, we smoothed the sentiment score with a convolutional kernel filter [30] , [31] during pre-processing. The datasets were then divided into training, validation, and testing sets. The training set was standardized using the mean and standard deviation of the training set and not the whole time series. Additionally, to avoid introducing systematic bias into the validation and test sets, identical normalization was used for the validation and test sets. In other words, the mean and standard deviation of the training set were used to normalize the validation and test sets. Further, log transformations were performed to decrease the data variability and bring the data closer to the normal distribution. Moreover, due to the strong downward impact of the COVID-19 pandemic on stock prices, the choice was made to remove the observation with the steep downward trend in the chosen stocks, as we expected the models to perform poorly during this phase. This resulted in the following divisions: 1. Training set: 2015-01-05 -2019-12-31 2. Validation set: 2020-05-04 -2020-12-28 3. Testing set: 2020-12-29 -2021-06-08 Moreover, we analyzed three different look-back time windows, 5, 10, and 21 days, to study the short-and long-term effects of the generalization related to the networks and time windows. After comparing the results for all the networks, stocks, and time windows, we observed that the results depicted the same pattern. The performance of the training data sets differed little, that is, the MSE and MAE were almost identical during the training phase of the three different networks, namely the simple RNN, LSTM, and α t -RIM, but the values of the metrics increased steadily from the validation to the testing sets, for the simple RNN and LSTM. Accounting for the aforementioned results and the results from the comparison with each prediction step ahead led to the following two observations. Our model outperforms the simple RNN and the LSTM for each prediction time step and during the validation and testing phases of the networks. The Figures 3-8 depict the results from the 10 day look-back and 5 day ahead prediction from the model evaluation and clearly demonstrate how the RNN and LSTM fail to generalize for the validation and testing sets. Furthermore, the α t -RIM is capable of using the sentiment score as an additional feature and improves its overall performance with the use of news sentiment. Regarding the input lags needed for the 5 days ahead prediction, the 10 lags resulted in the best overall performance for all tested networks, with two exceptions: the α t -RIM performance was slightly better with 21 lags input for Brown Forman stock, and the LSTM performed also better with 21 input lags, but for Thermo Fischer stock. A comparison of the results for Brown Forman stock with 10 input lags is presented in Tables 1 and 2 with the best values in bold. This work demonstrates the effectiveness of combining an RIM with an exponentially smoothed RNN for modeling non-stationary time series data. As our findings indicate, the α t -RIM outperforms the simple RNN and the LSTM, particularly within the validation and testing phases. This observation leads to the inference that the simple RNN and LSTM are not able to model changes in the unseen data. This demonstrates that the α t -RIM can generalize effectively on unseen data. Additionally, the bivariate time series prediction results suggest that the α t -RIM is capable of extracting patterns from news sentiment and using them as an additional input within its communication pattern to further stabilize the prediction accuracy. This result substantiates our earlier statement that it is necessary to utilize a new modeling approach that takes advantage of key-value attention, an exponentially smoothed RNN, and subsystems to model financial time series data in a dynamic and context-dependent way. Furthermore, in addition to the more accurate predictions that it provided, the advantage of the α t -RIM is that the attention weights and activation pattern of the modules can be visualized to further study the behavior of the model. In closing, for further research, we suggest investigating the possibility of including DSelect-k [32] for the α t -RIM, which would the model allow to learn to select a different number of modules at each time-step. Due to the models' constraints of the hyper-parameters (e.g., the number of RIMs have to be smaller or equal to k modules), the normal cross-validation could not be performed. Therefore, a special function was implemented to generate a list of dictionaries to be fed into the grid search as a parameter grid. The list encompasses the following parameters: • Units: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 25, 30, Advances in Financial Machine Learning Machine learning in non-stationary environments: introduction to covariate shift adaptation Causality: Models, Reasoning and Inference The consciousness prior. CoRR Elements of causal inference: foundations and learning algorithms On causal and anticausal learning The Architecture of the Mind: Massive Modularity and the Flexibility of Thought The Architecture of Complexity Cost-sensitive estimation of arma models for financial asset return data Comparison of arima and artificial neural networks models for stock price prediction Stock price prediction using lstm, rnn and cnn-sliding window model Stock prediction using sentiment analysis and long short term memory Industrial forecasting with exponentially smoothed recurrent neural networks Describing multimedia content using attention-based encoder-decoder networks Attention-based LSTM for aspect-level sentiment classification Attention in natural language processing Financial series prediction using attention lstm At-lstm: An attention-based lstm model for financial time series prediction Yoshua Bengio, and Bernhard Schölkopf. Recurrent independent mechanisms Tracking the world state with recurrent entity networks Neural relational inference for interacting systems Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks Empirical evaluation of gated recurrent neural networks on sequence modeling Long Short-Term Memory Tensorflow: A system for large-scale machine learning Understanding the difficulty of training deep feedforward neural networks Orthogonal rnns and long-memory tasks Adam: A method for stochastic optimization Time series modelling with unobserved components, by matteo m. pelagatti Polynomial, spline, gaussian and binner smoothing are carried out building a regression on custom basis expansions Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning Acknowledgements: The author would like to thank YUKKA Lab, Berlin, for providing the raw data for this research. Furthermore, the author would like to thank Matthew Dixon and Saeed Amen, who provided significant support to the research with their insights and expertise. Finally, the author would like to thank Jörg Osterrieder for his comments and suggestions on this paper.