key: cord-0326846-uacbhzov
authors: Rahimikia, Eghbal; Zohren, Stefan; Poon, Ser-Huang
title: Realised Volatility Forecasting: Machine Learning via Financial Word Embedding
date: 2021-08-01
journal: nan
DOI: 10.2139/ssrn.3895272
sha: 20ed677bbc4a38944153d325a61e08970763f4d8
doc_id: 326846
cord_uid: uacbhzov

We develop FinText, a novel, state-of-the-art, financial word embedding from Dow Jones Newswires Text News Feed Database. Incorporating this word embedding in a machine learning model produces a substantial increase in volatility forecasting performance on days with volatility jumps for 23 NASDAQ stocks from 27 July 2007 to 18 November 2016. A simple ensemble model, combining our word embedding and another machine learning model that uses limit order book data, provides the best forecasting performance for both normal and jump volatility days. Finally, we use Integrated Gradients and SHAP (SHapley Additive exPlanations) to make the results more 'explainable' and the model comparisons more transparent.

Many studies have attributed news as a major contributor to volatility (Engle and Ng, 1993; Engle and Martins, 2020; Conrad and Engle, 2021) . In recent years, researchers have shown an increased interest in using natural language processing (NLP) and machine learning (ML) methods to extract relevant information from textual data such as news. So far, however, despite dramatic successes in other fields, these new techniques have attracted very little attention from the finance and economic scholarly communities (Gentzkow et al., 2019) . This paper explores the use of news in realised volatility (RV) forecasting using a state-of-the-art word embedding approach. Instead of using pre-trained Google and Facebook word embeddings, we develop FinText 1 , a purpose-built financial word embedding for financial textual analysis, based on the Dow Jones Newswires Text News Feed, which covers different news services around the world with added emphasis on financial news. To the best of our knowledge, this is the first comprehensive study that develops a financial word embedding and uses it with a convolutional neural network (CNN) for RV forecasting. Unlike the Loughran-McDonald (LM) dictionary approach (Loughran and McDonald, 2011 ) that solely relies on predefined sets of words for extracting sentiment, our method extracts a substantial amount of information from big financial textual dataset without using any manually predefined resources and model assumptions. Moreover, to analyse the effect of a specific token on volatility forecasts, we use Explainable AI methods to make the forecasting performance evaluation more transparent and understandable.

Most RV forecasting studies use historical RV as the primary source to predict the next-day volatility using a linear model. Heterogeneous autoregressive (HAR) models, which are simple but yet effective linear models for RV forecasting, were first introduced by Corsi (2009) . The further development of the HAR-family of models continued with HAR-J (HAR with jumps) and CHAR (continuous HAR) of Corsi and Reno (2009) , SHAR (semivariance-HAR) of Patton and Sheppard (2015) , and HARQ model of Bollerslev et al. (2016) . The study of Rahimikia and Poon (2020a) provides a valuable comparison of the HAR-family of models and shows that the CHAR model is the best performing model considering 23 NASDAQ stocks. It also supplements the CHAR model with limit order book (LOB) data and sentiment variables extracted from financial news. The resulting CHARx model shows that news and LOB data provide statistically significant improvement in RV forecasts. By separating normal volatility days and volatility jump days, Rahimikia and Poon (2020a) shows that the negative sentiment has a clear impact on improving the RV forecasting performance of normal volatility days.

Although the aforementioned work has successfully demonstrated that adding more information from news data improves the RV forecasting performance, it has certain limitations in terms of just using linear regression for forecasting.

During the last decade, there has been a growing number of publications focusing on the theory and application of ML in financial studies. Recent evidence suggests that this group of models can outperform traditional financial models in portfolio optimisation (Ban et al., 2018) , LOB models for short-term price predictions (Zhang et al., 2018; Sirignano and Cont, 2019; Zhang and Zohren, 2021) , momentum strategies (Lim et al., 2019; Poh et al., 2021; Wood et al., 2021) , estimation of stochastic 1 FinText word embeddings are available for download from rahimikia.com/fintext. 2 discount factor , equity premium prediction using newspaper articles (Adämmer and Schüssler, 2020) , measuring asset risk premiums (Gu et al., 2020) , return prediction (Jiang et al., 2020) , classifying venture capitalists (Bubna et al., 2020) , designing trading strategies , latent factor modelling (Gu et al., 2021) , hedge fund return prediction (Wu et al., 2020) , bond return prediction (Bianchi et al., 2021) , and return forecasting using news photos (Obaid and Pukthuanthong, 2021) , to name a few. In the context of RV forecasting, Rahimikia and Poon (2020b) comprehensively examined the performance of ML models using big datasets such as LOB and news stories from Dow Jones Newswires Text News Feed for 23 NASDAQ stocks. They show that LOB data has very strong forecasting power compared to the HAR-family of models, and adding news sentiment variables to the dataset only improves the forecasting power marginally. More importantly, they find that this is only valid for normal volatility days, not volatility jumps. However, this study remains narrow in focus dealing only with sentiment extracted from the LM dictionary. The two principal limitations of the LM dictionary are that it does not consider the language complexities, and it is developed based on financial statements, not news stories. Except few studies focusing on a novel statistical model for sentiment extraction to predict asset returns (Ke et al., 2019) , topic modelling (Bybee et al., 2020) , designing a sentiment-scoring model to capture the sentiment in economic news articles (Shapiro et al., 2020) , and developing word embedding for analysing the role of corporate culture during COVID-19 pandemic (Li et al., 2020) ; so far, there has been little focus on more advanced NLP models for financial forecasting. Much of the current trend on NLP focuses on word embedding , a more sophisticated word representation that paved the way for the modern textual oriented ML models.

As a major review of textual analysis in accounting and finance, Loughran and McDonald (2016) warns that these more complex ML models potentially add more noise than signal. Nevertheless, we set out to investigate the usefulness of a more advanced NLP structure for RV forecasting. Part of the aim of this study is to develop, for the first time, a financial word embedding, named FinText, and compare it with publicly available general word embeddings. Another objective of this research is to determine whether a word embedding, together with a convolutional neural network (CNN) solely trained on news data, is powerful enough for RV forecasting or not. Finally, as another important objective, this study shines new light on these debates through the use of Explainable AI methods to interrogate these ML models.

We show that our financial word embedding is more sensitive to financial context compared with general word embeddings from Google and Facebook. Using 23 NASDAQ stocks from 27 July 2007 to 18 November 2016, we observe that using just previous day news headlines; all developed NLP-ML models strongly outperform HAR-family of models, extended models in Rahimikia and Poon (2020a) , and other proposed ML models in Rahimikia and Poon (2020b) for forecasting RV on volatility jump days. Furthermore, FinText word embedding was developed based on the Word2Vec algorithm and Skip-gram model marginally outperforms all other word embeddings. This finding highlights the potential importance of news in more advanced NLP and ML frameworks for detecting volatility jumps. Also, it captured some rich textual information that the traditional dictionary-based sentiment models cannot capture.

Our findings also show that political news headlines have lesser predictive power compared with stock-related news. Comprehensive robustness checks focusing on the different number of days for news aggregation, different filter (kernel) size, and making embedding layers trainable support these findings.

Another important finding that emerges from this study is that a simple ensemble model, which is a combination of the best performing model trained on LOB data in Rahimikia and Poon (2020b) and the best performing NLP-ML model in this study can substantially outperform all traditional and ML models for both normal volatility and volatility jump days. This shows the importance of combining financial information and textual news information in a single forecasting model. Using

Explainable AI methods, we show that words classified as negative (e.g., 'loss') in the LM dictionary could cause both an increase or a decrease in RV forecast. The Explainable AI measures can also analyse the impact of any word (or combination of words) on RV forecasting, which was not feasible in the dictionary approach. This paper is structured into seven sections. Section 2 deals with the theory of word embedding, Word2Vec, and FastText algorithms. Section 3 introduces our word embedding called FinText. News preprocessing steps are covered in Subsection 3.1 and evaluation and representation of this proposed word embedding in Subsection 3.2. Section 4 gives a brief review of RV forecasting followed by related models in Subsection 4.1. The proposed NLP-ML model for RV forecasting is introduced in Subsection 4.2, and Explainable AI methods in Subsection 4.3. Next, Section 5 presents the findings of the research, focusing on the stock-related news in Subsection 5.1, hot political news in Subsection 5.2, a comparison of models in Subsection 5.3, a proposed ensemble model in Subsection 5.4

and Explainable AI results in Subsection 5.5. Section 6 looks at robustness checks and finally, Section 7 concludes with a discussion and future research directions.

A word embedding is one of the most important recent developments in NLP, where a word representation is in the form of a real-valued vector such that words that are closer in the vector space are expected to be similar in meaning. Before the development of word embeddings, each token in an one-hot encoding was defined by a binary vector of zeroes except the index for that token in the vocabulary list. The vectors of any two tokens are orthogonal to each other. Hence, a word embedding is a better representation of semantics. Specifically, a word embedding

where M is the dimension size, and N is the number of unique tokens in the vocabulary list W . Each token is represented by a vector of M values. In the literature, M is usually about 300. Word2Vec and FastText (Bojanowski et al., 2017) are two of the most efficient algorithms for training word embeddings. Subsection 2.1 and Subsection 2.2 briefly review these two simple and efficient algorithms. 4 2.1 Word2Vec proposed supervised learning models with Skip-gram and continuous bag-ofwords (CBOW) log likelihoods for the Word2Vec algorithm as shown below:

where T is the total number of tokens in the sequence X = {t 1 , t 2 , ..., t T }, k is the window size around the chosen token w t , and p(w t+j |w t ) and p(w t |w t+j ) are the probability of correct predictions of Skipgram and CBOW models, respectively. In the Skip-gram, the input (middle) token is used to predict the context (surrounding tokens), whereas the context (surrounding) tokens are used to predict the middle token in CBOW. It is generally agreed that the faster CBOW is suited for training larger datasets, while the Skip-gram is more efficient for training smaller datasets. Both models aim to maximise the aggregate predictive probability in Equation (1a) and Equation (1b) based on a simple neural network architecture.

For both Skip-gram and CBOW, the softmax operation for calculating the conditional probability is defined as follows:

where w c and w t are context and target tokens, u wc and u wt are the trainable vector of context token w c and target token w c , and N is the number of tokens in the vocabulary. For the CBOW model in Equation (2b), as there are more than one context token, the average of the context token vectors, u wc , is used. Both models are trained using stochastic gradient descent.

When there are a large number of tokens in the dictionary (N ), Equation (2a) and Equation (2b) are computationally expensive. In this case, hierarchical softmax (Morin and Bengio, 2005) can be used instead. Another alternative is the negative sampling method with a binary logistic regression . In this method, the training set is the pair of target and context tokens, and K tokens are randomly chosen from a specific distribution. The output is one for the first token pair and zeroes for all other pairs. found that K ranges from 5 to 20 for large training sets and ranges from 2 to 5 for small training sets.

FastText is an extension of Word2Vec. For example, take 'profit' as a token and set n-grams equal to 3 (i.e. n = 3); the corresponding token vector is defined as ¡pr, pro, rof, ofi, fit, it¿, where ¡ and ¿ indicate the start and the end of the token vector. The original token ¡profit¿ is also added to this list. More formally, for token w , u w = n i=0 u i where u w is the vector of token w, n is the number of n-grams of this token, and u i is the vector of each sub-token. In this enhanced representation, each token consists of a bag of n-gram characters. This algorithm is computationally intensive, but compared with Word2Vec algorithm, it is more powerful for learning rare and out-of-vocabulary (OOV) tokens and morphologically rich languages (MRL) (Bojanowski et al., 2017) .

At the time of writing, several well-known pre-trained word embeddings are available. These include Mikolov et al. (2018) three-million-unique-tokens Word2Vec algorithm trained using the Google news dataset with about 100 billion words, and Joulin et al. (2016) one-million-unique-tokens FastText algorithm trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset. It is arguable whether these general word embeddings models are accurate for specialist financial forecasting. Some words will have a very different meaning when used in a specific financial context, e.g. apple as fruit and Apple as the technology company. To address this concern, we train a new word embedding, called FinText, using the Dow Jones Newswires Text News Feed database from 1 January 2000 to 14 September 2015. While specialising in corporate news, this big textual dataset is among the best textual databases covering finance, business, and political news services worldwide. Subsection 3.1 describes the database, the preprocessing steps and the properties of the word embedding developed from it, and Subsection 3.2 compares our FinText with those from Google and Facebook's WikiNews mentioned above.

This study uses all types of news (viz. financial, political, weather, etc.) from Dow Jones Newswires Text News Feed from January 1, 2000, to September 14, 2015. All duplicate news stories and stories without headline and body are removed. Extensive text preprocessing of the news stories is required to eliminate redundant characters, sentences, and structures. Table A1 in the Appendix presents a brief review of the cleaning rules applied. Each rule is defined by a regular expression and may contain different variations. For brevity, only one variation is shown in this table. The text cleaning procedures fall into five main categories: 1) Primary, 2) Begins with, 3) Ends with, 4) General, and 5) Final checks. 'Primary' extracts the body of news from the extensible markup language (XML), removing XML-encoding characters (XMLENCOD), converting XML to text (parsing), converting uppercase to lowercase letters, and removing tables. 'Begins with' and 'Ends with' remove, respectively, parts begins and ends with the specified structures. 'General' caters for patterns that may appear in any part of the news stories. Finally, 'Final checks' removes links, emails, phone numbers, short news (lower than 25 characters), and the leading and the trailing space(s). These five sets of rules are applied to news headlines and bodies separately. Due to the importance of numbers in accounting and finance, numerical values are kept.

After cleaning the dataset, tokenisation breaks the headlines and news bodies into sentences and words. Common bigram (two-word) phrases are detected and replaced with their bigram form. All 6 tokens with less than five times of occurrences are ignored, the proposed bigram scoring function in is used with ten as the threshold value, and the maximum vocabulary size is set to 30 million to keep memory usage under control. Finally, the ' ' character is used for glueing pairs of tokens together. For example, if 'financial' and 'statement' appear consecutively exceeding a threshold number, they are replaced by 'financial statement' as a new token. Altogether, FinText consists of 2,733,035 unique tokens. Following the preprocessing steps, Word2Vec and FastText algorithms are applied with window size, minimum count 1 , negative sampling 2 , and the number of iterations 3 all set equal to 5. The initial learning rate (alpha) is 0.025, the minimum learning rate is 0.0001, and the exponent for negative sampling distribution is 0.75. Also, the dimension of word embeddings is 300.

All these parameter values are the proposed ones by their developers.

The results for Skip-gram and CBOW models are reported for Word2Vec and FastText algorithms.

These results are compared with pre-trained word embeddings from Google 4 (Word2Vec algorithm) and Facebook 5 (FastText algorithm). and FastText, as an extension to Word2Vec algorithm, is developed by Bojanowski et al. (2017) . b Developed word embedding on Dow Jones Newswires Text News Feed database (FinText); Publicly available word embedding trained on a part of Google news dataset with about 100 billion words (Google); Publicly available word embedding trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (Mikolov et al., 2018) (WikiNews). c The continuous bag of words (CBOW) and Skipgram are the proposed supervised learning models for learning distributed representations of tokens in .

d WordSim-353 (Agirre et al., 2009 ) is a gold-standard collection for measuring word relatedness and similarity, and Simlex (Hill et al., 2015) is another gold-standard collection tending to focus on similarity rather than relatedness or association. ', 'ibm', 'google', and 'adobe' (technology companies) , 'barclays', 'citi', 'ubs', and 'hsbc' (financial services and investment banking companies), and 'tesco' and 'walmart' (retail companies). Dimension 1 (x-axis) and Dimension 2 (y-axis) show the first and second obtained dimensions. Word2Vec and FastText algorithms are shown in the first and second rows. FinText is the trained word embedding on Dow Jones Newswires Text News Feed database, Google is a publicly available developed word embedding trained on a part of Google news dataset, and WikiNews is another publicly available developed word embedding trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset. The continuous bag of words (CBOW) and Skip-gram are the proposed supervised learning models for learning distributed representations of tokens in .

For each word embedding, principal component analysis (PCA) is applied to the 300-dimensional vectors. Figure 1 presents the 2D visualisation of word embeddings. The tokens are chosen from groups of technology companies ('microsoft', 'ibm', 'google', and 'adobe'), financial services and investment banks ('barclays', 'citi', 'ubs', and 'hsbc') , and retail businesses ('tesco' and 'walmart'). Dimension 1 (x-axis) and Dimension 2 (y-axis) show the first and second obtained dimensions. Word2Vec is shown in the top row, and FastText is shown in the bottom row. Figure 1 shows that only FinText clusters all groups correctly, with Word2Vec producing generally better results than FastText.

Next, we challenged all word embeddings to produce the top three tokens that are most similar to 'morningstar' 1 . This token is not among the training tokens of Google. WikiNews's answers are 'daystar', 'blazingstar', and 'evenin'. Answers from FinText (word2vec/skip-gram) are 'researcher morningstar', 'tracker morningstar', and 'lipper' 2 . When asked to find unmatched token in a group of tokens such as ['usdgbp', 'euraud', 'usdcad'] , a collection of exchange rates mnemonics, Google and WikiNews could not find these tokens, while FinText (word2vec/skip-gram) produces the sensible answer, 'euraud'.

Many word embeddings are able to solve word analogies such as king:man :: woman:queen. 3 Table 3 lists some challenges we posed and the answers produced by the group of word embeddings considered here. While we do not have a financial gold standard collection, it is clear that our financial word embedding is more sensitive to financial contexts and able to capture very subtle financial relationships.

It is beyond the scope of this paper to test these word embeddings rigorously because the focus here is on realised volatility forecasting. We will leave the investigation of the robustness of FinText to future research.

The previous section has illustrated that the FinText word embeddings developed in this paper are more sensitive to financial relationships. Here we aim to use these embeddings in the context of volatility forecasting. Already Engle and Ng (1993) and Engle and Martins (2020) show that news is a potential contributor to volatility. Therefore, we use word embeddings as a part of an ML model, called NLP-ML, to see if word embeddings are useful in forecasting realised volatility or not. In particular, Subsection 4.1 gives a brief review of RV forecasting, Subsection 4.2 presents NLP-ML, and Subsection 4.3 introduces Explainable AI.

Assume an asset price P t follows the stochastic process below:

where µ t is the drift, w t is the standard Brownian motion, and σ t is the volatility process (càdlàg function). RV, defined below, is used as a proxy for the unobserved integrated variance, IV t = t t−1 σ 2 s ds:

where M = 1 δ is the sampling frequency and r t,

To date, the HAR-family of models is the most popular group of econometric models for forecasting RV. All HAR models follow the general specification below:

where RV t+1 is the forecasted RV, RV t−i is the average RV of the last i days, J t is the jump com-

positive/negative intraday return 2 , RQ t−i is the average realised quarticity 3 of the last i days, and f is a linear regression. Rahimikia and Poon (2020a) analysed 23 NASDAQ stocks and found the CHAR model to be the best performing HAR model. In Equation (5), the variable i for the BP V term in the CHAR model are the previous day (i = 1), the average of last week (i = 7), and the average of last month (i = 21) (Corsi and Reno, 2009) .

In this study, consistent with Rahimikia and Poon (2020a) and Rahimikia and Poon (2020b) , the training period is from 27 July 2007 to 11 September 2015 (2046 days), and the out-of-sample period is from 14 September 2015 to 18 November 2016 (300 days). RV is calculated during the NASDAQ market trading hours (from 9:30 AM to 4:00 PM Eastern Time). As a forecasting procedure, the rolling window method is applied in this study. Also, the LOB data from LOBSTER is used to calculate RV after applying the cleaning steps described in these studies. The RV descriptive statistics of 23 NASDAQ stocks are presented in Table 4 . These tickers are chosen based on their liquidity (high to low) and availability of data. The sample period is from 27 July 2007 to 18 November 2016. Apart from a few exceptions (e.g., NVDA and MU), as liquidity decreases, the RV becomes more positively skewed with a slightly higher mean. Figure 2 is an abstract representation of our NLP-ML model. {X (t,1) , X (t,2) , ..., X (t,kt) } is the vector of k t tokens from news headlines on day t. When k t is less than 500, the padding process will fill the vector to 500 with 'NONE' so that the daily input to the neural network has the same length.

The reason for using only the news headline and not the news body is because putting all the news bodies together makes the sequence of tokens extremely long, even for just one day. A very long token

M is the maximum value of sampling frequency, rt,i is the return at day t and sampling frequency i, and µ1 = 2/π (Corsi and Reno, 2009) . (Patton and Sheppard, 2015) . sequence increases the number of model parameters, stresses the computation, and could result in over-fitting, especially when our training sample size is relatively small. Also, it is often felt that the news headline is the most important abstract of the news body.

As shown in Figure 2 , the word embedding section is separated into days with news and days without news. For days with news, each token X (t,kt) has a 1 × 300 word embedding vector from one of the six pre-trained word embeddings in Section 3. These vectors are fixed and made non-trainable to reduce the number of parameters to be trained. So this results in a 500 × 300 sentence matrix to be fed into a CNN layer. On days when there is no news, the vector is initially filled with random numbers that can be trained by the neural network. After the CNN, we have a fully connected neural network (FCNN) that turns the signals into a single RV forecast, RV t+1 . Following Bollerslev et al. (2016) , an 'insanity' filter is applied: For each rolling window, the minimum, maximum, and average of training RVs are calculated. Any RV forecast that is greater (smaller) than the maximum (minimum) value will be replaced by the rolling window average RV. Figure 3 illustrates the structure of the CNN used in this study. Starting from the sentence matrix from Figure 2 for the news headline, 'apple looks to be further beefing up siri.', three filters of size 3, 4, and 5 are applied simultaneously with valid padding 1 and a stride size 2 of 1. 3 The outputs are 

. .

Notes: {X (t,1) , X (t,2) , ..., X (t,k t ) } consists of news headlines of day t and X (t,k t ) is the k th token of input t. Also, RVt+1 is the RV of day t + 1 (next day RV). Padding with the maximum length of 500 is adopted to ensure that all inputs of the neural network have the same length. The word embedding block consists of two different word embeddings. To capture days without any news, a trainable word embedding is used. The sentence matrix is a 500×300 matrix with a maximum length of padding of 500 and word embedding dimensions of 300. In this matrix, each token is defined by a vector of 300 values. This structure contains three filters of different sizes. The filters with the size of 3, 4, and 5 generate feature maps with the size of 498, 497, and 496, respectively. Max pooling and a fully connected neural network (FCNN) are applied then as the next steps. The output of this network is the RV of the next day (RVt+1).

three 1-dimensional feature maps of sizes 498, 497, and 496. Specifically, following Kim (2014) , let X i ∈ R M be the M -dimensional token vector corresponding to the i th token in the news headline. We know from Figure 2 that M = 300, and news headlines with less than 500 tokens will be padded so that n = 500. Let X i:i+j refer to the concatenation of token vectors X i , X i+1 , . . . , X i+j as follows:

where ⊕ is the concatenation operator. A convolution operation involves a filter W ∈ R hM , which is applied to a window size of h tokens to produce a new feature as follow:

where b ∈ R is a bias term, and f is a nonlinear function. This filter is applied to each possible window of tokens in the sentence {x 1:h , x 2:h+1 , ..., x n−h+1:n } to produce a feature map with C ∈ R n−h+1 ,

As the next step, max-pooling (Ĉ = max{C}) is applied. This step is used to ensure that the most important feature is chosen (Collobert et al., 2011) . For converting the max-pooling layer to the RV of the next day (RV t+1 ), an FCNN is used as the last layer. The activation function of both the CNN and the FCNN is a rectified linear unit (ReLU) 1 , the optimisation algorithm is Adam (Kingma and Ba, 2014) , and MSE is the objective function of this network. To prevent the NLP-ML model from over-fitting, L 2 regularisation with weight decay value set equal to 3 is used for both CNN and FCNN, while the dropout rate is set equal to 0.5 between the CNN and FCNN. The timespan for headlines of the day t starts from 9:30 AM Eastern Time of day t and ends at 9:30 AM Eastern Time of the day t + 1. Daily training of this model is computationally intensive; therefore, the training process is repeated every five days, and the trained model is used for the next days. In order to have reproducible results, a random number generator (RNG) with the same seed is used for all trained models.

Dow Jones Newswires Text News Feed provides a tagging system for finding news stories related to a specific stock. Therefore, in this study, considering the timespan of our analysis and availability of this tagging system during the timespan, the tag ('about') is used for extracting the stock-related news for each ticker. This tag denotes a story about a ticker but of no particularly significant. Also, for hot political news, news stories with a 'hot' tag and also 'politics' as the subject tag are used. 'Hot' tag means news story is deemed 'important' or 'timely' in some way. The main reason for adding 'hot'

as an extra tag here is to reduce the length of daily tokens as much as possible. Returning to padding, Figure 4 shows the distribution of daily tokens (headlines) for stock-related news in Figure 4a and hot political news in Figure 4b . What can be clearly seen from these figures is that for hot political news, padding is removing more tokens compared to stock-related news. This is expected because the number of daily hot political news is usually higher than stock-related news. 

This subsection describes two popular methods, the integrated gradient (IG) and the shapely additive explanations (SHAP), for making AI models more transparent. Sundararajan et al. (2017) defined IG with Riemann sum as follows:

where IG i is the integrated gradient for token i, X is the set of weights for the daily news headlines, X is the baseline (i.e. sequence of zeros here), m is number of steps in the Riemann sum, (X + k m (X −X )) is the linear interpolation between the baseline (X ) and X, F (.) is the trained NLP-ML model, ∂F (.) ∂X i computes the gradient of the NLP-ML model with respect to each token i, 1 m m k=1 Gradients is the average gradient, and (X i − X i ) is for scaling the integrated gradient. In this study, we use the Gauss-Legendre quadrature (with 50 steps, and a batch size of 100) instead of Riemann sum for the approximation.

The IG approach is applicable to any differentiable model. It has a theoretical foundation, it is easy to use, and is not computationally intensive. Moreover, IG provides the individual importance (attribute) for each token; a higher attribute value indicates a higher RV forecast and vice versa for a lower attribute value. Unfortunately, it is not possible to rank the tokens in order to find out which has the best predictive power for realised volatility. Lundberg and Lee (2017) proposed the SHAP method based on the coalition game theory. Shapley values φ i , defined below, show the importance of a model input S (a set of tokens in daily news headlines) given the model output f (S), the volatility forecast. In this case: The benefits of the SHAP approach include a solid theoretical foundation in game theory and no requirement for differentiable models. However, it is computationally intensive, and, like other permutation-based approaches, it does not consider feature dependencies and may generate misleading results. Here, we use a high-speed approximation algorithm, Deep SHAP based on DeepLIFT (Shrikumar et al., 2017) , to calculate SHAP values.

For each stock i, the performance difference between the NLP-ML model j and CHAR is measured as follow:

In equations (11) and (12), MSE could be replaced by QLIKE 1 (Patton, 2011) or mean directional accuracy (MDA). For MSE and QLIKE, a negative value in equations (11) and (12) 

1 It means that sum of the individual token attributions is equal to the forecasted RV. 2 It means that a missing token has no attributed impact (φi = 0). 3 It means that if a change in a specific token has a more considerable impact on the first model compared to the second model, the importance of this token should be higher for the first model than the second model. where L k is the loss from the benchmark (HAR-family of models), L 0 is the loss from the specific NLP-ML model, and n is the number of benchmark models (in this study, n = 8). Rejection of H 0 means that the loss from the NLP-ML model is significantly smaller than that from all benchmark models. For this RC test, we follow the stationary bootstrap of Politis and Romano (1994) with 999 re-samplings and an average block length of 5 (Bollerslev et al., 2016; Rahimikia and Poon, 2020a,b) . Rahimikia and Poon (2020a) showed the importance of separating normal volatility days and volatility jump days when evaluating out-of-sample forecasting performance. A day is defined as a volatility jump day when RV for that day is greater than Q3 + 1.5 IQR, where IRQ = Q3 − Q1, and Q1 and Q3 are, respectively, the first and third quantiles of RV. By applying this criterion to the sample of 23 stocks, averagely about 90% of the 300 days in the out-of-sample period are classified as normal volatility days, and the remaining 10% as volatility jump days. Table 5 reports the the out-of-sample forecasting results for different NLP-ML models with 3, 8, 13, 18, 23, 28, 33, 38, and 43 CNN filters. An NLP-ML model with a lower number of filters is less complex than one with a higher number of filters. In Table 5 , RC is the percentage of tickers with outstanding NLP-ML performance against all HAR-family of models at the 5% and 10% significant levels with MSE(QLIKE) as the loss function. 'Avg' and 'Med' are, respectively, Avg ∆ M SE,i,j and M ed ∆ M SE,i,j in Section 5.

The top panel of Table 5 is for normal volatility days, while the bottom panel is for volatility jump days. The bottom panel clearly shows that NPL-ML dominates the HAR models on volatility jump days. Using the MSE loss function, our 'FinText(skip-gram)' and 'FinText (FT 1 /skip-gram)' are the best performing word embeddings. The MDA results in Table A2 in the Appendix corroborates this finding. With QLIKE loss function, Google and WikiNews show slightly better results. On normal volatility days in the top panel, all NLP-ML models underperform and are dominated by HAR models. Figure 6 plots the results in Table 5 . The top panel of Figure 6 is for normal volatility days while the bottom panel is for volatility jump days. The left column is based on MSE, while the right column is based on QLIKE. The solid line represents the median, while the dashed line represents the mean. A negative value of the loss function (either MSE or QLIKE) means NLP-ML outperforms, and a positive value means the CHAR model is better. The bar chart represents the proportion of RC tests that shows NLP-ML outperforms the HAR-family of models. A value over 50% means NLP-ML is better, and a value closer to 100% means NLP-ML dominates. Like Table 5 , the most striking result in Figure 6 Figure A1 shows NLP-ML models are generally good 1 FastText algorithm. 3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38 3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38 3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38 3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38 in directional forecasts specially for normal volatility days.

This subsection attempts to provide a performance summary of the hot political news for RV forecasting. The results for hot political news are presented in Figure 7 . From this figure, in line with stock-related news in Subsection 5.1, NLP-ML models based on hot political news show improvement in forecasting power of volatility jumps but not in the same degree for normal volatility days. A comparison of stock-related news in Figure 6 and hot political news in Figure 7 also reveals that hot political news have less RV forecasting power for volatility jumps compared with stock-related news.

However, focusing on RC results, this statistically significant improvement of hot political news is not negligible for both MSE and QLIKE loss functions. Interestingly, MDA results in Figure A2 show improvement in RV forecasting of normal volatility days for averagely more than 60% of tickers. These results draw our attention to the importance of considering hot political news (or other types of news) as a potential contributor to RV.

If we now turn to a detailed analysis of hot political news results, it can be seen in Figure 7 (Loughran and McDonald, 2011) .

to be interpreted cautiously.

This subsection has attempted to provide a summary of other types of news like hot political and their contribution to RV forecasting. Understanding the impact of different types of news is vitally important but out of the scope of this study. Therefore, the following parts focus on stock-related news-based models as the best performing group of models in this study.

Here, we compare three approaches for forecasting realised volatility, viz. CHARx, ML (with long short-term memory (LSTM)), and NLP-ML. Rahimikia and Poon (2020a) conducted a full investigation into the HAR-family of models and concluded that the CHAR structure is the most stable and provided the best RV forecasting performance. The authors went on to create CHARx, i.e., CHAR with exogenous variables and found some forecasting power from LOB variables and LM sentiment variables extracted from Dow Jones Newswires Text News Feed (stock-related news only). Later, Rahimikia and Poon (2020b) put all the CHARx and other variables in an LSTM model and found this ML structure significantly improved the forecasting power on normal volatility days. Here, we compare the forecasting power of CHARx and ML models (ML group) above with the best performing NLP-ML model ('FinText(skip-gram)') from the previous section. Table 6 shows the differences between these three approaches. From this table, it is clear that compared with CHARx and ML group, NLP-ML models incorporate substantially limited variable types (news not financial data), news types (headline, not body and headline together), historical information (one day, not 23 days), and training sequence (every five days not every day).

The results are presented in Figure 8 . Regarding news sentiment, we have the LM dictionary sentiment variables (negative, positive, uncertainty, litigious, weak modal, moderate modal, strong modal, and constraining) and news count. The LOB variables are calculated based on 5 and 10 levels of bid, ask or both. The LOB variables include, among other variables, type-1 slope (Naes and Skjeltorp, 2006) and type-2 slope (Kalay et al., 2004 ) and depth. A few observations can be made from Figure 8 . First, the ML group provided significantly greater forecasting power on normal volatility days but did poorly on volatility jump days. In contrast, NLP-ML provided significantly greater forecasting power on volatility jump days but did poorly on normal volatility days. in Figure A3 presented in the Appendix, compared with the ML group, NLP-ML provides better forecasting performance for volatility jumps but worse for normal volatility days. Also, considering the RC results, in some cases, CHARx is marginally good for volatility jumps, and its forecasting performance is mixed on normal volatility days depending on whether one is using median or mean measure.

In summary, we deduce from the analyses here that, first, there is important textual information that cannot be extracted from the dictionary approach. Second, there is essential information in the financial numbers that is not in the text. Hence, in Subsection 5.4 below, we use an ensemble model to combine both sources of information to RV forecasting that works for both normal volatility days and volatility jump days.

In this study, the ensemble model forecast is simply the arithmetic mean of the forecast from the best performing NLP-ML model from the previous section and the forecast from the best performing OB-ML model in Rahimikia and Poon (2020b) . The OB-ML model contains 134 variables extracted from the LOB and autoregressive volatility variables in the HAR-family of models. Rahimikia and Poon (2020b) found the news sentiment variables to have only marginal predictive power. Also, because the NLP-ML model has a much more powerful textual information extraction routine, we have decided not to choose the news sentiment-based models from this study. importantly, the darker cell shows improvement, and the lighter cell shows degradation in performance.

Hence, Figure 10 shows the performance is not sensitive to the number of CNN filters in the NLP-ML model. This means one could choose a small number to save time. However, the performance is sensitive to the number of units in OB-ML. A smaller number is preferred for normal volatility days, and a larger number is preferred for volatility jump days. This finding is in line with the findings in Rahimikia and Poon (2020b) . The heatmap for the MDA loss function results in Figure A5 reached the same conclusion.

In summary, this ensemble approach stresses the importance of the information content of both financial numbers and news for forecasting RV again. When we study news with ML, a simpler structure with a small number of CNN filters is sufficient, and this textual information tends to be good for predicting volatility jumps. However, with financial numbers, the process is different, whether it is for a normal day forecast or a jump day forecast. A compromise is choosing a small number of units for normal day forecasts without damaging the jump day forecast and letting the news and NLP-ML model take care of jump day forecasts. Future research could investigate if weighted average or even more complex ensemble models could perform better.

Over the last few years, great efforts have been made to understand ML models better, which are often described as black-box. Here, we will use two Explainable AI methods, viz. IG and SHAP, to analyse the impact of textual information in volatility forecasting in an ML model. A brief review of IG and SHAP is in Subsection 4.3. considering SHAP values for the first ticker (AAPL), all appearances of the token 'loss' increased the RV forecast but with slightly different magnitudes. The behaviour is very different for the second ticker (ADSK); apart from one incident, all the other appearances of 'loss' reduced the RV forecasts.

In Figure 11 it is seen that 'loss' led to increase (decrease) in RV forecasts 42 (18) times. This is the case whether we measure the impact with IG or with SHAP. 'Loss' is an important token for 'MU' but has almost no effect on 'AMGN'. In contrast, the dictionary approach will classify 'loss' always as a negative term and possibly with a prediction that it always increases RV forecast by a fixed amount.

The ML via FinText word embedding allows for a greater variety of contextual relationships between 'loss' and RV forecast.

Here, 'donald trump' is chosen as a political bigram (i.e. a pair of words). Figure bigram produced from 'FinText(skip-gram)' with 3 CNN filters.

From the IG measure, there are now 1900 (492) cases when 'donald trump' led to an increase (decrease) in RV forecast. For the SHAP measure, this number is 1589 (803). So 'donald trump' is a highly sensitive bigram and tends to increase market volatility. What is interesting about the results from hot political news is that this approach is able to identify the impact of any desired term, not just a limited group of terms in a dictionary.

To gain some appreciation on the context within which the token affects volatility forecasts, we present a visualisation using the stock-related news for Apple (with ticker 'AAPL') on 2016-10-2 and hot political news on 2016-07-20. Figure 13 presents SHAP visualisation in the top panel for stock-related news about Apple and IG visualisation in the bottom panel for hot political news. Note that 'NONE'

is used repeatedly in the bottom panel as a padding procedure. The token coloured red (blue) indicates an increase (decrease) in the RV forecast. A darker colour indicates a greater intensity. Figure 13a shows that the majority of the highlighted tokens are related to the fourth-quarter earnings of Apple, and these tokens together increased the RV forecasts. Not only that the NLP-ML model learned the link between the earnings release and a higher volatility forecast on the following day, but also it appears to have also identified the news tag, i.e. '¿ aapl', that might have been used to denote earning related information. For hot political news in Figure 13b , the key phrases, 'republican national convention nominates donald trump', 'donald trump secures republican', and 'gop 1 presidential nomination' caused an increase in the RV forecast. This illustrates the capability of NLP-ML in detecting important political news that has an effect on volatility. Moreover, the different 1 GOP is short for 'Grand Old Party', a colloquium for the Republican party. apple 's billions may not be enough to end earnings recession . options foreshadow big swings for apple after earnings . apple 4q sales $ 46.9b > aapl . apple 4q gross_margin 38 % > aapl . apple sees 1q rev $ NONE $ 78b > aapl . apple 4q eps $ 1.67 > aapl . apple 4q net $ 9.01b > aapl . apple 4q mac rev $ 5.74b > aapl . apple 4q other products rev $ 2.37b > aapl . apple 4q services rev $ 6.33b > aapl . apple 4q ipad rev $ 4.26b > aapl . apple 4q iphone rev $ 28.16b > aapl . apple 4q americas rev $ 20.23b > aapl . apple 4q europe rev $ 10.84b > aapl . apple 4q greater china rev $ 8.79b > aapl . apple 4q japan rev $ 4.32b > aapl . apple 4q rest of asia pacific rev $ 2.67b > aapl . apple 4q ipad unit sales 9.27m > aapl . apple 4q iphone unit sales 45.5m > aapl . apple 4q mac unit sales 4.89m > aapl . apple ceo tim_cook : improvements in services business and introduction of flagship iphone improving outlook for coming quarter --interview . apple 's cook : 'customer response has really been off the charts ' for iphone . apple had first decline in annual revenue and profit since 2001 . apple 's cook : 'we could n't be more happy with how it 's been received ' on iphone . press release : apple reports fourth quarter results . press release : apple reports fourth quarter -2-. press release : apple reports fourth quarter -3-. apple profit and revenue slide as it copes with dwindling iphone sales . apple 4q rest of asia pacific rev down 1 % > aapl . apple 4q japan rev up 10 % > aapl . apple 4q greater china rev down 30 % > aapl . apple 4q europe rev up 3 % > aapl . apple 4q americas rev down 7 % > aapl . apple 4q iphone rev down 13 % > aapl . apple 4q ipad rev 0 % > aapl . apple 4q services rev up 24 % > aapl . apple 4q other products rev down 22 % > aapl . apple 4q mac rev down 17 % > aapl . apple sees 1q gross_margin between 38 % and 38.5 % > aapl . apple sees 1q operating expenses between $ 6.9 billion and $ 7 billion > aapl . apple sees 1q tax rate of 26 % > aapl . apple generated $ 16.1 billion in operating cash_flow . apple returned $ 9.3 billion to investors through dividends and share repurchases in 4q . apple has now completed over $ 186 billion of its capital return program . apple : international sales accounted for 62 % of 4q revenue > aapl . apple 's cook : 'we 're thrilled with the Notes: The bar chart is the percentage of tickers with the outstanding performance considering the MSE loss function in Figure 14a and QLIKE loss function in Figure 14b at the 0.05 significance level of the RC compared to the all HAR-family of models as the benchmark. Benchmark group (hashed bars) is 'FinText(skip-gram)' group in Figure 6 . For '3 days' and '5 days', the time duration of input data is changed from one day to three and five, respectively. For the 'Long filter size' group, 8, 9 and 10 are chosen as filter size (kernel size) instead of 3, 4, and 5. Furthermore, regarding the 'Trainable layer' group, the embedding layer of the NLP-ML model is set to trainable. The values for the bar chart can be read from the left-hand axis. The dashed (solid) line represents the difference between the average (median) of the out-of-sample MSE(QLIKE)s of the specified model with the CHAR model (the best performing HAR-family model in Rahimikia and Poon (2020a) ) for 23 tickers (negative value shows improvement, and positive value shows degradation in performance). The values for the dashed and solid lines can be read from the right-hand axis. The horizontal dashed line represents no improvement.

shades of colour show that NLP-ML assigns very different weights to the words in the phrase reflecting its ability to handle complexity.

In this section, we assess the impact of NLP-ML parameters and forecasting structures on the RV forecasting performance we have reported so far. Here, 'FinText(skip-gram)' is the benchmark, our best performing NLP-ML model with 3, 8, 13, 18, 23, 28, 33, 38 , and 43 CNN filters. The robustness check studies four modifications to the benchmark. First, the '3 days' group changes the input from news headlines of the previous day to the previous three days. Similarly, the second modification takes as input the headlines from the previous 5 days. 2 The third modification, 'Long filter size', changes the filter (kernel) size from 3, 4, and 5 to 8, 9 and 10. Finally, the fourth modification, 'Trainable layer', makes the fixed embedding layer trainable. This means that the initial embedding layer parameters obtained from the FinText could be changed. The results are presented in Figure 14 . Figure 14 reveals that moving to '3 days' and '5 days' input duration causes substantial degradation in forecasting performance of normal volatility days; apart from an increase in mean and median values, the RC values substantially reduced or dropped to zero. A degradation in forecasting performance on volatility jump days is also observable, especially when QLIKE is the loss function. The directional forecast results for the MDA loss function in Figure A6 show a degradation in the forecasting performance, especially on volatility jump days. These findings may be due to the fact that long input duration makes the NLP-ML model more complicated and would require more than 2046 days of data for training. This could be one major challenge researchers may face in future studies in this area. Figure 14 also shows little effect from changing the filter size. This finding is generally consistent with the MDA results in Figure A6 . Finally, making the word embedding layer trainable causes a substantial degradation in RV forecasting performance of normal volatility days while not providing any noticeable improvement on volatility jump days. The MDA results in Figure A6 do not show any substantial difference in forecasting performance. The main reason why the NLP-ML models with trainable embedding layers did not improve forecasting performance lies in the fact that this change substantially increases the number of trainable parameters. Therefore, more training samples are required, which is again not feasible with the current financial data sample size.

In this paper, we developed a financial word embedding called FinText 1 , based on Dow Jones Newswires Text News Feed database (2000) (2001) (2002) (2003) (2004) (2005) (2006) (2007) (2008) (2009) (2010) (2011) (2012) (2013) (2014) (2015) using Word2Vec and FastText algorithms. Our financial word embedding performed less well on general-purpose benchmarks when compared with Google's and Facebook's word embeddings. However, when challenged with detecting unique financial relation-2 Both modifications will reduce the times when there is no news. For the 23 tickers, moving to three and five days decreases the number of days without news by 45.22% and 64.32%, respectively, for in-sample data, and 49.34% and 69.89%, respectively, for out-of-sample data.

1 FinText word embeddings are available for download from rahimikia.com/fintext.

ships, FinText is better and more sensitive in detecting financial jargon. However, our focus in this paper is to test these pre-trained word embeddings in realised volatility forecasting using a machine learning model. Using data for 23 NASDAQ stocks from 27 July 2007 to 18 November 2016, we found evidence that headlines of stock-related news produced a substantial improvement on forecasting daily realised volatility on volatility jump days, beating all HAR-family of models in Rahimikia and Poon (2020a) and the machine learning models in Rahimikia and Poon (2020b) that did not make use of word embeddings.

For forecasting realised volatility on volatility jump days using stock-related news, our proposed model, named NLP-ML, performs marginally better than general word embeddings such as Google and Facebook word embeddings. This study has also identified that political news, to a lesser extent, is crucial for improving the realised volatility forecasting performance. Furthermore, since financial news with NLP-ML model performs well on volatility jump days, while market limit order book LOB-ML model performs well on normal volatility days in Rahimikia and Poon (2020b) , we create an ensemble model, combining both NLP-ML and LOB-ML, which dominates all HAR-family of models on both normal volatility days and days with volatility jumps. Our ensemble model is simply the arithmetic average of forecasts obtained from the best performing NLP-ML and LOB-ML. Finally, we use two

Explainable AI methods to measure the impact of given tokens on realised volatility forecasts. Such performance attribution is not feasible in the classical dictionary-based approach in textual analysis.

In conclusion, we demonstrate that our purpose-built financial word embeddings have superior volatility forecast power when used to analyse financial news headlines via a machine learning model.

Here, we forecast realised volatility of 23 NASDAQ stocks; it will be interesting to extend the forecasting exercise to the other financial contexts, such as stock returns, limit order book depth, trading volume etc. Furthermore, the scope of this study was limited because the training sample is small. As a result, we tested only news headlines but not the news bodies. Moreover, instead of daily forecasting, it would be interesting to test the NLP-ML forecasting power for higher-frequency data. We also hope that this study paves the way for other state-of-the-art natural language processing and machine learning research in different financial contexts beyond realised volatility forecasting. Notes: a Average∆MDA,j and b M edian∆MDA,j of 23 stocks, with a negative value indicates degradation, and a positive value indicates performance improvement. c Percentage of tickers with outstanding NLP-ML performance considering different number of CNN filters (3, 8, 13, 23, 28, 33, 38, and 43) at the 5% and 10% significance levels of the RC compared to all HAR-family of models based on MDA. 3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38 3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38  43  3  8  13  18  23  28  33  38 T 1slope(10bid) T 1slope(10ask) T 1slope5

T 1slope(5bid) T 1slope(5ask) T 2slope10 T 2slope5 Depth10 Depth5  5  10  15  20  25  5  10  15  20  25  5  10  15  20  25  5  10  15  20  25  3  8  13  18  23  28  33  38 Notes: The x-axis is the NLP-ML model with 3, 8, 13, 18, 23, 28, 33, 38 , and 43 as the number of filters and y-axis is the OB-ML model in Rahimikia and Poon (2020b) with 5, 10, 15, 20, and 25 as the number of units. The first(second) row contains the results for normal volatility days(jumps). MDA-Avg(Med) is the difference between the average(median) of the out-of-sample MDAs of the specified NLP-ML model in the x-axis and OB-ML in the y-axis with the CHAR model (the best performing HAR-family model in Rahimikia and Poon (2020a) ) for 23 tickers. MDA-RC is the percentage of tickers with the outstanding performance considering the MDA loss function at the 0.05 significance level of the RC compared to the all HAR-family of models as the benchmark for every specified NLP-ML model in the x-axis and OB-ML in the y-axis. Darker cell shows improvement, and lighter cell shows degradation in performance. Notes: The bar chart is the percentage of tickers with the outstanding performance considering the MDA loss function at the 0.05 significance level of the RC compared to the all HAR-family of models as the benchmark. Benchmark group (hashed bars) is 'FinText(skip-gram)' group in Figure 6 . For '3 days' and '5 days', the time duration of input data is changed from one day to three and five, respectively. For the 'Long filter size' group, 8, 9 and 10 are chosen as filter size (kernel size) instead of 3, 4, and 5. Furthermore, regarding the 'Trainable layer' group, the embedding layer of the NLP-ML model is set to trainable. The values for the bar chart can be read from the left-hand axis. The dashed (solid) line represents the difference between the average (median) of the out-of-sample MDAs of the specified model with the CHAR model (the best performing HAR-family model in Rahimikia and Poon (2020a) ) for 23 tickers (negative value shows degradation, and positive value shows improvement in performance). The values for the dashed and solid lines can be read from the right-hand axis. The horizontal dashed line represents no improvement.

Forecasting the equity premium: mind the news!

A study on similarity and relatedness using distributional and wordnet-based approaches

Machine learning and portfolio optimization

Bond risk premiums with machine learning

Enriching word vectors with subword information

Exploiting the errors: A simple approach for improved volatility forecasting

Venture capital communities

The structure of economic news

Deep learning in asset pricing

Natural language processing (almost) from scratch

Modelling volatility cycles: the (mf)ˆ2 garch model

A simple approximate long-memory model of realized volatility

Har volatility modelling with heterogeneous leverage and jumps. Available at SSRN 1316953

Measuring and hedging geopolitical risk

Measuring and testing the impact of news on volatility

Text as data

Empirical asset pricing via machine learning

Autoencoder asset pricing models

Simlex-999: Evaluating semantic models with (genuine) similarity estimation

re-) imag (in) ing price trends

Fasttext.zip: Compressing text classification models

Measuring stock illiquidity: An investigation of the demand and supply schedules at the tase

Predicting returns with text data

Convolutional neural networks for sentence classification

Adam: A method for stochastic optimization

The role of corporate culture in bad times: Evidence from the covid-19 pandemic

Enhancing time-series momentum strategies using deep neural networks

When is a liability not a liability? textual analysis, dictionaries, and 10-ks

Textual analysis in accounting and finance: A survey

A unified approach to interpreting model predictions

Efficient estimation of word representations in vector space

Advances in pre-training distributed word representations

Distributed representations of words and phrases and their compositionality

Hierarchical probabilistic neural network language model

Order book characteristics and the volume-volatility relation: Empirical evidence from a limit order market

A picture is worth a thousand words: Measuring investor sentiment by combining machine learning and photos from news

Volatility forecast comparison using imperfect volatility proxies

Good volatility, bad volatility: Signed jumps and the persistence of volatility

Building cross-sectional systematic strategies by learning to rank

The stationary bootstrap

Big data approach to realised volatility forecasting using har model augmented with limit order book and news. Available at SSRN 3684040

Machine learning for realised volatility forecasting. Available at SSRN 3707796

Measuring news sentiment

Learning important features through propagating activation differences

Universal features of price formation in financial markets: perspectives from deep learning

Axiomatic attribution for deep networks

Slow momentum with fast reversion: A trading strategy using deep learning and changepoint detection

A cross-sectional machine learning approach for hedge fund return prediction and selection

Multi-horizon forecasting for limit order books: Novel deep learning approaches and hardware acceleration using intelligent processing units

Bdlob: Bayesian deep convolutional neural networks for limit order books

Deep reinforcement learning for trading

Textual data cleaning rules Primary Extracting body of news from XML Removing XML-Encoding Characters

(END) XX (email-e-mail): XX for (more-further) (information-from marketwatch), please visit: XX (phone-fax-contact-dgap-ad-hoc-dgap-news): XX (EMAIL; @XX) image available: XX copyright XXXX, XX URL source: XX (more to follow) XX to read more, visit: XX end of (message-corporate news) XX (view source-view original content) (with-on) XX source: XX URL (investor relations-investor contact) XX XX contributed to this article XX like us on XX view source version on XX view original content XX: (fax-tell-contact-dgap-ad-hoc-dgap-news):(contacts-web site): ratings actions from baystreet: (=-----·--) cannot parse story for notes, kindly refer lipper indexes:to subscribe to following is the related link: for full details, please click on 

Removing links and emails Removing short news (lower than 25 characters) Removing both the leading and the trailing space(s) Removing phone numbers