key: cord-0617074-4cdm4l02
authors: Pasch, Stefan; Ehnes, Daniel
title: StonkBERT: Can Language Models Predict Medium-Run Stock Price Movements?
date: 2022-02-04
journal: nan
DOI: nan
sha: f21e7b4dff8c62938ab73387f3775d4d7ac2f55b
doc_id: 617074
cord_uid: 4cdm4l02

To answer this question, we fine-tune transformer-based language models, including BERT, on different sources of company-related text data for a classification task to predict the one-year stock price performance. We use three different types of text data: News articles, blogs, and annual reports. This allows us to analyze to what extent the performance of language models is dependent on the type of the underlying document. StonkBERT, our transformer-based stock performance classifier, shows substantial improvement in predictive accuracy compared to traditional language models. The highest performance was achieved with news articles as text source. Performance simulations indicate that these improvements in classification accuracy also translate into above-average stock market returns.

In recent years, transformer-based language models have received strong interest in both academic circles and the industry. The development of these language models has accelerated since the publication of BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2018) , which set new high scores in various NLP tasks and applications, such as question-answering, fact-checking, or hate-speech detection.

Unsurprisingly, transformer models have influenced the literature and work that links textual information, such as corporate communication or tweets, with stock price performance.

While initial attempts to use NLP methods for this purpose have mostly relied on bag-of-word type models (Loughran and McDonald 2011; Loughran and McDonald 2016; Jiang et al. 2019 ), transformer-based approaches allow to capture more complex aspects of textual information.

By integrating contextual information into the evaluation of words, initial evidence suggests that transformer models indeed achieve higher accuracy in predicting stock price movements compared to previous approaches that are, for example, based on naive bayes or dictionary methods (Chen 2021; Sivri et al. 2022 ).

Most of this work has focused on the stock movements in the immediate aftermath of the publication of the corresponding documents, i.e., 5 to 30 days after a publication event. Yet, many investors do not engage in day-trading and are interested in stock performance over a longer time span. A natural question that arises is whether predictions would also work on longer time horizons. Moreover, most language models that try to predict stock prices analyze only a single source of text data, e.g., one type of corporate communication. However, some types of company-related text data may be more informative than others. We aim to fill these research gaps in the following way:

First, in this paper, we investigate whether language models are able to predict the stock performance for longer time frames. To do this, we analyze the stock price movements on a one-year time frame after the publication of related text data. Second, we compare three different sources of text data and evaluate to what extent the performance of language models is dependent on the type of underlying documents. Specifically, we use news articles, blog posts and annual reports.

To address these questions, we use a financial news dataset (Gennadiy 2020) , containing news articles and opinion pieces, as well as official Form-10K annual report filings for 250 firms between 2012 and 2019. Additionally, we gathered information on the stock performance of these companies. We link these data sources to fine-tune transformer models that classify the one-year stock performance based on the corresponding text data.

Analyzing the performance of these language models for our sample data reveals that our language models are able to classify the one-year performance with an accuracy of up to 10 percentage points above the expected accuracy of a random stock movement classifier.

Additionally, we find that BERT-based language models outperform traditional language classifiers in all our specifications. Further, the analysis confirms that the performance of these models is highly dependent on the underlying text data, showing a clear ranking, with news data leading to highest, blog articles to the second highest, and company reports leading to the lowest performance. This also provides interesting economic insights, as our results suggests that news articles contain information that are the most "valuable" to an AI. Potentially, blog articles, in their speculative nature, only add noise compared to news articles, whereas the informational content of company reports may be too sparse.

In our supplementary analysis, we investigate to what extent the performance of the newsbased model translates into stock return, showing that for our sample and observation period, the recommended picks indeed perform well, compared to the average performance of the entire sample.

Finance, accounting, and economics scholars have long been interested in the interaction of textual information and stock price movements (Cutler et al. 1989; Tetlock 2007; Groß-Klußmann and Hautsch 2011) . Most of these studies link news coverage or corporate communications to stock price movements, initially using rather unsophisticated but nonetheless effective bag-of-words methods, such as sentiment analysis based on dictionaries, to investigate these relationships (Loughran and McDonald 2011) . Particularly official corporate disclosures have been scrutinized, with studies finding relationships between the readability of documents and stock returns. Additionally, the tone of the documents and the information provided to investors has an effect on stock market returns (Loughran and McDonald 2016) . News articles and their relation to stock return have also received a lot of attention: Researchers found evidence suggesting that rising negative sentiment in a firm's news coverage lowers a firm's returns (Ahmad et al. 2016) . Media coverage effects may, however, not be persistent, as increased coverage and visibility, for example, can also generate momentum returns that tend to wane in the long run (Hillert et al. 2014) . Similarly, social media sentiment has also been linked to stock market returns (Duz Tan and Tas 2021; Sprenger et al. 2014 ).

In recent years, deep learning based language models made great strides in better understanding textual information, specifically taking the surrounding context into consideration when building their word representations (Chan et al. 2020) . Starting with LSTMbased models, such as ELMO (Peters et al. 2018) , and with the introduction of transformerbased models, particularly BERT in 2018 (Devlin et al. 2018) , deep learning models have strongly improved the ability of NLP models to answer questions, inference textual meaning or summarize text. Moreover, specifically for text classification, BERT models achieve higher accuracy compared to traditional NLP models, such as TF-IDF based models, in various text classification applications (Gonzalez-Carvajal and Garrido-Merchan 2020).

In the domain of finance, transformer-based approaches also outperform traditional NLP approaches, such as dictionary methods, in various tasks, including sentiment analysis, sentence boundary detection, and question-answering for finance-related texts. Moreover, further performance increases could be achieved by pre-training finance-specific language models like FinBERT (Liu et al. 2020 ).

Not surprisingly, researchers have integrated these deep learning approaches to predict stock price movements by applying such language models on company-related text data, such as tweets or company reports. For example, Sawhney et al. (2020) combine the word encodings in twitter texts with graph neural networks to predict whether stocks decrease or increase within a 5-day lag window. Similarly, Sonkiya et al. (2021) use BERT and GAN to predict stock prices based on news articles in a 5 to 30 days window. However, to the best of our knowledge, there has been no investigation on the ability to predict medium-run stock movements utilizing this technology.

One of our principal goals is to test what type of textual information is most useful to predict medium-run stock price development. Accordingly, we scrutinize three types of company related text data: First, we utilized a dataset of historical financial news coverage available from "news") and stock market opinion pieces (which we refer to as "blogs"). Second, we use annual First, since the amount of coverage an individual equity receives is highly skewed, for example towards large firms and stocks that have larger trading volumes, we truncated the number of equities to the 250 corporations that received the most coverage in the dataset. This ensures that every equity has at least 170 individual items associated with it. Additionally, the temporal distribution of entries is also highly skewed toward the later years. Therefore, we truncated the dataset to only include 2012 forward in our analysis, as years before then have partially less than 1000 news items in total.

For the annual reports we relied on the data provided by Bill McDonald and Tim Loughran at the University of NotreDame. We relied on all available annual reports from the same companies that we selected from the news and blog dataset. In contrast to the news and blog data, annual reports are only reported once a year, and are overall much more extensive.

To deal with this, we split the reports into individual paragraphs.

To prepare the data further, we split the data into a train, development, and test set. The train set covers the years 2012-2017 and comprises 225 equities. The dev set covers 10% of the companies (25) and also comprises the years 2012-2017. Our test set covers the same 225 firms from the train set, but in the entire year of 2019. 4 Essentially, there is a one-year gap between the train and the test set. This is necessary because we want to predict the average stock return after one year. Therefore, if we include data from 2018 there would be a large overlap in the stock-price developments of the observations in the training set and the test set, which could lead to spurious results. Accordingly, we did not use the news and blog data from 2018 in our main specification. 5

For these companies we also gathered daily stock price data from Yahoo Finance. To abstract from general market movements, we look at the abnormal price in-or decreases for each stock, that is, a stock's price change compared to the average price change of the market. These abnormal one-year return data were then split in tertiles, dividing stocks in three equally large groups of over-, under-, and average-performers. Hence, from a random classifier we would expect an accuracy of 33%.

We fine-tuned various transformer-based language models to classify the above-mentioned texts based on the expected stock performance of the corresponding company. We will mainly present results for fine-tuning BERT-Base (Devlin et al. 2018) For all models, we use the following fine-tuning specifications: 

A classical way to conduct supervised NLP-tasks is to consider the input text as a bag-ofwords and analyze their frequency of occurrence (Gonzalez-Carvajal and Garrido-Merchan 2020). A common approach to do so is by forming Term Frequency -Inverse Document Frequency (TF-IDF) matrices that measures how often a word occurs in a text relative to the inverse number of occurances in the entire document corpus. Hence, we vectorize the text inputs using TfidfVectorizer from sklearn, similar to Gonzalez-Carvajal and Garrido-Merchan (2020). Based on these vectorized text inputs, in a second step, traditional machine learning algorithms can be applied to predict the class of the corresponding text inputs. In particular, we will use Logistic Regressions and XGBoost as benchmarks. Table 2 reports the performance of fine-tuned BERT models (StonkBERT) on a hold-out test set for the three text sources. Our first finding is that StonkBERT consistently outperforms the traditional methods in predicting the return category, irrespective of the data source under scrutiny. In fact, in some of the cases, a random draw would have outperformed the traditional models, while the StonkBERT results consistently outperform a random draw, albeit only narrowly for the annual reports.

More importantly, we find that the text source we use has a sizable influence on the achieved accuracy. The lowest performance is achieved using annual reports, indicating that the information therein is largely already priced in or at least of limited long-term value to investors. In contrast, models trained on blogs and news articles performed better. Comparing the accuracy between news and blogs, we find that the informational content in news is most valuable. Our best model achieves an accuracy of 43%, i.e., 10 percentage points over a random draw. Note that this is roughly in line with the work that focuses on short-term stock price 

Though the language models were trained on a simple classification task with three categories, we also analyze in how far the model's predictions translate into stock market returns. This means, we analyze the average abnormal one-year return in our news article test sample based on the predictions from StonkBERT. First, we calculate the performance of the three prediction groups (good, medium, bad) in our test period, where performance is measured as the average abnormal one-year performance in a rolling window. 6 Table 3 reports the results. In our test set, the included companies showed an average performance of 6.29%. The firms that had been predicted as "good" by StonkBERT, however, showed an average performance of 16.83% in the one-year period after the predictions were made. The firms that were predicted as "average" just showed a performance of 4.72% and the "bad" predictions of -3.17%, respectively.

Correspondingly, the classification into the three performance groups was actually associated with substantial differences in the one-year stock returns. Moreover, we looked at the Top-10 predictions, which include the 10 firms where StonkBERT predicted the highest probability of being a "good" firm. Those firms in fact outperformed the market by an even higher margin do not find that they further underperformed compared to all firms that were predicted as "bad".

We also conducted performance simulations for articles published in year 2018 and also find performance differences between the predicted categories, albeit the outperformance of the Top-10 compared to the market appeared to be more modest. only. Interestingly, we find that the firms predicted as "good" outperformed the other categories in both time periods before and after the Covid related stock market crash in March 2020.

We set out to test whether transformer-based language models can learn valuable information from text data to predict the stock performance of affiliated firms over a one-year time-period.

Our results provide two interesting findings: First, the predictive capability of the model heavily depends on the informational value of the underlying text data, where in all specifications and models the text data from the news sample outperformed both the blog as well as the annual reports sample. Further, we found that state-of-the-art transformer-models indeed outperform traditional NLP approaches to predict stock returns. While our results are encouraging, there are several important limitations that need to be kept in mind.

Among finance scholars the dominant theory is the efficient market hypothesis (Fama 1970) , which suggests that stock prices reflect all available information of the market, making it impossible to systematically outperform the market. Correspondingly, our results should be taken with a grain of salt and there exist various explanations why NLP models including StonkBERT may not outperform the market in the long run.

First, one explanation for the successful prediction may be the result of our specific time period, which includes very strong outperformance of tech-based stocks in general that were further amplified after the Covid crisis emerged. Correspondingly, our model may have learned that tech-based stocks outperformed their peers during the training period and inferred this trend to continue in the future. However, a purely industry-based effect should have been detected by traditional models as well. Potentially, transformer-based models may be able to pick up more fine-grained information, for example specific technological trends within industries (e.g. cloud computing, or machine learning). Another reason why we suspect that the results are not entirely driven by industry based effects is that the annual report based text data was unable to pick up on such industry based effects, despite convincing and thorough evidence that their contents are an excellent predictor of a firm's industry (Hoberg and Phillips 2016) .

Similarly, our models may mainly capture a momentum-effect that has been widely studied in the finance literature (Jegadeesh and Titman 2001) . Moreover, as our models learn from historical success factors it could also be susceptible to run into stock bubbles. However, it should be noted that the StonkBERT model not just predicted outperformers, which potentially reflect a bubble, but was also able to detect underperformers.

We find the highest accuracy for language models working with news data, the second highest with blog articles, and the worst accuracy for annual reports as text source. A potential explanation for the comparably weak results for our annual report models could be that the information density in annual reports are too sparse, because, for example, they contain various standard phrases and largely contain information on past events that could be already priced in by the market (Yuan et al. 2021) . Another concern is the limited frequency, which means that only new information with a close temporal proximity to the report can be learned by the model.

Further, seasonal fluctuations are also not included, as the annual reports are generally released around the same time each year. An interesting avenue of future research could therefore be to test other more frequent corporate communication such as earnings reports, quarterly reports or form-8K filings, which could disentangle the frequency concerns from the considerations around limited informational content in such documents.

For the performance difference between blogs and news, a potential explanation in our estimation is the increased noise that is created through speculative attempts by bloggers to beat the market. Additionally, since these articles usually are framed as opinion pieces, their average news content should be lower compared to news articles.

For most language tasks, larger transformer models tend to outperform smaller models, e.g.

BERT-Large outperforms BERT-Base in the seminal BERT paper (Devlin et al. 2018) . We, however, find that no model, including various "large" models could outperform a BERT-Base model. Moreover, we achieved the best results with just one epoch of fine-tuning the language model for the stock performance classification task. A potential explanation could be that the classification task at hand is susceptible to overfitting.

Another open question is the length of the training period and test period. In our approach, we used a simple heuristic based on the amount of data available. For example, including news data from the 1980s is unlikely to improve the predictive performance of the model for today, it may even introduce noise or outdated information that decrease the models' performance. On the other hand, using a longer training period could prevent an overfitting on short term trends that do not reflect fundamental values. Similarly, for the test period, it is so far unclear how far the performance differences between predicted groups persist after the one-year period. 

Mediaexpressed negative tone and firm-level stock returns

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

BERT is the Word: Predicting Stock Prices with Language Models

German's Next Language Model

Stock Movement Prediction with Financial News using Contextualized Embedding from BERT

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

What moves stock prices

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Social Media Sentiment in International Stock Returns and Trading Activity

Efficient Capital Markets: A Review of Theory and Empirical Work

Comparing BERT against traditional machine learning text classification

Merch'an/b1d01ce808e4671866e11a021ca7d8873946080e

When machines read the news: Using automated text analytics to quantify high frequency news-implied market reactions

Media Makes Momentum

Text-Based Network Industries and Endogenous Product Differentiation

Profitability of Momentum Strategies: An Evaluation of Alternative Explanations

Manager sentiment and stock returns

RoBERTa: A Robustly Optimized BERT Pretraining Approach

FinBERT: A Pretrained Financial Language Representation Model for Financial Text Mining

Textual Analysis in Accounting and Finance: A Survey

When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks

Deep contextualized word representations

Deep Attentive Learning for Stock Movement Prediction From Social Media Text and Company Correlations

From Statistical to Deep Learning Models: A Comparative Sentiment Analysis Over Commodity News

Stock price prediction using BERT and GAN

Tweets and Trades: the Information Content of Stock Microblogs

Giving Content to Investor Sentiment: The Role of Media in the Stock Market

Transformer-Based Counterfactual Augmentation for Earnings Call Analysis