key: cord-209932-1lsv7cel authors: Challet, Damien; Ayed, Ahmed Bel Hadj title: Predicting financial markets with Google Trends and not so random keywords date: 2013-07-17 journal: nan DOI: nan sha: doc_id: 209932 cord_uid: 1lsv7cel We check the claims that data from Google Trends contain enough data to predict future financial index returns. We first discuss the many subtle (and less subtle) biases that may affect the backtest of a trading strategy, particularly when based on such data. Expectedly, the choice of keywords is crucial: by using an industry-grade backtesting system, we verify that random finance-related keywords do not to contain more exploitable predictive information than random keywords related to illnesses, classic cars and arcade games. We however show that other keywords applied on suitable assets yield robustly profitable strategies, thereby confirming the intuition of Preis et al. (2013) Taking the pulse of society with unprecedented frequency and accuracy is becoming possible thanks to data from various websites. In particular, data from Google Trends (GT thereafter) report historical search volume interest (SVI) of given keywords and have been used to predict the present [7] (called nowcasting in [5] ), that is, to improve estimate of quantities that are being created but whose figures are to be revealed at the end of a given period. They include unemployment, travel and consumer confidence figures [7] , quarterly company earnings (from searches about their salient product)s [8] , GDP estimates [5] and influenza epidemics [15] . Asset prices are determined by traders. Some traders look for, share and ultimately create information on a variety on websites. Therefore asset prices should be related to the behavior of website users. This syllogism has been investigated in details in [9] : the price returns of the components of the Russell 3000 index are regressed on many factors, including GT data, and these factors are averaged over all of the 3000 assets. Interestingly, the authors find inter alia a significant correlation between changes in SVI and individual investors trading activity. In addition, on average, variations of SVI are negatively correlated with price returns over a few weeks during the period studied (i.e, in sample). The need to average over many stocks is due to the amount of noise in both price returns and GT data, and to the fact that only a small fraction of people who search for a given keywords do actually trade later. [24] 's claim is much stronger: it states that future returns of the Dow Jones Industrial Average are negatively correlated with SVI surprises related to some keywords, hence that GT data contains enough data to predict financial indices. Several subtle (and not so subtle) biases prevent their conclusions from being as forceful as they could be. Using a robust backtest system, we are able to confirm that GT data can be used to predict future asset price returns, thereby placing their conclusions on a much more robust footing. Raw asset prices are well described by suitable random walks that contain no predictability whatsoever. However, they may be predictable if one is able to determine a set of conditions using either only asset returns (see e.g. [21] for conditions based on asset cross-correlations) or external sources of information. Google Trends provide normalized time series of number of searches for given keywords with a weekly time resolution [28] , denoted by v t . [24] propose the following trading strategy: defining the previous base-line search interest asv t = 1 to consider the inverse strategy, but average price reversion over the next one or two weeks with respect to a change of SVI was already noticed by other authors [9, 11] . Instead of trying to predict the Dow Jones Industrial Average index, we use the time series of SPY, which mirrors the Standard and Poors 500 index. This provides a weak form of cross-validation, the two time series being highly correlated but not identical. For the same reason, we compute returns from Monday to Friday close prices instead of Monday to Monday, which keeps index returns in sync with GT data (they range from Sundays to Saturdays). Prediction is hard, especially about the future. But prediction about the future in the past is even harder. This applies in particular to the backtesting of a trading strategy, that is, to the computation of its virtual gains in the past. It is prone to many kinds of biases that may significantly alter its reliability, often positively [14, 20] . Most of them are due to the regrettable and possibly inevitable tendency of the future to creep into the past. This is the most overlooked bias. It explains in part why backtest performances are often very good in the 80s and 90s, but less impressive since about 2003, even when one accounts for realistic estimates of total transaction costs. Finding predictability in old data with modern tools is indeed easier than it ought to be. Think of applying computationally cpu-or memory-intensive methods on pre-computer era data. The best known law of the computational power increase is named after Gordon Moore, who noticed that the optimal number of transistors in integrated circuits increases exponentially with time (with a doubling time τ 2 years) [23] . But other important aspects of computation have been improving exponentially with time, so far, such as the amount of computing per unit of energy (Koomey' law, τ 1.5 years [18] ) or the price of storage (Kryder's law, τ 2 years [19] ). Remarkably, these technological advances are mirrored by the evolution of a minimal reaction timescale in financial data [16] . In addition, the recent ability to summon and unleash almost at once deluges of massive cloud computing power on large data sets has changed the ways financial data can be analyzed. It is very hard to account for this bias. For educational purposes, one can familiarize oneself with past computer abilities with virtual machines such as qemu [2] tuned to emulate the speed and memory of computers available at a given time for a given sum of money. The same kind of bias extends to progresses of statistics and machine learning literature, and even to the way one understands market dynamics: using a particular method is likely to give better results before its publication than, say, one or two years later. One can stretch this argument to the historicity of the methods tested on financial data at any given time because they follow fashions. At any rate, this is an aspect of backtesting that deserves a more systematic study. Data are biased in two ways. First, when backtesting a strategy that depends on external signals, one must ask oneself first if the signal was available at the dates that it contains. GT data was not reliably available before 6 August 2008, being updated randomly every few months [27] . Backtests at previous dates include an inevitable part of science fiction, but are still useful to calibrate strategies. The second problem is that data is revised, for several reasons. Raw financial data often contains gross errors (erroneous or missing prices, volumes, etc.), but this is the data one would have had to use in the past. Historical data downloaded afterwards has often been partly cleaned. [10] give good advice about high-frequency data cleaning. Revisions are also very common for macro-economic data. For example, Gross Domestic Product estimates are revised several times before the definitive figure is reached (about revision predictability, see e.g. [13] ). More perversely, data revision includes format changes: the type of data that GT returns was tweaked at the end of 2012. It used to be made of real numbers whose normalization was not completely transparent; it also gave uncertainties on these numbers. Quite consistently, the numbers themselves would change within the given error bars every time one would download data for the same keyword. Nowadays, GT returns integer numbers between 0 and 100, 100 being the maximum of the time-series and 0 its minimum; small changes of GT data are therefore hidden by the rounding process; error bars are no more available, but it is fair to assume that a fluctuation of ±1 should be considered irrelevant. In passing, the process of rounding final decimals of prices sometimes introduces spurious predictability, which is well known for FX data [17] . Revised data also concerns the investible universe. Freely available historical data does not include deceased stocks. This is a real problem as assets come and go at a rather steady rate: today's set of investible assets is not the same as last week's. Accordingly, components of indices also change. Analyzing the behavior of the components of today's index components in the past is a common way to force feed it with future information and has therefore an official name: survivor(ship) bias. This is a real problem known to bias considerably measures of average performance. For instance [14] shows that it causes an overestimation of backtest performance in 90% of the cases of long-only portfolios in a well chosen period. This is coherent since by definition, companies that have survived have done well. Early concerns were about the performance of mutual funds, and various methods have been devised to estimate the strength of this bias given the survival fraction of funds [3, 12] Finally, one must mention that backtesting strategies on untradable indices, such as the Nasdaq Composite Index, is not a wise idea since no one could even try to remove predictability from them. What keywords to choose is of course a crucial ingredient when using GT for prediction. It seems natural to think that keywords related to finance are more likely to be related to financial indices, hence, to be more predictive. Accordingly, [24] build a keyword list from the Financial Times, a financial journal, aiming at biasing the keyword set. But this bias needs to be controlled with a set of random keywords unrelated to finance, which was neglected. Imagine indeed that some word related to finance was the most relevant in the in-sample window. Our brain is hardwired to find a story that justifies this apparent good performance. Statistics is not: to test that the average performance of a trading strategy is different from zero, one uses a T test, whose result will be called t-stat in the following, and is defined as z = µ σ √ N where µ stands for the average of strategy returns, σ their standard deviation and N is the number of returns; for N > 20, z looks very much like a Gaussian variable with zero average and unit variance. [24] wisely compute t-stats: the best keyword, debt, has a t-stat of 2.3. The second best keyword is color and has a t-stat of 2.2. Both figures are statistically indistinguishable, but debt is commented upon in the paper and in the press; color is not, despite having equivalent "predictive" power. Let us now play with random keywords that were known before the start of the backtest period (2004). We collected GT data for 200 common medical conditions/ailments/illnesses, 100 classic cars and 100 all-time best arcade games (reported in appendix A) and applied the strategy described above with k = 10 instead of k = 5. Table I reports the t-stats of the best 3 positive and negative performance (which can be made positive by inverting the prescription of the strategy) for each set of keywords. We leave the reader pondering about what (s)he would have concluded if bone cancer or Moon Patrol be more finance-related. This table also illustrates that the best t-stats reported in [24] are not significantly different from what one would obtains by chance: the t-stats reported here being a mostly equivalent to Gaussian variables, one expects 5% of their absolute values to be larger that 1.95, which explains why keywords such color as have also a good t-stat. Finally, debt is not among the three best keywords when applied to SPY from Monday to Friday: its performance is unremarkable and unstable, as shown in more details below. Nevertheless, their reported t-stats of financial-related terms is biased towards positive values, which is compatible with the reversal observed in [9, 11] , and with results of Table 1 . This may show that the proposed strategy is able to extract some amount of the possibly weak information contained in GT data. An other explanation for this bias could have been coding errors (it is not). Time series prediction is easy when one mistakenly uses future data as current data in a program, e.g. by shifting incorrectly time series; we give the used code in appendix. A very simple and effective way of avoiding this problem is to replace all alternatively price returns and external data (GT here) by random time series. If backtests persist in giving positive performance, there are bugs somewhere. The aim of [24] was probably not to provide us with a profitable trading strategy, but to attempt to illustrate the relationship between collective searches and future financial returns. It is however striking that no in-and out-sample periods are considered (this is surprisingly but decreasingly common in the literature). We therefore cannot assess the trading performance of the proposed strategy, which can only be judged by its robustness and consistency out-ofsample, or, equivalently, of both the information content and viability of the strategy. We refer the reader to [20] for an entertaining account of the importance of in-and out-of-sample periods. F. Keywords from the future [24] use keywords that have been taken from the editions of the FT dated from August 2004 to June 2011, determined ex post. This means that keywords from 2011 editions are used to backtest returns in e.g. 2004. Therefore, the set of keywords injects information about the future into the past. A more robust solution would have been to use editions of the FT available at or before the time at which the performance evaluation took place. This is why we considered sets of keywords known before 2004. G. Parameter tuning/data snooping Each set of parameters, which include keywords, defines one or more trading strategies. Trying to optimize parameters or keywords is called data snooping and is bound to lead to unsatisfactory out of sample performance. When backtest results are presented, it is often impossible for the reader to know if the results suffer from data snooping. A simple remedy is not to touch a fraction of historical data when testing strategies and then using it to assess the consistence of performance (cross-validation) [14] . More sophisticated remedies include White's reality check [26] (see e.g. [25] for an application of this method). Data snooping is equivalent as having no out-of-sample, even when backtests are properly done with sliding in-and out-of-sample periods. Let us perform some in-sample parameter tuning. The strategy proposed has only one parameter once the financial asset has been chosen, the number of time-steps over which the moving averagev t is performed. Figure 1 reports the t-tstat of the performance associated with keyword debt as a function of k. Its sign is relatively robust against changes over the range of k ∈ 2, · · · , 30 but its typical value in this interval is not particularly exceptional (between 1 and 2). Let us take now the absolute best keyword from the four sets, Moon Patrol. Both the values and stability range of its t-stat are way better than those of debt (see Figure 2 ), but this is most likely due to pure chance. There is therefore no reason to trust more one keyword than the other. Assuming an average cost of 2bps (0.02%) per trade, 104 trades per year and 8 years of trading (2004-2011), transaction fees diminish the performance associated to any keyword by about 20%. As a beneficial side effect, periods of flat fees-less performance suddenly become negative performance periods when transaction costs are accounted for, which provides more realistic expectations. Cost related to spread and price impact should also included in a proper backtest. Given the many methodological weaknesses listed above, one may come to doubt the conclusions of [24] . We show here that they are correct. The first step is to avoid methodological problems listed above. One of us has used an industrial-grade backtest system and more sophisticated strategies (which therefore cause tool bias). First, let us compare the resulting cumulated performance of the three random keyword sets that we defined, plus the set of keywords from the Financial Times. For each sets of keywords, we choose as inputs the raw SVI, lagged SVI, and various moving averages of SVI, together with past index returns. It turns out that none of the keyword sets brings information able to predict significantly index movements (see Fig. 4 ). This is not incompatible with results of [9, 11, 24] . It simply means that the signal is probably too weak to be exploitable in practice. The final part of the performances is of course appealing, but this come from the fact that Monday close to Friday close SPY returns have been mostly positive during this period: any machine learning algorithm applied on returns alone would likely yield the same result. So far we can only conclude that a given proper (and not overly stringent) backtest system was not able to find any exploitable information from the four keyword sets, not that the keyword sets do not contain enough predictive information. To conclude, we use the same backtest system using some GT data with exactly the same parameters and input types as before. The resulting preliminary performance, reported in Fig. 4 , is more promising and shows that there really is consistently some predictive information in GT data. It is not particularly impressive when compared to the performance of SPY itself, but is nevertheless interesting since the net exposure is always close to zero (see [6] for more information). Sophisticated methods coupled with careful backtest are needed to show that Google Trends contains enough exploitable information. This is because such data include too many searches probably unrelated to the financial assets for a given keyword, and even more unrelated to actual trading. When one restricts the searches by providing more keywords, GT data often only contain information at a monthly time scale, or no information at all. If one goes back to the algorithm proposed by [24] and the compatible findings of [9, 11] , it is hard to understand why future prices should systematically revert after a positive SVI surprise and vice-versa one week later. The reversal is weak and only valid on average. It may be the most frequent outcome, but profitability is much higher if one knows what triggers reversal or trend following. There is some evidence that supplementing GT data with news leads to much improved trading performance (see e.g. [4] ). Another paper by the same group suggests a much more promising source of information: it links the changes in the number of visits on Wikipedia pages of given companies to future index returns [22] . Further work will investigate the predictive power of this type of data. We acknowledge stimulating discussions with Frédéric Abergel, Marouanne Anane and Thierry Bochud. When requesting data restricted to a given quarter, GT returns daily data Qemu, a fast and portable dynamic translator. USENIX, 2005. URL www.qemu.org Survivorship bias in performance studies Quant 3.0 -harnessing the mood of the web Nowcasting is not just contemporaneous forecasting Encelade Capital Internal Report. Final version available on request in Predicting the present with Google Trends In search of earnings predictability In search of attention An Introduction to High-Frequency Finance Measuring economic uncertainty and its impact on the stock market Survivor bias and mutual fund performance News and noise in G-7 GDP announcements Behind the smoke and mirrors: Gauging the integrity of investment simulations Detecting influenza epidemics using search engine query data Critical reflexivity in financial markets: a Hawkes process analysis Implications of historical trends in the electrical efficiency of computing After hard drives -what comes next? Magnetics Stupid data miner tricks: overfitting the S&P 500 Dissecting financial markets: sectors and states Quantifying Wikipedia usage patterns before stock market moves Cramming more components onto integrated circuits. Electronics Quantifying trading behavior in financial markets using Google Trends Data-snooping, technical trading rule performance, and the bootstrap A reality check for data snooping Google trends -Wikipedia, The Free Encyclopedia When requesting data restricted to a given quarter Kidney stone, Leukemia, Liver tumour, Lung cancer, Malaria, Melena, Memory Loss, Menopause, Mesothelioma, Migraine, Miscarriage, Mucus In Stool, Multiple sclerosis, Muscle Cramps, Muscle Fatigue, Muscle Pain, Myocardial infarction, Nail Biting, Narcissistic personality disorder, Neck Pain, Obesity, Obsessive-compulsive disorder, Osteoarthritis, Osteomyelitis, Osteoporosis, Ovarian cancer, Pain, Panic attack, Paranoid personality disorder, Parkinson's disease, Penis Enlargement, Peptic ulcer, Peripheral artery occlusive disease, Personality disorder, Pervasive developmental disorder, Peyronie's disease, Phobia, Pneumonia, Poliomyelitis, Polycystic ovary syndrome, Post-nasal drip, Post-traumatic stress disorder, Premature birth, Premenstrual syndrome, Propecia, Prostate cancer, Psoriasis, Reactive attachment disorder, Renal failure, Restless legs syndrome, Rheumatic fever, Rheumatoid arthritis, Rosacea, Rotator Cuff, Scabies, Scars, Schizoid personality disorder, Schizophrenia, Sciatica, Severe acute respiratory syndrome, Sexually transmitted disease, Sinusitis, Skin Eruptions, Skin cancer, Sleep disorder, Smallpox, Snoring, Social anxiety disorder, Staph infection, Stomach cancer, Strep throat, Sudden infant death syndrome, Sunburn, Syphilis, Systemic lupus erythematosus, Tennis elbow, Termination Of Pregnancy Iso Griffo A3L Ferrari 275 GTB/4, 1967 Shelby Mustang KR500 Corvette Stingray, Dodge Challenger, Dodge Charger, Dodge Dart Swinger, Facel Vega Facel II, Ferrari 250, Ferrari 250 GTO, Ferrari 250 GTO, Ferrari 275, Ferrari Daytona Missile Command, Moon Buggy, Moon Patrol, Ms. Pac-Man, Naughty Boy, Pac-Man, Paperboy, Pengo, Pitfall!, Pole Position, Pong, Popeye, Punch-Out!!, Q*bert We have downloaded GT data for the following keywords, without any manual editing. Here is a simple implementation in R of the strategy given in [24] . We do mean "=" instead of "<-". c o m p u t e P e r f S t a t s=f u n c t i o n ( f i l e n a m e , k=10 , g e t P e r f=FALSE) { g t d a t a=loadGTdata ( f i l e n a m e ) i f ( i s . n u l l ( g t d a t a ) | | l e n g t h ( g t d a t a ) <100){ r e t u r n (NULL) } spy=loadYahooData ( 'SPY ' ) s p y _ r e t s=g e t F u t u r e R e t u r n s ( spy ) #s p y _ r e t s i s a zoo o b j e c t , c o n t a i n s r_{ t +1} gtdata_mean=r o l l m e a n r ( gtdata , k ) # \ bar v_t gtdata_mean_lagged=l a g ( gtdata_mean , −1) # \ bar v_{ t −1} pos =2 * ( gtdata >gtdata_mean_lagged)−1 p e r f=−pos * s p y _ r e t s p e r f=p e r f [ which ( ! i s . na ( p e r f ) ) ] i f ( g e t P e r f ) { r e t u r n ( p e r f ) } e l s e { r e t u r n ( t . t e s t ( p e r f ) $ s t a t i s t i c ) } }