key: cord-0263743-zqoupui9 authors: Mesquita, S.; Vieira, C. H.; Perfeito, L.; Goncalves Sa, J. title: Learning from pandemics: using extraordinary events can improve disease now-casting models date: 2021-01-20 journal: nan DOI: 10.1101/2021.01.18.21250056 sha: 9fa20a04e323597c5807643b6f4debccd8dc3abb doc_id: 263743 cord_uid: zqoupui9 Online searches have been used to study different health-related behaviours, including monitoring disease outbreaks. An obvious caveat is that several reasons can motivate individuals to seek online information and models that are blind to people's motivations are of limited use and can even mislead. This is particularly true during extraordinary public health crisis, such as the ongoing pandemic, when fear, curiosity and many other reasons can lead individuals to search for health-related information, masking the disease-driven searches. However, health crisis can also offer an opportunity to disentangle between different drivers and learn about human behavior. Here, we focus on the two pandemics of the 21st century (2009-H1N1 flu and Covid-19) and propose a methodology to discriminate between search patterns linked to general information seeking (media driven) and search patterns possibly more associated with actual infection (disease driven). We show that by learning from such pandemic periods, with high anxiety and media hype, it is possible to select online searches and improve model performance both in pandemic and seasonal settings. Moreover, and despite the common claim that more data is always better, our results indicate that lower volume of the right data can be better than including large volumes of apparently similar data, especially in the long run. Our work provides a general framework that can be applied beyond specific events and diseases, and argues that algorithms can be improved simply by using less (better) data. This has important consequences, for example, to solve the accuracy-explainability trade-off in machine-learning. Infectious diseases pose great health risks to human populations worldwide. To mitigate these risks, public health institutions have set up surveillance systems that attempt to rapidly and accurately detect disease outbreaks. These systems typically include sentinel doctors and testing labs, and enable a timely response which can limit and even stop outbreaks. However, even when in place, detection and mitigation mechanisms can fail, leading to epidemics and, more rarely, pandemics, as we are currently experiencing. In fact, disease surveillance mechanisms that only rely on highly trained personnel, are typically expensive, limited, and slow. It has been extensively argued that these should be complemented with "Digital Era" tools, such as online information, mobility patterns, or digital contact-tracing [1] [2] [3] . Online behaviours, such as searches on Google, have proven to be very relevant tools, as health information seeking is a prevalent habit of online users 4 . This methodology has been applied to follow other epidemics, such as Dengue 5-7 , Avian Influenza 8 , and Zika surveillance 9 . In the case of Influenza, a very common infectious disease, the potential of online-based surveillance methods gained large support with the launch of Google Flu Trends (GFT), in 2008 10 . GFT attempted to predict the timing and magnitude of influenza activity by aggregating flu-related search trends and, contrary to traditional surveillance methods, provided reports in near real-time 11 , without the need for data on clinical visits and lab reports. More recently, many others have found strong evidence that the collective search activity of flu-infected individuals seeking health information online provides a representative signal of flu activity [12] [13] [14] [15] [16] . However, flu infection is not the sole (and perhaps not even the strongest) motivation for individuals to seek flu-related information online 17 . This is particularly true during extraordinary times, such as pandemics, when it is reasonable to expect individuals to have various degrees of interest, ranging from curiosity to fear, to actual disease 18 . In fact, the GFT model missed the first wave of the 2009 flu pandemic and overestimated the magnitude of the severe 2013 seasonal flu outbreak, in the USA 17, 19 . This led many authors to suggest that high media activity can lead to abnormal Google search trends, possibly leading to estimation errors 17, [20] [21] [22] [23] [24] . This "media effect" was also observed by others studying Zika [25] [26] [27] , and contributed to the disenchantment with the potential of such tools, particularly during such "extraordinary times". However, if we could decouple searches mostly driven by media, anxiety, or curiosity, from the ones related with actual disease, we could not only improve disease monitoring, we could also deepen our understanding of online human behavior. In the case of Google search trends, identifying what terms are more correlated with media exposure and reducing their influence in the model is crucial to correct past errors. In this paper, we propose that the characteristics that make pandemics unique and hard to now-cast, such as media hype, can also be used as opportunities for two main reasons: 1) as pandemics tend to exacerbate behaviors, the noise (media) is of the same order of magnitude as the signal (cases), making it more visible, allowing us to discriminate between the two; and 2) because information seeking becomes less common as the pandemic progresses 18, 28 and these different dynamics can be used when selecting the search terms. In fact, instead of ignoring pandemic periods, studying what happens during the worst possible moment can help us understand which are the search-terms more associated with the disease and the ones that were prompted by media exposure. This solution might avoid over-fitting and enable the predictive model to be more robust over time, especially during seasonal events. Therefore, we focus on the only two XXI century WHO declared pandemics and aim at learning from pandemics to now-cast seasonal epidemics (or secondary waves of the same pandemic), and improving current models by incorporating insights from information-seeking behavior. The first pandemic of the XXI century was caused by an Influenza A(H1N1)09pdm strain (pH1N1), which emerged in Mexico in February 2009 29 . By June 2009, pH1N1 had spread globally with around 30 000 confirmed cases in 74 countries. In most countries pH1N1 displayed a bi-phasic activity: a spring-summer wave and a fall-winter wave 30, 31 . The fall-winter wave was overall more severe than the spring-summer wave as it coincided with the common flu season (in the Northern Hemisphere), that typically provides optimal conditions for flu transmission 32 . The pandemic was officially declared to be over in August 2010 and a total of 18 449 laboratory-confirmed pH1N1 attributable deaths were counted (WHO, 2009 ). This number was later revised and pH1N1 associated mortality is now believed to have been 15 times higher than the original official number 33 . The second pandemic of this century, was caused by the SARS-CoV-2 virus, first identified in the last day of 2019 in the Chinese province of Wuhan. To date, Covid-19 has infected more than 78 million people and killed more than 1,7 million people worldwide. Both Covid-19 and influenza viruses cause respiratory diseases with manifestations ranging from asymptomatic or mild to severe disease and death. They share a range of symptoms and trigger similar public health measures due to common transmission mechanisms. Both pandemics have a led to a great surge in media reports and public attention across many platforms, from traditional to online social media. However, there are several differences between the two pandemics: there is still a lot of uncertainty and lack of knowledge surrounding the SARS-CoV-2 virus, including its lethality (although it is certain to be higher than the flu for older age-groups), whether it displays seasonal behaviour, its transmission frequency and patterns, whether infection confers lasting immunity, or what are its long-term health effects, respiratory or not [34] [35] [36] [37] [38] . Moreover, the Covid-19 pandemic led to unique public health measures and what might be considered the largest lockdown in history, with authorities implementing several preventive measures from social distancing to isolating entire countries. These restrictions have been instrumental in reducing the impact of the pandemic, but most decision-makers acknowledged the need to loosen the confinement measures. In the interest of economic and social needs, several countries re-opened schools and businesses, and many experienced surges in cases and deaths 39 , often referred to as second and even third waves. At this point, and as vaccines start to be distributed mostly in developed countries, all tools that can help us in identifying outbreaks are of utmost importance and different countries are deploying different measures such as conditional movement and contact tracing apps. For all these reasons, improving fast, online surveillance is even more crucial now than it was in 2009, and there are already several studies on using online data to explain and forecast Covid-19 dynamics [40] [41] [42] [43] [44] [45] . However, and despite its potential, separating what is media hype from reporting of actual disease cases (be it on Google, Facebook, or any other platform), and understanding their impact on collective attention, has been considered a huge challenge. One of the main reasons is that the patterns are intertwined with the actual spread of a disease within a population. Therefore, we learn from the 2009 flu pandemic and propose a system to reduce the signal to noise ratio on online-searches and now-cast the current Covid-19 pandemic. The 2009 influenza offers a great case study as it was extensively researched: precise signals of pandemic flu infections were obtained through large-scale laboratory confirmations 46 , several studies analyzed the media's behaviour during the pandemic [47] [48] [49] , including the collection of news pieces and news counts, and as the pandemic emerged at a period of widespread Internet usage 50 , several online datasets are available (including the collective behaviour of millions of users through their search trends on Google). Building on these datasets and by adding insights from human behaviour, we apply our framework to the current Covid-19 pandemic and provide a robust and possibly generalizable system. Improving signal (disease) to noise ratio is fundamental in disease surveillance. As extraordinary events, such as pandemics, tend to become the dominant story nearly everywhere, fear and curiosity can increase and so do searches for information. First we asked whether there is a correspondence between the number of cases (for both the 2009 flu and Covid-19), media reports, and searches on Google for disease-related terms (flu and Covid-19, respectively). We focused on the US in the case of the 2009 flu and Spain during the Covid-19 pandemic. These are countries that had a large number of cases, good data availability and, in the case of Spain, already a strong second Covid-19 wave, as detailed in the methods. Figure 1 shows the number of confirmed infections, news mentions and GT searches in the United States for the 2009 pandemic (a) and in Spain for the current one (b). Since news now travel faster than pathogenic agents, the news peak for the 09 flu pandemic (figure 1a) had its peak on the last week of April, while the first peak in cases happened later, at the end of June. More relevant is that by the time H1N1 infections had its highest peak in the US (in October/November, during regular flu season), the frequency of online searches for "flu" and news mentions had significantly reduced. In the case of the Covid-19 pandemic (figure 1b), the early news mentions began in late 2019 when the disease was identified in China, but the first cases in Spain were only identified in February 2020 (for a similar analysis on the US case see the supplementary materials). As observed before, there was a disconnect between the intensity of the disease and both its visibility in media and the volume of Google searches 17, 19 , raising the important question of whether we can discriminate between different drivers of online searches. , and Google-trends searches for the term "Covid-19" (pink) between February and November 2020. All datasets are normalized to their highest value in the period. We can see a quick increase in media activity in both situations that precedes the number of cases of infection. In both panels, searches for the terms 'flu' or 'Covid-19', display a pattern more similar to the media activity trend (Pearson correlation between the search term and media of 0.85 for the flu pandemic and 0.44 for Covid-19 pandemic, compared to 0.27 and -0.03 between the search term and cases of infection, respectively). Given that the searches for "flu" and "Covid-19" do not closely follow the variation in the number of confirmed cases, we asked if we could identify particular search terms, with higher correlation with the disease progression. We started by selecting a large number of search terms, related to each disease (see supplementary materials for the full list), all of which could be a priori considered useful for now-casting. Using hierarchical clustering, we identified three distinct clusters in both the 2009 flu and COVID-19 (2a and 2d). Figures 2b and 2e show the centroids of each cluster, revealing the existence of different dynamics. In the case of the flu in the US, one cluster has a strong peak in the second half of 2009, another has the strongest (almost unique) peak in the first half, and a third cluster has much less clearly defined peaks (figure 2b). The first cluster (orange) shows a strong correlation with the number of pH1N1 confirmed cases (r = 0.78, p = 4 × 10 −16 ) and a lower correlation with media (r = 0.60, p = 2 × 10 −8 ), while the second cluster (purple) has the opposite trend (figure 2c, r = 0.16, p = 0.2 with pH1N1 cases and r = 0.83, p = 3 × 10 −20 with media). The third has an intermediate correlation with the flu cases and poor with the media reports. As an additional test, we asked whether there was evidence that cases or media preceded any of the clusters. We performed a Granger causality test and show that that media precedes cluster 2 but not cluster 1 (supplementary materials). Neither cases nor media showed significant results for clusters 1 or 3. The grouping of the search terms is not intuitive from their meaning. Interestingly, there is no clear pattern on the search-terms that could have indicated that some would be more correlated with cases or media attention. For example, symptoms such as 'fever' or 'cough', appear on cluster 3, together with 'Guillaume-Barré syndrome' and disinfectant', while cluster 1 contains 'vaccine' and 'treatment' along with the strain of the virus and 'hand sanitizer'. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 20, 2021. ; In the case of Covid-19, the clusters are not so well defined, as shown by the smaller relative length of the internal branches of the clustering dendrogram (figure 2d). This is likely due to a) the smaller time-frame considered (roughly half of that of H1N1 - figure 1 ) b), the lower search volume, explained by the much smaller population of Spain when compared to the US, and c) the real-time nature of the analysis. Still, we could identify three clear clusters and a very similar pattern (figure 2e): the first cluster (again orange) shows two broad peaks, the second larger than the first. The second cluster (purple) shows a clear first plateau, between March and May 2020, and the third cluster (green) a much sharper peak, encompassing little over one month. When we repeated the correlation analysis, we again identified a cluster (C1, orange) that strongly correlates with the number of cases (r = 0.71, p = 8 × 10 −6 ) but less with the media (r = 0.52, p = 0.003), and a cluster (C2, purple) with the opposite pattern (a correlation with cases of r = 0.13, p = 0.45 and with media of r = 0.71, p = 2 × 10 −6 ) (figure 2f). Cluster 3 (green) correlates poorly with both the number of confirmed cases and media attention. Thus, and despite the strong entanglement and time-coincidence between the cases and the media, particularly in the case of the current pandemic, these results show that 1) not all pandemic-related search trends show the same patterns, and 2) some of the patterns may be driven by media attention whereas others by the number of cases. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 20, 2021. ; That very similar search-terms display such different time patterns is interesting in itself but only useful if they have predictive power. Therefore, we asked whether the search terms identified as correlating with the number of confirmed cases (during a pandemic) could be used to forecast seasonal epidemics. The rationale is that if we can reduce the noise caused by the media coverage and identify the terms that are more resilient to outside factors, we can make seasonal forecasting more robust. Therefore, our goal was not to devise the best possible model, but rather to test whether particular search terms perform better than others. To do this, we took advantage of extensively available seasonal flu data and chose two simple models: a linear regression and the non-linear random forest (details in the Methods). We then tested the predictive power of the models when we used all search terms from figure 2A (that we call "All data") or just the terms from the identified clusters in figure 2b. For both models and all dataset variations, we used three years of data to predict the fourth and assessed the performance of the model only on the prediction season (see Methods for details). Figure 3 and table 1 show the performance of the two models (Figure 3a and 3b ) measured by the root-mean-square error (RME) and the coefficient of determination, R 2 . In general, both models perform similarly, with a mean R 2 above 0.7. In both cases, using all data (pink line) is not better than just using the terms more correlated with the number of cases during the pandemic (cluster 1, orange line), and on average cluster 1 performs better than all terms in both the linear regression (R 2 = 0.81 for cluster 1 vs R 2 = 0.71 for all data) and random forest(R 2 = 0.86 for cluster 1 vs R 2 = 0.81 for all data). It can also be observed that cluster 1 terms (orange) tend to have a more consistent performance (shown by the smaller standard deviation:σ = 0.08 for cluster 1 in the case of linear regression andσ = 0.06 for random forest vsσ = 0.163 in the case of linear regression andσ = 0.104 for random forest when considering all data). It is important to note that some of the features from clusters 2 and 3 might be better local predictors, and that can explain the performance of the models when using all search terms, but overall, using only the pre-identified terms of cluster 1 is better. This indicates that 1) insights from pandemics can be used in seasonal forecasting models, and 2) refining the search-term selection, by selecting the ones less sensitive to media hype, might reduce over-fitting and improve model robustness. Each dot represents the squared difference between the prediction and the empirical data, averaged over one season. Cluster 1 (orange) shows better results in almost all seasons and has a smaller standard deviation (shaded area) when compared to cluster 2 (purple) or all data (pink). In both cases, three years were used as training and the models were tested on the following year, in a sliding window process. We then asked whether these results could be used in the current pandemic. This is a more challenging setting for several reasons: first, the data is arriving in close to real-time and with varying quality (the number of tests, the criteria for testing, and the reporting formats have been changing with time, even for the same country); second, there is no indication that Covid-19 might become a seasonal disease and the periodicity of new outbreaks, if any, remains unknown; third, reporting is now happening in many different online platforms, at an even faster pace than in 2009, and more importantly fourth, we do not have a large number of past seasons to train our models on. Still, we employed a similar approach to test whether the rationale of the flu pandemic could be applied to Covid-19. The US pandemic situation has been particular, with different states having widely different infection rates and risk levels 51 . Also, at the time of this study, there were no states with clear strong second waves or evidence of seasonality. Therefore, we focused on Spain, one of the first countries to have a clear and strong second wave and trained the models on the first (February-June) wave to try to now-cast the second (June-November) wave. Still, data for the US can be found in the supplementary materials with results very consistent with what we observed in the case of Spain. Figure 4 shows that, again, using only the features from cluster 1 (orange) offers a much better prediction than using the search-terms from clusters 2 (purple) or 3 (supplementary materials), despite the fact that cluster 1 has a much smaller number of terms. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 20, 2021. ; https://doi.org/10.1101/2021.01. 18.21250056 doi: medRxiv preprint The result is particularly striking in the case of the random forest (figure 4b, compare pink and orange). These results further support the idea that by selecting online data, using a semi-manual approach, it is possible to improve disease now-casting. in Spain. Each dot shows the squared difference between the prediction and the empirical data in each week. Cluster 1 (orange) presents better results in almost all seasons and has a smaller standard deviation (shaded area) when compared to cluster 2 (purple) or all data (pink). In both cases, the first wave was used to train the model. In the past, the inclusion of online data in surveillance systems has both improved the disease prediction ability over traditional syndromic surveillance systems, while also showing some very obvious caveats. Online-data based surveillance systems have many limitations and challenges, including noisy data, demographic bias, privacy issues, and, often, very limited prediction power. Previous approaches have assumed that if a search-term is a good predictor of cases in one year, it will be a good predictor in the following years 11, 52 , when in fact, search terms may be associated with both cases and media hype in a particular year, but soon loose association with one or the other (especially when media interest fades). Moreover, and taking into consideration that these approaches often use a single explanatory-variable, meaning the model ignores the variability in individual search query tendencies over time, it can happen that terms highly correlated with disease cases in a certain moment can be highly correlated with media reports as well, but over time some might lose their association with one or another. However, and despite the described limitations, there are several successful examples of using online behaviour as a proxy for "real-world" behaviour in disease settings and it is increasingly clear that such data can offer insights not limited to disease forecasting 16, 53-56 . Pandemics have been particularly ignored in digital now-casting because they represent (hopefully) rare events when people's behaviour changes, making forecasting even more challenging. A large part of these behavioural changes is driven by the excess media attention: people become curious and possibly afraid, and start looking for more (different?) information. This is in contrast with seasonal outbreaks where there isn't so much relative media attention, there is more common knowledge, and people's online searches might be primarily driven by actual disease. In general, the notions that online search-data is too noisy and that the models used have limited prediction power have led people to try to increase the type and quantity of data, or to build more complex models. However, we argue that this tension, between using the large potential of online data and the so-called "data hubris", can be balanced in the opposite direction, by including behavioural knowledge and human curation, to reduce the amount of data required, while keeping the models simple and explainable. In this study, we applied this approach to two pandemics and showed that, contrary to general arguments of "more data trumps smarter algorithms" 57 we can use such extraordinary events to improve seasonal forecasting, and argue that lowering the 6/11 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 20, 2021. ; volume of data can reduce over-fitting while maintaining the quality of the predictions. This was done by actively discriminating between search queries that are very sensitive to media from queries possibly more driven by symptoms. Our approach combines elements of human curation and blind statistical inference. On the one hand our initial term list is based on knowledge of the disease. On the other, the clustering algorithm is blind to the actual meaning of the terms. This leads to unusual term-pairings such as the fact that "oseltamivir" (cluster 2), a drug used to treat flu is separate from "flu treatment" (cluster 1). We can explain this separation by considering that the media is more likely to mention the name of the drug, but that sick people might not remember it. However a priori we might not think this distinction was important. Finally, the choice of the best cluster is again based on human curation by looking at the correlations with media and cases, which we postulate are the main drivers behind search queries. In many general now-casting problems, a similar semi-automated approach is probably more fruitful than a fully automated, data-hungry methodology. This approach can also be particularly useful in countries where data is sparse or suffers from significant bias or delays. Even within Europe, data collection and reporting have been inconsistent, limiting global epidemiological analysis 57 . Methods as the one we describe here cannot replace the need for strong, centralized, data collection systems (through the European, American or other CDCs) but might help to fill existing gaps, while surveillance networks are built or reinforced. In addition to improving now-casting models, finding different search patterns in Google Trends can offer insights into the behaviours of internet users. Specifically, by clustering search trends on a topic we can ask whether there are different motivations behind them. If there are hypotheses about what those motivations are, they can also be tested by correlating with centroids as we do here. For example, the search terms from the media-related clusters (clusters 2) could be further analyzed to discriminate which terms are more often found in newspapers versus television, offering insight into the preferred news media. This methodology opens new doorways into connecting online and offline behaviour. Overall, we add to the ongoing work on using digital tools and online data to improve disease monitoring and propose a new tool to now-cast infectious diseases, combining statistical tools and human curation, that can prove useful in the monitoring of the current and future pandemics and epidemics. Data for the 2009 pandemic was collected for the USA, from March 2009 to August 2019, as it offered reliable data on a large number of people. This was not possible for Covid-19 as this pandemic is reaching different states at different times and second or third waves are mostly caused by surges in new states than as a nation-wide, simultaneous epidemic. Still, supplemental text shows that three clusters are observed, one more correlated with cases than the rest. Data for the Covid-19 pandemic was collected for Spain, from January 2 nd to November 15 th 2020, as it was the country with highest number of reliable second-wave cases, offering at least one training and one testing period. Data from Google search trends (GT) 58 was extracted from the United States and Spain both for flu and Covid-19 pandemics, through the GT API. It provides a normalized number of queries for a given keyword, time and country 11, 59 . Search terms were selected to cover various aspects of pandemic and seasonal flu, and Covid-19, such as symptoms, antivirals, personal care, institutions and pandemic circumstantial terms.This was done with the help of "related queries" option that Google Trends provides, returning what people also search for when they search for a specific term. Terms that contained many "zeros" interspersed with high values were indicative of low search volume and were removed. In the end we had 49 flu-related weekly search trends in the United States and 63 Covid-related terms in Spain. Time periods were December 2019 to September 2020 in the case of Spain and September 2009 to September 2019, in the case of the USA, to cover pre-pandemic, pandemic and post-pandemic periods. In the case of the US flu pandemic, search-terms were extracted for each season separately, with a season being defined as going from September 1 st to October 1 st the following year. GT time series were extracted in September 2020 in the case of Spain, and July 2020 in the case of the US. Data was binned in a weekly resolution, to match that of reported cases and remove daily variation. Both word lists are reported in the supplemental text. The pandemic flu, United States media dataset contains the weekly count of both TV news broadcast and print media, that mentioned "flu" or "influenza". It includes NBC, CBS, CNN, FOX and MSNBC networks, obtained from the Vanderbilt Television News Archive 60 , and The New York Times, from the NYT API (https://developer.nytimes.com/). The Covid-19 media dataset, for both the USA and Spain was obtained through Media Cloud 61 , an online open-source platform containing extensive global news corpus starting in 2011. The query "Covid-19 OR Coronavirus" was used to track media coverage of the pandemic over time. It aggregated articles that had 1 keyword, the other or both. For the case of the US, we searched the collection "United States -National" (#34412234) and "United States -State & Local" (#38379429), which includes 271 national and 10,457 local media sources, respectively. For Spain we used collection "Spain -National" (#34412356) which includes 469 media sources, and Spain -State & Local(#38002034), including 390 media sources. Data of confirmed infections from both pH1N1 and SARS-CoV-2 are publicly available. For US pH1N1 cases were extracted from the CDC's National Respiratory and Enteric Virus Surveillance System 62 . In the case of Covid-19 in the US, data from national and state-level cases were extracted ECDC's Our World in Data 63 and from New York Times 64 , respectively, in August 2020. In the case of Covid-19 in Spain, data was obtained from the WHO 39 . Google search terms were independently extracted from Google Trends 65 . While all search queries include a 100, not all include a zero (if there were no weeks with less than 1% of the maximum weekly volume), so all series were re-scaled between 0 and 100. These were clustered using hierarchical clustering, computing the pairwise Euclidean distance between words and using Ward's linkage method (an agglomerative algorithm) to construct the dendrograms shown in 2. clustering was performed in Python, using scipy.cluster.hierarchy.dendrogram 66 . The number of clusters was determined through visual inspection of the dendrogram. This task was performed using data from the pandemic period, which for H1N1 pandemic was between March 2009 and August 2010, and for Covid-19 from December 2019 to September 2020. The datasets for seasonal flu were collected similarly to those of the pandemic. They are aggregated by week and seasons were defined by visual inspection, varying from season to season, over the 9 years of data. Each dataset (cases and search time series) in each season was standardized so its mean value was 0 and its standard deviation was 1. The model was trained with 3 seasons and tested with the 4 th . In the case of Covid-19 in Spain, the data was split around the week with the fewest number of cases (June). The first wave was then used to train and the second to test. In each case, a model of the form I i = β 0 + β 1 ×W 1 + β 1 ×W 1 + ... + β n ×W n + ε i (1) was trained, where I i is the number of infections in week i, β 0 is the intercept, β 1 to β n are the coefficients of each search term and ε i is the error. The coefficients were estimated as to minimize the sum of the square of the errors across all weeks. the regression was implemented in Python using sklearn.linear_model.LinearRegression 67 with default parameters. For each dataset, a random forest model was trained using sklearn.ensemble.RandomForestRegressor 68 implemented in Python. The hyperparameters -number of estimators, max features and max depth -were selected through cross validation using GridSearchCV from [10, 20, 50, 100, 200, 500 ,1000], [0.6,0.8,"auto","sqrt"] and [2, 4, 5, 6] respectively. Big data opportunities for global infectious disease surveillance Quantifying sars-cov-2 transmission suggests epidemic control with digital contact tracing Digital epidemiology Online health search Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance Prediction of dengue incidence using search query surveillance Correlation between google trends on dengue fever and national surveillance report in indonesia Dynamic forecasting of zika epidemics using google trends Google flu trends Detecting influenza epidemics using search engine query data Forecasting the 2013-2014 influenza season using wikipedia Separating fact from fear: Tracking flu infections on twitter Combining search, social media, and traditional data sources to improve influenza surveillance Evaluating google, twitter, and wikipedia as tools for influenza surveillance using bayesian change point analysis: a comparative analysis Early and real-time detection of seasonal influenza onset The parable of google flu: traps in big data analysis Mass media and the contagion of fear: the case of ebola in america Reassessing google flu trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales Google disease trends: an update Nine challenges in incorporating the dynamics of behaviour in infectious diseases models Media coverage of public health epidemics: Linking framing and issue attention cycle toward an integrated theory of print news coverage of epidemics Modelling the effects of media during an influenza epidemic The effects of media reports on disease spread and important public health measurements The impact of news exposure on collective attention in the united states during the 2016 zika epidemic Fear of zika: Information seeking as cause and consequence Understanding fear of zika: Personal, interpersonal, and media influences Public anxiety and information seeking following the h1n1 outbreak: blogs, newspaper articles, and wikipedia visits Origins of the 2009 h1n1 influenza pandemic in swine in mexico Surveillance for influenza during the 2009 influenza a (h1n1) pandemic-united states Initial surveillance of 2009 influenza a (h1n1) pandemic in the european union and european economic area Absolute humidity modulates influenza survival, transmission, and seasonality Estimated global mortality associated with the first 12 months of 2009 pandemic influenza a h1n1 virus circulation: a modelling study Misconceptions about weather and seasonality must not misguide covid-19 response Management of post-acute covid-19 in primary care Systems biological assessment of immunity to mild versus severe covid-19 infection in humans Long-term health consequences of covid-19 Will coronavirus disease 2019 become seasonal? An early warning approach to monitor covid-19 activity with multiple digital traces in near real-time Divergent modes of online collective attention to the covid-19 pandemic are associated with future caseload variance A machine learning methodology for real-time forecasting of the 2019-2020 covid-19 outbreak using internet searches, news alerts, and estimates from mechanistic models Predicting covid-19 incidence through analysis of google trends data in iran: data mining and deep learning pilot study Internet search patterns reveal clinical course of disease progression for covid-19 and predict pandemic spread in 32 countries Association of the covid-19 pandemic with internet search volumes: a google trendstm analysis Detection of influenza a (h1n1) v virus by real-time rt-pcr How the media reported the first days of the pandemic (h1n1) 2009: results of eu-wide media analysis Swine flu and hype: a systematic review of media dramatization of the h1n1 influenza pandemic pandemic public health paradox": time series analysis of the 2009/10 influenza a/h1n1 epidemiology, media attention, risk perception and public reactions in 5 european countries Internet usage in 2010-households and individuals. Eurostat. data Focus Real-time, interactive website for us-county-level covid-19 event risk assessment Assessing google flu trends performance in the united states during the 2009 influenza virus a (h1n1) pandemic Using big data to predict collective behavior in the real world 1 The cost of racial animus on a black candidate: Evidence using google search data Forecasting private consumption: survey-based indicators vs. google trends Estimating the effects of non-pharmaceutical interventions on covid-19 in europe Google trends: a web-based tool for real-time surveillance of disease outbreaks New york times covid-19 data A hands-on guide to google data The authors would like to thank members of the SPAC lab for comments and critical reading of the manuscript. This work was partially funded by FCT grant DSAIPA/AI/0087/2018 to JGS and by PhD fellowships SFRH/BD/139322/2018 and 2020.10157.BD to CHV and SM, respectively. All authors participated in project conception, data analysis, and paper writing.