key: cord-1018079-ui33jc8g authors: Rabiolo, A.; Alladio, E.; Morales, E.; McNaught, A. I.; Bandello, F.; Afifi, A. A.; Marchese, A. title: Forecasting the COVID-19 epidemic integrating symptom search behavior: an infodemiology study date: 2021-03-12 journal: nan DOI: 10.1101/2021.03.09.21253186 sha: 2173e66456bb7fbc3c7f2c121dcd15a4a15278f5 doc_id: 1018079 cord_uid: ui33jc8g Background: Previous studies have suggested associations between trends of web searches and COVID-19 traditional metrics. It remains unclear whether models incorporating trends of digital searches lead to better predictions. Methods: An open-access web application was developed to evaluate Google Trends and traditional COVID-19 metrics via an interactive framework based on principal components analysis (PCA) and time series modelling. The app facilitates the analysis of symptom search behavior associated with COVID-19 disease in 188 countries. In this study, we selected data of eight countries as case studies to represent all continents. PCA was used to perform data dimensionality reduction, and three different time series models (Error Trend Seasonality, Autoregressive integrated moving average, and feed-forward neural network autoregression) were used to predict COVID-19 metrics in the upcoming 14 days. The models were compared in terms of prediction ability using the root-mean-square error (RMSE) of the first principal component (PC1). Predictive ability of models generated with both Google Trends data and conventional COVID-19 metrics were compared with those fitted with conventional COVID-19 metrics only. Findings: The degree of correlation and the best time-lag varied as a function of the selected country and topic searched; in general, the optimal time-lag was within 15 days. Overall, predictions of PC1 based on both searched termed and COVID-19 traditional metrics performed better than those not including Google searches (median [IQR]: 1.43 [0.74-2.36] vs. 1.78 [0.95-2.88], respectively), but the improvement in prediction varied as a function of the selected country and timeframe. The best model varied as a function of country, time range, and period of time selected. Models based on a 7-day moving average led to considerably smaller RMSE values as opposed to those calculated with raw data (median [IQR]: 0.74 [0.47-1.22] vs. 2.15 [1.55-3.89], respectively).. Interpretation: The inclusion of digital online searches in statistical models may improve the prediction of the COVID-19 epidemic. COVID-19 is a new entity, and the dynamics of its propagation are difficult to predict. In the absence of compelling evidence, health and political decisions have been strongly driven by a wide variety of statistical models and simulation scenarios to forecast the COVID-19 epidemic. Still, large variations exist among the different models with respect to the predicted number of infected people, time to reach a peak of new cases, course of the epidemic, and identification of outbreaks. 1 One key limitation of such models is that they rely heavily on the number of confirmed infected subjects, who are usually seeking medical attention due to moderate to severe symptoms. However, confirmed cases are most likely only a small proportion of the true number of cases as the vast majority of infected individuals have an asymptomatic or mildly symptomatic disease. 2 There is increasing interest in the potential of 'big data' analysis to predict future areas of COVID-19 outbreaks and incidence of cases based on symptom search behaviors. In the past, search query data have been used to facilitate early detection and near real-time estimates of flu and Dengue. 3 These seminal works were helpful in introducing later improvements to predictive models that might help guide health authorities to mount rapid responses, and to design more efficient surveillance programs for COVID-19. 4 A few studies have shown a correlation between Google Trends of medical terms searches and COVID-19 metrics, 5 suggesting that incorporating Google Trends data into conventional metrics could lead to better nowcasting and forecasting of the COVID-19 epidemic. In this study, we systematically evaluate patterns of web queries for COVID-19 clinical manifestations and develop an open-access web application for exploring their correlations with COVID-19 propagation. We implement models integrating . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint conventional COVID-19 metrics with Google Trends data and compare them to those not containing Google Trends data. The results of this study provide a framework for digital surveillance of COVID-19 using open-access big data. COVID-19 daily new confirmed, cumulative number, and number per million of cases and deaths for all available world countries were automatically exported from the COVID-19 Data Repository by the Center for System Science and Engineering at Johns Hopkins University (source: https://github.com/CSSEGISandData/COVID-19). 6 The selected countries used as case studies are given in the results section below. Countries choice was arbitrary, and the following principles were adopted: representation of the five continents; inclusion of countries where the COVID-19 epidemic had different levels of severity and different evolutions over time; inclusion of countries where Google is the preferred search engine; exclusion of countries with limited access to the internet; exclusion of countries where one or more Google Trends topic had only zero or missing vales in the selected time frame; exclusion of countries whose reliability in terms of data reporting has been questioned. As data were fully anonymized and publicly available, no ethical approval was required. Google Trends API was used to extract trends of Google searches for the most common COVID-19 signs and symptoms in those countries. 7 For each search term, geographic region, and time frame selected, Google Trends outputs an 'interest over time' (IOT) index, which estimates the relative search volume on a normalized scale from 0 (no searches) to 100 (search term popularity peak). Twenty . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) topics were identified on the basis of the most frequent signs and symptoms of COVID-19 and included: abdominal pain, ageusia, anorexia, anosmia, bone pain, chills, conjunctivitis, cough, diarrhea, eye pain, fatigue, fever, headache, myalgia, nasal congestion, nausea, rhinorrhea, shortness of breath, sore throat, tearing. [8] [9] [10] [11] Google Trends queries were carried out with the «topic» function, which includes all the related terms sharing the same concept in different languages. This approach ensures that the frequency of searches for closely related symptom types are appropriately grouped together. For each country and search term, data were automatically exported as csv files for two pre-specified timeframes: (i) five years weekly data from 12/Apr/2015 to 5/Apr/2020 to study the long-term pattern of searched term, and (ii) daily data from 22/Jan/2020 to 20/Dec/2020. As Google Trends allows daily data exportation up to 9 month, daily data were reconstructed by means of an overlapping method. 12 Interest-over-time values for the five-year interval were used to distinguish topics with a significant deviation from their long-term pattern from the onset of COVID-19 epidemic. For seasonal queries, trends were isolated from seasonal and random components with an additive decomposition method (Supplementary Figure 1 ); for non-seasonal queries, trends were extracted by smoothing the time series with a one-year moving average. Decomposition plots were visually inspected, and topics with no clear change in their five-year trends from January 2020 were excluded from the subsequent analyses. The relationship between the daily IOT values for the selected topics and COVID-19 confirmed deaths and new cases were investigated in the shorter time . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. confirmed cases and deaths to blunt the day-by-day fluctuation. Principal Components Analysis (PCA) was used to perform data dimensionality reduction, decrease the number of input variable, and filter out noisy or redundant information. For each country, two PCA models were calculated: one using unprocessed data and the other using a 7-day moving average smoothing. For the sake of comparability of the variables, PCA was applied to standardized data (i.e., with zero mean and unit variance). The PCA model was graphically inspected through PCA score and loading plots. PCA was assessed via 5-fold cross-validation, and the results obtained in each test sample were averaged. The amount of variance explained by each principal component (PC) in the model was inspected with scree plots, and, based on the elbow and Kaiser rules, the first two PCs (PC1 and PC2) were subsequently used for time-series modelling. 13 Three different time series models were fitted on PC1 and PC2 values. The models tested were: Error Trend Seasonality (ETS), Autoregressive integrated moving average (ARIMA), and a feed-forward neural network autoregression (NNAR) model with one hidden layer. 14 Models were fitted on a 30-day window and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint used to predict future PC1 and PC2 values up to 14 days. The fourteenth predicted day was aligned to the peak and base of each wave. The new data scores predicted with the time series models were then reinserted into the model as input variables. For each country, the three models were compared in terms of ability to predict the PC1 and PC2 using the root-mean-square error (RMSE) of the predicted values. For each time-series model, the predictive ability of the models generated with raw data and 7-day moving average were compared. To further assess the PCA models based on both Google Trends data and conventional COVID-19 metrics, we also generated predictive models based on conventional COVID-19 metrics only; we then compared the predictive ability of models with and without Google Trends data by means of RMSE for each country. An open-source web-application (https://predictpandemic.org) was developed in the R Shiny. 15 Data are collected, imported, and updated daily for 188 countries from the sources mentioned above. For web-application feature, specific countries, timeframe, and moving averages can be selected by the user. The web-application allows users to generate lineagraphs and streamgraphs to visualize IOT values and COVID-19 metrics, and to view worldwide trends over time in the form of a choropleth map. Relationships between the variables at the various lags can be explored with cross-correlations. The web-application allows fitting and evaluating PCA models, fitting a time-series model (either ETS or ARIMA), predicting PC components or any of the input variable of the model (including numbers of new cases and deaths), and evaluating the model performance graphically and with various metrics, such as RMSE and mean . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Figures 9-16 ). The change of interest over time for all the topics is illustrated in Figure 2 . Overall, the IOT values of the selected topics had a peak in March in all the selected countries. In Italy, France, South Africa, and, to a lesser extent, the UK and the US, there was a decrease in the searched terms after the first peak followed by a second peak. In India and Brazil, searches of medical terms remained high after the first peak, and no second peak was seen. In Australia, the IOT values of the selected topics returned to the pre-peak values soon after the first peak in March and remained low and stable. Cross-correlations between each topic and the number of confirmed COVID-19 cases are reported in the Supplementary Tables 1-8 Figure 17) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint In this study, we investigated the relationship between Google Trends searches of symptoms associated with COVID-19 and confirmed COVID-19 cases and deaths using PCA. We found that some of the searched terms showed an unusually high recent online interest that deviated considerably from their expected behavior and anticipated the peak of confirmed COVID-19 cases by days to weeks. This pattern was consistent across different countries and of similar magnitude. We developed and validated predictive models to forecast COVID-19 epidemic based on the combination of Google Trends searches of symptoms associated with COVID-19 and traditional COVID-19 metrics. We found that models incorporating Google Trends data performed generally better than those based solely on traditional COVID-19 metrics. We also developed a web-application (https://predictpandemic.org) to translate our approach into action. Modeling epidemics is a complex task that depends on several assumptions and entails numerous limitations. The choice of data input is a crucial part of model development for accurate predictions. 16 Most of the current models for COVID-19 rely on confirmed cases or deaths, but neither of these measures is satisfactory. Confirmed COVID-19 cases represent only a part of all infected subjects, as those with mild symptoms may not seek medical attention or get tested. Also, the number of confirmed cases is highly dependent on the number of tests performed, which varies greatly in different countries and in the diverse phases of the epidemic. Confirmed COVID-19 deaths are likely to be a more reliable measure, but they occur in the final stages of the disease and are, therefore, a poor indicator to detect outbreaks at their earliest stages. Also, COVID-19 deaths are not uniformly reported . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint among the different countries and may vary as a function of the healthcare systems, population demographic and public health status. In the past decade, there has been an increasing interest in the use of internet big data to understand patterns of disease, population behaviors, and make surveillance of infectious disease. Despite being initially welcomed with enthusiasm, models based only on Google Trends data proved to be not accurate in determining the absolute numbers of cases in epidemics, but were helpful in identifying temporal dynamics, anticipating peaks, and improving forecasting when used in conjunction with (and not in place of) traditional data. 17, 18 Our study identified patterns of Google searches of several symptoms and signs associated with COVID-19 in a consistent way across the studied countries. Overall, Google searches of COVID-19 symptoms followed a similar trend to that of the COVID-19 epidemic, and, importantly, anticipated traditional COVID-19 metrics. This behavior can contribute to early recognition of new waves and epidemic peaks. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint RMSE when obtained on a 7-day moving average, rather than on daily data. This result is not surprising as Google Trends data have high daily fluctuations, and COVID-19 reported cases greatly oscillate reflecting testing and reporting practices and contingencies. 24, 25 To translate our results into practice so that the scientific community, agencies, and even curious users could potentially use them, we developed a webapplication, freely available at https://predictpandemic.org. The application is interactive and updates the data on a daily basis, so it operates in nearly real-time. It allows the user to visualize data for 188 world countries choosing any time frame. Also, COVID-19 traditional metrics and google search terms IOT over time can be visualize globally on different graphs. The application the user to explore crosscorrelations among selected keywords, to generate predictive models with default or a user-selected subset of variables, and to check model performance. Another potential use of Google Trends is the opportunity to filter the results based on precise geographic locations having more granular data. These filters include nations, but also states, regions, and even cities in some areas. As not all areas of a country are equally affected nor have the same risk of having COVID-19 outbreaks, this function may allow regional predictions. We are currently planning to provide more granular data in our web-application. The present study has limitations. The Google Trends algorithm is a 'black box', and the exact calculation formula for the interest over time and raw data have never been made public. Searched results may differ slightly when download by different computers or on different days, although we conducted search-research reliability, which showed excellent reliability for most of the topics included in this study (data not shown). The exclusion of those symptoms with no significant . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint deviation from their five-year trends reduced the possibility of spurious correlations, but it was not possible to account for seasonality in the selected topics. In other words, a small proportion of the increasing trend in some topics might be explained by their usual seasonal variations. The results of this study may not apply to countries where Google is not a popular search engine or where Google is censored or limited in its use. However, this approach can be applied to other search engines (e.g., Baidu, Yahoo, Naver), as was done in previous studies on different diseases and on COVID-19 in Hubei province, China. 26, 27 Geographical areas and groups of people (elders and children) with scarce Internet access cannot be studied with this strategy, and our results may not apply to largely rural countries. This study included only the most common clinical manifestations of COVID-19, and only a few selected countries were included as a case study. However, information and models for every country can be found on our web-application. Future work will include increased data granularity allowing to have information and make predictions at a regional level. Also, other metrics of interest, such as hospitalizations, will be included in our analysis as outcomes. Finally, we are planning to allow the user to generate a one-page report for each individual country, summarizing the most relevant information. In conclusion, the results of this study show that Google Trends based on online searches during the COVID-19 pandemic may anticipate outbreaks by up to two weeks. The inclusion of digital online searches in statistical models may improve the nowcasting and forecasting of COVID-19 epidemic, and could be used as one of the surveillance systems employed by government agencies and supranational organizations to refine their monitoring of COVID-19 disease. We provide a free web-application operating with nearly real-time data that can be used by any user to . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. y m p t o m s s c r e e n e d a t t h e 5 -y e a r a n a l y s i s T o p i c S e a s o n a l i t y D e v i a t i o n f r o m 5 -y e a r t r e n d A b d o m i n a l P a i n N o n s e a s o n a l N o A g e u s i a N o n s e a s o n a Topic categorization into deviating and not-deviating from their 5-year trend was determined on the visual inspetion of decomposition plots. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 12, 2021. ; https://doi.org/10.1101/2021.03.09.21253186 doi: medRxiv preprint Why is it difficult to accurately predict the COVID-19 epidemic? Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A systematic review and meta-analysis Detecting influenza epidemics using search engine query data Using networks to combine "big data" and traditional surveillance to improve influenza predictions Use of Google Trends to investigate loss-ofsmell-related searches during the COVID-19 outbreak An interactive web-based dashboard to track COVID-19 in real time Confirmed cases and deaths by country, territory, or conveyance Practical Multivariate Analisys Automatic Time Series Forecasting: The forecast Package for R Interactive Web-Based Data Visualization with R, plotly, and shiny: Taylor & Francis Ltd Accurate estimation of influenza epidemics using Google search data via ARGO Big data. The parable of Google Flu: traps in big data analysis Syndromic surveillance models using Web data: the case of scarlet fever in the UK Incidence and Mortality Data Reflect Diagnostic and Reporting Factors Reliability of Google Trends: Analysis of the Limits and Potential of Web Infoveillance During COVID-19 Pandemic and for Future Monitoring