key: cord-0154410-pnapi6dt authors: Wang, Tao; Ma, Simin; Baek, Soobin; Yang, Shihao title: COVID-19 Hospitalizations Forecasts Using Internet Search Data date: 2022-02-03 journal: nan DOI: nan sha: e61b88ea1d79f4a83407fa0def96853051a3a748 doc_id: 154410 cord_uid: pnapi6dt As the COVID-19 spread over the globe and new variants of COVID-19 keep occurring, reliable real-time forecasts of COVID-19 hospitalizations are critical for public health decision on medical resources allocations such as ICU beds, ventilators, and personnel to prepare for the surge of COVID-19 pandemics. Inspired by the strong association between public search behavior and hospitalization admission, we extended previously-proposed influenza tracking model, ARGO (AutoRegression with GOogle search data), to predict future 2-week national and state-level COVID-19 new hospital admissions. Leveraging the COVID-19 related time series information and Google search data, our method is able to robustly capture new COVID-19 variants' surges, and self-correct at both national and state level. Based on our retrospective out-of-sample evaluation over 12-month comparison period, our method achieves on average 15% error reduction over the best alternative models collected from COVID-19 forecast hub. Overall, we showed that our method is flexible, self-correcting, robust, accurate, and interpretable, making it a potentially powerful tool to assist health-care officials and decision making for the current and future infectious disease outbreak. , an acute respiratory syndrome disease caused by a coronavirus, has spread worldwide causing over 120,695,785 reported cases and 4,987,755 reported deaths [1] . During the continuous spread of COVID-19, many variants (alpha, delta, omicron, etc.) of COVID-19 emerges, leading to drastic surges in hospital admissions and shortages in health care resources [2] . Therefore, an accurate hospital admissions forecasting model is crucial to assist hospitals and policy makers with the possibilities and the timings of rapid changes, so as to further respond to and prepare for future COVID-19 spread. The Centers for Disease Control and Prevention (CDC) [3] has been collecting hospitalizations predictions from various research teams and making ensemble and baseline predictions since May 2020. According to the weekly COVID-19 forecasts submissions compiled by CDC [3] , machine learning [4, 5] and compartmental models [6, 7] are the most popular forecasting approaches [4, 5] . For example, Rodríguez, et al. [4] uses a neural network architecture incorporating COVID-19 time series, and mobility information as inputs, whereas Jin, et al. [5] utilizes attention and transformer models to compare and combine past COVID-19 trends for future forecasts. On the other hand, Vespignani, et al. [6] and Kinsey, el al. [7] adapt SEIR (compartmental model) as the baseline structure, and combine different exogenous variables including spatial-temporal and mobility information to build more sophisticated models to capture COVID-19 disease dynamics and forecast hospitalization. Meanwhile, statistical and data-driven models, taking advantage of COVID-19 public search information for hospitalizations predictions, have not drawn much attention. For the last decade, numerous studies have suggested online search data could be a valuable component to monitor and forecast infectious diseases, such as influenza [8, 9, 10, 11, 12, 13] , HIV/AIDS [14] , dengue [15] , etc. For instance, Google Flu Trend (GFT) [8] , a digital disease detection system that uses the volume of We conducted retrospective evaluation of weekly hospital admissions for the period between January 4, 2021 and December 27, 2021, on both national and state level. To evaluate prediction performance, we calculated the root mean square error (RMSE), the mean absolute error (MAE) and the Pearson correlation coefficient (Cor) of one-week-ahead and two-week-ahead predictions (detailed in the Methods section). All comparisons are based on original scale of the ground truth of new hospital admissions released by U.S. Department of Health and Human Services (HHS) [17] . National one-week-ahead and two-week-ahead predictions of new hospitalizations were generated using (i) ARGO inspired model, (ii) persistence (naive) model and (iii) AR7 model. The naive method simply uses past week's hospitalizations as current week's forecasts, without any modeling effort. AR7 is an autoregressive model of lag 7 (implemented in R package forecast [18] ). For fair comparisons, all models were trained on a 56-day rolling window basis. Retrospective out-of-sample predictions of daily national hospitalizations were made every week from January 4, 2021 to December 27, 2021 by the three models and were then aggregated into weekly new hospitalizations. To further demonstrate the prediction accuracy and robustness of ARGO, we collected hospitalizations predictions of the two benchmark methods (COVIDhub-baseline and COVIDhub-ensemble) from COVID-19 forecast hub [3] . The COVIDhub-baseline is a persistent method that uses latest daily observation as future daily predictions [3] . The COVIDhub-ensemble uses medians of hospitalizations predictions submitted to COVID-19 forecast hub as its "ensemble" forecasts. We also provide a full comparison of the top 5 teams from CDC's COVID-19 forecast hub in Supplementary Materials (Table S4) . Table 1 summarizes the national one-week-ahead and two-week-ahead predictions performance from January 4, 2021 to December 27, 2021. During this period, ARGO outperforms all the benchmark models in every error metric for both one and two weeks ahead predictions. Specifically, for the national one-week-ahead predictions, ARGO performs better than the best alternative method by around 27% in RMSE, 39% in MAE and 1.5% in Cor. The two-week-ahead ARGO forecasts have slightly lower error reduction in RMSE and MAE, and higher increase in Cor. The results demonstrate that ARGO is able to produce accurate and robust retrospective out-of-sample national one-week-ahead and two-week-ahead hospitalizations predictions during the evaluation period. Error metrics of national one-week-ahead and two-week-ahead new hospitalizations predictions. The best scores are highlighted with boldface. All comparisons are based on the original scale of hospitalizations released by HHS. Methods are sorted by their average RMSE of one-week-ahead and two-week-ahead predictions. On average, the ARGO model outperforms the best alternative method by approximately 18% in RMSE, 25% in MAE and 4% in Cor. Overall, ARGO has better predictions than all the benchmark methods during our comparison period. All of the time series forecasting methods exhibit some delaying behaviors to various degree, due to the input feature of the lagged information. Fortunately, by utilizing Google search information, ARGO is able to overcome such delayed behavior and is the only method that captures the hospital admission peaks around April 2021 and September 2021 as well as the surge around December 2021 possibly caused by omicron for both 1 and 2 weeks ahead predictions. Moreover, by leveraging time series information and the Internet search information, ARGO is able to avoid sudden spikes and drops in the prediction. ARGO is also self-correcting and can quickly recover from the over-shooting behavior (e.g. between August to September 2021 for two weeks ahead prediction). We also conducted retrospective out-of-sample one and two weeks ahead predictions for the 51 states in the U.S. (including Washington D.C.) during the same period of January 4, 2021 to December 27, 2021. Table 2 summarizes the average error metrics of all methods' state-level predictions from January 4, 2021 to December 27, 2021. For the one-week-ahead predictions, ARGO is able to achieve uniformly best performance in all error metrics. Compared with the two benchmark models from COVID-19 forecast hub, ARGO yields The ARGO inspired model combines autoregressive COVID-19 related information and online search data. It is able to produce accurate, reliable real-time hospitalizations predictions at both national and state-level for 1-2 weeks ahead predictions. The state-level real-time hospitalizations predictions made by ARGO could help local public health officials make timely allocation decisions of healthcare resources, such as ventilator, ICU beds, personal protective equipment, personnel, etc, as well as account and promptly prepare for future surges of COVID-19 pandemics caused by new virus variants. Furthermore, our ARGO hospitalization prediction model is a straightforward adaptation from the original ARGO model for influenza [9] , which reduces the chance of overfitting and again demonstrates ARGO's robustness and flexibility. Although ARGO shows strong performance in hospitalizations forecasts, its accuracy is controlled by the reliability of the inputs. Google search data can be noisy due to the instability of Google Trends' sampling approach and public fear. Especially for state-level Google search data, the lack of search intensity can make the search data unrepresentative of the real interest of the people. Luckily, the IQR filter [19] and moving average smoothing applied to Google search data are able to minimize the risk caused by such noisiness, and help ARGO produce robust output. To further account for the instability in the state-level Google search queries, the query terms are identified using the national level data where the search frequencies are more representative with lower noise and higher stability. ARGO selects the most representative search queries according to their Pearson correlation coefficients with hospital admission. In addition, the national level search frequency is directly used as input features for state-level predictions. An optimal delay between the search data and the hospital admissions is also identified for each query term. All together, ARGO is able to overcome the sparsity issues of Google search queries and produce robust future estimations ( Figure 1 ) while avoiding over-fitting. Another challenge in using online search data to estimate hospitalizations is that the predictive information in Google search data die down as forecasts horizon expands (shown in Table 1 and 2). In our results of COVID-19 hospitalizations predictions, the optimal lags (delays) of some Google search terms are relative small shown in table 3 which indicates those Google search queries are more effective for short-term prediction of hospitalizations. Nevertheless, by leveraging COVID-19 related time series information, ARGO is able to adjust the focus between the time series and the Internet search information features when forecast We focused on national hospital admission predictions and state-level predictions of 51 states in the United States, including Washington D.C.. The inputs consist of confirmed incremental cases, percentage of vaccinated population, confirmed new hospital admissions, and Google search query frequencies. Both state-level data and national data were directly obtained from respective data sources outlined in this section. Our prediction method is inspired by ARGO [9] , with details presented in this section as well. We used reported COVID-19 confirmed incremental cases from JHU CSSE data [20] , percentage of vaccinated population from Centers for Disease Control and Prevention (CDC) [21] and COVID-19 confirmed new hospital admissions from HHS [17] . The data sets were collected from July 15, 2020 to January 15, 2021. Google Trends provides estimated Google search frequency for the specified query term [22] . We obtained online search data from Google Trends [22] for the period from July 15, 2020 to January 15, 2021. To retrieve the time series search frequencies of a desired query, one needs to specify the query's geographical information and time frame on Google Trends. The returned frequency from Google Trends is obtained by sampling all raw Google search frequencies that contain this query [22] . We collected daily Google Search frequencies of 256 top searched COVID-19-related terms based on the previous work of COVID-19 death forecasts [19] . The raw Google search frequencies obtained from Google Trends [22] are observed to be unstable and sparse [19] . Such instability and sparsity can negatively affect prediction performance of linear regression models which are sensitive to outliers. To deal with such outliers in Google search data, we used an IQR filter [19] to remove and correct outliers on a rolling window basis. The search data that is beyond 3 standard deviation from the past 7-day average are examined and removed [19] . The trends of Google search frequencies are often a few days ahead of hospitalizations, indicating that the search data might contain predictive information about hospitalizations. Figure 3 demonstrates the delay behavior between Google search query frequencies and national hospitalizations. To fully utilize the predictive information in national Google search terms, we apply optimal lags [19] to filtered Google search frequencies to match the trends of national hospitalizations. For each query, a linear regression of COVID-19 new hospitalizations is fitted against lagged Google search frequency, considering a range of lags. The lag results in lowest mean square error is selected as the optimal lag for that query. The data used to find optimal lags are from August 1, 2020 to December 31, 2020. Among the 256 COVID-19 related terms, we further selected the queries that have highest correlation coefficients larger than 0.6 with national COVID-19 hospitalizations for the period from August 1, 2020 to December 31, 2020. We applied 7-day moving average to further smooth out weekly fluctuations in the selected Google search queries. Table 3 shows the 11 selected important terms as well as their optimal lags. Table 3 supports our hypothesis that when people get infected, they would probably search for general query like "symptoms of the covid-19" first as this query has relative large optimal lag. After the symptoms develop, people might begin to search for specific symptoms such as "loss of smell" which has relatively smaller optimal lag. Letŷ t,r be the daily hospital admissions of region r on day t; X k,t be the Google search data of term k on day t; c t,r be the JHU COVID-19 incremental confirmed cases on day t of region r; v t,r be the cumulative percent of people who get vaccinated by day t of region r; I {t,d} be the weekday indicator for day t (i.e. I {t,1} indicates day t being Monday). Standing on day T, to predict l-day-ahead hospital admission of state r, y T +l,r , we used penalized linear estimator as following: Where I = 6 considering consecutive one week lagged daily hospital admissions; J = max ({7, 28}, l), considering lagged confirmed cases; M r is the set of geographical neighboring states of state r; Q = max (7, l), considering vaccination data lagged by one week;Ô k = max (O k , l) is the adjusted optimal lag for term k; K = 11, considering 11 selected Google search terms. The coefficients for l-day-ahead predictions of region r, {µ y,r , α = (α 1,r , . . . , α 6,r ), β = (β 1,r , . . . , β |J|,r ), γ = (γ 1,r , . . . , γ |Mr|,r ), φ = φ max(7,l),r , δ = (δ 1,r , . . . , δ 11,r ), τ = (τ 1,r , . . . , τ 6,r )}, were computed by argmin µy,r,α,β,γ,φ,δ,τ ,λ M = 56 which is the length of our training period; ω = 0.8 is the exponentially time-decaying weight which assigns higher weight on more recent observation. Region r consists of U.S. and its 51 states, including Washington D.C.. For U.S. national level training, the hospitalizations of neighboring states, y t,m , and their coefficients, γ, are excluded. To address the sparsity of Google search data, we used penalty of L1-norm. For simplicity, the hyperparameters λ = (λ α , λ β , λ γ , λ φ , λ δ , λ τ ) for L1-norm penalty were set to be equal and obtained via 10-folds cross-validation. With the formulation above, on each Monday from January 4, 2021 to December 27, 2021, we iteratively trained our model and made national and state-level retrospective out-of-sample hospitalizations predictions up to 14 days into future. We then aggregated daily predictions into one-week-ahead and two-week-ahead predictions. For example,ŷ T +1:T +7,r = 7 i=1ŷ T +i,r andŷ T +8:T +14,r = 14 i=8ŷ T +i,r are the one-week-ahead prediction and two-week-ahead prediction on day T of region r, respectively. Root Mean Squared Error (RMSE) between a hospitalization estimateŷ t and the true value y t over period Mean Absolute Error (MAE) between an estimateŷ t and the true value Correlation is the Pearson correlation coefficient between y = (ŷ 1 , . . . ,ŷ T ) and y = (y 1 , . . . , y T ). All estimatesŷ t and the true value y t were weekly aggregated before calculating RMSE, MAE and Cor. After filtering out teams with incomplete submissions and miss-aligned prediction dates, we collected predictions of the top 5 models from COVID-19 forecast hub [3] for comparisons. Coronavirus pandemic (covid-19). Our World in Data States with the biggest hospital staffing shortages The united states covid-19 forecast hub dataset. medRxiv Deepcovid: An operational deep learning-driven framework for explainable real-time covid-19 forecasting Inter-series attention model for covid-19 forecasting Global epidemic and mobility model Detecting influenza epidemics using search engine query data Accurate estimation of influenza epidemics using google search data via argo Combining search, social media, and traditional data sources to improve influenza surveillance Improved state-level influenza nowcasting in the united states leveraging internet-based data and network approaches Using internet searches for influenza surveillance Monitoring influenza epidemics in china with search query from baidu Using search engine big data for predicting new hiv diagnoses Advances in using internet searches to track dengue A predictive internet-based model for covid-19 hospitalization census Department of Health and Human Services. Healthdata.gov covid-19 reported patient impact and hospital capacity by state timeseries forecast: Forecasting functions for time series and linear models Covid-19 forecasts using internet search information in the united states An interactive web-based dashboard to track covid-19 in real time Covid-19 vaccinations in the united states Faq about google trends data T.W., S.M., S.B. and S.Y. designed the research; T.W., S.M., S.B. and S.Y. performed the research; T.W. and S.B. analyzed data and conducted the experiment(s); T.W. and S.M. analysed the results. T.W., S.M. and S.Y. wrote the paper. All authors reviewed the manuscript. The authors declare no competing interests.