key: cord-0471238-76la0noi authors: Yi, Dingdong; Ning, Shaoyang; Chang, Chia-Jung; Kou, S. C. title: Forecasting unemployment using Internet search data via PRISM date: 2020-10-20 journal: nan DOI: nan sha: 58e1412bfbbcdf94802360ed6720e3e8f853a843 doc_id: 471238 cord_uid: 76la0noi Big data generated from the Internet offer great potential for predictive analysis. Here we focus on using online users' Internet search data to forecast unemployment initial claims weeks into the future, which provides timely insights into the direction of the economy. To this end, we present a novel method PRISM (Penalized Regression with Inferred Seasonality Module), which uses publicly available online search data from Google. PRISM is a semi-parametric method, motivated by a general state-space formulation that contains a variety of widely used time series models as special cases, and employs nonparametric seasonal decomposition and penalized regression. For forecasting unemployment initial claims, PRISM outperforms all previously available methods, including forecasting during the 2008-2009 financial crisis period and near-future forecasting during the COVID-19 pandemic period, when unemployment initial claims both rose rapidly. PRISM can be used for forecasting general time series with complex seasonal patterns. Driven by the growth and wide availability of Internet and online platforms, big data are generated with an unprecedented speed nowadays. They offer the potential to inform and transform decision making in industry, business, social policy and public health (Manyika et al., 2011; McAfee and Brynjolfsson, 2012; Chen et al., 2012; Khoury and Ioannidis, 2014; Kim et al., 2014; Murdoch and Detsky, 2013) . Big data can be used for developing predictive models for systems that would have been challenging to predict with traditional small-sample-based approaches (Einav and Levin, 2014; Siegel, 2013) . For instance, numerous studies have demonstrated the potential of using Internet search data in tracking influenza outbreaks (Ginsberg et al., 2009; Yang et al., 2015; Ning et al., 2019) , dengue fever outbreaks (Yang et al., 2017) , financial market returns (Preis et al., 2013; Risteski and Davcev, 2014) , consumer behaviors (Goel et al., 2010) , unemployment (Ettredge et al., 2005; Choi and Varian, 2012; Li, 2016) and housing prices (Wu and Brynjolfsson, 2015) . We consider here using Internet users' Google search to forecast US unemployment initial claims weeks into the future. Unemployment initial claims measure the number of jobless claims filed by individuals seeking to receive state jobless benefits. It is closely watched by the government and the financial market, as it provides timely insights into the direction of the economy. A sustained increase of initial claims would indicate rising unemployment and a challenging economic environment. Weekly unemployment initial claim is the (unadjusted) total number of actual initial claims filed under the Federal-State Unemployment Insurance Program in each week ending on Saturday. The Employment and Training Administration (ETA) of the US Department of Labor collects the weekly unemployment insurance claims reported by each state's unemployment insurance program office, and releases the data to the public at 8:30 A.M. (eastern time) on the following Thursday. Thus, the weekly unemployment initial claim data are reported with a one-week delay: the number reported on Thursday 2 of a given week is actually the unemployment initial claim number of the preceding week. For accessing the general economic trend, it is, therefore, highly desirable for government agencies and financial institutions to predict the unemployment situation of the current week, which is known as nowcasting (Giannone et al., 2008) , as well as weeks into the future. In this article, we use the general phrase forecasting to cover both nowcasting (the current week) and predicting into future weeks. In contrast to the one-week delayed unemployment reports by the Department of Labor, Internet users' online search of unemployment related query terms provides highly informative and real-time information for the current unemployment situation. For instance, a surge of people's Internet search of "unemployment office", "unemployment benefits", "unemployment extension", etc. in a given week could indicate an increase of unemployment of that week, as presumably more people unemployed are searching for information of getting unemployment aid. Internet search data, offering a real-time "peek" of the current week, thus, augments the delayed official time-series unemployment data. There are several challenges in developing an effective method to forecast weekly unemployment initial claims with Internet search data. First, the volatile seasonality pattern accounts for most of the variation of the targeted time series. Figure 1 (bottom) plots the weekly unemployment initial claims from 2007 to 2016; the seasonal spikes are particularly noteworthy. A prediction method should address and utilize the strong seasonality in order to achieve good prediction performance. Second, the method needs to effectively incorporate the most up-to-date Internet search data into the modeling of targeted time series. Third, as people's search pattern and the search engine both evolve over time, the method should be able to accommodate this dynamic change. Most time series models rely on state-space models to deal with seasonality, where the latent components capture the trend and seasonality (Aoki, 1987; Harvey, 1989 ). Among the time series models, structural time series models and innovation state-space models are two main frameworks (Harvey and Koopman, 1993 ; Durbin and Koopman, plementary Material A2 for more discussion). For nowcasting time series with seasonal pattern, Varian (2013, 2014) developed a Bayesian method based on the structural time series model, using a spike-and-slab prior for variable selection, and applied it to nowcast unemployment initial claims with Google search data by treating the search data as regressors. Alternative to this regression formulation, Banbura et al. (2013) proposed a nowcasting method using a factor model, in which targeted time series and related exogenous time series are driven by common factors. Here we introduce a novel prediction method PRISM, which stands for Penalized Regression with Inferred Seasonality Module, for forecasting times series with seasonality, and use it to forecast unemployment initial claims. Our method is semi-parametric in nature, and takes advantage of both the state-space formulation for time series forecast-ing and penalized regression. With the semi-parametric method PRISM, we significantly expand the range of time series models for forecasting, going beyond the traditional approaches, which are often tailored for individual parametric models. PRISM offers a robust and more accurate forecasting alternative to traditional parametric approaches and effectively addresses the three aforementioned challenges in forecasting time series with strong seasonality. First, our method accommodates various nonparametric and While the forecasting target here is the unemployment initial claims, we want to highlight that PRISM applies to forecasting other time series with complex seasonal pattern. The weekly (non-seasonally adjusted) initial claims are our targeted time series. The initial claims for the preceding week are released every Thursday. The time series of the initial claims from 1967 to present is available at https://fred.stlouisfed.org/ series/ICNSA. The real-time Internet search data we used were obtained from Google Trends (www. google.com/trends) with Python package pytrends. The Google Trends website, which is publicly accessible, provides weekly (relative) search volume of search query terms specified by a user. Specifically, for a user-specified query term, Google Trends provides integer-valued weekly times series (after 2004); each number in the time series, ranging from 0 to 100, represents the search volume of that search query term in a given week divided by the total online search volume of that week; and the number is normalized to integer values from 0 to 100, where 100 corresponds to the maximum weekly search within the time period (specified by the user). Figure A1 in the Supplementary Material illustrates the Google Trend time series of several search query terms in a 5-year span. The search query terms in our study were also identified from the Google Trends tool. One feature of Google Trends is that, in addition to the time series of a specific term (or a general topic), it also returns the top query terms that are most highly correlated with the specific term. In our study, we used a list of top 25 Google search terms that are the most highly correlated with the term "unemployment". Table A1 lists these 25 terms, which were generated by Google Trends on January 11, 2018; they included 12 general unemployment related query terms, such as "unemployment office", "unemployment benefits" and "unemployment extension", as well as 13 query terms that were combinations of state names and "unemployment", such as "California unemployment" and "unemployment Florida". PRISM employs a two-stage estimation procedure for forecasting time series y t using its lagged values and the available exogenous variables x t . The derivation and rationale of each step will be described subsequently. Target time series {y 1:(t−1) } and exogenous time series {x t 0 :t }. In the forecasting of unemployment initial claims, {y 1:(t−1) } is the official weekly unemployment initial claim data reported with one-week delay, and {x t 0 :t } is the multivariate Google Trends data starting from 2004. using data available at time t. Stage 2 of PRISM: Penalized regression. Forecast target time series using: where the coefficients are estimated by a rolling-window L 1 penalized linear regression using historical data for each forecasting horizon: l = 0 corresponds to nowcast; l ≥ 1 corresponds to forecasting future weeks. Note that for notational ease, we will suppress "(l)" in the superscripts of the coefficients in the subsequent discussion. iid ∼ N (0, H), and θ = (w, F , v, P , H) are the parameters. Our state-space formulation contains a variety of widely used time series models, including structural time series models (Harvey, 1989) and additive innovation statespace models (Aoki, 1987; Ord et al., 1997; Hyndman et al., 2008) . PRISM also models the contemporaneous information from exogenous variables. Let x 2,t , . . . , x p,t ) be the vector of the exogenous variables at time t. We postulate a state-space model for x t on top of y t , instead of adding them as regressors as in traditional models. In particular, at each time t, we assume a multivariate normal distribution for x t conditional on the level of unemployment initial claims y t , where β = (β 1 , . . . , β p ) , µ x = (µ x 1 , . . . , µ x p ) , and Q is the covariance matrix. x t is assumed to be independent of {y l , x l : l < t} conditional on y t . For {y t } following the general state-space model (2.2), our joint model for y t and x t can be diagrammed as: To forecast y t+l under above model assumptions at time t, we consider the predictive distribution of y t+l by conditioning on the historical data {y 1:(t−1) } and contemporaneous exogenous time series {x t 0 :t } as well as the latent seasonal component {γ 1:(t−1) }. z 1:(t−1) is known given y 1:(t−1) and γ 1:(t−1) . We can derive a universal representation of the predictive distribution p(y t+l | z 1:(t−1) , γ 1:(t−1) , x t 0 :t ), which is normal with mean 8 linear in z 1:(t−1) , γ 1:(t−1) and x t as in (2.1) (see Supplementary Material for the proof). This observation leads to our two-stage semi-parametric estimation procedure PRISM for nowcasting y t and forecasting y t+l (l ≥ 1) using all available information at time t. PRISM estimates the unobserved seasonal components γ 1:(t−1) in the first stage. For this purpose, various seasonal decomposition methods can be used here, including nonparametric methods such as the classical additive seasonal decomposition (Stuart et al., 1963) and parametric methods based on innovation state-space models. We use the method of Seasonal and Trend decomposition using Loess (STL) (Cleveland et al., 1990) as the default choice. The STL method is nonparametric. It is widely used and robust for decomposing time series with few assumptions owing to its nonparametric nature. At every time t for forecasting, we apply the seasonal decomposition method (such as the default STL) to historical initial claims observations y For each fixed forecasting horizon l (≥ 0), we estimate y t+l by the linear predictive equation:ŷ where for notational ease we have suppressed l in the coefficients and used the generic notations µ y , α j , δ j with the understanding that there is a separate set of {µ y , α = (α 1 , . . . , α K ), δ = (δ 1 , . . . , δ K ), β = (β 1 , . . . , β p )} for each l. At each time t and for each forecasting horizon l, the regression coefficients µ y , α, δ and β are obtained by minimiz- (2.5) where N is the length of a rolling window, w is a discount factor, and λ 1 and λ 2 are nonnegative regularization parameters. There are several distinct features of our estimation procedure. First, a rolling window of length N is employed. This is to address the fact that the parameters in the predictive Eq. (2.4) can vary with time t. In our case, people's search pattern and the search engine tend to evolve over time, and it is quite likely that the same search phrases would contribute in different ways over time to the response variable. Correspondingly, the coefficients in (2.4) need to be estimated dynamically each week, and the more recent observations should be considered more relevant than the distant historical observations for inferring the predictive equations of current time. The rolling window of observations and the exponentially decreasing weights are utilized for such purpose. Our use of exponential weighting is related to the weighted least square formulation that is usually referred to as discounted weighted regression in the econometrics literature (Ameen and Harrison, 1984; Taylor, 2010) . Second, since the number of unknown coefficient in (2.4) tends to be quite large compared to the number of observations within the rolling window, we applied L 1 reg-ularization in our rolling-window estimation (Tibshirani, 1996) , which gives robust and sparse estimate of the coefficients. Up to two L 1 penalties are applied: on (α, δ) and on β, as they represent two sources of information: information from time series components {ẑ t } and {γ t }, and information from the exogenous variables {x t }. Third, PRISM is a semi-parametric method. The predictive Eq. (2.4) is motivated and derived from our state-space formulation (2.2). However, the estimation is not parametric in that (i) the seasonal and seasonally adjusted components are learned nonparametrically in Stage 1, and (ii) the coefficients in (2.4) are dynamically estimated each week in Stage 2. Combined together, the two stages of PRISM give us a simple and robust estimation procedure. This approach is novel and different from the typical approaches for linear state-space models, which often estimate unknown parameters using specific parametrization and select a model based on information criteria (Hyndman et al., 2008) . In the case when exogenous time series {x t } are not available, PRISM estimates y t+l according to the following linear predictive equation: The semi-parametric nature of PRISM makes it more difficult to construct predictive intervals on PRISM forecasts, as we cannot rely on parametric specifications, such as posterior distributions, for predictive interval construction. However, the fact that we are forecasting time series suggests a (non-parametric) method for us to construct predictive intervals based on the historical performances of PRISM. For nowcasting at time t, given the historical data available up to time t − 1, we can evaluate the root mean square error of nowcasting for the last L time periods aŝ whereŷ τ is the real time PRISM estimate for y τ generated at time τ. Under the assumption of local stationarity and normality of the residual,ŝe t would be an estimate for the standard error ofŷ t . We can thus use it to construct predictive interval for the current PRISM estimate. An 1 − α point-wise predictive interval is given by where z α/2 is the 1 − α/2 quantile of the standard normal distribution. Supplementary A10 shows that the empirical residuals are approximately normal, supporting our construction. The point-wise predictive intervals for forecasting into future weeks can be constructed similarly. In practice, we set L = 52, estimating thê se t based on the forecasts of the most recent one-year window. In our forecasting of unemployment initial claims, we applied a 3-year rolling window of historical data to estimate the parameters in (2.5), i.e. N = 156 (weeks nationwide query terms related to "unemployment" in Table A1 . The weekly search volume data from Google Trends are limited up to a 5-year span per download. The subsequent normalization of the search volumes done by Google is based on the search query term and the specified date range, which implies that the absolute values of search volumes are normalized differently between different 5-year ranges. However, within the same set of 5-year data the relative scale of variables is consistent (as they are normalized by the same constant). Therefore, to avoid the variability from different normalization across different 5-year spans, and to ensure the coherence of the model, for each week, we used the same set of 5-year-span data downloaded for both the training data and the prediction. For the choice of the discount factor, we took w = 0.985 as the default choice. This follows the suggestion by Lindoff (1997) For the regularization parameters λ 1 and λ 2 in (2.5), we used cross-validation to minimize the mean squared predictive errors (i.e., the average of prediction errors from each validation set of data). We found empirically that the extra flexibility of having two separate λ 1 and λ 2 does not give improvement over fixing λ 1 = λ 2 . In particular, we found that for every forecasting horizon l = 0, 1, 2, 3, in the cross-validation process of setting (λ 1 , λ 2 ) for separate L 1 penalty, over 80% of the weeks showed that the smallest cross-validation mean squared error when restricting λ 1 = λ 2 is within 1 standard error of the global smallest cross-validation mean squared error. For model simplicity, we thus chose to further restrict λ 1 = λ 2 when forecasting unemployment initial claims. We used root-mean-squared error (RMSE) and mean absolute error (MAE) to evaluate the performance of different methods. For an estimator {ŷ t } and horizon l, the RMSE 13 and MAE are defined, respectively, as RMSE l (ŷ, y) = [ t=n 1 +l |ŷ t − y t |, where n 1 + l and n 2 are respectively the start and end of the forecasting period for each l. We applied PRISM to produce forecasts of weekly unemployment initial claims for the time period of 2007 to 2016 for four time horizons: real-time, one, two, and three weeks ahead of the current time. We compared the forecasts to the ground truth -the unem- week) as the prediction for the current week, one, two, and three week(s) later. The naive method serves as a baseline. Both BATS and TBATS are based on innovation state-space model; BATS is an acronym for key features of the model: Box-Cox transform, ARMA errors, Trend, and Seasonal components; TBATS extends BATS to handle complex seasonal patterns with trigonometric representations, and the initial T connotes "trigonometric". BSTS only produces nowcasting; it does not produce numbers for forecasting into future weeks. As PRISM uses both historical unemployment initial claims data and Google search information, to quantify the contribution of the two resources, we also applied PRISM but without the Google search information. We denoted this method as "PRISM w/o For fair comparison, in generating retrospective estimates of unemployment initial claims, we reran all methods each week using only the information available up to that week, i.e., we obtained the retrospective estimation as if we had relived the testing period of 2007 − 2016. The two exponential smoothing methods (b) BATS and (c) TBATS only use historical initial claims data and do not offer the option of including exogenous variable in their forecasting, while the method (a) BSTS allows the inclusion of exogenous variables. Thus, for forecasting at each week t, BSTS takes the historical initial claim data and Google Trends data as input, whereas BATS and TBATS use historical initial claim data only. PRISM was applied twice: with and without Google search information. The results of BSTS, BATS and TBATS were produced by their respective R packages under their default settings. Table 1 presents the performance of forecasting (including nowcasting) unemployment initial claims over the entire period of 2007 − 2016 for the four forecasting horizons. The RMSE and MAE numbers reported here are relative to the naive method, i.e., the number reported in each cell is the ratio of the error of a given method to that of the naive method. The absolute error of the naive method is reported in the parentheses. BSTS does not produce numbers for forecasting into future weeks, as its R package outputs prediction of the target time series only with exogenous variables inputted. better nowcasting results than the other methods that use only historical initial claim data. Third, the contribution of contemporaneous Google Trends data becomes less significant in forecasting future weeks, as evidenced by the shrinking performance gap between PRISM and "PRISM w/o x t " from nowcasting to forecasting. Fourth, among the three methods that only use historical initial claim data, the predictive method based on PRISM without Google information outperforms the exponential smoothing methods BATS and TBATS. Following the suggestion of a referee, we further compared the performance of PRISM to the other methods with an additional metric: Cumulative Sum of Squared forecast Error Differences (CSSED) (Welch and Goyal, 2008) , which calculates the cumulative difference in mean-squared error (MSE) between PRISM and the alternative. The CSSED at time T for an alternative method m is defined as CSSED m,PRISM = ∑ T t=1 (e 2 t,m − e 2 t,PRISM ), where e t,m and e t,PRISM are the prediction errors at time t for method m and PRISM respectively. The detailed comparison results are given in Supplementary Material A13, which again shows that the advantage of PRISM over alternatives is consistent over the whole evaluation period. week, 2 weeks and 3 weeks ahead. RMSE and MAE here are relative to the error of naive method; that is, the number reported is the ratio of the error of a given method to that of the naive method; the absolute RMSE and MAE of the naive method are reported in the parentheses. The boldface indicates the best performer for each forecasting horizon and each accuracy metric. To assess the statistical significance of the improved prediction power of PRISM compared to the alternatives, we conducted Diebold-Mariano test (Diebold and Mariano, 1995) , which is a nonparametric test for comparing the prediction accuracy between two time-series forecasting methods. (2017). real-time 1 week 2 weeks 3 weeks PRISM w/o 7.86 × 10 −5 2.09 × 10 −2 2.10 × 10 −2 3.95 × 10 −3 BSTS 7.14 × 10 −3 ---BATS 9.17 × 10 −8 3.60 × 10 −8 1.05 × 10 −9 1.88 × 10 −9 TBATS 5.20 × 10 −9 7.24 × 10 −3 2.59 × 10 −3 1.86 × 10 −2 To further assess PRISM's performance, we applied PRISM to produce out-of-sample forecasts of weekly unemployment initial claims for the period of 2017-2019. Note that the PRISM methodology, including all the model specifications, was frozen at the end of 2016, so this evaluation is completely out of sample. Figure 4 ), PRISM uniformly outperforms other methods in comparison and gives rather stable error reduction from the naive method over the years, both in-and out-of-sample. This The global shutdown due to the COVID-19 pandemic has heavily impacted the US economy and job market. In particular, the numbers of unemployment claims in the US have skyrocketed to record-breaking levels with more than 40 millions people in total filing for initial claims since the start of the pandemic. This phenomenon has attracted significant attention from major news media and the general public (Cohen, 2020a,b; Casselman, 2020) . As the weekly unemployment claims remain "stubbornly high" (Casselman, 2020) , concerns for significant layoffs and severe economic downturn persist. Accurate and reliable forecasting of near-future unemployment claims would thus provide very valuable insights into the trend of the general economy during such trying times. In light of this, we further applied PRISM to the out-of-sample data of the COVID-19 pandemic period to evaluate its performance and robustness to such unusual economic shock. The wide availability of data generated from the Internet offers great potential for predictive analysis and decision making. Our study on using Internet search data to PRISM is an example where traditional statistical models are brought together with more recent statistical tools, such as L 1 regularization and dynamic training. In this article we focus on using Internet search data to forecast unemployment initial claims weeks into the future. We introduced a novel method PRISM for forecasting time series with strong seasonality. PRISM is semi-parametric and can be generally applied with or without exogenous variables. PRISM is motivated from a general state-space formulation that contains a variety of widely used time series models as special cases. We conclude this article with a few remarks on the detailed implementation of the PRISM method. We used real-time Internet search data from Google Trends, which 25 provides publicly available data through subsampling and renormalization: a data set undergoes subsampling (Google draws a small sample from its raw search data for a search query term) and renormalization (after the sampling, Google normalizes and rounds up the search volumes so that they become integers between 0 and 100 for each search query term) when downloaded. Due to the subsampling and renormalization, the search term volumes are noisy and variable (Yang et al., 2015) . The L 1 regularization adopted in PRISM has shown advantages in extracting the signals and reducing redundant information from Google search data (see Supplementary Material A9 and A11). Furthermore, the dynamic training with rolling window accounts for changes in search engine algorithms, people's search patterns, economic trends and other patterns that change over time (Burkom et al., 2007; Ning et al., 2019) . This is also evident in Figure A11 , where each of the 25 candidate search terms has distinct patterns coming in and out of the dynamically fitted model throughout the time. The discount factor adopted also gives more weights on more recent data to capture more recent economic trends and Google search changes, which is similar to the data tapering idea proposed by (Dahlhaus, 1988; Dahlhaus et al., 1997) The R package PRISM.forecast that implements the PRISM method is available on CRAN at https://CRAN.R-project.org/package=PRISM.forecast. We also made the code available at https://github.com/ryanddyi/prism. Details of the methodology, derivation and performance of PRISM are presented as follows. First, the exact search query terms used in our study, which were identified from Google Trends, are presented. Second, the general state-space formulation that motivates PRISM is presented together with a few widely used special cases. Third, the predictive distribution for PRISM forecasting, together with the mathematical proof, is described in detail. Fourth, the robustness of PRISM to the choice of the seasonal decomposition method, to the choice of the discount factor, to the choice of the length of the training window, and to the choice of the number of observations for the seasonal decomposition is presented. Fifth, the effect of regularization in PRISM is studied. Sixth, we investigate the normality of the residuals, which supports PRISM's predictive interval construction. Seventh, the fitted coefficients of PRISM are discussed. Eighth, we further compare PRISM with two additional benchmark methods. Ninth, one additional performance metric, CSSED, is studied for comparing PRISM with the alternative methods. Tenth, year-by-year forecasting performance of different methods in longer forecasting horizons (1-week, 2-week, and 3-week ahead) are reported. Eleventh, the predictive intervals of PRISM for longer horizon forecasting (1-week, 2-week, and 3-week ahead) are presented. Twelfth, we report the results of different methods for forecasting longer horizons during the COVID-19 pandemic period. The real-time Internet search data we used were from Google Trends (www.google.com/ trends). The search query terms that we used in our study were also identified from the Google Trends tool. One feature of Google Trends is that, in addition to the time series of a specific term (or a general topic), it also returns the top query terms that are most highly correlated with the specific term. In our study, we used a list of top 25 Google search terms that are the most highly correlated with the term "unemployment". Table A1 lists these 25 terms, which were generated by Google Trends on January 11, 2018. Figure A1 , the upper panel, illustrates the Google Trend time series of several search query terms in a 5-year span. Comparing these time series to the lower panel of Figure A1 , which shows the unemployment initial claims in the same time period, it is evident that the former provides noisy signal about the latter. On the Google Trends site, the weekly data are available for at most a 5-year span in a query, and it would be automatically transformed to monthly data if one asks for more than 5 years. To model and forecast the weekly unemployment claims for the entire period of 2007-2016, we downloaded separate weekly data sets from Google Trends, covering 2004 Trends, covering -2008 Trends, covering , 2006 Trends, covering -2010 Trends, covering , 2008 Trends, covering -2012 Trends, covering , 2010 Trends, covering -2014 Trends, covering and 2012 Trends, covering -2016 To avoid the variability due to the normalization and to ensure the coherence of the model, for each week, we kept both the training data and the data used for prediction within the same 5-year span of data downloaded. For each search term, we downloaded Google Trends data based on the same 5-year range and the multivariate x t . Then we trained the model and made predictions based on a rolling window of 3 years. So for the same set of data downloaded with the range of 2004-2008, we are able to make predictions for weeks in 2007-2008; similarly, the data covering 2006-2010 will give predictions for weeks in 2009-2010, etc. See Figure A2 for an illustration. Our state-space model, Eq. (2.2) in the main text, contains a variety of widely used time series models, including structural time series models and additive innovation statespace models. Under this general formulation, a specific parametric model can be obtained by specifying the state-space models for z t and γ t along with their dependence structure H. The following AR model with seasonal pattern is a special case: modeling z t as an autoregressive process with lag N and assuming a dummy variable formulation with period S for the seasonal component γ t : The dummy variable model for the seasonal component implies that sum of the seasonal components over the S periods, ∑ S−1 j=0 γ t−j , has mean zero and variance σ 2 ω . In the model ( set h t = (1, z t , z t−1 , . . . , z t−N+1 ) and s t = (γ t , γ t−1 , . . . , γ t−S+2 ), then it reduces to the model (A1). The basic structural model assumes that a univariate time series is the sum of trend, seasonal and irregular components, each of which follows an independent stochastic process (Harvey 1989). The model is where µ t is the trend component, and γ t and t are the seasonal and irregular components, respectively. The trend is often specified by a local level model where µ t is the level and δ t is the slope. η t and ζ t are assumed mutually independent. For time series with S periods, the seasonal component can be specified through the seasonal dummy variable model which is the same as the seasonal component in the seasonal AR model (A1). Alternatively, the seasonal pattern can be modeled by a set of trigonometric terms at seasonal frequencies λ j = 2πj/S (Harvey 1989): where ω j,t and ω * j,t , j = 1, . . . , [S/2], are independent and normally distributed with common variance σ 2 ω . Under our general state-space model, Eq. (2.2) of the main text, if we take z t = µ t + t and h t = (µ t , δ t ), then it specializes to structural time series models. In particular, for the dummy variable seasonality of (A4), s t in the general model corresponds to s t = (γ t , γ t−1 , . . . , γ t−S+2 ); and for the trigonometric seasonality of (A5), s t in the general model corresponds to s t = (γ 1,t , . . . , γ [S/2],t , γ * 1,t , . . . , γ * [S/2],t ). An alternative to structural time series models, which have multiple sources of error, innovation state-space model (Aoki 1987) , where the same error term appears in each equation, is also popular. These innovation state-space models underlie exponential smoothing methods, which are widely used in time series forecasting and have been proven optimal under many specifications of the innovation state-space model (Ord et al. 1997; Hyndman et al. 2008 ). Among exponential smoothing methods, Holt- Winters' method (Holt 1957; Winters 1960 ) is developed to capture both trend and seasonality, and it postulates a model specification similar to the basic structural model (A2)-(A4). In particular, Holt-Winters' additive method is where the components µ t , δ t and γ t represent level, slope and seasonal components of time series, and t iid ∼ N (0, σ 2 ) is the only source of error. Since Eq. (A6a) can be rewritten as we observe that model (A6) is special case of our general formulation with The Holt-Winters model is among a collection of innovation state-space models that are summarized in Hyndman et al. (2008) using the triplet (E, T, S), which represents model specification for the three components: error, trend and seasonality. For instance, (A6) is also referred to as local additive seasonal model or ETS (A,A,A) , where A stands for additive. Our general state-space formulation, Eq. (2.2) in the main text, also incorporates many useful model extensions as special cases, including the damped trend (Gardner Jr and McKenzie 1985) and multiple seasonal patterns (Gould et al. 2008; De Livera et al. 2011) . For example, the damped trend double seasonal model extends model (A6) to include a factor φ ∈ [0, 1) and a second seasonal component as follows: Our general model contains this extended model, where t−S 2 +1 ). The motivation of our general state-space formulation is to collectively consider all possible models under it and to semi-parametrically obtain the prediction under this large class of models. In comparison, traditional time series studies often rely on parameter estimation of specified models such as those highlighted in the previous subsections. For instance, exponential smoothing is tailored for computing the likelihood and obtaining maximum likelihood estimates of the innovation state-space models. For other parametric models with multiple sources of error, their likelihood might be evaluated by the Kalman filter, but the parameter estimation can be difficult in many cases. In the traditional parametric times series model setting, model selections are often applied by optimizing certain selection criteria (e.g. AIC or BIC), but when the class of models under consideration become really large such as Eq. (2.2) of the main text, traditional model selection methods encounter serious challenges (as they lack scalability) to operate on such a wide range of models. As a consequence, traditional parametric time series models often consider a much smaller collection of models compared to Eq. (2.2) of the main text. The cost of focusing on a small class of models is that the forecasting accuracy can substantially suffer as the risk of model misspecification is high. To relieve these challenges and improve the performance of forecasting, we use our general state-space formulation to motivate a semi-parametric method for forecasting time series. We derive and study a linear predictive model that is coherent with all possible models under Eq. (2.2) of the main text. With forecasting as our main goal, we essentially transform the question from the inference of a complicated class of statespace models into penalized regression and forecasting based on a linear prediction formulation. Under our general state-space model -Eq. (2.2) and Eq. (2.3) of the main text -given the historical data {y 1:(t−1) } and contemporaneous exogenous time series {x t 0 :t }, the predictive distribution for forecasting y t+l (l ≥ 0) at time t would be p(y t+l | y 1:(t−1) , x t 0 :t ). In PRISM, we consider the predictive distribution of y t by further conditioning on the latent seasonal component {γ t }: p(y t+l | y 1:(t−1) , γ 1:(t−1) , x t 0 :t ). (A1) Note that since z t = y t − γ t for all t, z 1:(t−1) is known given y 1:(t−1) and γ 1:(t−1) . The advantage of working on (A1) is that we can establish a universal representation of the predictive distribution as given by the next proposition. Proposition 1. Under our model -Eq. (2.2) and (2.3) of the main text -y t+l (l ≥ 0) conditioning on {z 1:(t−1) , γ 1:(t−1) , x t 0 :t } follows a normal distribution with the conditional mean E(y t+l | z 1:(t−1) , γ 1:(t−1) , x t 0 :t ) linear in z 1:(t−1) , γ 1:(t−1) and x t . As a partial result that leads to Proposition 1, we have, without the exogenous variables, the conditional distribution of y t+l | z 1:(t−1) , γ 1:(t−1) is normal with mean linear in z 1:(t−1) and γ 1:(t−1) . Based on Proposition 1, we can represent p y t+l | z 1:(t−1) , γ 1:(t−1) , x t 0 :t as where µ (l) We first prove the case for l = 0. Let r t = (z t , γ t ) . Since y t = z t + γ t for all t, y t | y 1:(t−1) , γ 1:(t−1) is equivalent to z t + γ t | z 1:(t−1) , γ 1:(t−1) . By treating r t as a 2-dimensional observable and α t = (h t , s t ) as the state vector, we can rewrite Eq. (2.2) of the main text as Denote r 1:t = (r 1 , . . . , r t ) and α 1:t = (α 1 , . . . , α t ) . According to the property of Gaussian linear state-space model, r 1:t and α 1:t jointly follows a multivariate normal distribution. Therefore, the sub-vector r 1:t is also normal, and r t | r 1:(t−1) follows a bivariate normal distribution with mean linear in r 1:(t−1) , i.e. where Γ t and Σ t are determined by Φ, Λ and H. For any given parameters, the above distribution can be numerically evaluated through Kalman filter. Here, we focus only on the general analytical formulation. Following (A2) and y t = 1 r t , we have y t | r 1:(t−1) ∼ N (1 Γ t r 1:(t−1) , 1 Σ t 1). Thus, given r 1:(t−1) , or equivalently z 1:(t−1) and γ 1:(t−1) , y t has a univariate normal distribution with mean linear in z 1:(t−1) and γ 1:(t−1) . When taking the exogenous variable x t into account, we have p(y t | r 1:(t−1) , x t 0 :t ) ∝ p(x t | y t )p(y t | r 1:(t−1) ). Since it follows that Hence, y t | r 1:(t−1) , x t 0 :t is normal, since the above equation is an exponential function of 8 a quadratic form of y t . By reorganizing the terms, we have and Var y t | r 1:(t−1) , Therefore, y t | z 1:(t−1) , γ 1:(t−1) , x t 0 :t has a normal distribution with mean linear in z 1:(t−1) , γ 1:(t−1) and x t . Next, we prove the case for l ≥ 1. We consider the following predictive distribution p y t+l | x t 0 :t , r 1:(t−1) ∝ p y t+l , r t | x t 0 :t , r 1:(t−1) dr t ∝ p (y t+l | x t 0 :t , r 1:t ) p r t | x t 0 :t , r 1:(t−1) dr t . Since x t 0 :t is independent of y t+l conditional on y 1:t , p(y t+l | x t 0 :t , r 1:t ) = p(y t+l | r 1:t ). Similarly, x t 0 :(t−1) is independent of r t conditional on r 1:(t−1) , which implies p(r t | x t 0 :t , r 1:(t−1) ) = p(r t | x t , r 1:(t−1) ). Note that p r t | x t , r 1:(t−1) ∝ p(x t | r t )p(r t | r 1:(t−1) ). Thus, we have p y t+l , r t | x t 0 :t , r 1:(t−1) ∝ p (y t+l | r 1:t ) p r t | x t , r 1:(t−1) ∝ p (y t+l | r 1:t ) p(x t | r t )p(r t | r 1:(t−1) ). In the first part of the proof, we have learned that r 1:(t+l) is multivariate normal. Similar to (A2), we can write r t+l | r 1:t as r t+l | r 1:t ∼ N (Γ t,l r 1:t , Σ t,l ), where Γ t,l and Σ t,l are determined by Φ, Λ and H. Hence, y t+l | r 1:t ∼ N (1 Γ t,l r 1:t , 1 Σ t,l 1). Combining the above results with (A3), we have p y t+l , r t | x t 0 :t , r 1: whose right hand side is an exponential function of a quadratic form of y t+l and r t . Hence, y t+l , r t | x t 0 :t , r 1:(t−1) is multivariate normal. Consequently, the marginal distribution of y t+l | x t 0 :t , r 1:(t−1) is univariate normal. Moreover, the conditional expectation is E y t+l | x t 0 :t , r 1:(t−1) = E E (y t+l | r 1:t ) | x t 0 :t , r 1:(t−1) = E Γ t,l (r 1:(t−1) , r t ) | x t 0 :t , r 1:(t−1) where E(r t | x t , r 1:(t−1) ) is linear in x t and r 1:(t−1) . Therefore, y t+l | x t 0 :t , r 1:(t−1) is univariate normal with mean linear in x t and r 1:(t−1) . We compare the performance of PRISM with two seasonal decomposition methods: STL and the classic additive decomposition. Both methods decompose target time series y t into the trend component T t , the seasonal component S t and the irregular component R t : In the classic additive decomposition, the estimated trend componentT t is calculated from the moving average of {y t }. The seasonal component S t for a certain week is assumed to be the same in each period, and is estimated by the average of the detrended value, y t −T t for the specific week. For example, assuming 52 weeks in a year, the seasonal index for the fifth week is the average of all the detrended fifth week values in the data. In contrast, STL relies on a sequence of applications of loess smoother to generate the seasonal and trend components. The implementation of STL involves two loops. In the outer loop, robustness weights are assigned to each data point to reduce the effects of outliers, while the inner loop iteratively updates the trend and seasonal components. In the inner loop iteration, the seasonal components are updated using detrended time series similar to classic additive decomposition, and the trend components are calculated by loess smoothing of the deseasonalized time series. Both STL and the classic additive seasonal decomposition are options in the R package of PRISM with STL being the default setting. Table A2 describes the performance of PRISM with the two different methods. The numerical result of PRISM is quite robust to the choice of the seasonal decomposition method with STL providing slightly better overall result. We tested the effect of the discount factor w. Lindoff (1997) suggested setting w between 0.95 and 0.995 for practical applications. Table A3 shows the performance of PRISM with different w ∈ [0.95, 1] (note that w = 1 corresponds to no discount). The performance of PRISM is seen to be quite robust for the different choices of w between [0.95, 0.995]. The discount factor w is an option in our R package of PRISM, and the default value is set to be 0.985. We chose w = 0.985 as it gives the optimal real-time prediction accuracy and close to optimal prediction accuracy in longer forecasting horizons, as reported in Table A3 . We can also see that comparing with no discounting (w = 1), our default choice of discount factor can provide 1% to 5% more error reduction in terms of the relative errors to the naive method. This motivates us to incorporate the discount factor in our model. In this section we conducted an empirical analysis on the choice of the rolling window length in training PRISM. With the limited, 5-year availability of weekly Google search data (more details in Section 2.1 and Supplementary Material A1), we varied the rolling window lengths between 2 and 4 years to keep the fitting and prediction within the same set of downloaded data. Table A4 reports the result for various training window length. We observe that the default 3-year window gives the leading performance. It appears that the 3-year window provides a good trade-off between the timeliness of short-term training data and the statistical efficiency from long-term, large-size data. For the seasonal decomposition in Stage 1 of PRISM, we used M = 700 historical initial claim observations as the default choice. A relatively large number of M (for example, more than 10 years of data, M ≥ 520) is preferred to ensure stability of the decompo-sition. We also conducted an experiment on the sensitivity of the choice of M. Table A5 reports the result, which shows that the performance of PRISM is quite robust to the choice of M, and that the default choice of M = 700 gives rather favorable performance. We now assess different types of regularization in fitting the coefficients of PRISM, examining the effects of L 1 , L 2 and the elastic net (Zou and Hastie 2005) penalty. For L 2 regularization, we replaced the L 1 -norm penalty in (2.5) with L 2 norms (i.e., using the Ridge regression). For the elastic net , the objective function is where a is the weight between L 1 and L 2 penalization. Cross-validation is used for selecting both tuning parameters λ and a. The results of using different types of regularization are given in Table A6 . We can see the clear advantages of L 1 penalty in error reduction, especially in longer horizon forecasts. Unlike the L 1 regularization, the L 2 penalty will not produce zeros in the fitted coefficients. That is, for those noisy search terms that are not relevant in the prediction, they will remain in the model with small, non-zero coefficients, which may lead to more variability in the prediction. In contrast, with the L 1 penalty, the noise and less relevant information from these search terms tend to be eliminated while only the most predictive search terms are selected to remain in the model. Similarly, although the elastic net in theory is more flexible with both L 1 and L 2 penalties, in practice it is not doing as well as the L 1 penalization due to the noisy search terms. The results here thus show that the noise reduction by L 1 penalty is quite effective in improving the performance. A10. Normality of Residuals q q q q q q q q q q q qq qq qqq qqq qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq qqqqqq qq qqq q q q q q q q q Here we provide empirical evidence on the normality of the residuals for constructing the predictive intervals. Figure A3 shows the normal Q-Q plot of the fitted residuals by PRISM from 2008-2016. Except for a few points, the vast majority of the fitted residuals fall along with the normal distribution, which supports the construction of the predictive intervals by PRISM in Section 2.8 of the main text. We report the dynamically fitted coefficients by PRISM from each week's prediction in 2008-2016 in Figure A4 . Focusing on the coefficients of Google search terms (the bottom section of the heatmap), we can see the sparsity of coefficients from the L 1 penalty. On average, 6.20 out of the 25 search terms are selected and included in the each week's model by PRISM during 2008 PRISM during -2016 . Note that all 25 terms have been included at some point. In addition, we can also see different search terms come in and out of the model over time, indicating the dynamic movement for each term in its contribution to the final forecasting. In this section, we compare PRISM with two additional methods: the seasonal AR model and an AR model with a single Google search term as the exogenous variable (D'Amuri and Marcucci 2017). We fitted the seasonal AR with R package forecast (Hyndman et al. 2007) where the order for AR component is selected by BIC. For the method of D'Amuri and Marcucci (2017), following the procedure specified in the paper, the Google search volume of "jobs" was included as the exogenous variable. Table A7 summarizes the comparison result. It shows that PRISM leads both methods by a large margin. To assess the statistical significance of the improved prediction power of PRISM over the alternative methods, the p-values of the Diebold-Mariano test (Diebold and Mariano 1995) are reported in Table A8 (the null hypothesis being that PRISM and the alternative method in comparison have the same prediction accuracy in RMSE). With all the p-values smaller than 0.008, Table A8 shows that the improved prediction accuracy of PRISM over the seasonal AR model and the method of D'Amuri and Marcucci (2017) is statistically significant in all of the forecasting horizons evaluated. In this section, we compare PRISM with the benchmark methods according to an additional metric: Cumulative Sum of Squared forecast Error Differences (CSSED) Welch and Goyal (2008) , which was recommended by a referee. The CSSED at time T for benchmark method m is defined by CSSED m,PRISM = ∑ T t=1 (e 2 t,m − e 2 t,PRISM ), where e t,m and e t,PRISM are the prediction errors at time t for the benchmark method m and PRISM respectively. A positive CSSED indicates the benchmark is less accurate than PRISM. Figures A5 -A8 showcase the CSSED between different benchmarks and PRISM over the period of 2007-2016. Consistent with the RMSE and MAE reported in Table 1 of the main text, PRISM is the leading method. We also see a major gain in prediction accuracy for PRISM in 2009 during the financial crisis. This also indicates the robustness of PRISM by incorporating Google search information in forecasting. Table A9 compares the performance of PRISM and the alternative methods for longerhorizon forecasting (namely, forecasting 1-week, 2-week and 3-week ahead) during the COVID-19 pandemic period. The evaluation period is from April 11, 2020 to July 18, 2020, the time period where the signal from the unprecedented jump of unemployment initial claims due to the COVID-19 pandemic trickled in and challenged longer horizon forecasting. With input from Google search information, PRISM shows clear advantages over the alternative methods in 1-week ahead and 2-week ahead (i.e., near-future) predictions. As we are looking further into the future, the information from Google search data becomes less powerful for prediction, and the performance gap between PRISM and the other methods shrinks. This observation is consistent with what we observed in Table 1 of the main text. For 3-week ahead forecasting, PRISM is the second best, trailing TBATS (although TBATS performs quite poorly in nowcasting -often worse than the naive method -as shown in Table 5 and Figure 6 of the main text). Table A9 : Performance of PRISM and benchmark methods during COVID-19 pandemic period (April 11, 2020 to July 18, 2020) for three forecasting horizons: 1 week, 2 weeks, and 3 weeks. RMSE and MAE here are relative to the error of naive method; that is, the number reported is the ratio of the error of a given method to that of the naive method; the absolute RMSE and MAE of the naive method are reported in the parentheses. The boldface indicates the best performer for each forecasting horizon and each accuracy metric. Discount weighted estimation State space modeling of time series Now-casting and the real-time data flow Automated time series forecasting for biosurveillance Rise in Unemployment Claims Signals an Economic Reversal Business intelligence and analytics: From big data to big impact Predicting the present with Google Trends STL: A seasonal-trend decomposition Rise in Unemployment Claims Signals an Economic Reversal Automatic time series for forecasting: the forecast package for R Forecasting with exponential smoothing: the state space approach Big data meets public health Big-data applications in the government sector Nowcasting with big data: is Google useful in presence of other information On the optimal choice of the forgetting factor in the recursive least squares estimator Big data: The next frontier for innovation, competition, and productivity McKinsey & Company Big data: The management revolution The inevitable application of big data to health care Accurate regional influenza epidemics tracking using Internet search data Estimation and prediction for a class of dynamic nonlinear statistical models Quantifying trading behavior in financial markets using Google Trends Can we use daily Internet search query data to improve predicting power of EGARCH models for financial time series volatility Bayesian variable selection for nowcasting economic time series Predicting the present with bayesian structural time series Time series analysis and its applications: with R examples Predictive analytics: The power to predict who will click, buy, lie, or die The advanced theory of statistics Griffin Exponentially weighted methods for forecasting intraday time series with multiple seasonal cycles Regression shrinkage and selection via the lasso A comprehensive look at the empirical performance of equity premium prediction Forecasting sales by exponentially weighted moving averages The future of prediction: How Google searches foreshadow housing prices and sales Advances in using Internet searches to track dengue Accurate estimation of influenza epidemics using Google search data via ARGO Regularization and variable selection via the elastic net Similar to the nowcasting performance in Figures 2 and 4, PRISM gives leading performance in most of the years for longer horizon forecasting