key: cord-0570482-nx5amoup authors: Morris, Michael; Hayes, Peter; Cox, Ingemar J.; Lampos, Vasileios title: Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity date: 2021-05-26 journal: nan DOI: nan sha: e174c5c7d5ad395446a856ea42e78e5c35235854 doc_id: 570482 cord_uid: nx5amoup Influenza is an infectious disease with the potential to become a pandemic, and hence, forecasting its prevalence is an important undertaking for planning an effective response. Research has found that web search activity can be used to improve influenza models. Neural networks (NN) can provide state-of-the-art forecasting accuracy but do not commonly incorporate uncertainty in their estimates, something essential for using them effectively during decision making. In this paper, we demonstrate how Bayesian Neural Networks (BNNs) can be used to both provide a forecast and a corresponding uncertainty without significant loss in forecasting accuracy compared to traditional NNs. Our method accounts for two sources of uncertainty: data and model uncertainty, arising due to measurement noise and model specification, respectively. Experiments are conducted using 14 years of data for England, assessing the model's accuracy over the last 4 flu seasons in this dataset. We evaluate the performance of different models including competitive baselines with conventional metrics as well as error functions that incorporate uncertainty estimates. Our empirical analysis indicates that considering both sources of uncertainty simultaneously is superior to considering either one separately. We also show that a BNN with recurrent layers that models both sources of uncertainty yields superior accuracy for these metrics for forecasting horizons greater than 7 days. Influenza is an infectious respiratory disease responsible for 290,000 to 650,000 deaths annually according to the World Health Organisation. Estimating the prevalence of influenzalike-illness (ILI) in a population and forecasting its future trajectory is an important area of research (Shaman & Karspeck, 2012; Shaman et al., 2013; Yang et al., 2015b; Nsoesie et al., 2014) as it can contribute to better-informed policy decisions for public health, e.g. when to recommend the use of antiviral drugs. Established approaches to epidemic forecasting, such as mechanistic models, are commonly based on syndromic, clinical, and demographic data (Birrell et al., 2011; Shaman & Karspeck, 2012) . These methods attempt to mathematically model a disease, usually as a set of ordinary differential equations. They rely on assumptions such as random and uniform mixing of populations which often do not hold (Roberts et al., 2015) . In recent years, a considerable body of work has shown that data generated by web users during their interaction with search engines or social media platforms can be used to obtain influenza prevalence models with sufficient accuracy. This has been shown both for now-casting (Lampos & Cristianini, 2010; Culotta, 2010; Lampos et al., 2015; Yang et al., 2015a; Wagner et al., 2018) and forecasting (Paul et al., 2014; Volkova et al., 2017; Yang et al., 2017) . Neural Networks (NN) have been applied to many forecasting tasks with state of the art accuracy, including financial modelling (Gately, 1995) , hydrological forecasting (Thirumalaiah & Deo, 2000) , power load forecasting (Barbounis et al., 2006) , as well as to the task of influenza forecasting from social media data (Volkova et al., 2017) . However, traditional NNs do not provide uncertainty estimates. Influenza forecasts, and forecasts in general, are only useful when they have an associated uncertainty. It is problematic to make an informed decision based on a forecast without knowing the likelihood that it is accurate. Uncertainty estimates also aid the fusion of disparate information sources, which are common in syndromic surveillance of diseases. There are generally two main types of uncertainty in modelling problems (Der Kiureghian & Ditlevsen, 2009) : data uncertainty and model uncertainty. Model uncertainty, also known as epistemic, deals with uncertainty in the model's arXiv:2105.12433v1 [cs. LG] 26 May 2021 parameters (Kendall & Gal, 2017) . This captures uncertainty about information that is not contained in the training dataset. It can be reduced by having a more extensive dataset; a model will be more confident about data that is very similar to what it has seen during training, and less confident about out-of-distribution data. This is relevant to our task as disease transmission patterns and user search behaviour are expected to change over time. Bayesian models place a prior distribution over the model parameters (Yao et al., 2019; Li & Gal, 2017; Foong et al., 2019) to account for model uncertainty, but they neglect data uncertainty (Gal, 2016) . Data uncertainty, also known as aleatoric, is inherent in the observations and is often caused by noise or sampling errors. This is relevant to our task as search query data can be extremely noisy (e.g. data represents a pseudo-random 10-15% sample of all searches) and the uncertainty in the ILI rate changes throughout the year with very little uncertainty outside the main circulation months. Data uncertainty is modelled by placing a distribution at the output of an NN (Bishop, 1994; Nix & Weigend, 1994; Le et al., 2005) . To account for both of these uncertainties we combine the methods, creating a Bayesian Neural Network (BNN) which has an output distribution as described in (Kendall & Gal, 2017) . We find that a combination of these two methods gives a good estimation of uncertainty without compromising forecasting performance when forecast horizon windows are greater than 7 days. Our paper makes the following contributions: 1. We propose an uncertainty modelling solution for ILI forecasting using web search data and BNN layers. We show that our method can be incorporated into common NN architectures, such as feed-forward (FF) and recurrent neural networks (RNN). 2. We investigate the performance on ILI forecasting of traditional FF and RNN models and compare it to models that account for model or data uncertainty, as well as their combination. 3. In addition to common regression error functions (e.g. mean squared error), we also use metrics which penalise both errors in the mean and confidence interval of a forecast. We show that estimating uncertainty causes little or no accuracy degradation, and that an RNN that simultaneously estimates model and data uncertainty provides the most accurate forecasts and confidence intervals. Methods for time series forecasting are applicable to many different areas, from meteorology (Murphy & Winkler, 1984; Gneiting & Raftery, 2005) to financial modelling (Kaastra & Boyd, 1996; Cao & Tay, 2001; Abu-Mostafa & Atiya, 1996) , and health (Brookmeyer et al., 2007; Hoot et al., 2008; Soyiri & Reidpath, 2013) . Traditional NNs have shown promise as forecasters for influenza and are well-suited for incorporating large feature spaces that capture aggregate online user activity (Aiken et al., 2019; Volkova et al., 2017; Venna et al., 2018; Xue et al., 2017; Adhikari et al., 2019) . However, their original formulations do not provide confidence intervals for forecasts, something that reduces their practical utility. This has also been highlighted by influenza forecasting challenges such as CDC's FluSight (Reich et al., 2019) . Gaussian Processes (GP) have been used to now-cast or forecast influenza or COVID-19 mortality with uncertainty estimates (Lampos et al., 2015; Zimmer & Yaesoubi, 2020; Lampos et al., 2021) . However, scalability to larger datasets that from a modelling standpoint cover more meaningful time-spans, is challenging as GPs have O(n 3 ) complexity where n is the number of training samples (Liu et al., 2020) . NNs can be modified to provide similar uncertainty estimates. Dropout can be used as an approximation to deep GP models (Gal & Ghahramani, 2016a; Kendall et al., 2015; Gal, 2016) . This requires careful tuning and ultimately does not behave as desired given that the uncertainty generated by dropout does not reduce as more data becomes available (Osband, 2016; Hron et al., 2017) . Another approach is to use BNNs which specify a distribution over their weights and attempt to learn the posterior distribution. The majority of BNNs in forecasting are optimised using Markov Chain Monte Carlo (MCMC) (Liang, 2005; Niu et al., 2012; Cauchemez et al., 2004; Li et al., 2011; Zhang et al., 2011) , but are limited by the lack of scalability of existing MCMC algorithms for large training samples and feature spaces (Papamarkou et al., 2019; Kucukelbir et al., 2017) . This prevents MCMC from being a viable solution for neural network ILI forecasting with uncertainty estimates. Variational approaches that fit an approximate posterior can avoid the scaling issues of MCMC (Hinton & Van Camp, 1993; Blundell et al., 2015) . However, variational methods have been rarely adopted as they tend to deliver non-optimal performance, usually converging very slowly (Graves, 2011; Blundell et al., 2015) . Recent works applying the heuristics of batch normalisation and learning rate scheduling appear to have reduced this problem (Ioffe & Szegedy, 2015; Loshchilov & Hutter, 2016; Goyal et al., 2017; Osawa et al., 2019) . In this paper, we use variational inference to optimise a BNN for model uncertainty. We also combine this with data uncertainty by outputting a distribution, computing uncertainties as in Kendall & Gal (2017) . This improves the final uncertainty estimates and is computationally tractable. Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity 3. Methods For the forecasting task, we have a set of inputs X = [x 1 , . . . , x T ] and a set of target outputs y = [y 1 , . . . , y T ] across T time points (days). At each time point t ∈ [1, T ], we construct an input x t composed of m web search query frequencies from time t − to time t, and ILI rates from t − − δ to t − δ. The number of historical time points that we consider is denoted by and δ is a delay (in days) due to the time it takes for public health systems to obtain and curate a representative sample of ILI rates based on doctor visitations. The input x t is an (m + 1) × matrix which is flattened into an (m + 1) × -dimensional vector when used in an FF NN; it is maintained as an (m + 1) × matrix when an RNN is used. We also have a target ILI rate y t , associated with x t , which is a future ILI rate at time point t + γ, where γ is the forecasting horizon. In our experiments, we conduct γ = 7, 14, and 21 days ahead forecasting. For simplicity, in the remainder of the manuscript, we will drop the subscript t and refer to these variables as x and y. We experiment both with FF and RNN NN models. For the RNN we use the Long Short-Term Memory (LSTM) architecture. We obtain forecasts using deterministic model formulations, models that account for data uncertainty, BNNs that account for model uncertainty, and BNNs that combine model and data uncertainty. Feed-Forward Model. A single layer FF NN is used as the simplest model we modify with these uncertainty methods. We flatten 1 each input x and pass it into a hidden dense layer with 25 units and a ReLu activation function (max(0, x)). This goes into the output layer which is dependent on the uncertainty method we employ and will be described in the following paragraphs. All FF models are trained with an exponential learning rate scheduler using ADAM as an optimiser (Kingma & Ba, 2014) . Long Short-Term Memory Model. RNNs are wellsuited and common for time series modelling tasks (Kalchbrenner & Blunsom, 2013; Sundermeyer et al., 2012) . LSTMs (Hochreiter & Schmidhuber, 1997 ) are a popular form of an RNN. We use many-to-one LSTM architectures (Graves et al., 2006) only predicting the ILI rate at a single time point γ days ahead. Input data is passed to the model sequentially using a rolling window. As the main focus of our work is in incorporating uncertainty estimates and not in assessing complex NN architectures, we deploy the following straightforward architecture. We use a single LSTM layer that returns only the final sequence of predictions, followed by a dense layer, and then an output layer. More complex architectures need to be investigated with the caveat that our dataset is still relatively small when it comes to NNs (14 years of daily data) and optimising a network with many parameters will be more challenging in this case. The output layers are again dependent on the uncertainty method we employ. All LSTM models are trained with a cosine learning rate scheduler and an ADAM optimiser. Traditional NNs are deterministic and require modification to provide estimates of uncertainty. We first present the deterministic approach without any uncertainty consideration, and then we detail our approach to data or model uncertainty, as well as their combination. Deterministic Approach. Let us assume that an NN with L layers is being deployed. In our experiments, L = 1 or 2 for the FF-or the LSTM-based NN architectures, respectively. We denote the j th activation in the l th layer as a [l] j . The output layer has a single unit and generates a point predictionŷ = f Φ (x) where f Φ denotes an NN with parameters composed of weights and biases Φ. We optimise Φ by minimising the mean squared error (MSE) betweenŷ and y for T time points. We use backpropagation to find values for Φ. Data Uncertainty. Data uncertainty is caused by noise in the data. This usually arises due to sampling or measurement error. It can be represented as homoscedastic uncertainty where uncertainty is constant for every input, or as heteroscedastic uncertainty where the uncertainty is dynamic and dependent on input x. Heteroscedastic models are useful where parts of the observation space may be noisier than others (Gal, 2016) . The ground truth ILI rate is nearly constant when influenza is not circulating, but highly variable during the flu season. This makes a homoscedastic model unsuitable for this problem. As such we only consider the heteroscedastic case. We assume that the uncertainty is normally distributed for each data point. We separate our prediction into a mean and standard deviation (Kendall & Gal, 2017) : (1) To accomplish this, we expand the output layer of the NN from one to two units a Bishop, 1994) . The mean, y, is the first activation of the output layer that uses a linear activation functionŷ = a [L] 1 . The standard deviation,σ, is computed by taking the softplus of a [L] 2 : where ρ > 0 is a sharpening factor (see Appendix). The softplus ensures that the standard deviation is always positive. As we are assuming that the noise is normally distributed and we are outputting the parameters of a Gaussian distribution, we train the model using maximum likelihood estimation. This is the same as minimising the negative-loglikelihood (NLL) : where y = (y 1 , . . . , y T ) is a series of ILI rates (ground truth),ŷ = (ŷ 1 , . . . ,ŷ T ) is a series of ILI rate predictions, andσ = (σ 1 , . . . ,σ T ) is a series of associated uncertainties. We use NLL as a loss function and use backpropagation to find values for Φ. The first component of Eq. 3 contains a residual term equivalent to MSE and an uncertainty normalisation term. The second component prevents the model from predicting an infinitely large σ. We do not need to explicitly train with an 'uncertainty ground truth' as it is implicit in the NLL. Model Uncertainty. Model uncertainty is caused by the model having a limited understanding of the system which generates the data (O'Hagan, 2004; Tagasovska & Lopez-Paz, 2019) . This uncertainty can usually be reduced by observing more data. An architecture which accounts for model uncertainty should recognise when it is shown out-ofdistribution inputs and be less confident. This is not possible while using data uncertainty alone. To capture model uncertainty a prior distribution is first placed over the model parameters, e.g. Φ ∼ N (0, I) for a Gaussian prior. In the literature this is referred to as a BNN (Li & Gal, 2017; Yao et al., 2019; Foong et al., 2019) . Bayesian inference is used to compute the posterior distribution over the parameters p(Φ|X, y). We can then sample from this distribution K times and make K predictions to build a range of forecasts. The variance of the means of these forecasts is the variance due to model uncertainty. For each sample from the posterior a prediction is made, is calculated by the model from a sample of weights and σ is a hyperparameter. Each prediction is required to be a distribution so that the expected likelihood for each sample, Φ , can be calculated during training. The scale of σ is set to be similar to y. BNNs are difficult to perform inference on. The posterior is given by The denominator of Eq. 4 contains the marginal probability p(y|X) which cannot be computed analytically in this setting (Kendall & Gal, 2017) . To mitigate against this several approximations have been proposed (Graves, 2011; Blundell et al., 2015; Gal & Ghahramani, 2016a; Blei et al., 2017) . In approximate inference the averaging over all weights is replaced by an optimisation task. We constrain the posterior p(Φ|X, y) to a simple distribution q Q (Φ), where Q is a family of potential distributions which we define as Gaussians. We choose a q(Φ) which minimises the Kullback-Leibler (KL) divergence to the true posterior p(Φ|X, y): This is also intractable as it contains the marginal probability from Eq. 4. We can, however, avoid this issue by minimising the negative evidence-lower-bound (ELBO) given by: where ω and θ are parameters used to learn p ω (Φ) and q θ (Φ) (described later). This is equal to Eq. 5 up to a constant . The first component of Eq. 6 is the expected likelihood that encourages the model to choose a q θ (Φ) which explains the data well. The second component is the negative KL divergence between the posterior and prior distribution. This behaves similarly to a regulariser and encourages the model to choose a simple q θ (Φ). The prior is chosen to represent how we anticipate the model to behave before it has observed any data. When there is limited training data, the prior component dominates the loss function. When more data is observed, the model becomes more confident and relies more on the likelihood component. The prior and posterior distributions are parameterised by functions that are conditioned on x and output the parameters of a Gaussian distribution. These functions are singlelayer FF NNs with parameters ω and θ, respectively. In the prior we fix the standard deviation and parameterise the mean, p ω (Φ) = N (µ p , σ p ) = N (f ω (x), σ p ). In the posterior we parameterise both the mean and standard deviation, ). Here f θ (x) outputs a mean µ q and standard deviation σ q in the same way as in Eq. 1 (using ρ q as a sharpening factor). In practise, we utilise a BNN with a posterior distribution over the parameters in the final layer only. The other layers of the network can be interpreted as providing a lower dimensional representation of the inputs to a Bayesian model (Tran et al., 2018) . During training we sample once from q θ (Φ) and compute the ELBO. We take gradients with respect to ω and θ to perform gradient descent and optimise the Bayesian layer. We back-propagate the loss through the sampled weights to optimise deterministic weights in the preceding network layers. At prediction time, the model is called K times. Weights are sampled Φ ∼ q(Φ) each time the model is called and thus give a different prediction N (ŷ , σ). We assume that these predictions are normally distributed and use them to calculate the parameters of a Gaussian distribution N (ŷ,σ), whereŷ andσ are the mean and standard deviation of the meansŷ of K sampled forecasts. As we are using a Gaussian output and only our output layer uses a distribution over its weights, it would be possible to analytically compute the posterior distribution. However, we choose to use variational inference for this task as it is a more general solution that would allow any layer in the network to use a distribution over its weights . Combining Model and Data Uncertainty. To capture both data and model uncertainty we combine the data uncertainty model with a BNN. To do this we use a BNN which outputs [ŷ,σ] . We approximate the posterior over the BNN by minimising the negative ELBO, usingσ in the computation of the expected likelihood. At prediction time we draw weights from our posterior Φ ∼ q θ (Φ) and use this to obtain a model output with predictive mean and standard deviation [ŷ ,σ ] = f Φ (x). This process is repeated K times, each giving a different prediction. The mean of our combined uncertainty prediction is the mean of the K sampled forecasts,ŷ = µ(ŷ ). The standard deviation of the predicted distributions is given by (Kendall & Gal, 2017) : We evaluate forecasting accuracy based on the following metrics: mean absolute error (MAE), root mean squared error (RMSE), bivariate correlation (r), and symmetric mean absolute percentage of error (SMAPE) between the predicted,ŷ = [ŷ 1 , . . . ,ŷ T ], and the target, y = [y 1 , . . . , y T ], values. SMAPE accounts for different magnitudes in different flu seasons and provides more relevant error estimates when averaging across all the test periods. We also use a smoothed delay-to-peak (SDP) metric, which measures the delay between the actual and the predicted peak of a flu season in days, after smoothing both time series (15-day moving average). Smoothing is relevant to avoid misleading outcomes when, for example, a flu season has two peaks in close temporal proximity. A negative SDP value means that the model predicted the peak early, while a positive one means the peak was predicted late. We consider its absolute value when computing the mean performance over multiple test seasons. For probabilistic forecasts, the aforementioned metrics cannot distinguish between errors with large or small uncertainties. We need an error metric that penalises both the forecast and associated uncertainty estimates. NLL, used to train the NNs which we defined in Eq. 3 is one such metric. A criticism of NLL is that it over-penalises errors where the difference between the actual and forecasted value is much greater than the associated uncertainty (Gneiting & Raftery, 2007 ). An alternative that does not exhibit this property is the continuous ranked probability score (CRPS) (Gneiting & Raftery, 2007) which is given by where ϕ t and Ψ t denote the probability density function and the cumulative distribution function of a standard Gaussian variable N (ŷ t ,σ t ). CRPS is a probabilistic metric that generalises to the MAE when the standard deviation is 0. A discussion of the trade-off between NLL and CRPS is provided in the Appendix. Weekly ILI rates for England from January 1, 2004 to December 30, 2018 are obtained by the Royal College of General Practitioners (RCGP), which utilises a sentinel doctor network spread across the country. Weekly ILI rates represent the proportion of patients (in 100,000 of the population) visiting medical clinics with symptoms of influenza. Weekly data is converted to daily through a linear interpolation centred around Thursday of each week. Daily web search frequencies for England are obtained from the Google Health Trends API for the same period as the ILI rates. We use a predetermined pool of 20,856 health-related search queries. We apply a semantic filter based on word embedding representations to extract queries that are related to influenza (Lampos et al., 2017; Zou et al., 2018; . We also filter queries based on their correlation with ILI rates. The selected queries change with the test year as the correlations can differ from one training period to another. On average across the 4 test periods, 170 queries are selected. 2 We also smooth query frequencies using a harmonic weighted average over the past 7 days. Evaluation Protocol. We report results averaged over 4 test periods (2014-15 to 2017-18) that cover the respective annual flu seasons. Each test period starts on August 23 and ends on August 22 of the following year. Min-max normalisation is applied to each search query frequency time series (based on the training set each time). We do not normalise the ILI rate inputs as they are on the same Table 1 . Metrics averaged across 4 test years for FF, LSTM, and 3 baseline models. We evaluate the statistical significance between the metrics within a model architecture. We use and † superscripts to indicate that an estimate is different in a statistically significant way from the deterministic baseline of the same architecture and an equivalent model in the other NN architecture, respectively. The best result for each metric is in bold. We append model abbreviations with -v for (vanilla) deterministic models, -d, -m, and -c for data, model, and combined uncertainty, and -c-nq for combined uncertainty without using search queries. scale as the output, and this normalisation can negatively affect forecasting accuracy. In all models, the number of historical time points, , is 28 days for both search queries and ILI rates. ILI rates are delayed by a week, i.e. δ = 7 to simulate a practical setting -syndromic surveillance data is commonly reported with a delay. We report results for γ = 7, 14, and 21 days. Baselines. We compare our approach to a GP model as formulated by Zimmer & Yaesoubi (2020) . This model is trained using the latest available data at the start of each week. The GP formulation does not use search query data -an increased input space dimensionality needs a different covariance structure otherwise it negatively affects accuracy. To compare with it in a more meaningful way we also train models that do not use web search activity data. In addition to the NN architectures, we also include a naïve model (also known as a persistence model) as a benchmark. This uses the most recently available value of the input ILI rate as its forecast. Finally, a historical averages model (Hist.) computes the ILI rate and uncertainty on a date as the average and variance of ILI rates on the same date in previous years. The naïve estimates are superior for short-term forecasts while historical averages are better for long-term forecasts. Table 1 enumerates performance outcomes for all forecasting tasks, models, and error metrics we considered, averaged over the 4 year-long test periods. For all metrics except the correlation (r) a lower number is better. We append NN model abbreviations with -v for (vanilla) deterministic models, -d, -m or -c respectively for data, model or combined uncertainty, and -c-nq for combined uncertainty without using search query data. We find that combined -c models generally perform the best in terms of their confidence intervals and accuracy, however, there is an exception for forecasting 7 days ahead (γ = 7). Here the deterministic FF-v model outperforms the combined FF-c model by 21% and 33% in terms of SMAPE and RMSE, respectively. This is in part due to the the model's good performance on the 2017-18 flu season which had a higher peak. Arguably for decision and planning purposes, precise uncertainty intervals may be an acceptable compromise for a small accuracy reduction. We find that for forecast horizons, γ = 14 or 21 days, the LSTM-c model is equivalent or better than all other methods. We also evaluate the effect of using search query frequencies as inputs in FF-c and LSTM-c models, by removing the queries from the inputs and training other- wise identical models. We find that models trained without search query data are less accurate and less confident, confirming that search query data is useful for forecasting. Standard Regression Accuracy Metrics. Standard metrics (MAE, RMSE, SMAPE, and r) do not consider corresponding uncertainty estimates, but are, of course, important. NNs are superior to the baseline models with the exception of data uncertainty models -d. For forecasting horizon γ = 7, FF models provide the best forecasts. For γ = 14, the FF-v and LSTM-v are not statistically different, and -v, -m and -c models perform similarly. For γ = 21, which is the most challenging forecasting task in our experiments, the best-performing models are based on the LSTM architecture. We see that for γ = 7 or 14 days, the LSTM-c model is not different in a statistically significant way from the baseline LSTM-v model. However, for γ = 21 this is not the case; according to SMAPE, LSTM-c is the best performing model in a statistically significant way. Uncertainty. The combined uncertainty -c models always perform better than data uncertainty -d and model uncertainty -m models in terms of both NLL and CRPS. Data or model uncertainty when used in isolation tend to underestimate uncertainty; this is resolved when used together. We find that the -c-nq models sometimes outperform the -c models in terms of NLL. This is due to the NLL's tendency to heavily penalise forecasts which are over-confident. The -c-nq models are by far the least confident of the models ( Figures A2 and A3 ) and as a result, are less likely to have a bad prediction which strongly affects NLL. Figure 1 shows how the uncertainty differs between different models. The -m model underestimates uncertainty but has a reasonably accurate mean, while the -d model has an inaccurate mean and less confidence. The -c model has a good mean prediction and a reasonable confidence interval (metrics for this are shown in Table 1 ). Figure 1 also shows that as γ increases, the uncertainty surrounding the forecasts increases as expected. We evaluate how well the uncertainties are calibrated in the same way as in Kendall & Gal (2017) . We generate confidence intervals for 0 to 3σ confidence (0% to 99.7%) and compute the frequency that the ground truth falls within this forecast. For example, we expect the ground truth to be within a 50% confidence interval 50% of the time. We show this in calibration plots in Figure 2 . The diagonal line y = x represents perfect calibration. When a model is too confident, it will be below this line, and when it is too uncertain, it will be above it. In every case the com- -d, -m, and -c data, model, and combined uncertainty, and -c-nq combined uncertainty without using search queries. Confidence intervals indicate the standard deviation of the mean obtained by training with different weight initialisation seeds over 10 runs. bined models have the best calibrated uncertainty, i.e. their curves are closest to y = x. LSTM-c and FF-c have the best-calibrated uncertainty for all γ's. We also observe that as γ increases, the calibration lines deviate further from the diagonal. The GP model underestimates uncertainty significantly. Finally, -c-nq models overestimate uncertainty, which is also reflected in their low NLL values. Epidemic Forecasting Analysis. During the flu season 2015-16, the ILI rate peaked much later than in other seasons (mid-March compared to early January in most seasons). As a result, models tend to overestimate the flu rate in January and then predict the true peak at an inaccurate future point, resulting in a much larger SDP. The peak intensity of the flu season does not appear to affect the SDP. Flu rates in 2017-18 were significantly increased compared to other recent flu seasons, so the models tended to underestimate the flu rate. In addition, the larger face value of the maximum flu rate during this season increases the error metrics (other than the SDP) for it compared to other years. The models are most accurate during the 2014-15 and 2016-17 flu seasons. In general, the variation in intensity and timing during the flu season make forecasting challenging. Because of this, the models must exhibit meaningful and well-tuned uncertainty estimates which reflect the variations year to year. In our experiments, the models with combined uncertainty are visibly the best ones in their attempt to capture this (Figures 1 and 2) . The uncertainty in a forecast is important to decision making and planning, especially in the context of public health interventions. We showed how BNNs can be used to model both data and model uncertainties. Empirical assessment showed that for a 7 day forecast horizon there was some degradation in forecast accuracy when uncertainty is considered. However, this knowledge of uncertainty may be a worthwhile compromise for a small accuracy reduction. For longer time horizons (14, 21 days) there was little or no degradation in performance while still having accurate estimates of uncertainty. Modelling both uncertainties proved to be better than either method individually. This was confirmed by calibration curves (Figure 2 ). LSTM models were best for 14 and 21 days ahead. The ability to make probabilistic forecasts with NNs makes them a viable option for real-world epidemiological forecasting where knowledge of uncertainty is required. Hyper-Parameter Settings. We have not optimised hyperparameters using a validation set as our main focus was not in establishing state-of-the-art performance. Our results are still significantly better compared to their respective baselines. In our experiments, we have used the following hyper-parameter settings. For the -d NNs, we set ρ = 0.25. For the -m NNs, we set ρ q = 10 (posterior sharpening factor), σ p = 0.5 (prior standard deviation), and σ = 5 (output standard deviation). For the -c NNs, we set ρ q = 10, σ p = 0.5, and ρ = 0.25. We use batch normalisation to improve the training of all models besides the deterministic FF model that is negatively affected by it. Batch normalisation layers using default parameters (momentum= 0.99, epsilon= 0.001) are inserted between layers of the network, and normalise the input's mean and variance. In the models that combine uncertainty types, we investigated the effect that the number K of sampled models has on the prediction. We looked at values for K between 2 and 1,000 and found that for K > 25 the evaluation metrics do not change significantly. Therefore, in our experiments, we use K = 100 as this ensures that our output distribution converges. We found that for FF models an exponential learning rate scheduler is best. LSTM models are trained with a cosine learning rate scheduler with warm-up (Loshchilov & Hutter, 2016; Goyal et al., 2017) . We train the models for a fixed amount of 200 epochs. Statistical Significance. We use a two-tailed t-test with Bonferroni correction to evaluate if there is a statistically significant (p ≤ 0.05) change in accuracy between different approaches. We train each model 10 times with different initialisation seeds and generate metrics for each for comparison. Limitations. The models in this work are limited in that they are dependent on the existence of search query and epidemiological data. This data is only available in countries where both a significant Internet usage and an established health infrastructure are present. The ground truth is also not representative of the true ILI prevalence in the population as it is measured by the RCGP and does not account for people who do not go to the doctor when they have ILI symptoms. However, we expect that any consistent biases in the ground truth will not affect the relevance of the machine learning task. Finally, we have not optimised the hyper-parameters of the NNs, but just set them manually to some reasonable values. The performance we obtained was still significantly better than the baselines we compared it against. Future Work. Future work will look at improving the accuracy of the forecasting models by incorporating more complex architectures. Solutions might also explore the use of probabilistic weights in more layers of the network to give more accurate posterior approximations. Assessing the accuracy of our approach in more countries and sub-regions as well as on different infectious diseases, e.g. COVID-19, are also natural next steps. and LSTM-c-nq. In Figures A2 and A3 , we can see that the models that do not use web search data are not yielding confident estimates for any of the considered forecasting horizons. The predictions from these models contain many invalid estimates (incorrect spikes and multiple peaks during a single flu season). As expected, this issue becomes more pronounced for greater forecasting horizons. Notably, the forecasts for the 2017-18 flu season are surprisingly good for both models. This can cause metrics which are dependent on scale to be skewed against models which perform poorly on this season. To this end, the SMAPE scores for models without queries are much poorer. In Figure A4 , we show the predictions obtained from the FF models. The accuracy for γ = 7 is better than the one obtained from LSTM models, however for γ = 14 and 21 the LSTM models are comparable or better (see Figure 1 ). FF-c, = 21 ILI rates (for 100,000 population) Figure A4 . Comparison of FF models for ILI forecasting tasks. FF-v is a deterministic model, and FF-d, FF-m, FF-c estimate data, model, and combined uncertainty, respectively. The number following the model names denotes γ (the forecasting horizon). It can be seen that as γ increases the predictions become less accurate and less confident. Introduction to Financial Forecasting Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity Epideep: Exploiting embeddings for epidemic forecasting Towards the use of neural networks for influenza prediction at multiple spatial resolutions Long-term wind speed and power forecasting using local recurrent neural network models Bayesian modeling to unmask and predict influenza A/H1N1pdm dynamics in London Mixture Density Networks Variational inference: a review for statisticians Weight uncertainty in neural networks Forecasting the global burden of Alzheimer's disease Financial forecasting using support vector machines A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data Towards Detecting Influenza Epidemics by Analyzing Twitter Messages Aleatory or epistemic? Does it matter? Structural Safety 'In-Between' Uncertainty in Bayesian Neural Networks Uncertainty in deep learning Dropout as a Bayesian approximation: representing model uncertainty in deep learning A Theoretically Grounded Application of Dropout in Recurrent Neural Networks Neural networks for financial forecasting Strictly proper scoring rules, prediction, and estimation Training imagenet in 1 hour Practical Variational Inference for Neural Networks Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks Keeping the neural networks simple by minimizing the description length of the weights Long Short-Term Memory Forecasting emergency department crowding: a discrete event simulation Batch normalization: Accelerating deep network training by reducing internal covariate shift Estimating the Uncertainty of Neural Network Forecasts for Influenza Prevalence Using Web Search Activity Kaastra, I. and Boyd, M. Designing a neural network for forecasting financial and economic time series Recurrent continuous translation models What uncertainties do we need in Bayesian deep learning for computer vision? Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding A method for stochastic optimization Automatic differentiation variational inference Tracking the flu pandemic by monitoring the Social Web Advances in nowcasting influenza-like illness rates using search query logs Enhancing feature selection using word embeddings: The case of flu surveillance Tracking COVID-19 using online search Heteroscedastic Gaussian Process Regression Bayesian adaptive combination of short-term wind speed forecasts from neural network models Dropout inference in Bayesian neural networks with alpha-divergences Bayesian neural networks for nonlinear time series forecasting When Gaussian Process meets big data: A review of scalable GPs Stochastic gradient descent with warm restarts Probability Forecasting in Meteorology Short-term load forecasting using Bayesian neural networks learned by hybrid Monte Carlo algorithm Estimating the mean and variance of the target probability distribution A systematic review of studies on forecasting the dynamics of influenza outbreaks Dicing with the unknown Practical deep learning with Bayesian principles Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout Challenges in Bayesian inference via Markov chain Monte Carlo for neural networks Twitter improves influenza forecasting A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the united states Nine challenges for deterministic epidemic models Forecasting seasonal outbreaks of influenza Real-time influenza forecasts during the 2012-2013 season An overview of health forecasting LSTM neural networks for language modeling Single-model uncertainties for deep learning Hydrological forecasting using neural networks Bayesian layers: A module for neural network uncertainty A novel data-driven model for real-time influenza forecasting Forecasting influenza-like illness dynamics for military populations using neural networks and social media The added value of online user-generated content in traditional methods for influenza surveillance Influenza activity surveillance based on multiple regression model and artificial neural network Accurate Estimation of Influenza Epidemics using Google Search Data via ARGO Using electronic health records and internet search information for accurate influenza forecasting Forecasting influenza epidemics in Hong Kong Quality of uncertainty quantification for Bayesian neural network inference Explicitly integrating parameter, input, and structure uncertainties into Bayesian Neural Networks for probabilistic hydrologic forecasting Influenza Forecasting Framework based on Gaussian Processes Multi-Task Learning Improves Disease Models from Web Search Transfer Learning for Unsupervised Influenza-like Illness Models from Online Search Data The authors would like to thank Simon Moura for the constructive feedback in early versions of this work. We would also like to acknowledge RCGP for providing syndromic surveillance data, and Google for providing access to the Google Health Trends API. I.J.C. and V.L. would like to acknowledge all levels of support from the EPSRC project "i-sense: EPSRC IRC in Agile Early Warning Sensing Systems for Infectious Diseases and Antimicrobial Resistance" (EP/R00529X/1). I.J.C. and V.L. would also like to acknowledge the support from a Google donation funding the project "Modelling the prevalence and understanding the impact of COVID-19 using web search data". Trade-off between CRPS and NLL. Here we discuss the differences between the CRPS and NLL loss functions. The optimal solution to CRPS is the same as for NLL (e.g. confident and correct). CRPS, however, is more forgiving when the confidence is high and the accuracy is poor. Figure A1 illustrates this point. Here the blue and green curves depict the NLL and CRPS scores, respectively, as a function of the variance (x-axis). Predictions are represented by the red diagonal line. The true value to be predicted is y = 0. The first point on the diagonal line has zero variance, but predicts y = −1, i.e. an erroneous value with perfect confidence (zero uncertainty). We observe that the CRPS penalises this with a score of 1. In contrast the NLL tends to infinity. As we move from left to right, the error in y is initially decreasing while our uncertainty is increasing. As our estimate approaches the true value of y = 0 which occurs when the standard deviation (x-axis) is 0.25, both curves approach a minimum value. As we continue to move from left to right, the error in y begins to increase along with the uncertainty. At the right-most side, we have y = 1 with a standard deviation of 0.5. Here, the CRPS score is similar to when the standard deviation was zero, while the NLL is about 2.3, i.e. the NLL metric much more strongly penalises errors that are outside of the uncertainty region. We do not favour one measure over the other, and report both in our results.