key: cord-0198286-ektmjnn5 authors: Cabras, Stefano title: A Bayesian - Deep Learning model for estimating Covid-19 evolution in Spain date: 2020-05-20 journal: nan DOI: nan sha: 72987ee881cb010118bc1cd324d0111b47756ff7 doc_id: 198286 cord_uid: ektmjnn5 This work proposes a semi-parametric approach to estimate Covid-19 (SARS-CoV-2) evolution in Spain. Considering the sequences of cases, deaths, and recovered of all Spanish regions, it combines modern Deep Learning (DL) techniques for analyzing sequences with the usual Bayesian Poisson-Gamma model for counts. DL model provides a suitable description of observed sequences but no reliable uncertainty quantification around it can be obtained. To overcome this we use the prediction from DL as an expert elicitation of the expected number of counts and thus obtaining the posterior predictive distribution of counts in an orthodox Bayesian analysis. The overall resulting model allows us to either predict the future evolution of the sequences on all regions, as well as, estimating the consequences of eventual future scenarios. Understand and predict the evolution of Covid-19 (SARS-CoV-2) diffusion has become of primary importance in the actual Spanish society. The study of a disease spread is an old topic in statistical epidemiology and the actual disease is not an exception. Many of epidemiology monitoring models rely on the well-known SIR model in which the basic reproduction rate, R is the main parameter of interest as R < 1 indicates that the infection will naturally disappear. The nice of the SIR model is that R is a simple function of SIR model parameters. The SIR model in its basic version assumes that model parameters are constant along with time and space although variations along time are currently under development Chen et al. (2020) . However, since the beginning of the application of the spatial-temporal statistical analysis, modeling has evolved incorporating into the epidemiology of more sophisticated models. These are derived from statistical applications in different fields. For instance, from timeseries analysis Covid-19 evolution is viewed as a sequence of counts (of cases, deaths, etc..), and autoregressive and moving average models can be used (Agosto et al., 2016; Agosto and Giudici, 2020) . From a spatial analysis, we consider the evolution of Covid-19 at the areal level (in contrast to point data) in a specified time domain. We feel that Bayesian modeling, more than just another away to build models with complicated structures, is necessary as it takes into account the uncertainty that data, especially noisy data, bear to any statements on Covid-19 evolution. In this context, Bayesian models derived from time series analysis with random spatial effects have been proposed in Paul and Held (2011) for epidemic monitoring and already applied to Covid-19 diffusion in Italy Giuliani et al. (2020) (a generalization of Agosto and Giudici (2020) ). The problem with the usual Bayesian spatial-temporal model is that they assume specific parametric forms for evolution along time (e.g. ARMA process) and for spread among areas mostly given by a neighborhood matrix plugged in a Conditional Autoregressive (CAR) Model which assumes that neighborhoods are defined beforehand, in contrast, to be adaptive on the observed data. It is well known that the disease diffusion process is not linear at all and while surrounding areas are a good approximation at a local level of disease spread, these may be not optimal at a larger scale. Considering the specific case of Spain, we can for instance state that although Cataluña and Canary Islands are not surrounded areas of Madrid, these are connected to the Madrid region by high-speed trains and flights. This is maybe enough to put in question the usefulness of the concept of the neighborhood surrounding areas assumption defined beforehand and encoded in the neighborhood matrix. Furthermore, there can be also dynamics among features of Covid-19 spread, namely cases, deaths and recovered. For instance, the presence of one or more super diffuser may connect regions that are far away. Finally, we try to account for an interesting and still not modeled dynamics: the actual number of deaths may be an effect of past undetected cases, and conversely, the actual number of cases may prelude to future increments into the number of cases and deaths. Stated differently, the question is: does it make sense to read the sequence of counts from past to future (as usual in time series), or also should we read them in the opposite (retrospective) way? We do not address this question, but instead, we analyze counts in both directions. This is precisely what we try to do in reading a text: a new word sometimes can be understood reading forward rather than just looking backward. To complicate things even more, we have to take into account the fact that the data collecting process is far from being regular. First of all, in Spain, there is no centralized official unit dedicated to collect Covid-19 data assuring data consistency. There only exists, at the day of writing, groups of academics that spontaneously collect data and share it on (see, for instance, https://github.com/datadista/datasets/tree/ master/COVID%2019, code.montera34.com), while in other countries as in Italy there exists an official repository of Italian civil guard (https: //github.com/pcm-dpc/COVID-19). However, even with this official data are still questioned for being useful in analyzing Covid-19 in Italy Bartoszek et al. (2020) . In this paper we will only use the version of the counts provided by the Instituto Carlos III de Salud (www.isciii.es). Such noise in the data requires robust statistical methods. In practice, Covid-19 spread is the result of a complex dynamic along time, space, and features that can either be modeled by good experts or try to estimate from the available noisy data. The latter approach is chosen as modern machine learning techniques are here used to create such a super expert (from data of course) that will elicit priors on a Bayesian model for counts. The discrepancy between observations and experts is properly taken into account by Bayes theorem which finally allows predicting counts along with the corresponding uncertainty. The rest of the paper is organized as follows: Section 2 discusses the two-step analysis whose results are reported in Section 3. With the proposed approach we are also able to forecast eventual scenarios that are presented in Section 3.4. Section 4 reports possible generalizations of the approach, which are beyond the scope of the paper, but that can be quickly implemented using the code in the GitHub project: https://github.com/scabras/covid19-bayes-dl. Let Ytsj ∈ 0, 1, 2, . . . be the random variable of interest representing the counts in region s = 1, . . . , S = 19 at day t = 1, . . . , T of the number of cases (j = 1), deaths (j = 2) and recovered (j = 3). The aim of the paper is to estimated where Ft−1 represents the process filtration up to the day t. The Bayesian approach is needed to properly account for conditioning on observed data D, while the non-parametric approach implemented in the DL model enters as Ft−1 is the evolution of all sequences, that is overall possible s and j up to the day t − 1. Estimating conditionally on such Ft−1 it would be possible even with parametric models but at the cost of requiring strong expertise on fixing, beforehand, the parametric form of P by including how to relate past and future Covid-19 evolution. To avoid assumptions derived from such an unavailable expert, we only fix the family of P being the Negative Binomial, derived from the usual Poisson likelihood and a Gamma prior on the Poisson mean that will be specified below. Hence, where E(Ytsj) = ηtsj. For estimating ηtsj we first project all observations up to t − 1 into a point guess ytsj using the later specified biderectional LSTM model having as input all sequences up to day t − 1. This is illustrated in Section 2.1. Secondly, we derive the posterior distribution of ηtsj by assuming a priori, E(ηtsj) = ytsj. Bayes theorem finally allows to obtain (1) as illustrated in Section 2.2. A deep learning (DL) model is a neural network with many layers of neurons Schmidhuber (2015) , it is an algorithmic approach rather than probabilistic, see Breiman et al. (2001) for the merits of both approaches. Each neuron is a deterministic function such that a neuron of a neuron is a function of a function along with an associated vector of weights w = (w1, . . . , w k ). Essentially, for a generic response variable Yi of the ith statistical unit and a corresponding predictor Xi we have to estimate and the larger k is, the "deeper" is the network. With many stacked layers of neurons all connected (a.k.a. dense layers) it is possible to capture high nonlinearities and interactions among variables. The approach to model estimation underpinned by a DL model is that of composition function in contrast to that of additive function underpinned by the usual regression techniques including the most modern one (e.g. Smoothing Splines, Nonparametric regression, etc...), as Yi = w1f1(Xi)+w2f2(Xi)+...+w k f k (Xi). A throughout the review of DL is beyond the scope of this paper and can be found for instance at Schmidhuber (2015) . When f (X) functions are linear in its argument, the DL model can be also interpreted as a maximum a posteriori estimation of Pr(Y |X, Data) for Gaussian process priors Polson et al. (2017) . However, despite this and because of its complexity it cannot be evaluated the whole distribution Pr(Y |X, Data), but only its mode. Fitting a DL consists of estimating the vectors of weights w. The estimation requires evaluating a multidimensional gradient which is not possible to be performed jointly for all observations, because of its dimensionality and complexity. Recalling that the derivative of a composite function is defined as the product of the derivative of inner functions (i.e. the well-known chain rule (f • g) = (f • g) · g ), this is implemented for purposes of computational feasibility as a tensor product which in turn is implemented into the hardware (not software) of Graphical Processing Unit (GPU) device. A GPU can deal in parallel with thousands of treads at a relatively lower cost than the usual eight treads of a normal CPU. That's what makes the DL approach to data analysis popular in recent years as it can be run also on a mobile phone. Such tensor products, along with the entirely DL machinery, are evaluated for batches of observations (in this case sequences of counts) and it is implemented, for instance, in the open-source software known as Google Tensor Flow Abadi et al. (2015) . Most of the DL models are suitable for independent observations in which batches of observations can be drawn at random from the sample and then used to estimate w. Such models cannot be used here as observations are sequences and thus not independent. For these purposes, we have to resort to specific DL models as those belonging to the class of recurrent neural networks (RNN) Lim and Zohren (2020) . Fundamentals of RNN and Long Short Term Memory (LSTM) models (a specific instance of RNN) can be found here Sherstinsky (2020). Unfortunately, this paper needs some translation to a statistical audience, but basically, an RNN acts similarly to a Hidden Markov Model (HMM) or more specifically to a Dynamic linear model (West and Harrison, 1989) for observed sequences Y1, . . . , Yt. They differ from usual HMM in that they do not manage conditional probabilities, but signals which can be viewed as point guesses of the observed and hidden states represented by the above mentioned deterministic nodes of the DL network. The modeling approach instead of being justified on Markovian processes is derived from posing an additive deterministic model on the evolution equation evaluated at time t * , dY (t) dt |t=t * with the same structure as in (2) in which the evolution of sequence at | t=t * −k are the predictors. In this case, k has the meaning of lag in time series analysis and also the hidden layers in this model architecture. RNN with says k hidden layers implies that the evolution at time t * is a nonlinear function of k past evolution equations and if observations were not informative for estimating such a complex function with corresponding w the gradient vanished and thus w cannot be updated and this, in turn, prevent the train of the model. To avoid the vanishing gradient problem, the LSTM model introduces an adjustment (Sherstinsky, 2020) into the gradient which avoids it being exactly zero. This allows to estimate short effects (terms of evolution equation near to t * ) as well as long-term effects (t far from t * ). Finally, the structure of the network can be further complicated by connecting an LSTM network which analyzes sequences in the order 1, . . . , t− 1, t with another LSTM network that analyze sequences in the opposite order t, t − 1, . . . , 1. These types of architectures are called bidirectional LSTM, are used for Artificial Intelligence (AI) language understanding and speaking, described, for instance in Wöllmer et al. (2010) and implemented into Google Tensor Flow. Finally, if we let Y being as a vector of S × 3 time series each one with length T and X as the corresponding vector of past (future) values of Y , then once w have been estimated (namely the network has been trained), we end up in having the guess, ytsj, of the mode of P r(Ytsj|Ft−1) in (1) as the bidirectional LSTM model jointly considers past evolution of the Covid-19 in Spain. The point guesses ytsj is a suitable projection of all sequences evolution into the space of sequence guesses, just as a sample mean is used to project observed quantities. If we want to recover the Bayesian analysis, it can be considered as an elicitation of the expected number of counts at time t, region s, and count type j (cases/deaths/recovered). Also in an empirical Bayes point of view it can be considered as a point estimation of the prior hyper-parameter E(Ytsj) by setting E(Ytsj) = ηtsj. The aim of this second and last step in the analysis of Ytsj is to derive the uncertainty for counts at time t, region s and count type j knowing the guess ytsj obtained in the previous step described in the above Section 2.1. The model is very simple: for each count, where b = 1 is the prior sample size, such that the prior information is that of a sample with an observed number of counts ytsj. The predictive posterior (1) at an observed count is a Negative binomial: while the prediction at a an unobserved count is obtained by using the corresponding (informative) prior predictive distribution: All other quantities of interest as cumulative counts, R, aggregation at higher territorial levels, are obtained by simulating from (5) and (4). In this section, we apply the above model to the problem of estimating Covid-19 evolution in Spanish regions. The analysis is implemented in the following GitHub project: https://github.com/scabras/covid19-bayes-dl, where code and results for all regions can be friendly browsed at https: //scabras.github.io/covid19-bayes-dl/ and are update with newer observations than those considered in this work. Here we limit to illustrate results for some regions such as Canary Islands (CN), Cataluña (CT), and Madrid (MD) and also the whole of Spain. The analysis is however conducted using data from all regions. Data comes from Instituto Carlos III de Madrid in Madrid and they can be downloaded as CSV file at https://covid19.isciii.es/resources/ serie_historica_acumulados.csv. From the database, we only used the cumulative numbers of cases (obtained as the sum of the results of two tests: polymerize chain reaction (PCR) and antibody), deaths, and recovered. Such a cumulative number is not updated regularly, especially at the beginning of the spread and sometimes during weekends. Hence, there is a clear warning on the fact that differences among cumulative counts from day to day do not necessarily represent daily increments. In fact, as shown in Figure 1 data are very noisy, especially in small regions as the Canary Islands and at the beginning of the period that for the present paper starts March 20 t h and end the 10 t h of May 2020 for a total number of T = 80 days and a total number of 56 (i.e. S = 19 × 3) sequences of counts. Noisy counts reported in Figure 1 justifies the need for using robust models to describe the data. Specifically, we used a bidirectional LSTM model looking back (forward) up to k = 14 days which are the usual two weeks for an asymptomatic case to become symptomatic and hopefully appear in the database as a case. Therefore, at each time t and for the backward direction of the LSTM the process filtration F −∞ refers to the history of the process up to two weeks before. On the opposite, the forward part of the LSTM network is devoted to estimating the evolution of the 2 weeks ahead. The DL network is made of two stacked layers of neurons: the first is the 32 all connected LSTM layers (one for direction) and the second is a layer of 56 all connected linear neurons. The model capability is of around 26 thousand weights (nodes are all connected) most of them will be zero as data are not informative enough to update all original zero weights. Such an architecture can model: • in-sequence evolution, that is Ytsj|Y t sj for all t = t; • between-sequence evolution that is Ytsj|Y t s j for all t = t, s = s, s = s. This translates into saying that for instance, the number of cases in Madrid can be affected by past cases and will affect the future number of cases (in sequence), as well as the cases in Madrid, can be a consequence or can have effects on past and future cases/deaths/recovered on all other regions. This is the main contribution of the proposed way of representing data to the actual literature of Covid-19 evolution already mentioned above. The model is trained 200 steps (epochs) and the training sample is made of batches of 10 random sequences drawn at each optimization step. The training sample per step consists of 10 rows of the original data matrix D of size T × 56 along with the corresponding previous k = 14 rows. Two out of the 10 batches are used to evaluate the prediction error, these are called validation sequences. For the first steps, validation sequences are not used to train the model, but for the last steps, the validation sequences are likely to have been used to train the model in past steps. Therefore, the validation error is essentially an in-sample error and thus it is overoptimistic. Further, as the sample is not very large and DL techniques are known to be data-hungry, we have to prevent overfitting by just avoiding random signals from the data that it is not consistent over batches. To do this at each step we set at random a certain percentage of weights to 0. This is called in machine learning jargon dropout: we set a 10% of dropout of weights in the evolution equations (the so-called recurrent dropout) and also between the output states of the 32 LSTM nodes. The effect on fake non zero weights is the following: if during the training steps some weights are just updated from zero because of outliers, then it is likely that these updates will be returned to 0 by the dropout. The same does not occur for weights whose value is not related to outliers. This is important as sometimes the number of cases just increases unexpectedly because of an update in the counts due to administrative effects (for instance holidays) rather than to be the results of only the underlying evolution of the Covid-19 spread. All nodes in the network are linear in their arguments and to further robustify the analysis weights are estimated to minimize the prediction error defined as the mean absolute error between estimated and observed counts. For the 200 steps Figure 2 reports the estimation error which decreases with the step meaning that the network is learning from the available sample. Figure 2 can be interpreted also as a lack of overfitting given that the number of parameters is much larger than the number of observed days and the error is still positive, reaching a plateau. This is related to the concept of model capability which differs from that of model dimension typical in usual regression analysis (without shrinkage methods). In fact, for the usual regression models data are used to impose an estimated value on all coefficients. This is not necessary here and only relevant and consistent information is passed along the nodes of the network. This is why the overfitting is somehow limited and the training error is bounded below from zero. The estimated LSTM model can guess the Covid-19 evolution one day ahead given all history as shown in Figure 3 and also we can use it to predict for instance next 30 days. Even if the observed sequences are noisy, the point is that such a noise is common in many regions and paradoxically it turns out to be "regular noise", that is the network can detect the effect of reporting counts during and just after weekends. This could have been not possible to be accounted using simpler linear regression models like those mentioned in the introduction unless very informative experts were available on assessing interactions among the 56 series of counts. Finally, the subsequent projections indicate that the number of daily cases and deaths will not stay steady at almost 0 for Madrid and Cataluña and at some point are expected to increase in the middle of June. The number of recovered is expected to decrease in Cataluña and Madrid and expected to stay low also during June. Of course, we have no clue how much are probable the above statements to be true and this is addressed in the following Section 3.3. The additional problem in using directly the output of a DL model regards the fact that it is very difficult to understand why are that decrease/increase and this shed doubts on direct use of such an output as a final prediction. Although this, it remains the usually observed sampling argument: why a model that did well in the past should fail in the future? Is this due to overfitting? This is the question that also justifies the subsequent Bayesian analysis to at least put probability evaluations to the two statements of the question: 1) how well did in the past and 2) how much to trust the predicted future. Results on Figure 3 and the above LSTM model do not convey a proper evaluation of the uncertainty around the estimation of one day ahead evolution. For this purpose we calculate the posterior predictive distribution of Ytsj when observed and the prior predictive distribution for unobserved counts. The key is to use the above AI expert which elicits the prior mean with ytsj as described in (4). Such an expert can be trusted based on the pros and cons described in Section 3.2. We trust this expert as one observation (b = 1) implying that the predictive mean is the simple mean between observed and predicted, while the Poisson variance equal the mean inflated by 3/2 under (5) and by 2 in (6). Given the likelihood (3) and prior (4) we calculated for all t, s, j the posterior predictive distribution (5) for observed counts and the prior predictive distributions (6) for unobserved ones. As we are also interested in other evolution features as: territorial aggregations, cumulative counts and R here defined in a crude way as the ratio for cases Y (t−1)(j=1) /Y ts(j=1) ; we simulated 1000 (independent) samples from (5) and (6). The most important result is the prediction along with its uncertainty, which is the estimation of (1). This is the base result of all modeling process and in Figure 4 we report observed and predicted counts along with 95% equal tail credible intervals. From Figure 4 we can see that there is an excess of over confidentiality on the model, especially for the initial evolution, which can be explained with the fact that at the beginning of the Covid-19 spread there was not enough data to understand its evolution. This is also reflected, although with less emphasis, on the cumulative number of counts showed in Figure 9 in the Appendix. Considering the predictions for all regions is possible to obtain that for all Spain in terms of daily increments ( Figure 5 ) and of cumulative counts (Figure 10 in Appendix). From Figure 5 we can appreciate that it is expected a mild trend toward a decrease in the number of cases and deaths, but the speed of such a decrease is not comparable as that just after the peak. That is, the "famous" curve is now definitely a plateau. Once (1) has been estimated, we can also predict the R at the regional level (see Figure 11 in appendix) as well as for all Spain. The latter is reported in the following Figure 6 and indicates that it is expected in the future to be around 1. Posterior predicted distribution (mean and 95% credible intervals shaded area delimited by yellow bounds) for observed days (red points) along with the prediction for next 30 days since last one. This is the estimation of (1) for observed and future predictions. This section is devoted to analyzing the impact of hypothetical scenarios and thus illustrates how the model can be eventually used by public policymakers. This section has also twofold purposes: firstly to analyze the impact of plausible scenarios making predictions and secondly to attempt to interpret the overall model given that the LSTM projections lack a straightforward interpretation. For instance, the effect of a perturbation in a region and the effect on other regions can shed light on the spatial relation between that region and others which is encoded in the bidirectional LSTM and this would be useful to interpret the spatial effect of the LSTM along with their associated uncertainty retrieved from the Bayesian step analysis. Suppose that during the last 10 days up to the last observed day, the number of deaths in the Madrid region was artificially incremented by 20% every day. What should be the impact on the rest of the country? Figure 7 reports the differences, to actual estimations ( Figure 5 ) and the predicted number of increase in cases, deaths, and recovered overall Spain during and after the hypothesized 10 days. Given the actual situation, we do not expect an increase in the number of cases with certainty probability due to the underlying contagious process assumed by the model. Suppose that the same increments as in scenario 1 occur in the Canary Islands which has a less population than the Madrid region. Figure 8 reports impact on all Spain. We can see that the overall count of cases and deaths and recovered indicate a significant increment for all Spain, although absolute values are not practically important if compared with yet observed ones. Why this differential effect against scenario 1? This is a question of model interpretation which is very difficult for the posed model and a lack of an answer may reasonably cast doubts on its validity. However, we may think that the Canary Islands is more isolated than Madrid and it acts as a different country with different responses to this huge increment of cases (compared to its population) to Madrid. Also, the dynamic depicted in Figure 8 makes sense as cases lead to a subsequent fraction of deaths and then a long recovery time. The proposed modeling approach, more than a cleaned and specific model, allows taking into account part of the complexity of Covid-19 evolution in Spain by summarizing it in a complicated model structure represented by the bidirectional LSTM network, whose output is further processed into a Bayesian Poisson-Gamma model. This finally accounts for the randomness given by the underlying estimated evolution. All the exposition here can be applied to other definitions of regions, for instance, province or hospital areas or even areal data coming from geo-localization of cases/deaths/recovered. It can include other covariates on estimating Covid-19 evolution such as meteorological data if they were relevant. The available commented code at https://scabras.github. io/covid19-bayes-dl/ makes easy all this kind of generalizations of the somehow simplified analysis, which assumes that all evolution process can be known from the observed counts. The important message here is that, with such a complex problem as the estimation of Covid-19 evolution, it may be not convenient to rely on a single one oriented approach (only machine learning or only parametric Bayes), but the combination of some would be more profitable. The drawback of this hybrid approach is that it becomes difficult to assess the overall theoretical reliability. For instance, we know very well that the Poisson-Gamma model leads to estimator which are closed to the true one (in mean squared error) when prior precision is not too high, However, we don't have explicitly theoretical results on the consistency of LSTM models although there exists much successful application of them Karpatne et al. (2017) . Even less, there are not theoretical consistency results when these two approaches are posed in sequence. In contrast, we face a real (not theoretical) problem of predicting and forecasting Covid-19 evolution and further theory may come in the future to support or disprove the here proposed two-stage approach. Predicted and Observed cumulatives Figure 9 : Cumulative counts. Posterior predicted distribution (mean and 95% credible intervals shaded area) for observed days (red points) along with the prediction for next 30 days since last one. This is the estimation of (1). Posterior predicted distribution (mean and 95% credible intervals shaded area) for observed days (red points) along with the prediction for next 30 days since last one. This is the estimation of (1). Predicted and Observed R Figure 11 : Base reproduction index R by region. Posterior predicted distribution (mean and 95% credible intervals shaded area) for crude daily estimation (red points) along with the prediction for next 30 days since last one. Note that these are unstable ratios whose distributions can be very skewed and point means can lay out of the corresponding 95% credible intervals. Tensor-Flow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org Modeling corporate defaults: Poisson autoregressions with exogenous covariates (parx) A poisson autoregressive model to understand covid-19 contagion dynamics. Available at SSRN 3551626 Are official confirmed cases and fatalities counts good enough to study the covid-19 pandemic dynamics? a critical assessment through the case of italy Statistical modeling: The two cultures (with comments and a rejoinder by the author) A time-dependent sir model for covid-19 with undetectable infected persons Modelling and predicting the spread of coronavirus (covid-19) infection in nuts-3 italian regions Theoryguided data science: A new paradigm for scientific discovery from data Time series forecasting with deep learning: A survey Predictive assessment of a non-linear random effects model for multivariate time series of infectious disease counts Deep learning: A bayesian perspective Deep learning in neural networks: An overview Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network The Dynamic Linear Model Bidirectional lstm networks for context-sensitive keyword detection in a cognitive virtual agent framework The author is responsible for the content in this work, however, it is worth mentioning that this work would have been not possible without the encouragement of colleagues at the Department of Statistics of Universidad Carlos III de Madrid (Spain).