key: cord-0864051-4ffbqpkk authors: Fenga, Livio title: Forecasting the CoViD19 Diffusion in Italy and the Related Occupancy of Intensive Care Units date: 2020-04-01 journal: nan DOI: 10.1101/2020.03.30.20047894 sha: 905ff29124af7e2b7698a6816dee5f1ee7065a85 doc_id: 864051 cord_uid: 4ffbqpkk This paper provides a model based method for the forecast of the total number of currently CoVoD19 positive individuals and of the occupancy of the available Intensive Care Units in Italy. The predictions obtained, for a time horizon of 10 days starting from March 29th, will be provided at a national as well as at a more disaggregate levels, following a criterion based on the magnitude of the phenomenon. While the Regions which have been hit the most by the pandemic have been kept separated, the less affected ones have been aggregated into homogeneous macroareas. Results show that , within the forecast period considered (March 29th April 7th ) , all of the Italian regions will show a decreasing number of CoViD-19 positive people. Same for the number of people who will need to be hospitalized in a Intensive Care Unit (ICU). These estimates are valid under constancy of the Government s current containment policies. In this scenario, Northern Regions will remain the most affected ones and no significant outbreak are foreseen in the southern regions. On March 19th, the death toll paid by Italy for the spread of the virus CoViD-19 amounted to 3405 deaths, the highest paid by a single country in the World. Despite an hard and relatively timely lock-down policy implemented by the Government, on March 26 this figure has risen to 8165 deaths. In such an emergency situation, a reliable forecast method for the infection's development is essential for policy and decision makers to design evidence-based policies and to implement fast actions to curb the spread of the infection. In particular, predicting the number of people currently tested positive for CoViD-19 (thereafter "positive cases") could be useful to draw the epidemiological curve of the infection and therefore to predict its peak. Other than for this variable, the forecasting procedure presented in this paper is used to predict the future values of another crucial variable, i.e. the number of people needing hospitalization in a Intensive Care Unit (ICU). The Italian ICUs system is at the moment severely stressed due to the spread of the disease, therefore predictions of future ICUs demand could be fruitfully considered in the design and the implementation of operational schemes. The forecast horizon for both the variables is of 10-day starting from March 29th. Page 1 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https: //doi.org/10.1101 //doi.org/10. /2020 Since the Italian regions are affected in different extents by the CoViD-19, it has been decided to perform the forecasting exercise for the following geographical areas: Lombardia, Piedmont, Valle d'Aosta, Veneto, Friuli Venezia Giulia, Trentino Alto Adige, Lazio and Campania. The remaining Regions have been grouped in the following macro-areas: "Center" (Marche, Umbria and Toscana) and "South" (Abruzzo, Molise, Puglia Basilicata, Calabria, Sicilia and Sardegna). At least other two reasons justify such a break down: 1. the different starting times recorded for the lock-downs; 2. the Southern regions have been hit less severely and therefore, especially at the beginning of the observation period, show several zeroes or low numbers across the considered time span. In essence, in this study the available official data have been employed in a three step procedure, i.e.: 1. data pre-processing, in which data anomalies are identified and corrected according to an approach of the type a Kalman filter; 2. univariate forecasting, based on a autoregressive moving average (ARMA) model for number of positive cases and ICU; 3. bootstrap-based generation of predicted values and confidence intervals. . This paper employs the data related to COVID-19, collected and regularly updated by the Italian National Institute of Health (an agency of the Italian Ministry of Health) and by the Italian Civil Protection Department. The whole data set is freely and publicly available in a comprehensive database, accessible on the Internet at the web address https://www.iss.it/. It collects crucial data related to all the persons tested for COVID-19 -from the outbreak of the pandemics (February the 24 th ) -and, in particular, it 1. is a collection of 21 data points -representing 19 Italian Regions plus the two autonomous provinces of Trento and Bolzano -for each day of infections; 2. considers crucial variables, such as positive cases, recovered cases, deaths, number of people hospitalized and number of people admitted to Intensive Cure Units (ICU). As already pointed out, in the present study, the variables of interest are the number of people who have been: 1. tested positive for SARS-CoV-2 (in what follows denoted by the bold Latin letter V); 2. hospitalized in a ICU (which will be denoted by the bold Latin letter U). It is worth to outlining how, according to the regulations issued by the Italian government, only the people showing moderate to severe symptoms, generally associated with the infection, or which have been in close proximity with at least one positive person, are tested. Therefore, the predictions obtained are to be referred to the sample, as no attempt have been made to carry out inferences procedures for variable estimation at the population level. Page 2 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 In order to correctly process the data, all the regions showing no positive cases at the beginning of the recording period and/or low values along the whole time span have been aggregated into macro areas. This has been done to i) give more meaningful results and ii) save degrees of freedom (which are always precious in short time series). In details, the prediction exercise will be performed on the following Regions/macro areas: The north Italy Regions -presently the more severely affected by the pandemic -have been treated separately along with two other regions, i.e. Lazio and Campania, since their major cities -Rome and Naples -deserve special attention for the institutional role played and the population density exhibited. On the other hand, the Regions showing less worrying figures have been aggregated into macro areas according to their geolocation. The only exception is Valle d'Aosta, which has been left separated as no aggregation options could be found. To simplify notation, for both the variables of interest V and U , the following convention is introduced: Here, the upper left superscript (denoted by the upper case Latin letter K) refers to the geographical areas (i.e. North, Center and South) whereas the subscript j is associated to the number the different regions or macroareas are codified with, as above detailed. For example, by the symbols A V 6 and B V 2 the number of positive cases for the Emilia Romagna region and the Center macro-area "Marche -Umbria -Toscana" are respectively identified. Page 3 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.30.20047894 doi: medRxiv preprint Missing data and other anomalies become the first challenge when designing predictive models, as statistical methods, in general, are designed and tested under the assumption of no missing observations (Moritz et al. (2015) ). Before delving into the details of the proposed procedure, a word of caution is needed since, unfortunately, a visual inspection of the database suggests the presence of a number of anomalous data both at a regional and country level. The detected anomalies might be associated to the biological samples collecting process and the testing procedures. In fact, the typical lab-workflow is governed by a set of rigid protocols which might be critically affected by factors such as the availability of manpower, swabs, reagents and other laboratory materials. In emergency situations, such a workflow can be disrupted and temporal inconsistency might appear as a result. For example, a set of samples might be delivered to a laboratory with longer than usual delays with respect to the time of collection or a given lab can only complete a screening process for a certain number of samples. In both the cases, one day (or more) shift in the release of the lab results can be reasonably expected. A further source of anomalies is represented by the data entry and data editing processes, carried out in working environment necessarily affected by the risk of contagious and under rigid deadlines. An example of such anomalous data is given in Figure 1 , where the series 1 V 1,t (Lombardia) is depicted. Here, some data points showing values inconsistent with the overall pattern are clearly noticeable. Given the (very small) available sample size, the relative weight of such data is almost surely not negligible and can introduce severe distortions in the model parameter inference procedures and thus in the predicted values. In order to correct those data, a Kalman-smoother state-space model (Simon (2001) ) has been applied. In particular, the Kalman smoother adopted is of the type fixed point smoothing. This algorithm is designed to obtain the estimate of a realization w N (the time t N is fixed N < K) of a given random variable W t , given a set of observations Z k = {z k |0 ≤ N ≤ k} Sage and Melsa (1971) . In figure 2 the corrected version of the series 1 V 1,t -resulting by applying the Kalman smoother -is depicted. Not only this series lends itself to a better visual inspection but, more importantly, is more suitable to be processed by the adopted prediction model. The approach used in this paper relies on i) the theory of stochastic process and ii) a resampling method. While the former is necessary to generate the input (predicted values) of the bootstrap algorithm, as well as to justify the employment of the outlier correction method, the latter serves the purpose of 1. generating the final predictions, which are affected by a reduced amount of uncertainty (with respect to those generated by the stochastic model) 2. yielding the related confidence intervals. The approach proposed in the present paper relies on the assumption that the (transformed) time series K V j, t and K V j, t are approximately a realization of a process of the type ARMA (Autoregressive Moving Average) (Makridakis and Hibon (1997) ). . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.30.20047894 doi: medRxiv preprint Let X = (X t ) t∈Z be a real 2 nd order stationary process, it is said to admit a ARMA(p,q) representation (p,q ∈ Z ) if, for some constant a 1 ....a p ,b 1 ...b q , will be: under the following conditions: Here F t denotes the sigma algebra induced by (j), j ≤ t and p j=0 a j Z j and p j=0 b j Z j are assumed not to have common zero. The above conditions assure that X t can be represented as: with β j decreasing to 0 at geometric rate. Page 5 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.30.20047894 doi: medRxiv preprint The dynamics of the series under investigations are not suitable for this theoretical framework which requires 2 nd -order stationarity; this is achieved by preprocessing the series according to the following filter: log(∇ d ), being the symbol ∇ the difference operator and the exponent d indicating the order of the difference. To fully understand the role played by ∇, the backward operator B is now introduced. In essence, B moves the time index of an observation back by p time intervals, i.e. B p x t = X t−p , and thus we have that (2) In order to extract valuable information from our data and, at the same time, decrease the total amount of uncertainty associated to the outcomes of the ARMA model, a resampling procedure has been employed. Among the several resampling methods for dependent data available -many of which freely and publicly available in the form of powerful routines working under software packages such as Python ® or R ® -the adopted resampling method is of the type Maximum Entropy Bootstrap (MEB). Proposed by Vinod (2006) and subsequently improved (see, e.g., Vinod (2016) ), it is based on basic assumptions which are different from those usually followed by standard schemes. In more details, while in the classic bootstrap an ensemble Ω represents the population of reference the observed time series is drawn from, in MEB a large number of ensembles (subsets), say {ω 1 , . . . , ω N } becomes the elements belonging to Ω, each of them containing a large number of replicates {x 1 , . . . , x J }. Page 6 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 Unlike standard bootstrap schemes, in the MEB case the resample set Ω mimics the observed realization of the underlying stochastic process, in MEB a large number of subsets, say {ω 1 , . . . , ω N } becomes the elements belonging to Ω, each of them containing a large number of replicates {x 1 , . . . , x J }. Among the important features of the MEB scheme, it is worth mentioning the consistency of its bootstrap samples with the ergodic theorem (see, e.g., Birkhoff (1931) ) and with the probabilistic structure of the observed time series. In Figure 3 an example of the application of MEB for the variable 1 V t,1 is given. In what follows, the proposed procedure is presented in a step-by-step fashion. 1. Eqn. 1 is estimated for both V t and U t so that the model orders (Eqn. 1) M 1 and M 2 become available; 2. for each time series V t and U t the MEB procedure is applied so that the sets V and U -each containing B = 500 "bona fide" replications -are available, i.e. Figure 4 the set V for the variable 1 V t,1 is given); 3. for each of the replications stored in V, Eqn. 1 is estimated according to the model order selected, i.e. M 1 , and the 1 to 10-step-ahead predictions -as well as the 5% and 95% bootstrap confidence interval -are generated; 4. the B predictions and the confidence intervals obtained in the previous step are stored in the B × 3 matrix [F V (h); h = 1, 2, . . . , 10], whose columns are: lower bootstrap confidence interval, bootstrap prediction and upper bootstrap confidence interval, respectively denoted by the symbols CI * L,b (h), V * t,b , CI * U,b (h) b = 1, . . . , B. is then extracted along with the ≈ 95% confidence intervals CI * L,b (h = 1) and CI * U,b (h = 1), computed according to the t-percentile method. The explanation of this procedure goes beyond the scope of this paper, therefore the interested reader is referred to the excellent paper by Berkowitz and Kilian (2000) . 6. CI * L (h = 2, . . . , 10) and CI * U (h = 2, . . . , 10) (the subscript b is omitted for brevity) are computed conditional to a subset of V, say V, made up of the bootstrap replications whose range falls between the minimum and maximum values of the values of the confidence intervals computed for h = 1. In symbols: 7. steps 1-6 are repeated for U t , so that a new matrix of prediction of dimension B × 3 is built, i.e. [F V (h) h = 1, 2, . . . , 10], whose columns are as in F U (h) and denoted by the symbols CI * U (h),Û * t , CI * U (h). Unfortunately, the whole procedure cannot be considered fully automatic since the estimation of Eqn. 1 (step 1) is required. Page 7 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https: //doi.org/10.1101/2020.03.30.20047894 doi: medRxiv preprint 5.1. The adopted models The stochastic model structures identified for both V t and U t are almost always of the type ARMA (1,0), with the exception of Campania (ARMA(0,1), for both the variables V t and U t )) and Emilia Romagna, for which the best model for the variable U t is of the type ARMA(1,1). The most suitable prefilter (Eqn. 2) has been always of the type d=3 difference of the natural log of the variables of interest. At the national level (data have been plotted in Figure 4 ) the peak in the number of CoViD-19-positives will be reached on 2 April, with a number of predicted positive close to 77,000. The maximum forecasted value for the occupied ICU -expected for April 4 -will be 4280. These values have been calculated using an indirect methodology, i.e. by summing up the estimates obtained at a disaggregated level. Regarding the results obtained at a disaggregated level, the models outcomes are now commented. • Lombardia -the most affected region -will reach the peak of positive cases (25963) and of the demand of ICUs (1425) respectively on 2 and 4 April; • Emilia Romagna is the second most affected Region by COVID-19 but still shows a very high number of victims. The trend of infected people will reach its peak on April 5th whereas the number of cases in Intensive Care will continue to grow at a progressively slower rate over the forecasting period; Page 8 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 • Veneto is the third Region for number of deaths. Here, the number of positive cases, as well as the number of cases in ICU, will reach the peak on April 3rd; • for Piedmont -the fourth Region for number of victims -the predicted positive cases will reach the peak on March 29th (6635) whereas the persons in ICU will be 431 on 31 March, when the peak is predicted; • Liguria will begin a process of relative reduction of positive cases as early as March 29th. The number of cases in Intensive Care, after a period of stability (lasting until 31 March), will start a slow decreasing path; • Positive cases in Trentino Alto Adige -which incorporates the cities of Trento and Bolzano -are projected to be 2158 on March 30th and then a decreasing trend is expected. The ICU beds occupied in this region will reach its peak on around 3 April; • The positive cases in Friuli Venezia Giulia show a relatively stable trend in the first half of the prediction interval with a peak around April 4th, after that the absolute number of cases will start decreasing. The number of cases in ICU will reach the peak between 30 March and 1 April; • Valle D'Aosta is a small region which has been relatively less impacted by the virus. Here, a downward trend is expected to start on March 31st (for the positive cases) and around 31 March (cases in ICUs); • The upward trend in the number of positive cases of Lazio is estimated to stop on 31 March and to reach the minimum at the end of the forecasting people (1821 cases). The number of ICU cases is estimated to reach its peak on the period 1-3 April; • The Macro-area Center will reach its peak at the very beginning of the month of April (for the variable V ) whereas for the variable U the estimated peak day is around 31 March; • Campania will reach the peak of contagions on April 5 whereas ICU cases will do on the previous day; • The remaining southern regions (Abruzzo, Molise, Puglia, Basilicata, Calabria, Sicily and Sardinia) will show an upward trend in the number of future positive cases lasting until 6 April, where 6355 cases are predicted. The number of persons requiring an ICU will reach the peak on 4 April (348 is the estimated number of cases). Page 9 of 19 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https: //doi.org/10.1101 //doi.org/10. /2020 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https: //doi.org/10.1101 //doi.org/10. /2020 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of the Italian National Institute of Statistics. Recent developments in bootstrapping time series Proof of the ergodic theorem ARMA models and the Box-Jenkins methodology Comparison of different methods for univariate time series imputation in R Estimation theory with applications to communications and control Embedded systems programming New bootstrap inference for spurious regression problems Maximum entropy ensembles for time series inference in economics