key: cord-0451882-dw2076zq authors: Cardoso, M'ario; Cavalheiro, Andr'e; Borges, Alexandre; Duarte, Ana F.; Soares, Am'ilcar; Pereira, Maria Joao; Nunes, Nuno J.; Azevedo, Leonardo; Oliveira, Arlindo L. title: Modeling the geospatial evolution of COVID-19 using spatio-temporal convolutional sequence-to-sequence neural networks date: 2021-05-06 journal: nan DOI: nan sha: d719828259e3dbe8ea0e800032b13c4eb774fcde doc_id: 451882 cord_uid: dw2076zq Europe was hit hard by the COVID-19 pandemic and Portugal was one of the most affected countries, having suffered three waves in the first twelve months. Approximately between Jan 19th and Feb 5th 2021 Portugal was the country in the world with the largest incidence rate, with 14-days incidence rates per 100,000 inhabitants in excess of 1000. Despite its importance, accurate prediction of the geospatial evolution of COVID-19 remains a challenge, since existing analytical methods fail to capture the complex dynamics that result from both the contagion within a region and the spreading of the infection from infected neighboring regions. We use a previously developed methodology and official municipality level data from the Portuguese Directorate-General for Health (DGS), relative to the first twelve months of the pandemic, to compute an estimate of the incidence rate in each location of mainland Portugal. The resulting sequence of incidence rate maps was then used as a gold standard to test the effectiveness of different approaches in the prediction of the spatial-temporal evolution of the incidence rate. Four different methods were tested: a simple cell level autoregressive moving average (ARMA) model, a cell level vector autoregressive (VAR) model, a municipality-by-municipality compartmental SIRD model followed by direct block sequential simulation and a convolutional sequence-to-sequence neural network model based on the STConvS2S architecture. We conclude that the convolutional sequence-to-sequence neural network is the best performing method, when predicting the medium-term future incidence rate, using the available information. Modeling the spatio-temporal dynamics of the evolution of the COVID-19 pandemic is important for a variety of reasons, which include health service planning, adoption of containment measures, and allocation of resources. In fact, knowing the incidence rate in a given area enables public health officials and individuals to adopt the best strategies to contain the disease and reduce the risk of infection. Regrettably, it is difficult to obtain data that accurately reflects the incidence rate at any given time and place, since statistics are collected at an aggregate level at the country, province, municipality, or city level. Therefore, detailed maps of the incidence rate are usually not available, in part because of data collection limitations but also because, in general, it is not possible to determine exactly where people became infected with COVID-19. However, since contagion depends on individual and social behaviour and only happens when people are in close proximity, incidence rate maps remain an excellent tool to plan and act in order to contain an epidemic that propagates mainly through person-toperson contact. In previous work the authors have proposed an incidence rate spatial model for COVID-19 in Portugal, based on a geostatistical framework and using the confirmed positive tested cases reported by the Portuguese Directorate-General for Health (DGS acronym, in Portuguese) [1] . These incidence rate maps are a very good approximation to the real incidence rate since they are derived from official data and require only mild assumptions about the way epidemics spread through a territory. After a brief review of related work, in section 2, we describe the method used to derive the incidence rate maps in section 3. To test the accuracy of the modeling algorithms in the incidence rate data, we implemented and evaluated four different methods, described in section 4. The results obtained with these models are reported in section 5. Section 6 presents the conclusions gathered from this work and proposes a few directions for future work. In this section, we provide a brief overview of related work on geospatial modeling of disease risks, with a focus on the methods that take into account the continuous nature of the territories. Data concerning diseases is commonly available by administrative regions. For many infectious diseases, statistics on the number of cases by countries, municipalities or other administrative regions are widely available. Thus, the most common approaches to map and visualize the risk of disease is through the use of choropleth maps, where those pre-defined spatial regions are colored in accordance with recorded incidence rates. In the wake of the COVID-19 pandemic, many such analyses have been performed and made available, e.g, [2] [3] [4] . However, such maps suffer from the intrinsic limitation caused by the arbitrary boundaries imposed by the pre-defined administrative regions considered and exhibit undesirable discontinuities at these boundaries. Geostatistics provides the framework to model spatial variability and to account for data with varying spatial support delivering desegregated and continuous maps of the risk of disease, overcoming the limitations of choropleth maps. Additionally, geostatistical risk maps can integrate data uncertainty resulting from population size, which is very important in this case, given that the population density varies widely across the country. A number of methods that estimate the spatial risk of diseases from known datapoints have been proposed [5] , including distancebased approaches and kernel densitiy estimation (KDE) methods. These and other methodologies may be used to convert point data to continuous risk maps. However, these methods have been designed to model static or slow-varying distributions, and cannot be trivially adapted to dynamic modeling of diseases. Combined with timeseries prediction, KDE has had some success in visualizing disease cluster evolution but depends critically on the accuracy of the timeseries prediction method. Direct estimation of the geospatial risk of COVID-19 in Switzerland using high-resolution spatiotemporal data analysis [6] based on the modified space-time density-based spatial clustering of application with noise (MST-DBSCAN) algorithm [7] has been proposed. However, high spatiotemporal resolution health data is rarely available, due to limited information and other limiting factors, such as personal data protection policies. Our modeling of the geospatial risk model, described in Section 3 is based on geostatistical block sequential simulation [8] , assuming a Poisson model for rare diseases as proposed by Waller and Gotway [9] to estimate the local distributions functions [10] . Given the actual municipality-level data, with varying support, this method makes it possible to integrate this uncertainty in the spatial model to generate the incidence rate maps and to assess its spatial uncertainty at any region in the country. Compartmental models have been extensively used in epidemiological studies. The first proposals to group populations in different compartments and to use differential equations to model the transitions between compartments, deriving the dynamics of the processes, date back more than a century [11, 12] . Compartmental models use continuous differential equations to model the evolution of an epidemic in a population of individuals, where each individual is assigned a state (a compartment) that determines its stage in the epidemic process [12, 13] . Since then, many articles have been published on the use of compartmental models to predict and simulate the dynamics of epidemics. In the last year alone, hundreds of articles have been published on the application of compartmental models to study the dynamics of the COVID-19 pandemic in many countries and locations, such as China, Italy, and India [14] [15] [16] . A complete description of the many applications of the models, as well as the variants proposed, is outside the scope of this work. Our own application of compartmental models to predict the dynamics of the pandemic in Portugal is based on the Susceptible-Infected-Recovered-Dead (SIRD) model, which uses four different compartments. Despite the model simplicity, many assumptions and approximations are required to make it fit real-world data, including the consideration of time-varying parameters, adjustments for small-sample, high-variance, data, and estimation of unknown values in the observed data series. We used, with adaptations described in section 4.3, the methods proposed in recent work that used data from the COVID-19 pandemic in New York City, Madrid, and Stockholm, among other cities, together with various states, countries, and regions to estimate a SIRD model [17] . Recurrent neural network (RNN) architectures, based on Long Short-Term Memory (LSTM) units [18] or gated recurrent units (GRU) [19] have been extensively employed in forecasting tasks involving time-series data. However, these models consider the input data as independent sequences of vectors and do not explore the spatial context available in spatio-temporal data. To address this limitation, spatial convolution operations were combined with LSTMs, thereby exploiting the abilities of convolutional neural networks (CNNs) and RNNs to effectively model spatial and temporal information, respectively. In the Convolutional LSTM (ConvLSTM) approach [20] , all input data structures are 3D tensors, with the first dimension corresponding to either the number of measurements or the number of feature maps, and the last two dimensions representing the spatial dimensions (i.e., width and height). By replacing the product operation in the original LSTM with the convolution operation, the future states of a certain cell depend on the inputs and past states of itself and its local neighbours. Many different architectures have been proposed using ConvL-STM building blocks, applied to different problems, such as weather prediction from satelite image sequences [21] , air quality forecasting [22] , and video frame prediction [23] . In the architecture that served as the basis for this work [24] the authors used a standard encoder-decoder structure comprised of multiple stacked ConvL-STM layers, with each decoder layer initializing its hidden states from the output of the corresponding encoder layer. The final predictions of the model are given by the concatenation of the hidden states from the decoder network followed by a 1 × 1 convolution. This method was adapted to this problem by introducing a number of improvements, described in section 4.4. We use the incidence rate, the number of confirmed positive tests in a region, in a period of 14 consecutive days, divided by the population of the region, as a proxy for assessing the risk of infection. In particular, we consider the incidence rate per municipality ( ), defined by where ( ) is the number of confirmed positive tests in the 14 days preceding a given day , at each municipality, , and is the respective population size. We will use to denote the incidence rate at a specific location (or cell) in the country. In the sequence, we may drop the explicit dependence on to simplify the notation, when not required. The daily numbers of new cases by municipality are reported regularly by the Portuguese Directorate-General for Health (DGS), and they have been converted to predicted incidence rate by dividing by the population count of each municipality, as given by official statistics, and scaled to 100,000 inhabitants (thus corresponding to the number of new cases in the most recent 14 days per 100,000 inhabitants). This will be the scale used in the maps and reports in this paper. Since not all cases are detected, this incidence rate may underestimate the real incidence rate. However, it is reasonable to assume that the ratio between the actual number of newly infected persons and the number of detected cases is approximately equal in the different regions of the country, enabling us to use the incidence rate computed in this way as a proxy to the real incidence rate. In practice, we may be underestimating the incidence rate by a signficant factor, since it is well known that a relevant fraction of the COVID-19 cases are asymptomatic and remain undetected. This has an impact on the value of the parameters of the dynamic models, since important characteristics of the process such as the group immunity threshold, depend on the actual rates of infection and recovery. In practice, we believe this underestimation does not have a strong impact in the accuracy of the models (other than a systemic underestimation), since all model parameters are estimated from actual data and updated as time goes by in order to reflect the dynamics of the pandemic. Therefore, the only significant impact of the fact that not all cases are detected is the underestimation of the incidence rate, by a factor that can be considered approximately constant, if one assumes that there are no significant differences in testing strategy from region to region. This assumption can be justified by the relative homogeneity of the health system in a country like Portugal. We consider the country divided in a regular grid of = 40608 square cells, which, in this work, have a dimension of 2 km by 2 km (see Figure 1 for an example of the discretization used). 2. The methodology can be trivially adapted to a different grid. We compute the incidence rate maps by estimating for each cell in the country, using geostatistical stochastic simulation, namely direct block sequential simulation (block-DSS) [25] , which is based on a stationary spatial covariance model (i.e., stationary variogram model). This modelling technique allows the daily update of COVID-19 incidence rate maps at high-resolution together with the associated spatial uncertainty. The geostatistical incidence rate maps are continuous (up to the scale of the discretization) and their spatial distribution follows a given spatial pattern as revealed by a variogram model inferred from the data. The incidence rate maps do not show any sharp discontinuities and are not influenced by administrative boundaries, such as municipality borders. Since incidence rates refer to different population sizes, these must be weighted when calculating the experimental variogram such that municipalities with large populations have greater weighting. Below, we briefly describe this geostatistical simulation method. A full description of the method can be found in previous work by the authors [1] . We assume that the geometric centroid of each municipality has coordinates ( , ). Data is available for the 278 municipalities in mainland Portugal, as illustrated in Figure 2 . For clarity, in this section we will drop the explicit dependency on . The daily update of incidence rates known for each municipality are assigned to each centroid (weighted by population density) and are used as experimental data in the geostatistical simulation. This is a reasonable approximation given the relatively small area of each municipality when compared with the area of the country. Within a geostatistical framework is interpreted as a realization of a random variable corresponding to the true, and unknown, incidence rate in municipality . The expected value of [ ] is an approximation of the incidence rate. depends on the population size of each municipality. For example, a given municipality with small population size (i.e., a small denominator) will have high variance and consequently higher uncertainty in the incidence rate estimate. We use the Poisson kriging model [10] to define the risk variance Block-DSS [25] is a stochastic sequential simulation method based on the Poisson kriging model and an adequate modelling technique to deal with data with varying spatial support (i.e., municipalities with varying size and shape) and to make spatial predictions with change of support, e.g., from area (block) to point support. In this case the scale related to the map cells can be referred as point support, because it denotes a small area when compared with the municipality areas. The following sequence of steps summarizes the block-DSS algorithm [25] : (1) Generate a random path that visits each cell, , of the simulation grid; (2) At each location along the random path, , search the conditioning data within a given neighborhood dependent on the variogram model. The conditioning data comprises the closest experimental data, previously simulated values and block data; (3) Calculate the local covariance values considering spatial covariance matrices between: block-to-block, block-to-point, point-to-block and point-to-point. The point data represents the incidence rates assigned at the centroid of each municipality and the block data are defined as the spatial linear average of point values. These matrices are built to solve the block kriging system and obtain the local mean and variance kriging estimate at location [25] ; (4) Draw a value, , from the global probability distribution function centered at the local mean and bounded by the local variance obtained in (3); (5) Add the simulated value to the data set and repeat steps (3) to (6) until all grid cells are simulated for one realization; We applied the algorithm to compute the incidence values for each cell in the territory, for each day in the period under consideration. Figure 3 depicts an example of this computation, for a specific day in the period. Block-DSS generates alternative incidence rate maps, designated realizations, at each run, as the random path changes and consequently the conditioning data when simulating a location . A given set of realizations provides the value of the uncertainty of incidence rate distribution in a given cell. In this work we simulate daily sets of one hundred realizations of incidence rates for mainland Portugal. Then, we computed the median incidence rate model as it represents the most likely incidence rate at a given period of time and location. The method also derives confidence intervals for the incidence rate in each cell (see example in Figure 4 ). The value of the incidence rate at each cell represents a good approximation, given existing data limitations, to the real incidence rate at that location. Both the incidence risk maps and the confidence intervals are made available as supplemental data. To model the spatio-temporal spread of the COVID-19 epidemic in Portugal, we implemented four models: • an ARMA autoregressive-moving-average at the cell level; • a vector-autoregressive model at the cell level that takes into account the previous values of all other cells; • a municipality level compartmental SIRD model coupled with block-DSS kriging; • a spatio-temporal convolutional sequence-to-sequence neural network. All models produce predictions at the cell level. These predictions are then compared with the gold standard obtained using the method described in section 3. The baseline ARMA(p,q) model is a simple, cell-level, autoregressivemoving-average model [26] . In this model, the incidence rate for each cell is computed using where , ( ) and ( ) are calculated, for each cell , using linear regression, in order to minimize the observed errors, ( ). We used the ARIMA package in Python to compute an independent ARMA model for each cell . The resulting model was used to predict the temporal evolution of the incidence rate ( ) for each cell at time . The ARMA model suffers from a strong limitation, since it computes the evolution of the incidence rate at each cell not taking into account the values of any other cells. To circumvent this limitation we also tested a vector autoregressive (VAR) model [27] , which computes the incidence rate at each cell using all the values from the other cells, in accordance with where denotes the incidence rate and is an × time-invariant matrix, calculated, together with , using linear regression, in order to minimize the observed errors, ( ). We used the VAR package in Python to compute a VAR model for the data and a prediction of ( ), for each cell at time . = , = , where , , and are the rates of infection, recovery and mortality, respectively. We computed a separate SIRD model for each municipality, , with independent parameters , , and , for a total of 278 municipalities. Furthermore, as other authors have done [17] , we removed the assumption of the basic SIRD model that the rates , , and are constant and depend only on the nature of the epidemic and the fixed spreading dynamics. Since, in practice, the behavior of an epidemic is sensitive to changes in human behavior and to other environmental variables, we used a generalized version of the SIRD model that uses time-varying rates, which are calculated from known pandemic data, at each municipality , resulting in a series of values for each parameter: where ( ) represents the new confirmed cases at time , ( ) = ( ) − ( − 1) the daily recovered, ( ) = ( ) − ( − 1) the daily deaths, and ( ) the total active cases at time , given by To avoid instability and to reflect the fact that the , , and rates vary slowly, once the parameters are calculated for the entire historical data, kernel smoothing is applied to reduce noise. We use Nadaraya-Watson as the kernel with a bandwidth of 10. Each Portuguese municipality is treated as an independent region, modeled by its own set of SIRD equations, which can be used to forecast the evolution of the cases. In order to predict the future number of infections for each municipality, we solve the SIRD model forward in time using the last day of historical data as the initial conditions for , , and . We obtain the values for ( ), ( ) and ( ) to use during the forecast period based on the historical smoothed parameter series. This can be done in three ways: (1) Using their last known value; (2) Using an average of the last days; (3) Estimating via extrapolation. Equations 9, 10, and 11 compute all parameters relative to the number of active cases. This can lead to severe fluctuations of parameter values when the population of the municipality is small, since the variance of the parameters will be high. We mitigate this by introducing a pseudo-count component that forces parameter values to consider not only regional but national data as well -therefore avoiding unfounded radical shifts. Let ( ), ( ) and ( ) be the parameters for municipality at time , and ( ), ( ), ( ) the values for the rates calculated on the national level. We can now calculate * ( ) as * ( ) = ( ) + ( ) where is the pseudo-count. If = 0, then the national parameters are ignored. As increases so does their influence but remains inversely proportional to the region's population size. The values for * ( ), * ( ) are computed in the same way. * ( ), * ( ), and * ( ) are then used in equations 5 to 8 to model the evolution of the relevant variables, instead of using the rates computed directly from the historical time series. Privacy laws create a challenge in obtaining the number of deaths for each region daily. Furthermore, it is difficult to determine the exact recovery date for every single patient accurately. As a way of overcoming these limitations, we approximate the values for each region's number of recoveries and deaths based on its population size, the mean recovery period, and the national mortality ratio. We used a mean recovery period of 14 days, which led to the best fit of the models to the real data. First, we calculate the national mortality rate for each day : where is the current day, is the expected recovery time, ( ) is the number of deaths in the country on day , and ( − ) is the total number of infected people in the country at time − . Then, for each municipality and for each day , we estimate the number of recoveries and deaths, as follows: where ( ), ( ), ( ) are the number of recoveries, total active cases and deaths at time , respectively. Computing the geographical incidence rate using the SIRD model. By solving the SIRD equations forward in time, with the estimated parameters, smoothed using pseudo-counts, we obtain the values of ( ) for each municipality and each time . From ( ) we can trivially compute the number of new cases, for each day, which makes it possible to compute the incidence rate ( ) for each time . This data, generated by the SIRD model, is then used instead of the real data, as input to block-DSS algorithm, using the procedure described in section 3 to compute the incidence rate in each cell of the territory at time , ( ). This approach uses recently proposed techniques for spatio-temporal modeling using neural networks, resulting in a model derived from the STConvS2S architecture [24] , coupled with new normalizationactivation layers and learnable parameters intrinsic to each location. The STConvS2S architecture is a sequence-to-sequence model comprised exclusively of convolutional layers, which are suited to capture spatial features by design (see Figure 5 ). Convolutions are performed with factorized 3D kernels = × × , where is the size of the temporal kernel and is the size of the spatial kernel. The first component is a temporal block that extracts temporal features through the temporal kernels (i.e., ×1×1), while the second component is a spatial block that receives the output of the previous block in order to extract spatial features through the spatial kernels (i.e., 1 × × ). Both the temporal and spatial blocks are comprised of successive convolutional blocks with normalization-activation operations, with the temporal block using causal convolutions to maintain temporal coherency during prediction (i.e., predictions for a time-step make no use of future information from time-steps + 1 onward). The feature maps generated throughout the architecture are adequately padded in order to maintain the initial dimensions, and the intermediate feature maps are increased by a factor of 2 in each layer, with the last convolutional layer in both the spatial and temporal block reducing the number back to the initial amount. Instead of the traditional batch normalization and activation function operations following each convolution, we apply an extension of the normalization-activation layers recently proposed [28] . The chosen operation is an adapted version of the EvoNormB0 layer, which was the best performing batch-based version reported, and explicitly models spatio-temporal scenarios, considering sequences of two-dimensional inputs when calculating both the batch and instance variance (i.e., the values associated to each time-step in the input sequence are considered separately when computing the variance). Let 1 , and denote learnable parameter vectors, and let , , ,ℎ and , ,ℎ represent the variance of a mini-batch and the variance of a single instance, respectively. This extension, denoted here as B03D, is defined as follows: , ,ℎ ( ) + · + (17) Another technique used in our model was the extension of the input data with learnable local features and weights, intrinsic to each spatial location, as originally proposed for the area of wind forecasting, where changes in location imply changes in the model parameters [29] . This technique seeks to improve spatio-temporal forecasting scenarios by combining both (i) global location-invariant features, and (ii) location-specific features. Typical CNNs are mostly translation invariant. Translating an input and convolving with a filter will yield the same result as translating the feature map resulting from a convolution between and . Therefore, standard CNNs treat each spatial location equally and thus learn global patterns. This is often insufficient in many To allow the learning of such features, we used two complementary techniques, Learnable Inputs (LI) and Local Weights (LW). The LIs correspond to a set of trainable parameters with the same spatial dimensions as the original input, concatenated with the input before processing with a convolutional layer. These parameters allow the network to learn local features regarding every spatial location, which can complement the learned global patterns or be ignored by the convolutional kernels in situations where they are not beneficial. The LWs further reinforce the individualized learning of different spatial locations, through a locally-connected layer (i.e., a convolutional layer with a different filter at each input region, allowing different spatial locations to be weighed according to their relevance) of weights over the input, resulting in trainable weights with the same spatial dimensions as the input. Similar to LIs, these weights are afterwards concatenated with the original input. In our experiments, both the LIs and LWs are concatenated with the inputs on the channel dimension (sharing the LIs across the different input time-steps and using different LWs in each timestep), prior to every convolutional layer. LIs are implemented with a locally connected layer that receives as input a constant unitary tensor with the same spatial dimensions as the original input, and the LWs are implemented with a separate locally connected layer that receives the original tensor as input. We performed an empirical comparison of the four methods described in section 4 by using them to predict the incidence rate 7 and 10 days ahead (corresponding to time-step + , with = 7 and = 10), using only past data up to time-step . The methods were used to predict this variable from September 7th, 2020, until February 28th, 2021. Each model was initially trained with data from the first six months of the pandemic (March 1st 2020 -August 31st 2020) and, as the prediction moved forward, was allowed to use existing data up to days before the day being predicted. During the periods used for training and testing (twelve months) Portugal faced three major waves of the infection. The prediction period included the two largest waves, which reached incidence rates never seen during the training period. Figure 6 shows the 14day incidence rates, per 100,000 inhabitants, for the whole country, during the first year of the pandemic. The ARMA and VAR baseline models used as input data the entire past sequence for each cell up to time . They were then used to predict the next days sequentially. We considered only the last value in the output sequence, corresponding to the time-step + , and the process was repeated by sliding the window one day into the future and concatenating the ground-truth value corresponding to the time-step + 1 with the existing data. For the SIRD compartmental models no training is necessary. As explained in section 4.3.5, the time-varying parameters for each municipality were determined and used to solve the equations forward in time. For each parameter we calculate each day in the historical dataset and then use extrapolation to project its values forward days. By solving the equations for each municipality we get a prediction of the future number of new cases for that municipality, . We then applied the block-DSS algorithm to determine the incidence rate for each cell in the territory, as described in section 4.3.5. The application of the block-DSS algorithm used the same spatial covariance matrix as the one used to build the gold standard data set. For the STConvS2S architecture, the model received as input a sequence of contiguous days, and generated a sequence corresponding to the next days. The final prediction is thus given by considering only the last time-step in the output sequence. This process was repeated by sliding the input window one day into the future, as was done for the ARMA and VAR models. Furthermore, in order to incrementally update the parameters as new data becomes available, we apply a simple online learning procedure, fine-tuning the model with each new input sequence for 5 epochs, after the prediction was made. This enables the model to quickly adapt to sudden changes. Regarding hyperparameter choices, for the ARMA model we consider = 7 and = 1. For the VAR model, we used = 4. For the SIRD model, we used the pseudo-count parameters = 10, 000 in the prediction 7 days ahead and = 100, 000 in the prediction 10 days ahead. This values were selected as the one that obtained better results. As in the original proposal, the STConvS2S architecture used 3 convolutional layers for both the temporal and spatial block, plus the final convolution to reduce the channel dimensionality back to one. The initial number of convolutional filters was set to 32, and each filter was of dimensionality 5 × 5. In order to select the most optimal Local Inputs and Learnable Weights configuration, we varied the filter size of the LWs between 1 × 1 × , 2 × 2 × and 1 × 1 × 1 (i..e, in this last case, considering a direct element-wise weighing unique to each spatial location and time-step in the input sequence), where corresponds to the number of time-steps in the input sequence. Furthermore, we varied the number of LIs and LWs ∈ {1, 2, 3}. The parameters were finally set as = 1 × 1 × 1 and = 2. Training relied on the AdaMod optimizer using a learning rate of 10 −3 and batch size of 5. Furthermore, for the STConvS2S model, in order to fit in the GPU memory and to avoid losing information over individual regions, the same architecture was trained and tested on three different subsets of the input region, corresponding to the upper third, middle third, and lower third of Portugal. The final prediction results were then concatenated, resulting in a prediction encompassing all of mainland Portugal. The predictions made by each method were compared with the reference incidence rate values, obtained as described in section 3. We computed two commonly used figures of merit in order to compare the predictions of the models with the reference data, the RMSE and the sMAPE. For a given day , the root mean square error (RMSE) between the predicted value and the reference value, averaged over all cells, is given by: where ( ) is, as before, the incidence rate in cell at time and ( ) is the value predicted by the model. Since absolute values of the incidence rate vary greatly by location, it is also relevant to compute a second figure of merit, the symmetric mean average error (sMAPE): Figure 7 and 8 show the evolution of these two figures of merit for the four methods used (day 0 corresponds to September 7th, 202o). The superior performance of the STConvS2S method is clear, when compared with both the baseline ARMA and VAR models, and the SIRD model. As always, the incidence rate is reported as the number of new cases in the last 14 days per 100,000 inhabitants, and the value of the RMSE uses the same scale. Table 1 shows the values for the RMSE and the sMAPE of the predictions, averaged over all days in the period September 7th, 2020, to February 28th, 2021. This table also shows the clear superiority of the STConvS2S algorithm. The SIRD model seems to be more stable as the horizon of the prediction is increased, although we needed to adjust the value of the hyperparameter , so that the longer-term prediction uses more heavily the national level data in the computation of the parameters of the dynamics. Figures 9 and 10 illustrates the quality of the predictions for week seventeenth of the predicted period, the week after Christmas (December 28th 2020 -January 3rd 2021). This was a particularly relevant week, since it corresponded to a sharp inflection of the tendency and the start of the third wave. The predictions were made with the data available until December 21st, 2020, for the first day in the week. The window was then adjusted by one day for each successive day. The results clearly show the superior predictive ability of the STConvS2S model which, in this particular week, only exhibited significant error in a sparsely populated region in the south of Portugal, Alentejo, while the other models exhibited a much more significant error in different parts of the country. Other weeks exhibit similar behavior, although there is significant variation in the model errors over time, as shown in Figures 7 and 8. The results obtained in our tests have shown conclusively the superior predictive ability of the STConvS2S method, when compared with the other methods. We attribute this superiority to the ability of this model to take into consideration the incidence rate at neighboring cells, when predicting the evolution of that rate in a given cell. Although, conceptually, the VAR model could also use this information, it has no access to the position of each cell and, even though it beats the basic ARMA method, it does not reach the level of precision attained by the STConvS2S model. The SIRD model suffers from the same limitations as the ARMA model, in that it cannot use information from the neighboring municipalities to infer the dynamics of the pandemic. In principle, this model should have been the one better tuned to the reference data, since it uses exactly the same method to infer cell level incidence rates from municipality level rates as the approach used to obtain the reference data: the Block-DSS krigging algorithm. We attribute the lower precision of this method to two factors: the inherent limitations of the time-varying SIRD models to make predictions far in the future due to the changing pandemic dynamics; and the inability of the SIRD model to take into consideration the geographical relations between municipalities. We conjecture that a modified SIRD model that uses information from neighboring municipalities may improve significantly these results. We computed an approximation to the spatio-temporal incidence rate of COVID-19 in mainland Portugal, using a methodology that uses the official municipality level information made available by the Portuguese Directorate-General for Healh (DGS) post-processed by direct block sequential simulation (block-DSS). This dataset is made available with the publication of this paper, and was used as a gold standard for the spatio-temporal evolution of the COVID-19 incidence rates during the first year, in this specific geography. We then used these data as a gold standard to test the behavior of four predictive models: an autoregressive-moving-average model, a vector autoregressive model, a model based on the combination of a SIRD compartmental model with block-DSS, and a model based on the STConvS2S methodology. We concluded that the STConvS2S neural network performed significantly better than the alternatives considered, and should therefore be considered the state-of-the-art for this problem. We invite researchers interested in applying spatiotemporal prediction methods to use these data, and to compare the predictions with the ones obtained by the four methods reported in this study. All the data used in this paper, a significant set of additional results including the day-by-day previsions, maps and videos of incidence rate, as well as the code for the SIRD and the STConvS2S models is available at http://covid.vps.tecnico.ulisboa.pt/models. The baseline incidence rate data and the previsions of this model have been integrated in the interactive information site available at http://covid.vps.tecnico.ulisboa.pt. Geostatistical COVID-19 infection risk maps for Portugal A multimethod approach for county-scale geospatial analysis of emerging infectious diseases: a cross-sectional case study of COVID-19 incidence in Germany Social disparities in the first wave of COVID-19 infections in Germany: A county-scale explainable machine learning approach Mapping the burden of COVID-19 in the United States Spatial point pattern analysis and its application in geographical epidemiology Geospatial digital monitoring of COVID-19 cases at high spatiotemporal resolution Characterizing diffusion dynamics of disease clustering: A modified space-time DBSCAN (MST-DBSCAN) algorithm Stochastic simulation model for the spatial characterization of lung cancer mortality risk and study of environmental factors Applied spatial statistics for public health data Geostatistical analysis of disease data: estimation of cancer mortality risk from empirical frequencies using Poisson kriging An application of the theory of probabilities to the study of a priori pathometry -Part i Containing papers of a mathematical and physical character The mathematical theory of infectious diseases and its applications. Charles Griffin & Company Ltd, 5a Crendon Street, High Wycombe, Bucks HP13 6LE Chinese and Italian COVID-19 outbreaks can be correctly described by a modified SIRD model A time-varying SIRD model for the COVID-19 contagion in Italy Studying the progress of COVID-19 outbreak in India using SIRD model Estimating and simulating a SIRD model of COVID-19 for many countries, states, and cities Long short-term memory Learning phrase representations using RNN encoder-decoder for statistical machine translation Convolutional LSTM network: A machine learning approach for precipitation nowcasting Psique: Next sequence prediction of satellite images using a convolutional sequence-to-sequence network Plumenet: Large-scale air quality forecasting using a convolutional lstm network Predrnn: Recurrent neural networks for predictive learning using spatiotemporal LSTMs STConvS2S: Spatiotemporal convolutional sequence to sequence network for weather forecasting A package for geostatistical integration of coarse and fine scale data Prediction and regulation by linear least-square methods Macroeconomics and reality Evolving normalization-activation layers Localized convolutional neural networks for geospatial wind forecasting The authors would like to thank André Peralta Santos and the Portuguese Directorate-General for Health, for making available the municipality level epidemiological data. This work was funded by projects INTAKE, SMOCK and MARÉ, financed by the Portuguese Science Foundation.