key: cord-0438770-8hwpkoma
authors: Fiorini, Stefano; Ciavotta, Michele; Maurino, Andrea
title: Listening to the city, attentively: A Spatio-Temporal Attention Boosted Autoencoder for the Short-Term Flow Prediction Problem
date: 2021-03-01
journal: nan
DOI: nan
sha: f94cd0b6386962107eda4579ffa157911c2f5f3d
doc_id: 438770
cord_uid: 8hwpkoma

In recent years, studying and predicting alternative mobility (e.g., sharing services) patterns in urban environments has become increasingly important as accurate and timely information on current and future vehicle flows can successfully increase the quality and availability of transportation services. This need is aggravated during the current pandemic crisis, which pushes policymakers and private citizens to seek social-distancing compliant urban mobility services, such as electric bikes and scooter sharing offerings. However, predicting the number of incoming and outgoing vehicles for different city areas is challenging due to the nonlinear spatial and temporal dependencies typical of urban mobility patterns. In this work, we propose STREED-Net, a novel deep learning network with a multi-attention (spatial and temporal) mechanism that effectively captures and exploits complex spatial and temporal patterns in mobility data. The results of a thorough experimental analysis using real-life data are reported, indicating that the proposed model improves the state-of-the-art for this task.

In recent years, many researchers have studied and created models to describe and predict mobility dynamics, or flow prediction in urban areas. This interest is motivated by the need to comprehend displacement dynamics, which are also rapidly changing due to alternative electric and shared public transport systems, to define effective regulatory strategies for human mobility and freight transport in the smart city [1] . A clear example of the public decision-maker's regulatory action concerns the capacity constraints on public transport (in Europe) imposed to control the spread of the SARS-CoV-2 virus that significantly reshaped mobility habits in large urban areas.

Mobility models are traditionally deployed to plan actions over the long term, but acting in a timely (even preventive) manner is becoming increasingly necessary to achieve a high quality of service and availability. This need has grown in recent years, mainly due to the introduction of new sharing services, which according to the latest Moovit report [2] , are increasingly appreciated by consumers for the shorter commuting time and the greater flexibility to reach areas not well served by public transport.

In this regard, this research draws on the following two guiding scenarios. In the first one, we consider a company, which runs shared mobility services, that exploits demand forecasting models to improve short-term resource planning. In the second scenario, a public administration wants to predict the number of vehicles entering the different areas of the city to identify in advance possible traffic jam conditions. In both cases, the area of interest is divided into a grid, each element of which is a region. The number of vehicles entering (Inflow) and exiting (Outflow) each region must be predicted. This problem has inherently spatio-temporal characteristics; evidently, the vehicular flow entering (exiting) a region does not only present temporal dependencies (time of day, flow in the previous hours) but also spatial dependencies as it strongly depends on the traffic leaving (entering) adjacent areas. Formally, such considerations relate to two widely recognized properties in the study of displacement dynamics [3] , namely temporal and spatial correlation. Mobility data are innately continuous time series generally not associated with abrupt changes. This means that displacement dynamics in temporally close periods share similarities, and this phenomenon is all the more true when the data sampling frequency increases. Moreover, different neighboring areas featuring similar functional characteristics (e.g., residential, commercial and industrial areas) often show correlated traffic patterns. However, the models proposed in the literature for predicting vehicular flows entering or leaving a particular area generally tend to consider all areas adjacent to the one considered as equivalently predictive. In practice, the district structure of the city tends to be irregular, which calls for a mechanism that can identify these zones and exploit this information to improve the forecast.

Finally, external factors have also a profound impact on the use of vehi-cles. For instance, it is well known in the literature that weather conditions and the days of the week (workdays vs. weekend) affect displacement dynamics, especially for lightweight transport means like bikes. These considerations have steered the design of our proposal, STREED-Net (Spatio Temporal REsidual Encoder-Decoder Network) novel effective flow prediction deep learning network.

The main contributions of this paper can be summarized as follows:

1. We propose a novel prediction architecture that includes two different attention blocks to acquire customized temporal and spatial information, i.e. able to adapt specifically to the city and the means of transport considered, identifying districts in the city.

2. Moreover, to the best of our knowledge, STREED-Net is the first autoencoder architecture that combines the use of convolutional blocks with residual connections, a series of Cascade Multiplicative Unit (CMU) and two different attention mechanisms.

3. Finally, this work presents a methodologically sound comparative performance assessment of various models from the literature on real-life datasets. The analyses presented consider different types of loss functions and KPIs. Results indicate that STREED-Net outperforms the considered state-of-the-art approaches.

The rest of this paper is organized as follows. In section 2, the literature on techniques used in flow and traffic prediction is analyzed. Section 3 defines the flow prediction problem in urban areas, while in section 4, the core deep learning techniques exploited in this work and the proposed framework are described in detail. In section 5 data and results of experiments are presented and analyzed. Finally, the conclusions and recommendations for future work are discussed in section 6.

Several studies have addressed the problem of predicting vehicle flows in urban environments. This problem has been initially modelled as a time series prediction problem for each city area and approached through classical statistical methods at first, and Artifical Neural Networks (ANNs) (e.g., deep learning) later. In particular, different statistical methods have been applied, including autoregressive integrated moving average (ARIMA) [4] , Kalman filtering [5] , and their variants, as well as other classical approaches such as Bayesian networks [6] , Markov chain [7] , and Support Vector Regression (SVR) models [8] . Others approaches have used k-means clustering, principal component analysis, and self-organizing maps to mine spatiotemporal performance trends [9] . However, classical statistics models shows some weaknesses when applied to the flow prediction problem, namely they are unable to capture the spatial dependencies between the various areas because data for each region of the city are considered as independent time series, and ii) they fail to capture the nonlinear relationship between space and time, which is essential for reliable prediction. Further studies overcame these downsides by considering spatial relationships [10] and external factors (e.g., environment and weather conditions [11] ) within traditional time-series prediction methods.

ANNs have exploited in flow predictions for their capability of capturing the non-linear spatial and temporal relationships within data. Initial works using ANNs followed two main approaches. The first one exploits variants of Recurrent Neural Networks (RNNs) [12] such as i) Long Short-Term Memory (LSTM) [13] and ii ) Gated Recurrent Unit (GRU) [14] , whose architectures can effectively capture both the long-term pattern and short-term fluctuation of time series. The second research line applies models based on Convolutional Neural Networks (CNNs) to identify spatial dependencies in traffic networks, treating dynamic traffic data as a sequence of frames [15] . However, these neural networks in their standard configuration (derived from the image recognition field) can only identify either spatial or temporal patterns of the traffic flow data, respectively. Spatial and temporal information are inherent in traffic data, making it essential to consider both aspects at the same time when predicting mobility dynamics. In this direction, deep learning-based approaches have been recently proposed, which exploits architectures able to capture spatial and temporal patterns, including 2D convolutions and residual units [16, 17] , 3D convolutions [18] , 3D convolutions and LSTM [19] , a combination of 2D and 3D convolutions [20] , autoencoder architecture with Multiplicative Cascade Unit (CMU) [21] . In recent years, with the development of graph convolutional networks [22] , which can be used to capture the structural characteristics of the graph network, we are witnessing their use in the field of traffic prediction. In [23] the authors propose DCRNN, it is a model that captures the characteristic space through random walks on the graphs, and the tempo-ral feature through the encoder-decoder architecture, while in [24] they apply the temporal graph convolutional network (T-GCN) model, which is in combination with the graph convolutional network (GCN) and gated recurrent unit (GRU). Finally, in [25] the authors propose a method of forecasting the traffic flow based on dynamic graphs: the traffic network is modeled by dynamic probability graphs. The convolution of the graph is performed on the dynamic graphs to learn the spatial features, which are then combined with the LSTM units to learn the temporal features. We provide a more detailed description of a selection of the approaches mentioned above in section 5.

Given a tessellation of the area of interest (henceforth referred to as city) in regularly-shaped regions, a set of historical observations regarding trajectories of vehicles within the city and, possibly, other spatial and non-spatial data sources for a reference time horizon T H of H time points, the citywide vehicle flow prediction problem [18] is defined as the problem of minimizing the prediction error for vehicle Inflow and Outflow at time t that is the first time point after T H .

In the literature, there are several definitions of location/region with different granularity and different semantic meaning [26] . However, when it comes of traffic forecasts, the majority of works uses a rectangular tessellation, which maximizes the number of neighboring areas. Similarly, in this study, the geographical space of interest (city) is logically partitioned into a regular grid of size N × M oriented by longitude and latitude [16] . Each element of the grid is termed region and is addressable through a pair of coordinates (n, m) corresponding to the n th row and the m th column of the grid.

The term Inflow (Outflow, respectively) refers to the number of vehicles entering (leaving) a specific region in the considered time unit (Figure 1a ) [18] . More specifically, the Inflow (Outflow) indicates the number of pedestrians, cars, public transport and sharing vehicles entering (leaving) the region in a certain time period. As shown in Figure 1b , by analyzing the movement data of the vehicles, it is possible to obtain the Inflow and Outflow matrix, which encompass the information about displacements between the areas of the city at each time t. More in detail, let τ i = {s 1 i , s 2 i , . . . s t i } be a trajectory where s t i represents the position of vehicle i at time t, and let T be a collection of trajectories. The Inflow (Outflow, respectively) of 

Finally, the state of the vehicular flow at time t can be represented by a tensor (also referred to as frame in what follows) F t ∈ R N ×M ×C , where C indicates the number of flow variables considered in the analysis, in this specific case C = 2 (Inflow/Outflow), whereas N × M is the total number of regions in the city. Then, to take into account the temporal dependence, over the time horizon T (divided into H time points), the flow representation is extended to a tensor of four dimensions F ∈ R H×N ×M ×C , which represents the main input to our problem. The problem at issue then becomes predicting F t given a volume, that is a sequence of past tensors V ⊂ F . It is worth noting that the resulting problem shows several similarities with the frame prediction problem [19] since the tensor F can be seen as a four dimensional volume composed of H consecutive images, each of which featuring C channels.

STREED-Net is a Autoencoder -based deep learning model that combines convolutions and CMUs with two different types of Attentions (spatial and temporal). This section presents STREED-Net, detailing its components and relationships, prefacing it with a brief introduction to the main underpinning concepts, namely the autoencoder architecture and the attention mechanism.

Autoencoder architecture. Given a set of unlabeled training examples {x 1 , x 2 , x 3 , ...}, where x i ∈ R n , an autoencoder neural network is an unsupervised learning algorithm that applies backpropagation setting the target values to be equal to the inputs y(i) = x(i). It is a neural network that is trained to learn a function h W,b (x) =x ≈ x, where W and b are weights and biases of the ANN, respectively. In other words, an autoencoder is a learned approximation of the identity function, so as to outputx that is as much as possible similar to x. The overall network can be decomposed into two parts: an encoder function h = f (x), which maps the input vector space onto an internal representation, and a decoder that transforms it back, that isx = g(h). This type of architecture has been applied successfully to different difficult tasks, including traffic prediction [21] .

Attention mechanism. In Deep Neural Networks (DNNs) Attention Mechanism helps focus on important features of the input, shadowing the others. This paradigm is inspired by the human neurovisual system, which quickly scans images and identifies sub-areas of interest, optimizing the usage of the limited attention resources [27] . Similarly, the attention mechanism in DNNs determines and stresses on the most informative features in the input data that are likely to be most valuable to the current activity.

Recently, attention has been widely applied to different areas of deep learning, such as natural language processing [28] , image recognition [29] , image captioning [30] , image generation [31] and traffic prediction [32] .

The encoder structure depicted in Figure 3 is the first block of the STREED-Net architecture. It is composed of an initial convolutional layer, a series of residual units, and a final convolution layer. Unlike similar approaches (e.g., STAR [17] ), the proposed encoder structure introduces three novel aspects: i) each layer is time-distributed, meaning that the model learns from a sequence of frames (for time coherence) instead of focusing on each frame singularly. ii) it applies further convolutions after the residual unit, so that to reduce the frame size and iii) it applies a Batch Normalization (BN) after each convolution to avoid gradient disappear/explode problems and achieve faster and more efficient reported optimization [33, 34] . Unlike other works from the literature, where the distant temporal information is also used (from the previous day and previous week), the encoder takes as input a four-dimensional tensor F ∈ R H×N ×M ×2 . This tensor is a sequence of consecutive three-dimensional frames conveying flow information of nearby periods (with regard to the prediction time t ). Such a tensor (also referred to as closeness in the literature [18] ) is obtained by selecting p points preceding the prediction time t , i.e., the sequence

In this way, STREED-Net can focus on the most recent dynamics only. Each frame in F is processed by the convolution layer to extrapolate spatial information. It is worth noting that in Figure 2 , the encoder is represented by a collection of identical blocks in parallel execution on the input frames instead of (as in reality) a single convolution applied sequentially. Such a representation is used to highlight that a time-distributed layer is trained by taking into account all input frames simultaneously. The use of this approach leads the model to identify temporal (that is, inter-frame) dynamics, rather than looking only to spatial dependencies within each frame.

Each convolutional layer is followed by a Rectified Linear Unit (ReLU) activation function and a BN layer. Formally, we have:

where E (0) t corresponds to the output of the first convolutional layer, F t is one of p frames in input to the model and * the convolution operator. W e are the weights and biases of the respective convolutional operation. Next, L encoder blocks (see Figure 3 ) are placed. Each of these blocks is composed of a residual unit followed by a downsampling layer:

For what concerns the Downsampling, it has been implemented as:

ds and * indicate a convolutional layer with kernel size and stride parameters set to halve the height and width of the input frame.

The rationale behind the design of this architecture is threefold: i) a deep structure is needed for the model to grasp dependencies not only among neighboring regions but also among distant areas; ii) Deep networks are difficult to train as they present both the problem of the explosion or disappearance of the gradient and a greater tendency to overfitting due to the large number of parameters. To try to avoid these obstacles and to make the training model more efficient, we introduced residual units. Finally, iii) the downsampling layers were introduced to ensure translational equivariance [35] .

Finally, the encoder structure ends with a closing convolution-ReLU-BN sequence, which has as its main objective to reduce the number of feature maps. In this way, the next architectural component (i.e., the Cascading Hierarchical Block) will receive and process a smaller input, reducing the computational cost of the CMU array. The encoder output is:

The output of the encoder is a tensor E (L+1) ∈ R H×N/2 L ×M/2 L ×C , where C is the number of feature maps generated by the last convolution of the encoder.

A connection section between the encoder and the decoder is provided to handle the temporal relationships among the frames. Unlike what is proposed in other works that combine the use of CNN with the use of RNN such as LSTM [36] , STREED-Net implements a Cascading Hierarchical Block with Cascade Multiplicative Units (CMU) [21] , which computes the hidden representation of the current state directly using the input frames of both previous and current time steps, rather than what happens in recurrent networks that model the temporal dependency by a transition from the previous state to the current state. This solution is designed to explicitly model the dependency between different time points by conditioning the current state on the previous state, improving the model accuracy; incidentally, it also reduces training times.

The fundamental constituent of CMU architecture is the Multiplicative Unit (MU) [37] , which is a non-recurrent convolutional structure whose neuron connectivity, except for the lack of residual connections, is quite similar to that of LSTM [38] ; the output, however, only depends on the single input frame h. Formally, MU is defined by the following equation set:

where σ is the sigmoid activation function, * the convolution operator and the element-wise multiplication operator. W 1 ∼ W 4 and b 1 ∼ b 4 are the weights and biases of the respective convolutional gates and W denotes all MU parameters.

CMU incorporates three MUs. Unlike MU, CMU accepts two consecutive frames as input to model explicitly the temporal dependencies between them. The more recent frame in time is inputted to a MU to capture the spatial information of the current representation. The older frame is instead processed by two MUs in sequence to overcome the time gap. The partial outputs are then added together and finally, thanks to two gated structures containing convolutions along with non-linear activation functions, the output of the CMU (X l+1 t ) is generated. CMU is described by the following equations:

where W 1 and W 2 are the parameters of the MU in the left branch and of the MU in the right branch respectively, W o , W h , b o and b h are the weights and biases of the corresponding convolutional gates. The cascading hierarchical block uses CMUs to process all frames at the same time (see Figure 4 ):

where X cmu ∈ R N ×M ×C . 

To integrate external information, such as the day of the week, holidays, and weather conditions STREED-Net features a specific input branch. The input is a one-dimensional vector that contains information that refers to prediction time t .

Through the use of two fully connected overlapping layers, this information is conveyed, encoded, into the mainstream of the network. The first level is used to embed each sub-factor, while the second reshapes the external factors embedding space to match the size of the CMU output vector.

The decoder is the last component of STREED-Net and its task is to generate the flow prediction starting from the latent representation that corresponds to the output of the cascading hierarchical block.

As shown in Figure 5 , the decoder takes as input a tensor z = X cmu + X ext , where z ∈ R N ×M ×C , which is the result of the sum of the outputs of the hierarchical structure and the network dedicated to incorporate external factors. X ext is added at this point of the network to allow the model to use The decoder architecture features a structure that is somehow symmetrical to that of the encoder with an array of residual units preceded and followed by a convolutional layer (Equation 21).

Nevertheless, this symmetry is breached by the presence of two significant differences. The first one is the presence of a long skip connection before every residual unit. The long skip connection is used to improve the accuracy and to recover the fine-grained details from the encoder. Another benefit is a significant speed up in model convergence [39] . A generic decoding block D (l) , ∀l ∈ {1 . . . L} can be formally defined as the sequential application of the following three operations:

where D (l−1) corresponds to the input block, Conv2DTranspose indicates the transposed convolution operation (also known as deconvolution), which doubles the height and width of the input, and sc (l) (skip connection) is the sum of u with ERU (L+1−l) 1 , i.e., the output of the remaining encoder unit at level L + 1 − l for the most recent frame. The residual units of the decoder are structured exactly like those of the encoder.

The second difference is the presence of two attention blocks (viz. Channel and Temporal Attention) before the final convolution layer. More details are provided in the following subsections.

After the convolutional stage of the decoder, a three-dimensional tensor, referred to as D (L) ∈ R N ×M ×C , is obtained with the channel size C . Since the dimension of the channel also includes the temporal aspects compressed by the cascading hierarchical block, the channel attention has been introduced to identify and emphasize the most valuable channels. Figure 6 shows the structure of the channel attention block. Given the input tensor, D (L) a channel attention map A c ∈ R 1×1×C is created by applying attention block deduction operations on the channels. More in details, through the operation of global average pooling and global max pooling performed simultaneously, two different feature maps (X max and X avg ) of size 1 × 1 × C each are spawned. They go through two fully connected layers that allow the model to learn (and assess) the importance of each channel. The first layer performs a dimensionality reduction, downsizing the input feature maps to 1 × 1 × C s , based on the choice of the reduction ratio s; the second layer restores the feature maps to their original size. This approach has proven to increase the model efficiency without accuracy reduction [32] . Once these two steps have been completed, the two resulting feature maps are combined into a single tensor through a weighted summation as:

where Λ 1 and Γ 1 are two trainable tensors with the same size as the two feature maps and σ is the activation function. Λ and γ are set during the training phase and weight the relative importance of each element of the two feature maps. Finally, the process of getting channel attention can be summarized as:

where D is the operation output and ⊗ denotes the element-wise multiplication.

Cities are made up of a multitude of different functional areas. Areas have different vehicle concentrations and mobility patterns; thus, the spatial attention mechanism has the task of identifying where are located the most significant areas and scale their contribution to improve the prediction. The spatial attention map A s ∈ R N ×M ×1 can be calculated by applying pooling operations along the axes of the channel to highlight informative regions [40] . Therefore, first the global average pooling and global max pooling operations are applied along the channel axes and, as in the channel attention block, two distinct feature maps of size N ×M ×1 are obtained. Subsequently, each single feature map passes through a convolution layer with a filter size of 4 × 4. unlike what is done in other approaches from the literature [41] where the filter is set to 7 × 7. It is worth noting that the filter size depends on the size of the areas that make up the city. For the case studies addressed in this work (see Section 5), which feature rather large regions, the proposed model does not need to focus on large area clusters. Finally, the sigmoid activation function is applied. Also for the spatial attention block, the two feature maps are combined into a single tensor through a weighted summation:

Finally, the process of getting spatial attention can be summarized as:

This section reports on an extensive experimental evaluation of the proposed model by comparing it against several reference models (see Section 5.1) using three different performance metrics and three different case studies (detailed in Section 5.2). An ablation study and the analysis of computational complexity complete the section.

The proposed model is compared against the following state-of-the-art methods expressly devised to to solve the citywide vehicle flow prediction problem [18] :

ST-ResNet [16] : it is one of the first deep learning approaches to traffic prediction. It predicts the flow of crowds in and out each individual region of activity. ST-ResNet uses three residual networks that model the temporal aspects of proximity, period, and trend separately.

MST3D [18] : this model is architecturally similar to ST-ResNet. The three time dependencies and the external factors are independently modeled and dynamically merged by assigning different weights to different branches to obtain the new forecast. Differently from ST-ResNet, MST3D learns to identify space-time correlations using 3D convolutions.

ST-3DNet [20] : the network uses two distinct branches to model the temporal components of closeness and trend, while the daily period is left out. Both branches start with a series of 3D convolutional layers used to capture the spatio-temporal dependencies among the input frames. In the closeness branch, the output of the last convolutional layer is linked to a sequence of residual units to further investigate the spatial dependencies between the frames of the closeness period. The most innovative architectural element is the Recalibration Block. It is a block inserted at the end of each of the two main branches to explicitly model the contribution that each region makes to the prediction.

3D-CLoST [19] : the model uses sequential 3D convolutions to capture spatio-temporal dependencies. Afterwards, a fully-connected layer encloses the information learned in a one-dimensional vector that is finally passed to an LSTM block. LSTM layers in sequence allow the model to dwell on the temporal dependencies of the input. The output of the LSTM section is added to the output produced by the network for external features. The output is multiplied by a mask, which allows the user to introduce domain knowledge: the mask is a matrix with null values in correspondence with the regions of the city that never have Inflow or Outflow values greater than zero (such areas can exist or not depending on the conformation of the city) while it contains 1 in all other locations.

STAR [17] : this approach aims to model temporal dependencies by extracting representative frames of proximity, period and trend. However, unlike other solutions, the structure of the model consists of a single branch: the frames selected for the prediction are concatenated along the axis of the channels to form the main input to the network. In STAR as well, there is a sub-network dedicated to external factors and the output it generates is immediately added to the main network input. Residual learning is used to train the deep network to derive the detailed outcome for the expected scenarios throughout the city.

PredCNN [21] : this network builds on the core idea of recurring models, where previous states in the network have more transition operations than future states. PredCNN employs an autoencoder with CMU, which proved to be a valid alternative to RNN. Unlike the models discussed above, this approach considers only the temporal component of closeness but has a relatively complex architecture. The key idea of PredCNN is to sequentially capture spatial and temporal dependencies using CMU blocks.

Historical Average (HA): the algorithm generates Inflow and Outflow forecasts by performing the arithmetic average of the corresponding values of the same day of the week at the same time as the instant in time to be predicted. This classical method represents a baseline in our comparative analysis as it has not been developed specifically for the flow prediction problem.

Excluding MST3D, which has been entirely reimplemented following the indications of the original paper strictly, and PredCNN, whose original code has been completed of some missing parts, for all the other models the implementation released by the original authors has been used. The STREED-Net code, together with all the code realized for this research work, is freely available 1 .

We conclude this section by pointing out that, although the literature offers numerous proposals for deep learning models based on graphs with performances often superior to those of convolutional models, for the problem addressed in this paper, preliminary experiments that we conducted with graph-based models did not lead to satisfactory results. The main reason for this apparent contradiction resides in the nature of the problem considered, whose basic assumption is to be able to observe only the inflow and outflow across all areas of the city. Such a scenario is feasible and more realistic than one in which the trajectory or origin-destination pair of all vehicles is known but makes it impossible to create graphs with nontrivial connections (i.e., not between adjacent areas) for the problem under consideration. Graph-based models under such conditions have not proven to be sufficiently accurate for the case studies considered.

Three real-life case studies are considered for the experimental analysis, which differ in both the city considered (New York and Beijing) and the type of vehicle considered (bicycle and taxi). This choice allow the models to be assess on usage patterns that are expected to be significantly distinct. Follows a brief description of the considered case studies:

BikeNYC. In this first case study the behavior of bicycles in New York city is analyzed. The data has been collected by the NYC Bike system in 2014, from April 1 to September 30. Records from the last 10 days form the testing data set, while the rest is used for training. The length of each time period is of 1 hour.

TaxiBJ TaxiNYC. Finally, a data set containing data from a fleet of taxicabs in New York is considered. Data have been collected from January 1, 2009 to December 31, 2014. The last four weeks are test data and the others are used for training purposes. The length of each time period is set to one hour. This case study has been specifically created to perform a more thorough and sound experimental assessments than those presented in the literature.

The city of New York has been tessellated into 16 × 8 regions, while the city of Beijing has been divided into 32 × 32 areas; the discrepancy in the number of regions considered is due to the large difference in extension between the two cities. The Beijing area (16,800 km 2 ) is 22 times bigger than the New York area (781 km 2 ).

The Beijing taxi data set (TaxiBJ) and New York Bike data set (BikeNYC) are available via [16] ; they are already structured to carry out the experiments reported in this work. As for the TaxiNYC dataset, available for experiments on GitHub 2 , it has been expressly built for this work by processing and structuring data available from the NYC government website 3 .

A Min-Max normalization has been applied to all data sets to convert traffic values based on the scale [-1, 1]. Note, however, that in the experiments a denormalization is applied to the expected values to be used in the evaluation.

In the three experiments, public holidays, metadata (i.e. DayOfWeek, Weekday/Weekend) and weather have been considered as external factors. Specifically, the meteorological information reports the temperature, the wind speed, and the specific atmospheric situation (viz., sun, rain and snow).

This section presents and discusses the results of experiments performed by running STREED-Net and the models presented in Section 5.1 on the three case studies. Moreover, three different evaluation metrics are used in this study to compare the results obtained: Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Absolute Percentage Error (APE), which are defined as follows:

M AP E = 100 · 

whereι n,m andω n,m are, respectively, the predicted Inflow and Outflow for region (n, m) at time t and N × M is the total number of regions in the city.

Finally, it is worth noting that to account for and reduce the inherent stochasticity of learning-based models, each experiment was repeated ten times (replicas) using a different random seed in each replica. Mean and standard deviation are reported for each metric to provide a robust indication of the overall behavior of the compared methods. 

For the BikeNYC case study STREED-Net parameters have been set as follow. The number n of input frames has been set to 4, the number L of encoding and decoding blocks has been set to 2. This decision has been dictated by the size of the grid (16 × 8): setting L greater than 2 (for example 3) would result in an encoder output tensor of size 4 × 2 × 1 × C, which would be too small to allow the CMU block to effectively capture the time dependencies in the section located between the encoder and decoder. After some preliminary tests, the number of convolutional filters has been set to 64 in the first layer of the encoder and in the subsequent blocks, while in the last layer it has been set equal to 16. In this way, the dimensionality of the input vector goes from I ∈ R 4×16×8×2 to O ∈ R 4×4×2×16 as the encoder output. Symmetrically, the convolutions within the decoder use 64 filters, except for the final layer which uses only 2 filters to generate the prediction of the Inflow and Outflow channels. The parameters corresponding to the dimensionality of the convolution kernel (kernel size equal to 3, batch size equal to 16 and learning rate equal to 0.0001. The number of epochs is set to 150), to the batch size and to the learning rate have been optimized with the Bayesian optimization technique.

As for the models from the literature, they have been arranged and trained following carefully the parameter values and indications reported in the respective publications.

As shown in Table 1 , STREED-Net outperforms all other considered approaches for all evaluation metrics. In addition, the small standard deviation values are evidence of the robustness of the proposed approach. Nonetheless, it is worth observing that all learning-based approaches return similar results. We believe this is mainly due to the reduced size of the data set that does not allow the models to be adequately trained. Moreover, the tessellation used in this case study (widely used in the literature), with a small grid of dimensions (16 × 8) , tends to level off the metrics and hinder a more precise performance assessment.

As with the experiment discussed above, for the TaxiBJ case study the parameters of the models have been set according to the specifications given in the respective publications. In the case of STREED-Net, the hyperparameters are kept unchanged in the two experiments, except for the number L of encoding and decoding blocks, which has been increased to 3 because the grid is larger ( 32 × 32) in this experiment and more convolutional layers are needed to map the input tensor of the model. Also for this experiment, the kernel size, batch size, and learning rate parameters have been optimized with Bayesian optimization and the best values found were 3, 16, and 0.0001 respectively. The number of epochs has been set at 150. Notice that these values are the same used in the BikeNYC experiment.

As can be seen from Table 2 , STREED-Net outperforms all other methods, reducing by 6%, 4.4%, and 1.5% RMSE, MAPE, and APE respectively, compared to the second-best approach. The difference in performance, in favor of the proposed model, in this experiment is more appreciable because the data set used for the training process is more significant but also because the number of regions is higher. This last consideration highlights how the proposed model seems suitable to be applied in real-world scenarios, i.e., where high model accuracy and dense tessellation are required (i.e., the city is partitioned into a large number of small regions).

As mentioned earlier, the TaxiNYC case study was created specifically to be able to evaluate the behavior of the proposed model in a wider set of scenarios than the literature. Consequently, in order to make a fair comparison, it was necessary to search for the best configuration of hyperparameters not only for the STREED-Net model but also for all the other approaches considered. The optimized parameters and the relative values used in the training phase are briefly summarized below for each model. Unreported configuration values have been set as for the BikeNYC case study since the two experiments share the same map size (16 × 8) . The parameters for each model are as follows:

• ST-ResNet*. It is worth noting that, preliminary experiments showed a convergence issue for the training phase of both STAR and ST-ResNet models. In particular, they were unable to converge for any combination of parameters. This behavior is due to the strong presence of outliers and to the concentration of the relevant Inflow and Outflow values in a few central regions of the city. To overcome this issue, Batch Normalization layers have been inserted in the structure of the two models. In particular, Batch Normalization layers have been added after each convolution present in the residual units (a possibility that has already been foreseen in the original implementations) and after the terminal convolution of the networks (an option not considered in the source code provided by the original authors). For this reason, ST-ResNet and STAR are marked with an asterisk in the Table 3 , which summarizes the experimental results.

As it can be seen from Table 3 , STREED-Net achieved excellent results in this experiment as well, ranked as the best model for two out of three evaluation metrics. In particular, as far as the RMSE is concerned, the performances obtained are very close to the best one (achieved by ST-ResNet*, which is considerably different from the original ST-ResNet), while the MAPE and APE values position it as the best model.

In this section, an ablation study conducted on STREED-Net is presented in which variations in the input structure and in the network architecture are analyzed. The study, for reasons of space refers only to BikeNYC case study and does not involves the full combinatorics of all possible variants of the proposed model but aims to assess the impact on performance metrics of some parameters (namely, the number of input time points n) and specific architectural choices (viz., long skip connection, attention blocks, and external factors input branch), while maintaining all other conditions. More precisely, in what follows STREED-Net is compared against the 5 different variations described below:

• STREED-Net N3. Same architecture as STREED-Net, but input volumes with 3 frames ([X t−3 , X t−2 , X t−1 ]). • STREED-Net N5. Same architecture as STREED-Net, but input volumes with 5 frames ([X t−5 , X t−4 , X t−3 , X t−2 , X t−1 ]).

• STREED-Net NoLSC. STREED-Net by removing the long skip connection between encoder and decoder.

• STREED-Net NoAtt. STREED-Net without the attention blocks.

• STREED-Net NoExt. STREED-Net without the external factors.

Notice that the study does not consider the variations with n = 1 and n = 2 as such values would not allow the network to capture meaningful temporal patterns between traffic flows. Table 4 reports the results of the ablation study conducted. Each data point in the table has been obtained performing 10 times the training procedure for each model variation changing the random seed, and evaluating the resulting network on the test set. The mean and standard deviation are reported.

The results show that regarding the time horizon, for the BikeNYC case study, n = 4 allows the model to obtain better results. This means that, considering the particular setup, for the city of New York 4 hours of data allow to predict more accurately the dynamics of bicycle mobility whereas considering a greater amount of information (n = 5) would reduce the accuracy of the network. It is plausible to believe that considering a larger number of temporal instants would lead the network to grow in the number of parameters to be trained and thus require a larger amount of data to identify possible longer-term patterns.

From the architectural point of view, the two components attention block and long skip connection, confirm their importance in improving the performance of the proposed model, accounting for a 2.36% and 3.64% increase in RMSE, respectively. In particular, as regards the attention block, not only STREED-Net reaches lower average error values, but also the standard deviation is reduced, proving that the attention blocks are effective in helping the network single out the most meaningful information and in making the training process more stable. Finally, the experiment shows also a strong impact of the long skip connection mechanism, which, as illustrated in subsection 4.5, connects the encoder to the decoder to convey fine-grained details through the network.

A a brief analysis of the number of trainable parameters and the computational complexity (measure in number of FLOPs) of each model for the different case studies is reported in this section. For what concerns the number of trainable parameters, as shown in Table 5 , STREED-Net has a generally low number compared to other models as only STAR features lesser parameters to train. Such a reduced number of parameters is due to the fact that the dimensionality of the input is reduced by the encoder downsampling mechanism. The model with the highest number of parameters is 3D-CLoST, which uses both 3D convolutions and LSTM.

Finally, Table 6 provide the computational complexity of each model in terms floating point operations (FLOPs) as in [42] for each case study. As can be seen from the results obtained, the model with the higher computational complexity is PredCNN, which is based on CMUs. While, 3DCLoST is the model with the shortest forward and backward times. STREED-Net, instead, has a middle-range computational complexity compared to the other models, despite its autoencoder structure, the use of attention blocks, and CMUs. This occurs because although the number of network parameters is small, the network employs high complexity operators. However, the training and execution times of STREED-Net are compatible with its applicability in fullscale real-world scenarios. 

Predicting vehicular flow is one of the central topics in the domain of intelligent mobility. It is a challenging task, influenced by several complex factors, such as spatio-temporal dependencies and external factors. In this study, we have developed a new deep learning architecture, based on convolutions and CMU to forecast the Inflow and Outflow in each region of the smart city. A comprehensive experimental campaign has been conducted on three different real-world case studies. The results showed that STREED-Net consistently outperforms state-of-the-art models in predicting dynamics in all the experiments conducted on the three performance metrics considered. This work also reports and analyzes the results from an ablation study and complexity analysis. For future developments, the possible integration of other external factors, such as the territorial characteristics of each geographical area, should be tested. Moreover, it would be appropriate to increase the granularity of the city tessellation, as well as conduct transfer learning experiments to study the applicability of the proposed model to scenarios with a reduced amount of data available.

Urban computing: concepts, methodologies, and applications

Spatio-Temporal Ensemble Method for Car-Hailing Demand Prediction

Arima model for network traffic prediction and anomaly detection

Adaptive kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification

A bayesian network approach to traffic flow forecasting

A hidden markov model for short term prediction of traffic conditions on freeways

Travel-time prediction with support vector regression

Spatiotemporal patterns in large-scale traffic speed prediction

The simpler the better: A unified approach to predicting original taxi demands based on large-scale online platforms

Spatial variation of the urban taxi ridership using gps data

A long short-term memory recurrent neural network framework for network traffic matrix prediction

Deep learning: A generic approach for extreme condition traffic forecasting

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction

Predicting citywide crowd flows using deep spatio-temporal residual networks

Star: A concise deep learning framework for citywide human mobility prediction

Exploiting spatio-temporal correlations with multiple 3d convolutional neural networks for citywide vehicle flow prediction

3d-clost: A cnn-lstm approach for mobility dynamics prediction in smart cities

Deep spatial-temporal 3d convolutional neural networks for traffic data forecasting

Predcnn: Predictive learning with cascade convolutions

Semi-supervised classification with graph convolutional networks

Diffusion convolutional recurrent neural network: Data-driven traffic forecasting

T-gcn: A temporal graph convolutional network for traffic prediction

Dynamic graph convolutional network for long-term traffic flow prediction with reinforcement learning

Encyclopedia of geographic information science

Mechanisms of visual attention in the human cortex

Neural machine translation by jointly learning to align and translate

Learning multi-attention convolutional neural network for fine-grained image recognition

Show, attend and tell: Neural image caption generation with visual attention

Draw: A recurrent neural network for image generation

Attention-based deep ensemble net for large-scale online taxi-hailing demand prediction

Batch normalization: Accelerating deep network training by reducing internal covariate shift

How does batch normalization help optimization?

Deep learning

City-wide traffic congestion prediction based on cnn, lstm and transpose cnn

International Conference on Machine Learning

Long short-term memory

Inception-v4, inceptionresnet and the impact of residual connections on learning

Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer

Cbam: Convolutional block attention module

Benchmark analysis of representative deep neural network architectures