key: cord-0277181-1x37a0e6 authors: Derrow-Pinion, Austin; She, Jennifer; Wong, David; Lange, Oliver; Hester, Todd; Perez, Luis; Nunkesser, Marc; Lee, Seongjae; Guo, Xueying; Wiltshire, Brett; Battaglia, Peter W.; Gupta, Vishal; Li, Ang; Xu, Zhongwen; Sanchez-Gonzalez, Alvaro; Li, Yujia; Velivckovi'c, Petar title: ETA Prediction with Graph Neural Networks in Google Maps date: 2021-08-25 journal: nan DOI: 10.1145/3459637.3481916 sha: 5822490cf59df7f7ccb92b8901f244850b867a66 doc_id: 277181 cord_uid: 1x37a0e6 Travel-time prediction constitutes a task of high importance in transportation networks, with web mapping services like Google Maps regularly serving vast quantities of travel time queries from users and enterprises alike. Further, such a task requires accounting for complex spatiotemporal interactions (modelling both the topological properties of the road network and anticipating events -- such as rush hours -- that may occur in the future). Hence, it is an ideal target for graph representation learning at scale. Here we present a graph neural network estimator for estimated time of arrival (ETA) which we have deployed in production at Google Maps. While our main architecture consists of standard GNN building blocks, we further detail the usage of training schedule methods such as MetaGradients in order to make our model robust and production-ready. We also provide prescriptive studies: ablating on various architectural decisions and training regimes, and qualitative analyses on real-world situations where our model provides a competitive edge. Our GNN proved powerful when deployed, significantly reducing negative ETA outcomes in several regions compared to the previous production baseline (40+% in cities like Sydney). : Google Maps estimated time-of-arrival (ETA) prediction improvements for several world regions, when using our deployed graph neural network-based estimator. Numbers represent relative reduction in negative ETA outcomes compared to the prior approach used in production. A negative ETA outcome occurs when the ETA error from the observed travel duration is over some threshold and acts as a measure of accuracy. are used not only by everyday users finding the (optimal) routes between waypoints, but also by enterprises which rely on accurate mapping features (such as food delivery services, which may seek to provide accurate and optimised delivery time estimates). One critical task in this context is expected time of arrival (ETA) prediction: given current conditions on the road network, and for a particular candidate route between waypoints, predict the expected time of travel along this route. The significance of such a predictor is clear: accurate ETA predictions allow traffic participants to make more informed decisions, potentially avoiding congested areas and minimising overall time spent in traffic. It is important to note that powerful ETA predictors need to meaningfully reason about conditions which take place in the future, and may not be directly obvious from current road state. As a simple example, the number of traffic participants may substantially increase during daily rush-hour periods, when most people commute to and from the workplace-but in many cases such situations may be more subtle. Further, there is a wealth of historical travel data available which may be used to support detection and exploitation on such subtleties. As such, this problem is substantially amenable to machine learning approaches. As the road network is naturally modelled by a graph of road segments and intersections, ETA prediction is amenable to graph representation learning [1, 2, 10] approaches, particularly graph neural networks (GNNs) [8, 15, 25] . Here we present our graph neural network model for ETA prediction, which we deployed in production at Google Maps, observing significant reductions in negative ETA outcomes across all trips worldwide-above 40% in cities like Sydney-compared to the previous production baseline. See Figure 1 for improvements in additional regions. A negative ETA outcome occurs when the ETA error from the observed travel duration is over some threshold and acts as a measure of accuracy. The task of arrival time estimation can be described as providing an estimated travel time given a user-provided starting location and a route suggested by a prior system. This problem is difficult because it needs to take into account both spatial information (captured within the road network) as well as temporal information (the evolution of traffic conditions over time). In particular, we need to anticipate that travel times across road segments may be affected by traffic conditions that are further away, or not immediately on the user track-potentially also accounting for "background" signals such as traffic light configurations. We also need to anticipate that traffic conditions may have changed by the time a user traverses to a road segment further along their route. Contributions. We demonstrate how a powerful graph neural network estimator can be constructed for the ETA prediction task, while following simple design principles and carefully designed loss functions. Such a GNN then remains stable and performant when deployed in production, providing substantial gains against prior production baselines which did not use the road network graph reprsentations. Beyond this, our key contributions to this problem are as follows: • Innovations on how the road network data is featurised and presented to the GNN; • Innovations in the GNN model design and operation. While the GNN itself is constructed of standard building blocks, we found several benefits for the training regime (such as Meta-Gradients [31] and semi-supervised training [16, 26] ) as well as architectural ablations (e.g. carefully tuned combinations of aggregators [5] ) that proved particularly important for maintaining stability of the GNN in this regime and making it production-ready. • Finally, we deploy the trained GNN model in production within Google Maps, observing strong real-world benefits. As displayed in Figure 1 , we noticed clear quantitative improvements in negative ETA outcomes, but besides the on-line performance, we performed several offline ablations on various architectural and setup decisions, as well as visualisations of particular traffic situations where our model has a clear advantage over the prior baseline. On one hand, our work represents another in a series of successful web-scale deployed applications of graph neural networks with real-time positive user experience impact [12, 18, 32, 33] . On another, we offer ablative studies and insights into the GNN model's predictions which simultaneously (a) elucidate the kinds of realworld traffic use cases where such a model may provide a significant edge; and (b) provide prescriptive advice for other practitioners in the general area of user-facing traffic recommendation. We outline several related works that either directly inspire the building blocks of our proposed GNN architecture and pipeline, or sit more broadly in the space of GNNs for traffic forecasting. Graph representation learning. Being a graph neural network, our work directly builds upon the previous years of progress in graph representation learning. In particular, our model is fundamentally based on the Graph Network framework [1] and follows the encodeprocess-decode [11] paradigm. This allows us to align better with the iterative nature of traffic computations, as well as pathfinding algorithms. Improving such an alignment is known to help GNNs generalise better to a diverse distribution of graphs [27] . Travel-time prediction. Given its wide applicability, travel-time prediction is a problem that has historically been closely studied under various settings. Earlier work modifies convolutional neural networks to be mindful of the spatial properties of the trajectory [28] and employs graph neural networks coupled with recurrent mechanisms [13] . The concurrently developed work of Yuan et al. [35] makes an important observation in this space: travel time is often largely dependent on the historical travel times in the given time-of-day-an observation that we extensively use. The recent CurbGAN framework [38] may also be of interest, with its leveraging of generative adversarial networks for evaluating urban development plans in the context of estimated travel times. Spatiotemporal traffic forecasting. Using GNNs to forecast traffic conditions is historically a very active direction: most influential papers in the area of spatiotemporal graph representation learning rely on traffic datasets. The spatiotemporal setup typically assumes a graph structure over which each node maintains a time series-in the case of road networks, such time series typically correspond to historical speeds measured at regular time intervals. While our GNN model detailed here processes static inputs, and does not directly forecast future speeds, having some estimate about future traffic flow behaviour is likely to be meaningful for ETA prediction. One of the first influential papers in the area was the diffusionconvolutional recurrent neural network (DCRNN) [17] . This paper illustrated how explicitly accounting for the graph structure provides a significant reduction in forecasting error over several horizons. This generic blueprint of a spatial GNN component combined with a temporal component yielded several subsequent proposals, including STGCN [34] , GaAN [37] , Graph WaveNet [29] , ST-GRAT [19] , StemGNN [3] , and GMAN [39] . We refer interested readers to [23] for a survey of interesting problems and methodologies in the area. Lastly, we find it important to mention that the assumption of the time-series being aligned across nodes does not necessarily hold in practice-we may instead expect data to be provided in an asynchronous fashion. A recent line of research on dynamic graph representation learning [20, 30] tries to explicitly account for such situations, both on the featurisation and the topological side. Graph neural networks at scale. As graph neural network methodologies became more widely available and scalable [4, 9, 21] , this also enabled a wider range of web-scale applications of graph representation learning. Perhaps the most popularised applications concerned recommender systems, which are very naturally representable as a graph-structured task: with Pinterest being one of the most early adopters [18, 33] . GNNs have also been deployed for product recommendation at Amazon [12] , E-commerce applications at Alibaba [32] , engagement forecasting and friend ranking in Snapchat [22, 24] , and most relevantly, they are powering traffic predictions within Baidu Maps [6] . Training regimes and components. While the core components of our architecture correspond to the Graph Network paradigm, ETA prediction in production invites particularly unstable training conditions across many batches of queries, particularly over routes of different scales. We adapted the MetaGradients [31] methodology from reinforcement learning, which allowed us to dynamically tune the learning rate during training and stabilise it across many uneven query batches, enabling a production-ready GNN model. Further, there exist many patterns inherent to the road network topology that may be particularly pertinent to forecasting traffic flow. Similar motifs in the road network of the same region are likely to correspond to similar traffic dynamics-as such, automatically discovering and compressing them is likely to be useful. To enable such effects, we have directly adapted established unsupervised GNN methodologies such as graph auto-encoders [16] and deep graph infomax [26] -for both of them, we demonstrate benefits to more accurate ETA predictions. Additionally, different kinds of heuristical computations over road networks may require different ways of aggregating information. For example, shortest path-finding would require optimising the signals over a node's neighbourhood, whereas flow estimation may require summing such signals. Inspired by the principal neighbourhood aggregation [5] architecture, we have investigated various combinations of GNN aggregation functions, showing them to be beneficial across many modelling scenarios. In this section, we describe our travel time prediction problem in the context of arrival time estimation. We also provide the technical details of the GNN models that we used for this problem, including their architecture and training details. The key components of our method are models that operate on networks of route segments in order to leverage their structure, and predict travel times for multiple time horizons into the future, in order to provide end users with more accurate ETAs. Key technical details of our GNN models are the various stabilizing methods that are used to reduce training variance, which have critical impact on user experience. Further, we will demonstrate that several changes to our launched model provide offline improvements under specific conditions. To achieve accurate travel time estimates, we model the road network using supersegments, which are sequences of connected road segments that follow typical traffic routes. For a given starting time, we learn the travel time of each supersegment for different fixed time horizons into the future. At serving time, the sequence of supersegments that are in a proposed route are queried sequentially for increasing horizons into the future. This process specifically involves sequentially using the prediction time of an earlier supersegment to determine the relevant fixed horizons for the next supersegment, and interpolating between the prediction times of the fixed horizons to arrive at a travel time for the next supersegment. This process enables us to provide accurate travel time estimates in a scalable way, taking into account both spatial and temporal information. A supersegment, S, is defined as a graph S = ( , ) where each node ∈ is a road segment, and an edge ∈ exists if two segments and are connected (see Figure 2 for an overview). The supersegment travel time prediction task is then: given a supersegment S , predict the travel time across the supersegment at different horizons into the future , +ℎ 1 , +ℎ 2 , ..., +ℎ , where ℎ 1 , ℎ 2 , ...ℎ correspond to the different horizons. 3.1.1 Data. The world is divided into regions that confine similar driving behaviors and capture most trips without splitting. Data is constructed for each of these regions in order to build regionspecific GNN models. For training and evaluation, each example was collected from a traversal event across a supersegment and its underlying segments. Specifically, a unique supersegment may appear in multiple examples corresponding to multiple traversals over that supersegment. The actual traversal times along segments and supersegments in seconds are used as node-level and graph-level labels for prediction, and each unique supersegment can appear in the training and evaluation data multiple times, corresponding to multiple traversals. The features, describing the traffic conditions of the supersegments prior to the traversal event, were collected backwards in time by the duration of each fixed horizon. The road segments are on average 50 -100 meters long, and the supersegments contain on average about 20 road segments. Horizons we use are 0 seconds, 600 seconds, 1200 seconds, 1800 seconds, and 3600 seconds. In Section 4, we also investigate the trade-off of expanding supersegment graphs to include neighbouring nodes, or segments that extend beyond an immediate route. For this task, we utilize both real-time information that captures traffic patterns at the time of prediction and historical data for traffic patterns specific to the time-of-day and day-of-week when the prediction is made. On a segment (node) level, we provide the average real-time and historical segment travel speeds and times, as well as segment length and segment priority (eg. road classifications such as highways) as features. On a supersegment (graph) level, we additionally provide real-time supersegment travel times as a feature. Real-time travel speeds and times are provided for 17 2-minute windows prior to horizon 0. Historical speeds and times are provided for five 8-minute windows prior to horizon 0, and seven 8-minute windows after horizon 0, with the values being an average across the past 17 weeks. We additionally provide learnable segment-and supersegment-level embedding vectors (of sizes 16 and 64, respectively). This enables sharing of information whenever the same segment appears in different supersegments, and whenever the same supersegment appears in different candidate routes. Possible baselines for this problem include (a) nonparametric methods, that compute travel times by averaging speeds across segments and supersegments. (b) segment-level models, that Figure 2 : An example road network with shared traffic volume, which is partitioned into segments of interest (left). Each segment is treated as a node (middle), with adjacent segments connected by edges, thus forming a supersegment (right). Note that for extended supersegments (discussed in Section 4.1.4), extra off-route nodes may be connected to the graph. bypass the graph structure induced by supersegments. Such models predict travel times along every segment in isolation, adding them together to predict across supersegments. Prior production models used segment-level linear regression models. Our GNN model leverages full Graph Network (GN) blocks exactly as described in [1] . Specifically, a GN block is defined with three "update" functions corresponding to edge, node, and global or supersegment-level updates, and three "aggregation" functions, : For each edge , takes in edge features , source and target node features and , supersegment features , and outputs an edge representation ′ . For each node , takes in node features , aggregate edge representations of node 's edges¯′, supersegment features , and outputs a node representation ′ . Finally, takes in aggregate node representations¯′, aggregate edge repre-sentations¯′, supersegment features , and outputs a supersegment representation ′ . We compose 3 GN blocks into an encode-process-decode architecture [11] , and learn a separate model for each horizon ℎ. These blocks are defined as GN The encoder is first applied to the supersegment S with the previously described raw features. It produces latent representations of nodes, edges and the supersegment itself. These latents are then passed to the processor, which is applied 2 times (with shared parameters) to update these representations. Finally, these representations are transformed into appropriate predictions by applying the decoder. The model predictions areˆ( The update functions, , are all simple multilayer perceptrons (MLPs). Aggregation functions are summations. While this is common practice, there exist benefits from using multiple aggregator functions like summing and averaging and concatenating their results together [5] . For certain regions and horizons in offline testing, this was true in our case as well. We will discuss the comparative benefits of various aggregators in Section 4. While training our GNN models, we found it very useful to use a combination of various loss functions. This resulted in strong serving performance when deployed in production. The primary bottleneck to deploying was the high variability of models during training. Variance is an issue in application because models saved at different points of training will result in different end user experiences. To address this, we use a MetaOptimizer that makes use of MetaGradients [31] for adapting the learning rate during training, as well as applying an exponential moving average over model parameters during evaluation and serving. 3.3.1 Losses. Although our main objective is to predict supersegmentlevel travel time, we found that using auxiliary losses for good segment-level travel time prediction helps the final performance. Our final loss for a specific horizon ℎ is = ℓ ss + s ℓ s + sc ℓ sc , with additional weight decay, where s = 1, sc = 0.15 and The provided labels and weights are Table 1 : Information on datasets, including the number of total supersegments, and unique supersegments in each training and evaluation dataset, the average number of segments in a supersegment computed across a training dataset, and the average length of segments computed across a training dataset. are supersegment and segment free flow times, an estimate of traversal times when little to no traffic is present. We predictˆ( ) +ℎ in addition to segment-level travel times in order to avoid an accumulation of errors over segments. L is a Huber loss function with delta value , which is important in being less sensitive to outliers. The losses are summed up across the segments and supersegments, so losses such as the supersegment-level loss is more influenced by supersegment length whereas losses like the segment-level loss is more independent from length. We exponentially scale the Huber loss down for examples with greater free flow times to better control the loss's growth. We also experimented with purely unsupervised auxiliary losses, and will discuss their trade-offs in Section 4. In order to reduce the variance of the recovered GNN models across epochs, we resort to a careful treatment of the learning rate for each step of training. Our method of choice, MetaGradients [31] , follows the online cross-validation paradigm and jointly optimizes the hyper-parameter with the model parameters at each training iteration. Although the original work targets at reinforcement learning, the approach can be generalized well to supervised learning applications. MetaGradients frames the optimization problem as ( , , ), where is a training example, are the model parameters, is the hyper-parameter (e.g., the learning rate) and is the original loss function additionally parameterized by . For a new training example, ′ , an underlying update to the model parameters ′ = + ( , , ) with some update function (eg. a gradient step) additionally parameterized by , and a reference meta-parameter value ′ , MetaGradients computes the update for using the following derivative: where the gradient w.r.t. the model parameter is multiplied by a factor ′ / . Xu et al. [31] has shown that, in practice, the second term can be approximated using an accumulative trace, i.e., In our application, we have successfully combined MetaGradients with the Adam optimizer [14] . During evaluation and serving, we use an EMA of model parameters where = 0.99. EMAs were another key component in reducing variance in saved models within training runs. Although our models were launched across regions world-wide, we evaluate our model specifically over: New York (NYC), Los Angeles (LAX), Tokyo (TYO) and Singapore (SGP), finding that the results generalize to other regions (see Figure 1 ). For offline evaluation, we used training data collected during January 2020. For online evaluation, our data was collected during November 2020. We note that November 2020 data may be susceptible to COVID-19 traffic patterns, and hence we attempted to select regions that minimize this deviation for evaluation in order to capture performance under regular traffic conditions. The distributions of the datasets and their extended versions are described in Table 1 . We evaluate our GNN models against nonparametric travel times computed using real-time speeds and historical speeds, as well as models that do not leverage graph structure: • Real-time travel times: Summation of segment-level travel times computed using segment speeds averaged over a two minute window prior to the time of prediction. These are numbers computed from data. • Historical travel times: Summation of segment-level travel times computed using segment speeds specific to hour of day and day of week at the time of prediction, averaged across last 17 weeks. These are numbers computed from data. • DeepSets: Transform node representations using a segmentlevel feed-forward network followed by a sum aggregation over the segment-level representations before using a final feed-forward network to make graph-level predictions [36] . This effectively treats supersegments as a "bag-of-segments", ignoring their graph structure. We ablate over the specific elements of our launched GNNs that contribute to performance improvements and variance reduction. This includes learnable segment and supersegment embeddings. Embeddings. We ablated over the embeddings defined for unique segment and supersegment IDs. We use a dimension of 16 for segment-level embeddings and a dimension of 64 for supsersegmentlevel embeddings across all models that use embeddings. The embedding vocabularies are region-specific, covering 99.5% of segment Table 2 : Offline model performance against baseline methods reported in RMSE, averaged across five random seeds. The graph network (GN) significantly outperforms all reported baselines for each region and horizon ( < 0.01 using a -test). IDs and supersegment IDs in the corresponding datasets, and assigning the rest to be out-of-vocabulary (OOV), spread out across 200 and 20 OOV buckets for segments and supersegments respectively. MetaGradients & EMA. We directly compare our models trained with and without MetaGradients. When trained with MetaGradients, we set the meta-learning rate to 0.01 and the update frequency to every 100 iterations. We compare our GNN models that use EMA (with a decay factor of = 0.99) to models that do not. We describe and compare extensions to the launched GNN models that show offline improvements, and can be useful in specific settings. This includes extended supersegment graph data, combinations of aggregators, and unsupervised auxiliary losses. Extended Supersegments. We investigate performance improvements for GNN models that operate on extended supersegments. Extended supersegments are larger graphs that include nodes from neighbouring segments, in addition to the original segments. These include segments from nearby, possibly connected traffic. The prediction task and labels remain the same. Specifically, the goal is still to predict travel times across the original segments and supersegments within the extended supersegments. This enables our models to make use of additional spatial contextsuch as traffic congestion on a connecting route-at the expense of data storage, slower training, and inference. We note that the same GNN architecture is used for both supersegment and extended supersegment experiments, making the model storage cost remain constant. Further, we experiment with binary features that indicate whether segments are on the main route in an extended graph. Combinations of Aggregators. We further investigated the benefits of swapping out the default sum aggregation function with combinations of different aggregators. We combine by concatenating; for example, the combination [Min, Max] has an aggregation function for nodes of: where ( ) are the road segments that neighbour . We ablate over all combinations of Max, Min, SqrtN, and Sum. In the interests of brevity, only the best-performing ones are included in Table 6 . Unsupervised Auxiliary Losses. Lastly, we experimented with incorporating unsupervised losses, which are designed to produce representations that capture the structural information of input graphs. This, in turn, may simplify discovery of interesting motifs within the road network topology, which may implicitly drive traffic dynamics. In all experiments, these losses are combined with the supervised objective (Equation 2) to assist learning. We combine two main kinds of unsupervised loss functions. Firstly, Deep graph infomax (DGI) [26] encourages the node embeddings to encode global structural properties, by carefully contrasting the road network against an artificially corrupted version. Conversely, the graph auto-encoder (GAE) [16] leverages a link prediction-style loss and is hence mindful of local topology. For offline evaluation, we use the root mean squared error (RMSE) between the predicted supersegment travel times and the actual traversal times across supersegments, in seconds: Reported values are computed using the temporally held-out test datasets of the same regions. Other metrics such as mean absolute error (MAE) show similar results and are thus omitted. RMSE is used to make decisions for which architectures are used in online evaluations and launches, so we focus on that metric here. Baseline Comparison. Table 2 shows the offline evaluation results of our GNN models compared to the baselines, averaged over five random seeds. Across all studied regions and horizons, our GNN models outperform the baselines. Compared to the closest baseline, DeepSets, our GNN models show an improvement of 0.12 to 0.77 RMSE. One note is that these improvements are computed over individual supersegments, and they will accumulate over entire routes-hence our improvements may quickly become more important on a whole-route level. Across all of the considered regions and horizons, the improvements of the GNN are statistically significant: < 0.01 using a -test on the individual run results, across five runs. Table 3 studies the effect of the various learnable input embeddings, by attempting to ablate them out. Across most regions and horizons, such embeddings are useful for improving RMSE. MetaGradients & EMA. Figure 3 compares training with and without applying MetaGradients and EMA decay over several seeds. For brevity, we only show results for two settings (TYO Horizon 0, LAX Horizon 3600); the trends generalize across different regions and horizons. Both MetaGradients and EMA contribute to lowering within-run and across-run variance, which are critical for inference in production. Further, it was shown that MetaGradients consistently decays the learning rate over time, which ensures stable training. Extended Supersegments. Table 4 compares GNNs that operate on supersegments to variants that operate on extended supersegments. Combining extended supersegments with additional binary features that indicate whether a segment or an edge is part of the original supersegment is capable of improving RMSE (by 0.01 to 0.29). In a scenario where data storage and inference cost increases are not core issues, the extended supersegments may hence be a useful data representation. Unsupervised Auxiliary Losses. Table 5 compares the performance between different unsupervised losses in terms of RMSE. Augmenting the loss function with deep graph infomax (DGI) or graph auto-encoders (GAE) by a tuned weight parameter achieves an improvement of 0.05 to 0.23 RMSE. The optimal variant and hyperparameters were different for each region and prediction horizon. Thus, it is likely expensive to tune such losses for all existing regions and each horizon. However, GN+DGI would be a reasonable option as it showed some degree of improvement in all cases. Note that there are several important design decisions involved with deploying DGI: e.g. deciding the readout, corruption, and discriminator functions. We, however, did not explore many options in this space, as the potential performance improvements were not deemed significant enough for the additional costs incurred in a production environment. Table 6 compares model performance when using different aggregation functions. When compared to our production default of sum, other variants may achieve an improvement of up to 0.34 RMSE depending on the region and horizon. Since the optimal aggregation function differs between regions and horizons, it is likely difficult and expensive to tune for all regions and horizons of the world. Using all five aggregations at once is a reasonable preference, and it may be beneficial to investigate dynamic aggregators as a future line of work. We compare our GNNs against baselines in an online setting in order to show the effectiveness of the models during serving. Both realtime travel times and historical travel times are used in precursor systems and fall-back systems. For online evaluation, we use: which is the root mean squared error between the predicted track travel times and actual travel times, computed over all tracks within a week, specifically January 15th 6:00 PM -January 22nd 2021 6:00 PM. We note that the GN and DeepSets models additionally use a series of fallback models for segments that do not appear in supersegments, and segments that do not have real-time or historical travel time input features, whereas the same is not done for realtime and historical travel times. Table 7 shows the results of our GNN models compared to the baselines in the online setting. It shows a similar pattern compared to Table 2 with GNNs outperforming the baseline methods. Compared to DeepSets, the strongest baseline approach, GNNs outperform by 0.05 to 1.35 RMSE in this online setting. In this section, we provide specific examples that provide insight into scenarios where our GNN models are able to provide more accurate estimated time of arrivals compared to baselines, by better making use of spatial and temporal information. Figure 4 (left) shows the supersegment (which is part of LAX) which we will use to illustrate these qualitative findings. Figure 4 (above) compares the predicted travel times over the supersegment to the actual travel times, over a range of departure times. The upward slopes of the GNN predicted travel times match the upward slopes of the actual travel times better than the realtime travel times baseline. In this setting, our GNN model is able to detect traffic congestions more accurately than the baseline by making use of traffic information from a larger neighborhood. Figure 4 (below) compares predicted travel times over the supersegment for various horizons into the future. The predictions of our GNN for ℎ = 3600 better match the actual travel times 60 minutes into the future, compared to the predictions of our GNN for ℎ = 0. This example illustrates the value of having multiple prediction horizons in providing accurate ETA. Figure 4 : Qualitative analysis over a supersegment in LAX (shown on the left). Above: Travel times and predictions over different departure times. Our GNN is able to detect traffic congestions more accurately than using real-time travel times. Below: Travel times and predictions for different horizons into the future, over different route departure times. Having multiple prediction horizons is important to providing accurate ETA. Caching It is challenging to meet the latency requirements of Google Maps while keeping the cost of evaluating Graph Net models low. The path in an ETA request usually includes multiple supersegments and making predictions for these supersegments on the fly is not practical nor scalable. Approaches like CompactETA [7] learn high level representations that can be queried during inference time for use in a very simple MLP. To resolve this issue, we instead created a shared lookup table caching fresh predictions for predefined supersegments for a fixed set of horizons, which is periodically updated. The server fetches necessary predictions upon a request and interpolates the predictions of adjacent horizons to compute the estimated arrival time for each supersegment. We have tested an update frequency of 15 seconds and found no measurable regression up to a frequency of 2 minutes, indicating staleness is not an issue. Turn speeds It is common to have a multimodal speed distribution for a busy road segment heading to a split of traffic, such as an intersection or a highway ramp. For such cases, we used the real time and historical speed for the specific turn which the supersegment follows. Coverages Currently we have about 1 million predefined supersegments that cover the most common routes taken by our users. The effect is that most freeways, major arterials and some popular short-cuts in metropolitan areas are selected, whereas less busy surface streets will not be covered. Multiple supersegments may cover any given road segments to account for multiple common routes that traverse through that road segment. For less frequently visited road segments, we use simpler per-segment models. We presented how a graph neural network can be engineered to accurately predict travel time estimates along candidate routes. Applying stabilising techniques such as MetaGradients and EMA was a necessary addition to make the GNNs production-ready. We deployed our GNN model for ETA prediction in Google Maps, where it now serves user queries worldwide. We presented offline metrics, online evaluations, and user studies: all showing significant quantitative improvements of using GNNs for ETA predictions. Further, through extensive ablations on various design and featurization choices, we provide prescriptive advice on deploying these kinds of models in practice. We believe our findings will be beneficial to researchers applying GNNs to any kind of transportation network analysis-which is certain to remain a very important application area for graph representation learning as a whole. Relational inductive biases, deep learning, and graph networks Geometric deep learning: going beyond euclidean data Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting Fastgcn: fast learning with graph convolutional networks via importance sampling Principal neighbourhood aggregation for graph nets ConSTGAT: Contextual Spatial-Temporal Graph Attention Network for Travel Time Estimation at Baidu Maps CompactETA: A Fast Inference System for Travel Time Prediction Neural message passing for quantum chemistry Inductive representation learning on large graphs Graph representation learning Relational inductive bias for physical construction in humans and machines P-Companion: A Principled Framework for Diversified Complementary Product Recommendation Stochastic origin-destination matrix forecasting using dual-stage graph convolutional, recurrent neural networks Adam: A method for stochastic optimization Semi-supervised classification with graph convolutional networks Variational graph auto-encoders Diffusion convolutional recurrent neural network: Data-driven traffic forecasting PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest ST-GRAT: A Novel Spatio-temporal Graph Attention Networks for Accurately Forecasting Dynamically Changing Road Speed Temporal graph networks for deep learning on dynamic graphs SIGN: Scalable Inception Graph Neural Networks Graph Neural Networks for Friend Ranking in Large-scale Social Platforms Machine learning for spatiotemporal sequence forecasting: A survey Knowing your FATE: Friendship, Action and Temporal Explanations for User Engagement Prediction on Social Apps Graph attention networks Deep graph infomax Neural execution of graph algorithms When will you arrive? estimating travel time based on deep neural networks Graph wavenet for deep spatial-temporal graph modeling Inductive Representation Learning on Temporal Graphs Meta-gradient reinforcement learning Aligraph: A comprehensive graph neural network platform Graph convolutional neural networks for web-scale recommender systems Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting Effective Travel Time Estimation: When Historical Trajectories over Road Networks Matter Deep Sets Gaan: Gated attention networks for learning on large and spatiotemporal graphs Curb-GAN: Conditional Urban Traffic Estimation through Spatio-Temporal Generative Adversarial Networks GMAN: A Graph Multi-Attention Network for Traffic Prediction We would like to thank Paulo Estriga and Adam Cain for designing several of the figures featured in this paper.