key: cord-0110669-ygw69ltg authors: Wang, Ke Alexander; Maddix, Danielle; Wang, Yuyang title: GOPHER: Categorical probabilistic forecasting with graph structure via local continuous-time dynamics date: 2021-12-18 journal: nan DOI: nan sha: 5378bcdc0a0c3345993e8ab0944d468b41d2be75 doc_id: 110669 cord_uid: ygw69ltg We consider the problem of probabilistic forecasting over categories with graph structure, where the dynamics at a vertex depends on its local connectivity structure. We present GOPHER, a method that combines the inductive bias of graph neural networks with neural ODEs to capture the intrinsic local continuous-time dynamics of our probabilistic forecasts. We study the benefits of these two inductive biases by comparing against baseline models that help disentangle the benefits of each. We find that capturing the graph structure is crucial for accurate in-domain probabilistic predictions and more sample efficient models. Surprisingly, our experiments demonstrate that the continuous time evolution inductive bias brings little to no benefit despite reflecting the true probability dynamics. In categorical probabilistic forecasting, we seek to predict a discrete probability distribution p(t) at some instantaneous time t, based on observed time-stamped data [10] . Consider the example of forecasting the most likely locations of the next earthquake over a finite set of locations at t, given the history of earthquake times and locations. We can view locations as vertices V on a graph G = (V, E) with edges E that represent adjacency. Specifically, the probability of an earthquake at node v ∈ V in the near future is mostly influenced by the probability of earthquakes at nodes within its neighborhood. This type of graphical structure also appears in other problems, including traffic forecasting [28] , information diffusion in social networks [1] , epidemic diffusion [24, 13] , urban conflict patterns [16] , and is an example of a marked temporal point process [6] . In this paper, we consider categorical probabilistic forecasts where there is a graphical structure to inform us of the local dynamics governing p(t) over time. We formalize the intuition that each component of the probability vector p(t) ∈ R |V | obeys local dynamics using the differential equation which we use to inform our model's inductive bias. Here, g governs the local dynamics, N (v) ⊆ V denotes the set of neighboring nodes of v, and p v denotes the probability at node v. To capture the equivariant local dynamics of our forecast p(t), we propose GOPHER, a model that learns a neural ODE [4] with graph neural network (GNN) [26] dynamics. Our method GOPHER introduces two inductive biases to aid with probabilistic forecasting over graph-structured categories by 1) utilizing graph structure explicitly and 2) introducing temporal evolution through a neural ODE. To disentangle the benefits of these two biases, we introduce two baseline models, ablating each bias. We find that utilizing the known graph structure results is key, and results in 10x improvements in accuracy and sample efficiency. On the other hand, explicitly modelling the temporal dynamics surprisingly results in little benefits. * Work done as an intern at Amazon Research Let G = (V, E) be a graph, and let t i ∈ R + 0 denote the timestamp of an event at node v i ∈ V . Given G and an irregularly sampled dataset , we want to learn the probability vector p(t) ∈ R |V | of each v ∈ V at any time t. We wish to model the dynamics of p(t) such that the change in the probability at node v depends only on the neighborhood N (v) around v, as described in Equation 1. However, directly parameterizing g from Equation 1 with a neural ODE can violate conservation of probability: 1 p(t) = 1. Instead of explicitly enforcing the sum constraint into our neural ODE, we model the dynamics in a continuous-time embedding space from which we derive the dynamics dp/dt. Specifically, let Z 0 ∈ R |V |×D , where D denotes the embedding space dimension. We use z 0,i to denote row i of Z 0 at initial time t 0 , corresponding to the embedding of node v i . We then model the dynamics of the continuous-time embeddings Z(t) ∈ R |V |×D via where g is the learned graph neural network (GNN) dynamics. To map Z(t) to a probability space while preserving equivariance, we learn a shared projection π : R D → R such that Provided that g and π are differentiable, which can be done using smooth activation functions, our model then implicitly models the local temporal dynamic of our problem in Equation 1. Finally, we train our model GOPHER by maximizing the log likelihood N i=1 log p vi (t i ) with respect to the parameters of π, g, and the initial condition Z 0 . Incorporating node attributes. In some cases G may have node attributes {a v } v∈V for each node v ∈ V that affect the interaction dynamics, such as the geographical coordinates of each node in a spatial graph or the demographics of a user in a social network. Node attributes can be easily incorporated by letting the initial node embeddings be a learned function of the attributes, , and optimizing with respect to the parameters of {ψ v }. Related works. Our paper lies at the intersection of probabilistic forecasting, neural ODEs, and graph neural networks (GNNS), and can be seen as the discrete analogue of continuous normalizing flows [4, 11, 5] on manifolds [17, 18] . Probabilistic forecasting seeks to predict a full distribution at each time step [14, 10] , with contemporary methods often relying on deep probabilistic models [22, 25, 21, 20] . A direct application of categorical probabilistic forecasts is marked temporal point processes, which learn the rate of an event type v at time t, summarized by the conditional intensity function λ(t, v) = λ(t) · p v (t) [6] . The inductive bias of a learnable ODE with GNN dynamics has also been explored in the context of other problems, including graph generation [7] , node classification [19, 2] , multi-particle trajectory prediction [19] , learning partial differential equations [15] , and knowledge graph forecasting [12] . 3 Results: I Can't Believe Temporal Dynamics Don't Matter! Synthetic datasets. We apply our method to model the mark component of a marked temporal point process (TPP) occurring on the nodes of a graph such that p v (t) is the probability of an event occurring on vertex v at time t. We create a synthetic dataset where events occur over time on a directed graph G, with node probabilities that obey graph advection as an example of local dynamics [3] . Graph advection conserves the total probability by ensuring 1 dp/dt = 0. We represent the graph G by the weighted adjacency matrix A, where A uv > 0 for (u, v) ∈ E. We sample sequences of events over time [0, T ] from a homogeneous Poisson process with constant temporal intensity λ(t) = λ = 2.5 and temporal node probability p(t) ∈ R |V | governed by the graph advection equation [3] dp We create two graphs structures for our synthetic datasets, a ring graph and a random geometric graph, and visualize their advection on the graph over time in Figure 1 ; see Appendix D for more details on their construction. We also visualize the advection dynamics of each component of p(t) for the ring graph in Figure 9 of Appendix D. We use T = 5 seconds for the ring graph dataset and T = 1 second for the geometric graph dataset. Since the timestamps are sampled from a Poisson process and not equidistantly spaced over [0, T ], the continuous time aspect of the problem is clearly evident in the dataset. Evaluating each inductive bias of GOPHER. We evaluate the accuracy and sample-efficiency improvements from incorporating graph structure and modelling temporal dynamics in GO-PHER. To disentangle the effects of these two inductive biases, we compare our model to two baseline models. The first is a two-layer MLP that acts on node-embeddings concatenated with time, which has none of the above inductive biases. The second is a single-layer GNN that also acts on node-embeddings concatenated with time. The GNN incorporates the explicit graph structure, but does not incorporate dynamical systems structure. We refer to the models as NAIVEMLP and NAIVEGNN respectively. In our experiments, GOPHER learns g using a Graph Isomorphism Network (GIN) layer [27] parameterized by a two-layer MLP; we use another two-layer MLP for the projection π. NAIVEGNN uses the same GIN architecture and projection except that it does not learn a differential equation. Finally, NAIVEMLP replaces the GIN layer with a two layer MLP. See Appendix D for further details on our experiment hyperparameters. Figure 2 shows the KL divergence betweeen the ground truth p(t) and the learned predictions over time for the ring graph. We summarize the KL divergence over [0, T ] in Figure 3 by the geometric mean since the error varies over multiple overs of magnitudes over time [9] . In both figures, we show the 95% confidence intervals over 3 seeds. For both datasets, there is 10x difference in accuracy Num. train seqs. Num. train seqs. between the graph structured models and NAIVEMLP, indicating that utilizing the graph structure is greatly beneficial. Though NAIVEGNN does not explicitly model the local temporal dynamics of the datasets, it performs nearly identically to our model GOPHER in fitting p(t) over the training interval [0, T ]. In principle, GOPHER has the best chance of extrapolating to the [T, 2T ] time period not seen during training since GOPHER explicitly models the local dynamics. However, GOPHER's poor extrapolation ability suggests that its learned dynamics do not actually reflect the true dynamics. Indeed, in Figure 7 of Appendix C we show that although GOPHER can fit the training data well, it is brittle to edge deletions, further indicating GOPHER does not learn the true dynamics. Real-world dataset. We use data released publicly by the New York Times [23] on daily COVID-19 cases in New Jersey state to construct a real-world categorical probabilistic forecasting dataset, following the preprocessing script of Chen et al. [5] . We aggregate the cases by county and form a graph with 21 nodes where each node is a county and each edge is a county border. Using the train/test split from Chen et al. [5] , we obtain per event log likelihoods with 1-standard-deviations of −2.766 ± 0.003 for NAIVEMLP, −2.768 ± 0.003 for NAIVEGNN, and −2.767 ± 0.000 for GOPHER over 3 seeds. However, these likelihoods are not representative of the model differences since we find a large distribution shift between the train and test distribution shown in Figure 5 of Appendix A. This distribution shift causes the models to perform equally poorly on the test set. In actuality, NAIVEMLP completely fails to capture variations in p(t) over time, as shown in Figure 4 . Although the inductive biases of GOPHER, directly reflect properties of categorical forecasting with local continuous-time dynamics, our experiments find that, surprisingly, explicitly modelling the temporal dynamics does not improve performance. Most of the performance gains of GOPHER come from incorporating a graph structure, which can be done with a simple baseline model like NAIVEGNN. The failure of GOPHER can be attributed to the fact that the learned dynamics in the embedding space do not accurately reflect the ground truth dynamics in probability space. A Distribution shift in our real-world dataset The predicted p(t) after we remove all edges from the ring graph. Notice that the ground truth of a completely disconnected graph is to have p(t) = p(0) for all t. However, all of the models fail completely on this new disconnected graph, suggesting that they do not learn the true dependence of p(t) on the graph structure. We use 2 layer MLPs with 64 hidden units per layer whenever we use a MLP. We use Swish activations to ensure smoothness of our dynamics. We also use an augmented neural ODE [8] , using 16 dimensions as augmented dimensions out of the 64 hidden dimensions. GOPHER. We parameterize the dynamics g from Equation 2 using one graph isomorophism network layer parameterized by a MLP. We also use a MLP to model the projection π. NAIVEGNN. We use the same architecture as GOPHER, namely a GIN layer followed by a projection π. However, instead of using the model to parameterize ODE dynamics, we directly input the node embeddings concatenated with time t through the GNN. NAIVEMLP. We replace the GIN layer of NAIVEGNN with a MLP, keeping all else the same. D.2 Training procedure and dataset details. To maximize hardware parallelism, we parallelize our neural ODE computation across sequence timesteps and across sequences using the time-reparameterization trick outlined in Chen et al. [5] . For the synthetic datasets, we use the AdamW optimizer with 0.01 learning rate and batch size 64 for 30 epochs. For the COVID-19 dataset, we use a 3 × 10 −4 learning rate and batch size 4 for 15 epochs. Here, each batch consists of multiple sequences drawn from the training period [0, T ]. We use T = 5 for the ring graph, T = 1 for the geometric graph, and T = 8 for the New Jersey counties graph. We generate the ring graph dataset by using hand-set coefficients for the edge weights A uv to allow for counter-clockwise transport. We generate the geometric graph dataset by generating a random geometric graph via the networkx python package and drawing a random sample of {A uv }. The role of social networks in information diffusion GRAND: Graph Neural Diffusion Advection on graphs Neural Ordinary Differential Equations Neural Spatio-Temporal Point Processes An Introduction to the Theory of Point Processes: Volume I: Elementary Theory and Methods. Probability and Its Applications, An Introduction to the Theory of Point Processes Augmented Neural ODEs Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints Probabilistic Forecasting FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models Temporal Knowledge Graph Forecasting with Neural ODE Dynamics of an SIS reaction-diffusion epidemic model for disease transmission Forecasting: Principles and Practice Learning continuous-time PDEs from sparse data with graph neural networks Discovering Latent Network Structure in Point Process Data Neural manifold ordinary differential equations Riemannian Continuous Normalizing Flows Continuous-Depth Neural Models for Dynamic Graph Prediction Deep State Space Models for Time Series Forecasting Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows DeepAR: Probabilistic forecasting with autoregressive recurrent networks Covid-19) Data in the United States Bridging physics-based and data-driven modeling for learning dynamical systems Deep Factors for Forecasting A Comprehensive Survey on Graph Neural Networks How Powerful are Graph Neural Networks Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting