key: cord-0217962-74loywol authors: Lienen, Marten; Gunnemann, Stephan title: Learning the Dynamics of Physical Systems from Sparse Observations with Finite Element Networks date: 2022-03-16 journal: nan DOI: nan sha: 4586dfd009c0f9614aeb23b3918e42298481f38b doc_id: 217962 cord_uid: 74loywol We propose a new method for spatio-temporal forecasting on arbitrarily distributed points. Assuming that the observed system follows an unknown partial differential equation, we derive a continuous-time model for the dynamics of the data via the finite element method. The resulting graph neural network estimates the instantaneous effects of the unknown dynamics on each cell in a meshing of the spatial domain. Our model can incorporate prior knowledge via assumptions on the form of the unknown PDE, which induce a structural bias towards learning specific processes. Through this mechanism, we derive a transport variant of our model from the convection equation and show that it improves the transfer performance to higher-resolution meshes on sea surface temperature and gas flow forecasting against baseline models representing a selection of spatio-temporal forecasting methods. A qualitative analysis shows that our model disentangles the data dynamics into their constituent parts, which makes it uniquely interpretable. : Finite Element Networks predict the instantaneous change of each node by estimating the effect of the unknown generating dynamics on the domain volume that the node shares with its neighbors. The laws driving the physical world are often best described by partial differential equations (PDEs) that relate how a magnitude of interest changes in time with its change in space. They describe how the atmosphere and oceans circulate and interact, how structures deform under load and how electromagnetic waves propagate (Courant & Hilbert, 2008) . Knowledge of these equations lets us predict the weather (Coiffier, 2011) , build sturdier structures, and communicate wirelessly. Yet, in many cases we only know the PDEs governing a system partially (Isakov, 2006) or not at all, or solving them is too computationally costly to be practical (Ames, 2014) . Machine learning researchers try to fill in these gaps with models trained on collected data. For example, neural networks have been trained for weather forecasts (Shi et al., 2015) and fluid flow simulations (Belbute-Peres et al., 2020) , both of which are traditionally outcomes of PDE solvers. Even the dynamics of discrete dynamical systems such as traffic (Li et al., 2018) and crowds (Zhang et al., 2017) have been learned from data. A challenge facing these models is the high cost of acquiring training data, so the data is usually only available sparsely distributed in space. Since graphs are a natural way to structure sparse data, models incorporating graph neural networks (GNNs) have been particularly successful for spatio-temporal forecasting Wu et al., 2019) . Our implementation is available at https://www.daml.in.tum.de/finite-element-networks/ In the domain of physical processes we can reasonably assume that the observed system follows a PDE. There are mainly two ways to incorporate this assumption as a-priori knowledge into a model. First, we can encode a known PDE into a loss function that encourages the model to fulfill the equation (Raissi et al., 2019) . Another way to go about this is to derive the model structure itself from known laws such as the convection-diffusion equation (de Bézenac et al., 2018) . We will follow the second approach. Consider a dynamical system on a bounded domain Ω ⊂ R d that is governed by the PDE ∂ t u = F t, x, u, ∂ x u, ∂ 2 x u, ... (1) on functions u : [0, T ] × Ω → R m . If we have a dense measurement u 0 : Ω → R m of the current state of the system and a solution u that satisfies Eq. (1) for all t ∈ [0, T ] and also fulfills the initial condition u(0, x) = u 0 (x) at all points x ∈ Ω, we can use u as a forecast for the state of the system until time T . From a spatio-temporal forecasting perspective, this means that we can forecast the evolution of the system if we have a continuous measurement of the state, know the dynamics F , and can find solutions of Eq. (1) efficiently. Unfortunately, in practice we only have a finite number of measurements at arbitrary points and only know the dynamics partially or not at all. Contributions. An established numerical method for forecasts in systems with fully specified dynamics is the finite element method (FEM) (Brenner et al., 2008) . In this paper, we introduce the first graph-based model for spatio-temporal forecasting that is derived from FEM in a principled way. Our derivation establishes a direct connection between the form of the unknown dynamics and the structure of the model. Through this connection our model can incorporate prior knowledge on the governing physical processes via assumptions on the form of the underlying dynamics. We employ this mechanism to derive a specialized model for transport problems from the convection equation. The way that the model structure arises from the underlying equation makes our models uniquely interpretable. We show that our transport model disentangles convection and the remainder of the learned dynamics such as source/sink behavior, and that the activations of the model correspond to a learned flow field, which can be visualized and analyzed. In experiments on multi-step forecasting of sea surface temperature and gas flow, our models are competitive against baselines from recurrent, temporal-convolutional, and continuous-time model classes and the transport variant improves upon them in the transfer to higher-resolution meshes. In the following, we will outline how to approximate a solution u to the dynamics in Eq. (1) from an initial value u 0 by discretizing u in space using finite elements. Let X be a set of points with a triangulation T of d-dimensional, non-overlapping simplices (2) such that ∪ ∆∈T CH(∆) equals the domain Ω where CH(∆) is the convex hull of simplex ∆. So we define a simplex ∆ (j) ∈ T representing the j-th mesh cell as the set of vertices of the cell and denote the domain volume covered by the cell by the convex hull CH(∆ (j) ) of the vertices. We will assume u to be a scalar field, i.e. u : [0, T ] × Ω → R. If u is a vector field, we treat it as a system of m scalar fields instead. For a detailed introduction to FEM, we refer the reader to Igel (2017) . Basis Functions. A priori, we assume that the unknown solution u to our problem lies in an infinitedimensional function space U. The first step in FEM to make the problem numerically feasible is to approximate U with a finite-dimensional linear subspaceŨ. This subspace can then be written in terms of linear combinations of basis functionsŨ = span ϕ (1) , ..., ϕ (N ) . There are many possible bases and the choice determines various qualities of the resulting procedure such as continuity of the approximation and the sparsity pattern of the mass matrix in Eq. (7). In our case, we choose the so-called P1 basis of piecewise linear functions (hat functions), see Fig. 2a (Igel, 2017) . There are as many basis functions as there are points and each is uniquely defined by being linear when restricted to a single cell ∆ ∈ T and the constraint So the basis function ϕ (j) is 1 at x (j) , falls linearly to 0 on mesh cells adjacent to x (j) and is 0 everywhere else. The resulting finite-dimensional function spaceŨ is the space of linear interpolators between values at the vertices, see Fig. 2b. An important property is that if we expand u ∈Ũ in this basis, the value of u at the i-th node is just its i-th coefficient. Galerkin Method. A piecewise linear approximation u ∈Ũ is not differentiable everywhere and therefore cannot fulfill Eq. (1) exactly. So instead of requiring an exact solution, we ask that the residual R(u) = ∂ t u − F (t, x, u, ...) be orthogonal to the approximation spaceŨ with respect to the inner product u, v Ω = Ω u(x) · v(x) dx at any fixed time t. In effect we are looking for the best possible solution inŨ. BecauseŨ is generated by a finite basis, the orthogonality requirement decomposes into N equations, one for each basis function. Plugging the residual back in and using the linearity of the inner product, we can reconstruct a system of equations that resemble the PDE that we started with. At this point we can stack the system of N equations into a vector equation. If we plug in the basis expansion N j=1 c j ϕ (j) for u into the left hand side, we get a linear system A∂ t c = m where A ij = ϕ (i) , ϕ (j) Ω is the so called mass matrix, c is the vector of basis coefficients of u, and m i = F (t, x, u, ...) , ϕ (i) Ω captures the effect of the dynamics F . The left hand side evaluates to A∂ t c, because the basis functions are constant with respect to time. The right hand side cannot be further simplified without additional assumptions on F . Method of Lines. If we can evaluate the right hand side m, we can solve the linear system in Eq. (7) for the temporal derivatives of the coefficients of u at each point in time. In fact we have converted the PDE into a system of ordinary differential equations (ODEs) which we can solve with an arbitrary ODE solver given an initial value c (0) as in Fig. 2c . This is known as the method of lines because we solve for u along parallel lines in time. To find a vector field u : [0, T ] × Ω → R m instead of a scalar field, we treat the m dimensions of u as a system of m scalar fields. This results in m copies of Eq. (7), which we need to solve simultaneously. Because the mass matrix A is constant with respect to u, we can combine the system into a matrix equation A∂ t C = M (8) where C, M ∈ R N ×m are the stacked c and m vectors, respectively. In summary, the spatial discretization with finite elements allows us to turn the PDE (1) into the matrix ODE (8). Message-Passing Neural Networks (MPNNs) are a general framework for learning on graphs that encompass many variants of graph neural networks (Gilmer et al., 2017) . It prescribes that nodes in a graph iteratively exchange messages and update their state based on the received messages for P steps. For a graph G = (V, E) with nodes V and edges E, and initial node states h where f msg maps node states and edge attributes to messages and f upd updates a node's state with the aggregated incoming messages. The final node states h can then be interpreted as per-node predictions directly or passed as node embeddings to downstream systems. In this work, we employ a slight generalization of the above to undirected hypergraphs, i.e. graphs where the edges are sets of an arbitrary number of nodes instead of having a cardinality of exactly 2. For such a hypergraph G = (V, E) with nodes V and hyperedges ε = {u, v, w, ...} ∈ E, and initial node states h (10) Note that f msg jointly computes a separate message for each node v participating in a hyperedge ε. Consider a set of nodes representing N points in d-dimensional space. At each node we measure m features y (t0,i) ∈ R m . Our goal is to predict y (tj ) at future timesteps t j by solving PDE (1). For it to be well-defined, we need a continuous domain encompassing the nodes X . By default, we construct such a domain by Delaunay triangulation. The Delaunay algorithm triangulates the convex hull of the nodes into a set of mesh cells -d-dimensional simplices - which we represent as sets of nodes denoting the cell's vertices. The PDE will then be defined on the domain Ω = ∪ ∆∈T CH(∆). One shortcoming of this algorithm is that it can produce highly acute, sliver-like cells on the boundaries of the domain if interior nodes are very close to the boundary of the convex hull. We remove these slivers in a post-processing step, because they have small area and thus contribute negligible amounts to the domain volume, but can connect nodes which are far apart. See Appendix H for our algorithm. As an alternative to Delaunay triangulation, a mesh can be explicitly specified to convey complex geometries such as disconnected domains, holes in the domain and non-convex shapes in general. As a first step towards solving the PDE, we need to represent y (t0) as a function on Ω. We define the function u 0 ( , such that its coefficients as a vector in the P1 basis encode the data and evaluating it at the nodes u 0 (x (i) ) = y (t0,i) agrees with the data by construction. In Section 2.1, we have seen that solving PDE (1) approximately under an initial condition u(t 0 , x) = u 0 (x) can be cast as an initial value ODE problem over the trajectory of the coefficients of a solution in time. Above, we encoded the features y (t0) in the coefficients of u 0 , meaning that Eq. (8) exactly describes the trajectory of the features through time. The dynamics of the features are thus given by is the feature matrix and A ij = ϕ (i) , ϕ (j) Ω is the mass matrix. We call the matrix M on the right hand side the message matrix and it is given by where u encodes the predicted features Y at time t in its coefficients. Our goal is to solve the feature dynamics (11) with an ODE solver forward in time to compute the trajectory of Y and thereby predict y (tj ) . To apply an ODE solver, we have to compute ∂ t Y at each time t which in turn consists of, first, evaluating the right hand side and, second, solving the resulting sparse linear system. Solving Eq. (11) for ∂ t Y with the exact sparse mass matrix A carries a performance penalty on GPUs, because sparse matrix solving is difficult to parallelize. To avoid this operation, we lump the mass matrix. Lumping of the mass matrix is a standard approximation in FEM for time-dependent PDEs to reduce computational effort (Lapidus & Pinder, 1999) , which leads to good performance in We employ the direct variant that diagonalizes the mass matrix by row-wise summation, making the inversion of A trivial. The remaining step in solving Eq. (11) is evaluating the right hand side. Let T i = ∆ ∈ T | x (i) ∈ ∆ be the set of all mesh cells that are adjacent to node x (i) . Then we can break the elements of M up into the contributions from individual cells adjacent to each node. The sum terms capture the effect that the dynamics have on the state of the system such as convection and chemical reaction processes given the geometry of the mesh cell, the values at the nodes, and the basis function to integrate against. However, the dynamics F are of course unknown and need to be learned from data. Yet, if we were to introduce a deep model f θ ≈ F in Eq. (14), we would run into two problems. First, each evaluation would require the numerical integration of a neural network over each mesh cell against multiple basis functions and adaptive-step ODE solvers can require hundreds of evaluations. Second, conditioning f θ on just t, x and u(x) would not provide the model with any spatial information. Such a model would not be able to estimate spatial derivatives ∂ x u internally and could therefore only represent PDEs that are actually just ODEs. We deal with both problems at once by factoring the inner products in Eq. (14) into Such a scalar coefficient F ∆ depends jointly on all x in a mesh cell as well as their solution values u(x). Therefore our deep model f θ naturally also needs to be conditioned on these spatially distributed values, giving it access to spatial information and making it possible to learn actual PDE dynamics. With these theoretical foundations in place, we can formally define Finite Element Networks (FENs). Let G = (X , T ) be the undirected hypergraph with nodes X and hyperedges T corresponding to the meshing of the domain, so each mesh cell ∆ ∈ T forms a hyperedge between its vertices. Let f θ be a neural network that estimates the cell-wise dynamics F ∆ from time, cell location, cell shape and the values at the vertices are the local coordinates of the cell vertices and y (t) ∆ = {y (t,i) | x (i) ∈ ∆} are the features at the vertices at time t. We call f θ the free-form term, because it does not make any assumptions on the form of F and can in principle model anything such as reactions between features or sources and sinks. f θ conditions on the cell centers µ ∆ and the local vertex coordinates x ∆ separately to help the model distinguish between cell position and shape, and improve spatial generalization. Define the following message and update functions in the MPNN framework where m (i) = ∆∈T f msg (∆) x (i) are the aggregated messages for node x (i) . Performing one message passing step (P = 1) in this model is exactly equivalent to solving for the feature derivatives ∂ t Y (t) in Eq. (11) with the lumped mass matrix from Eq. (13) and the message matrix in Eq. (14) with the factorization from Eq. (15) and f θ as a model for F ∆ . So making a forecast with FEN for T points in time t = (t 0 , t 1 , ..., t T ) based on a sparse measurement y (t0) means solving an ODE where at each evaluation step t e we run one message passing step in the MPNN described above to compute the current time derivatives of the features ∂ t y| t=te . Because FENs model the continuous-time dynamics of the data, they can also be trained on and make predictions at irregular timesteps. By controlling the information available to f θ , we can equip the model with inductive biases. If we omit the current time t, the learned dynamics become necessarily autonomous. Similarly, if we do not pass the cell position µ ∆ , the model will learn stationary dynamics. An orthogonal and at least as powerful mechanism to induce inductive biases into the model is through assumptions on the form of the unknown dynamics F . Let us assume that the dynamics are not completely unknown, but that a domain expert told us that the generating dynamics of the data include a convective component. So the dynamics for feature k are of the form where v (k) (t, x, u, ...) ∈ R d is the divergence-free velocity field of that feature and F represents the still unknown remainder of the dynamics. We assume the velocity field to be divergence-free, because we want the velocity field to model only convection and absorb sources and sinks in the free-form term. By following similar steps with the convection term −∇ · v (k) (t, x, u, ...)u k as for the unknown dynamics, we arrive at a specialized term for modeling convection. Let g θ be a neural network that estimates one velocity vector per cell and attribute. Ensuring that the learned velocity field is globally divergence-free would be a costly operation, but by sharing the same predicted velocity vector between all vertices of each cell, we can cheaply guarantee that the velocity field is at least locally divergence-free within each cell. The corresponding message function, which we derive in Appendix B, is With this we can extend FEN with specialized transport-modeling capabilities by adding the two message functions and define Transport-FEN (T-FEN) as a FEN with the message function Consequently, T-FEN learns both a velocity field to capture convection and a free-form term that fits the unknown remainder of the dynamics. Network Architecture. We instantiate both f θ and g θ as multilayer perceptrons (MLPs) with tanh non-linearities. The input arguments are concatenated after sorting the cell vertices by the angle of x (i) − µ ∆ in polar coordinates. This improves generalization because f θ and g θ do not have to learn order invariance of the nodes. Both MLPs have an input dimension of 1 + d + (d + 1) · (m + d) in the non-autonomous, non-stationary configuration since |∆| = d + 1. The free-form term f θ computes one coefficient per vertex and attribute and therefore has an output dimension of (d + 1) · m whereas the transport term estimates a velocity vector for each cell and attribute resulting in an output dimension of m · d. The number of hidden layers and other configuration details for each experiment are listed in Appendix D. We use scikit-fem (Gustafsson & McBain, 2020) to compute the inner products between basis functions over the mesh cells and the dopri5 solver in torchdiffeq (Chen et al., 2018) to solve the resulting ODE. See Appendix E for details on model training. Baselines. We evaluate FEN and T-FEN against a selection of models representing a variety of temporal prediction mechanisms: Graph WaveNet (GWN) combines temporal and graph convolutions (Wu et al., 2019) ; Physics-aware Difference Graph Network (PA-DGN) estimates spatial derivatives as additional features for a recurrent graph network (Seo et al., 2020) ; the continuous-time MPNNbased model by (Iakovl et al., 2021 ) which we will call CT-MPNN, uses a general MPNN to learn the continuous-time dynamics of the data. 1 See Appendix D for the configuration details. GWN and PA-DGN use 12 and 5 timesteps respectively as input while the ODE-based models make their predictions based on a single timestep. To ensure a fair comparison, we do not evaluate any of the models on the first 12 timesteps of the test sets. Datasets. For our experiments we have chosen two real-world datasets and a synthetic one. The Black Sea dataset provides data on daily mean sea surface temperature and water flow velocities on the Black Sea over several years. ScalarFlow is a collection of 104 reconstructions of smoke plumes from 2 seconds long, multi-view camera recordings (Eckert et al., 2019) . In each recording a fog machine releases fog over a heating element in the center of the domain which then rises up with the hot air through convection and buoyancy. CylinderFlow contains simulated fluid flows around a cylinder along a channel simulated with the inviscid Navier-Stokes equations (Pfaff et al., 2021) . All our experiments are run on hold-out datasets, i.e. the year 2019 of the Black Sea data and the last 20 recordings of ScalarFlow. See Appendix C for a detailed description of the datasets, our sparse subsampling procedure, choice of normalization, and train-validation-test splits. Multi-step Forecasting. The main objective of our models as well as the baselines are accurate forecasts and we train them to minimize the mean absolute error over 10 prediction steps. We chose this time horizon because it is long enough that a model needs to learn non-trivial dynamics to perform well and more accurate dynamics lead to better performance. At the same time it is short enough that not too much uncertainty accumulates. For example, if we train any of the models on Black Sea on much longer training sequences, they learn only diffusive smoothing because the correlation between the temperature distribution today and in 30 days is basically nil, so that the only winning move is not too play -similar to how forecasting the weather for more than 10 days is basically impossible (Mailier, 2010). For Figure 3 : Predictive accuracy of models trained on 1000 nodes and evaluated on increasingly fine subsamplings of the test data. Super-Resolution. Graph-based spatio-temporal forecasting models make predictions on all nodes jointly. Training these models on many points or fine meshes therefore requires a prohibitive amount of memory to store the data points and large backwards graphs for backpropagation. One way around this problem is to train on coarse meshes and later evaluate the models on the full data. In Fig. 3 we compare FEN and T-FEN against the strongest baseline on the Black Sea dataset. While all three models increase in prediction error as the mesh resolution increases, FEN models deteriorate at a markedly slower rate than CT-MPNN. We ascribe this to the fact that in FEN and T-FEN estimation of the dynamics and its effect on the nodes are to a certain extent separated. The terms f θ and g θ estimate the dynamics on each cell, but the effect on the node values is then controlled by the inner products of basis functions which incorporate the shape of the mesh cells in a theoretically sound way. In CT-MPNN on the other hand, the model learns these aspects jointly which makes it more difficult to generalize to different mesh structures. See Appendix F for the complete results on both datasets and a visual comparison of low and high resolution data. Extrapolation. While we train the models on 10-step forecasting, it is instructive to compare their predictions on longer time horizons. Fig. 4 shows the predictions of the three strongest models on a smoke plume after 60 steps. Because these models are all ODE-based, they were conditioned on a single timestep and have been running for hundreds of solver steps at this point. First, we notice that all three models have the fog rise at about the same, correct speed. However, with CT-MPNN the fog disappears at the bottom because the model does not learn to represent the fog inlet as a fixed, local source as opposed to our models. Comparing FEN and T-FEN, we see that the former's dynamics also increase the density further up in the fog column where no physical sources exist, while the latter with its separate transport term modeling convection keeps the density stable also over longer periods of time. See Appendix G for a comparison with all baselines at multiple time steps. Interpretability. In FEN the free-form term f θ models the dynamics F as a black box and as such its activations are equally opaque as the dynamics we derived it from. However, as we have seen in Section 3.1, by making assumptions on the structure of F we impose structure onto the model and this structure in turn assigns meaning to the model's components. For a T-FEN we see that the transport term corresponds to a velocity field for each attribute, while the free-form term absorbs the remainder of the dynamics. Fig. 5 shows that this disentanglement of dynamics works in practice. In the ScalarFlow data, we have a fog inlet at the bottom, i.e. a source of new energy, and from there the fog rises up by convection. Plotting the contributions of f θ and g θ to ∂ t y separately shows that the free-form term represents the inlet while the transport term is most active in the fog above. Instead of inspecting the contributions to ∂ t y, we can also investigate the learned parameter estimates directly. Fig. 6 shows that the transport term learns a smooth velocity field for the sea surface temperature on the Black Sea. Spatio-Temporal Forecasting. Various DNNs, and in particular GNNs, have been designed to model complex spatial and temporal correlations in data, with applications ranging from forecasting Free-Form Transport Figure 5 : Contribution to ∂ t y from the free-form and transport term in a T-FEN on a snapshot of ScalarFlow. To maximize the reproducibility of our experimental results, we have fixed seeds for every domain subsampling, training run and evaluation and recorded these in configuration files, so that everything can be re-run. We obtained all seed values from numpy.random.SeedSequence to guarantee that the streams of random numbers from the random number generators (RNGs) are independent and have high sample quality. While we fixed all seeds, model training is still non-deterministic because the message passing step in graph-based models relies on a scatter operation. Scattering is implemented via atomic operations on GPUs, which can re-order floating point operations such as additions between runs with the same seed inducing non-determinism (Fey & Lenssen, 2019) . Nonetheless, we have observed that multiple runs from the same seed only diverge slowly as training progresses and the variability in performance between runs from different seeds is small for all models anyway as can be seen in the standard deviation of the mean absolute error in Table 1 . We have used an NVIDIA GeForce GTX 1080 Ti for all experiments and evaluations. We have presented a way to derive a spatio-temporal forecasting model directly from FEM. In addition to establishing a connection between this well-known theory behind many scientific simulations and neural ODE models, our derivation opens a way to include knowledge in the form of PDEs directly into the structure of a machine learning model, as we have shown by the example of the convection equation and T-FEN Extensions of the model with other first or second order, linear PDEs such as the heat equation or the wave equation should follow with a similar derivation. However, our approach cannot be applied directly when derivatives of higher order are involved. The problem fundamentally stems from our choice of basis functions. The P1 basis is piecewise linear and thus has vanishing second and higher order derivatives. This means that, at the end of the model derivation, any basis function involved in fixed weights such as ϕ (j) , ϕ (i) can appear with at most a first derivative, otherwise the expression will evaluate to 0. A solution would lie in extending our approach to higher order basis functions, though this is a non-trivial endeavor. In such a setting, the values at the nodes of a triangular mesh would no longer suffice to determine all coefficients in the new basis. In future research, we will work on constraining these new coefficients and make learning with higher-order basis functions a well-posed learning problem. Let F be just a convection term. Because of the linearity of the inner product ·, · Ω , we can ignore other additive components of F in this derivation. In the rest of this section, we will write v k for v (k) (t, x, u, ...). We begin by plugging F into the message matrix from Eq. (14). Next, we split the domain into the mesh cells. Now we apply the product rule. We assume that the velocity field is divergence-free, i.e. ∇ · v k = 0. Now we expand u in the P1 basis (remember that we have encoded the data y in the coefficients of u) and make use of the linearity of the inner product and gradient once more. Because the inner product is restricted to ∆, we can restrict the inner sum to vertices of ∆. Finally, we factorize the inner products in the same way as we did it for the free-form term and plug in the neural network g θ approximating the velocity field. If we stack this expression over all features k and translate the result into the MPNN framework, we arrive at the transport term that we present as part of T-FEN in Section 3.1. In this section, we give a brief overview over each of the datasets we used in Section 4 and how we pre-processed them. Subsampling. Both real-world datasets that we use provide large amounts of dense data on a 2D or 3D grid, which we subsample to simulate sparse measurements of a physical system and make the amount of data per timestep manageable. A non-trivial problem is the selection of sampling points and we decided to use the k-Medoids algorithm. This way we ensure that we use actual data points as sampling points and reach a roughly uniform but still random cover of the domain. A uniform covering is especially important for the Black Sea dataset because of the non-convex shape of the domain and small bays on the border that we still want to cover. The CylinderFlow dataset is a collection of simulated flows characterized by velocity and pressure fields released by Pfaff et al. (2021) . The domain is a two-dimensional channel with impenetrable boundaries at the top and bottom and a circular obstacle of varying size and position that blocks the flow. On the left-hand boundary, the velocity field is fixed to an inflow profile that is constant over time but varies between samples. Depending on the obstacle, inflow profile, and flow parameters the resulting velocity and pressure fields exhibit distinct behaviors, varying between stable flows and oscillations. Pfaff et al. used the COMSOL solver to generate the data by solving the incompressible Navier-Stokes equations. See Fig. 7 for a visualization of the velocity and pressure fields as well as the simulation mesh. The training set contains 1000 simulations, while validation and test set have 100 each. Every sequence consists of 600 steps of the flow sampled at a time resolution of ∆t = 0.01s. Spatially, the data is discretized on meshes with about 1800 nodes. A special property of this dataset is that each sequence has a different mesh, because each mesh has a notch representing the cylinder obstacle which varies in size and position. To enforce the boundary conditions, we fix the velocity fields for all methods on the boundaries except for the outflow, similar to Pfaff et al.. On the boundaries of the channel as well as the cylinder boundary, we fix the velocities to zero. The velocity inflow profile stays constant over time and we hold it fixed during training and inference. The pressure, however, is unconstrained throughout the domain and can also vary on the boundaries. Due to the size of the dataset, we select only the first 100 sequences for training and train only on subsequences starting at a multiple of 10 steps, in effect eliminating overlapping training sequences and reducing the size of the training dataset by another factor of 10. For the evaluation, we also choose the subsequences to test on in the same way, so every subsequence of length 10 starting at time step 0, 10 and so on. We normalize each feature by its mean and standard deviation across all training sequences. The Black Sea data is a reanalysis dataset of daily and monthly mean fields in the Black Sea of variables such as temperature, flow velocity, and salinity and 31 depth levels from 01/01/1992. The whole dataset is available as the BLKSEA_MULTIYEAR_PHY_007_004 2 product from the EU Copernicus Marine Service under a custom license 3 granting permission to freely use the data in research under the condition of crediting the Copernicus Marine Environment Monitoring Service. The spatial resolution of the raw data is 1/27°× 1/36°on a regular mesh of size 395×215 covering the Black Sea. Of the resulting 84,925 grid points, about 40,000 contain actual sea data and we subsample these valid points to create our sparse datasets. We mesh the chosen sample points with the Delaunay algorithm and filter acute mesh cells on the boundary as described in Appendix H. Additionally, we remove mesh cells that cover more land than water to avoid connecting nodes that are close by air distance but far apart through the sea. We use daily data from 01/01/2012 to 31/12/2017 for training, 01/01/2018 to 31/12/2018 for validation and 01/01/2019 to 31/12/2019 for testing. As features we choose the zonal velocities and the water temperature. Both features are taken at a depth of 12.54m because the dynamics at the surface are strongly influenced by wind, sun, and cloud cover, which are not part of the data and thus make the dynamics non-deterministic from the perspective of the model. They captured the resulting flow on video with a calibrated multi-camera setup and reconstructed the velocity and density fields of each recording on a dense 100×178×100 grid for 150 timesteps with their proposed method. At each grid point, the dataset provides 3-dimensional velocity vectors and the current fog density. The resulting dataset represents high-resolution measurements of a dynamical physical system that is mostly driven by buoyancy and transport processes but also evokes turbulence. All plots of this dataset show the fog density. Of the 104 recordings we use the first 64 for training, the next 20 for validation and the final 20 as a test set. Before subsampling the grid points, we reduce the data to 2D by averaging over the depth (Kohl et al., 2020) and restrict the data to the central 60 points in x-direction and bottom 140 points in y-direction, because the fog does not reach points outside of that central box in almost any of the recordings. We normalize the coordinates of the sample points by their mean and standard deviation as well as the features. We trained all models on all datasets for at most 50 epochs or 24 hours. Most trainings have been stopped early after no improvement in the validation score has been observed for 5 epochs or the ODE-based models ran out of memory due to excessive memory requirements, when the learned dynamics required too many function evaluations to solve. The three datasets present us with different circumstances regarding the basic properties of optimal dynamics that would model them well. Both Black Sea and ScalarFlow are best served by a nonstationary model, because it is reasonable to assume that the dynamics of the sea depend on the position and ScalarFlow has an inlet at the bottom with fixed position. On top of that, we further expect that the BlackSea dataset has a time dependence exceeding the global, yearly fluctuations that we subtract with our normalization scheme. Therefore, we make all models non-stationary and non-autonomous on Black Sea and just non-stationary on ScalarFlow. On CylinderFlow, the models are autonomous and stationary. The mechanism to make FEN and T-FEN non-stationary and non-autonomous is described in Section 3.1. For the baselines, we concatenate the time stamps and the node positions respectively to the node features. ODE-based models such as FEN and CT-MPNN require many function evaluations when their dynamics change abruptly. One possible trigger for that can be a jump in their inputs. This would, for example, occur if we would represent the time of the year on Black Sea as a number between 0 and 1 to capture the yearly cyclicity of the dynamics. Then we would get a discontinuity in the input at the turn of the year. To avoid this jump, we instead embed the time feature on Black Sea two-dimensionally as (sin(t), cos(t)) wheret is a map from the current time within a year to [−1, 1]. For GWN, we use a batch size 6 on all datasets and a batch size of 3 with PA-DGN. Due to an implementation detail of torchdiffeq as of the submission of this paper, batched solving of ODEs with adaptive-step solvers can introduce interference between independent ODEs, because internal error estimates in the solver are computed across ODEs and the dynamics are stepped jointly. Therefore, we train and evaluate the ODE-based models with a batch size of 1. Both f θ and g θ are MLPs with 4 hidden layers and tanh non-linearities. In FEN the layers of f θ have a width of 128. For T-FEN we chose to reduce the width of the hidden layers of both f θ and g θ to 96, so that both FEN and T-FEN have a comparable number of parameters and we can trace back any difference in performance to difference in model structure and not capacity. See Table 2 for the parameter counts. We solve the feature dynamics with a dopri5 solver with an absolute tolerance and relative tolerance of 10 −6 on CylinderFlow and Black Sea, and 10 −7 on ScalarFlow. On initialization, we set the last weights and bias of the last layers of both MLPs to 0 to start with the constant dynamics as a reasonable initialization for training. For parameter learning, we use the Adam optimizer with a learning rate of 10 −3 (Kingma & Ba, 2015). Iakovl et al. (2021) , we increased the widths of the MLPs from 60 to 128 and the message dimension from 40 to 128 to make the model capacity comparable to the other baselines. Instead of an implicit Adams-Bashforth solver we used the same dopri5 solver as for FEN and T-FEN with an absolute tolerance of 10 −6 and a relative tolerance of 10 −7 . This both increased the models performance and avoided frequent convergence issues in the step function of the Adams-Bashforth solver. Finally, we exchanged the RPROP opti-mizer with a standard Adam optimizer and we compute the gradients the same in FEN and T-FEN by backpropagation through the ODE solver. We apply the same zero-initialization and increasing training sequence lengths, that we use for our models, to this baseline. For PA-DGN we choose two graph network layers of two layers each with a hidden size of 64. We condition the model on 5 timesteps. To represent the structure of the domain, we pass the 4-nearest neighbor graph of the nodes to the model following authors of the method (Seo et al., 2020) . GWN. For GWN we use the improved configuration described by Shleifer et al. (2019) with 40 convolution channels, extra skip connections and four of GWN's WaveNet inspired blocks with two layers each. The model is conditioned on 12 time steps of data and predicts 10 steps ahead. For the long-term extrapolation in Appendix G, we apply the model auto-regressively to get predictions for more than 10 steps. We do not learn an adaptive adjacency matrix because that comes with a quadratic memory cost in the number of nodes and Wu et al. have shown that GWN only performs 1.2 % worse on METR-LA without the adaptive adjacency matrix. For the computation of the forward and backward transition matrices, we pass in the adjacency matrix derived from the mesh of the domain where the edges are weighted by the Gaussian kernel where σ is the standard deviation of x (i) − x (j) 2 . We train our models with the Adam optimizer (Kingma & Ba, 2015) by minimizing the multi-step prediction error in terms of the mean L 1 distance between predicted and actual node features whereŷ (tj ,·) is the prediction at time t j . To accumulate gradients, we backpropagate through the operations of the ODE solver, also known as discretize-then-optimize, as it improves training time significantly compared to backpropagation with the adjoint equation (Onken & Ruthotto, 2020) . We observed that training our models to predict complete sequences from the beginning can get them stuck in local minima that take many epochs to escape. To stabilize training, we begin by training on subsequences of length s = 3 and iteratively increase s by 1 per epoch until it reaches the maximum training sequence length. For our super-resolution experiment, we have subsampled both datasets 3 times for each fixed number of nodes and evaluated each of the 10 trained models per model class against each of them. Therefore, the means (denoted by the markers) and standard deviations (denoted by the error bars) in Fig. 3 and Fig. 8 have been computed over 30 evaluations. Inspecting the full results in Fig. 8 reveals two things. While T-FEN performs best in super-resolution on both datasets, GWN and PA-DGN also only deteriorate slightly on finer meshes. However, this is offset by the fact that these models did not achieve such a good fit to begin with and are still out-performed by T-FEN on every resolution. CT-MPNN achieved a lower prediction error than T-FEN on the original mesh granularity but generalizes markedly worse than both FEN and T-FEN to finer meshes. We suppose that this is due to the fact that CT-MPNN uses a general MPNN internally, while FEN and T-FEN have an inductive bias towards producing physically sensible predictions because of their structural connection to FEM and PDEs. In Fig. 10 we show 60-step extrapolations of the all models trained on 10-step forecasting. The first thing to note is that PA-DGN struggles with the directionality inherent in the ScalarFlow dataset. While its prediction evolves over time, it exhibits a mostly diffusive dynamic. From a visual standpoint, GWN produces the most realistic looking forecasts with small-scale details even after many timesteps, especially compared to the ODE-based models that beat GWN in short-term forecasting. We see two reasons for that. First, ODE-based models that learn the data dynamics directly, i.e. all three of FEN, T-FEN, and CT-MPNN, have to work with a tight information bottleneck. Because these models carry information through time only in the features themselves and have no latent state, they can condition their prediction only on a single time step and this bottleneck makes it difficult to conserve fine details over time. GWN on the other hand predicts 10 timesteps at a time and conditions each forecast on the past 12 timesteps. In this way, GWN can extract more information from the input data and also carry that information forward 10 steps at a time whereas ODE-based models have to conserve the information through roughly 100 solver steps to reach t 10 , see the number of function evaluations in Table 1 . Second, FEN, T-FEN, and CT-MPNN are ODE models and as such are biased towards smooth predictions over time and over long time frames this leads to a smoothing in space. Yet, on short time frames, before smoothing sets in, these models are able to model these physical processes more accurately, because ODEs are natural models for these data. For a discussion of the differences between the FEN, T-FEN, and CT-MPNN predictions, see Section 4. Delaunay triangulation of a set of points can produce very long cells on the boundary as in Fig. 11 when several points close to the boundary are almost co-linear. This can just occur in the data but it can also happen during generation of synthetic data. If some boundary nodes in synthetic data are exactly co-linear such as in grid domains, they might lose the exact co-linearity due to the inexactness of floating point numbers and operations if we, for example, rotate the domain. To ensure that these artifacts do not impact our models, we pre-process the triangulation with Algorithm 1 and an angle threshold of ε = 10π 180 . (a) The inside angles of the shaded boundary cell are below the threshold and it is removed. (b) The removal has made this (previously) interior cell into a boundary cell that itself falls below the angle threshold and gets removed as well. Figure 11 : Delaunay triangulation can produce elongated, splinter-like cells when interior nodes are close to the boundary of the convex hull as in this example. We post-process the Delaunay triangulations with Algorithm 1 to filter these cells out. Numerical methods for partial differential equations Learning dynamical systems from partial observations. CoRR, abs/1902.11136 Combining differentiable pde solvers and graph neural networks for fluid flow prediction The mathematical theory of finite element methods Neural ordinary differential equations Fundamentals of numerical weather prediction Methods of Mathematical Physics: Partial Differential Equations Deep learning for physical processes: Incorporating prior scientific knowledge Fast graph representation learning with PyTorch Geometric Neural message passing for quantum chemistry scikit-fem: A python package for finite element assembly Array programming with NumPy Matplotlib: A 2d graphics environment Learning continuous-time pdes from sparse data with graph neural networks Computational Seismology: A Practical Introduction Inverse problems for partial differential equations Graph neural network for traffic forecasting: A survey. CoRR, abs Examining COVID-19 forecasting using spatio-temporal graph neural networks. CoRR, abs Adam: A method for stochastic optimization Learning similarity metrics for numerical simulations Aristidis Likas, and Dimitrios Fotiadis. Artificial neural networks for solving ordinary and partial differential equations. Neural Networks, IEEE Transactions on Numerical solution of partial differential equations in science and engineering Diffusion convolutional recurrent neural network: Data-driven traffic forecasting Pde-net: Learning pdes from data Pde-net 2.0: Learning pdes from data with a numericsymbolic hybrid deep network Can we trust long-range weather forecasts? Management of weather and climate risk in the energy industry Cartopy: a cartographic python library with a matplotlib interface Hybrid FEM-NN models: Combining artificial neural networks with the finite element method. CoRR Discretize-optimize vs. optimize-discretize for time-series regression and continuous normalizing flows. CoRR, abs Pytorch: An imperative style, high-performance deep learning library Learning meshbased simulation with graph networks Graph neural ordinary differential equations Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations Einops: Clear and reliable tensor manipulations with einstein-like notation Physics-aware difference graph networks for sparselyobserved dynamics Machine learning for spatiotemporal sequence forecasting: A survey Convolutional LSTM network: A machine learning approach for precipitation nowcasting Incrementally improving graph wavenet performance on traffic prediction Simulating the spread of covid-19 via a spatially-resolved susceptible-exposed-infected-recovered-deceased (seird) model with heterogeneous diffusion Graph wavenet for deep spatial-temporal graph modeling Hydra -a framework for elegantly configuring complex applications. Github Deep learning on traffic prediction: Methods, analysis and future directions Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting Deep spatio-temporal residual networks for citywide crowd flows prediction We thank Leon Hetzel for helpful discussions and suggestions. This research was supported by the Bavarian State Ministry for Science and the Arts within the framework of the Geothermal Alliance Bavaria project. Algorithm 1 Filtering cells with small interior boundary-adjacent angles from a triangulation given a threshold angle and a set of d − 1 dimensional boundary faces 1: function FILTERCELLS(triangulation T , threshold ε > 0, boundary faces B) for all ∆ ∈ T do Count number of faces on the boundary for each cell 3:if c ∆ = 1 then Check that ∆ is not at an edge or corner of the domain 8:continue 9:(∆, ξ) ← split(∆) Split boundary face and interior node 10:if MINBOUNDARYANGLE(∆, ξ) < ε then 11: return γ