key: cord-0487348-d9h84hle authors: Nikitin, Alexander; John, ST; Solin, Arno; Kaski, Samuel title: Non-separable Spatio-temporal Graph Kernels via SPDEs date: 2021-11-16 journal: nan DOI: nan sha: 2585f50befeaa77735f145f95a4cc40ac6815225 doc_id: 487348 cord_uid: d9h84hle Gaussian processes (GPs) provide a principled and direct approach for inference and learning on graphs. However, the lack of justified graph kernels for spatio-temporal modelling has held back their use in graph problems. We leverage an explicit link between stochastic partial differential equations (SPDEs) and GPs on graphs, introduce a framework for deriving graph kernels via SPDEs, and derive non-separable spatio-temporal graph kernels that capture interaction across space and time. We formulate the graph kernels for the stochastic heat equation and wave equation. We show that by providing novel tools for spatio-temporal GP modelling on graphs, we outperform pre-existing graph kernels in real-world applications that feature diffusion, oscillation, and other complicated interactions. Contemporary machine learning typically leverages structure within the modelling task or application. Often, the structure can be represented as a graph, where the objects and relations are represented as nodes and edges. Such tasks can be found, for example, within chemoinformatics (Tsuda, 2011), social network analysis (Campbell et al., 2013) , and anomaly detection (Akoglu et al., 2015) . The abundance of graph-structured problems has led to the invention of new methods for graph-structured data, including graph neural networks (GNNs, Scarselli et al., 2008) and spectral methods (Chung and Graham, 1997) for capturing structure across space and time. Working with the graph allows for direct modelling of spatiotemporal diffusion over the graph nodes as described by the SPDE. Figure 1: We start with an unknown process on a particular domain with an SPDE model, discretize it, and derive the GP covariance function from this SPDE. Finally, we use GP inference in the spatial or spatio-temporal model from discrete graph-structured observations. In this work, we focus on graph-based prediction of temporal signals on graphs. In other words, the goal is to approximate spatio-temporal processes on graphs. Many real-life problems belong to this class of tasks, including traffic prediction (Jiang and Luo, 2021) , epidemiological modelling (Keeling and Eames, 2005) , and analysis of dynamics in social networks (Carrington et al., 2005) . Recently, a number of approaches have been developed for temporal machine learning problems on graphs, including the works by Li et al. (2017) , Yu et al. (2018), and Zhao et al. (2019) . We approach these graph-based learning problems using Gaussian processes (GPs, Rasmussen and Williams, 2006) . GPs are widely used tools in statistical machine learning, where including prior knowledge is desired. Their probabilistic treatment allows for convenient inference machinery and enables capturing the uncertainties associated with predictions. However, it is not arXiv:2111.08524v2 [cs. LG] 22 Mar 2022 always apparent how to encode relevant prior knowledge into the GP prior on graphs. In the continuous domain, bringing in prior knowledge from stochastic partial differential equations (SPDEs) has enabled constructing useful kernels, including the Matérn family of kernels (Whittle, 1963; Lindgren et al., 2011) , resulting in improvements in quality and performance in many tasks. In this work, we ask whether analogous constructs can be made for graphs and whether they improve the performance of GPs on spatio-temporal graph problems. We hypothesize that because many physical processes follow differential equations, deriving kernels from their stochastic counterparts will allow incorporating physical priors in GP models. Apart from challenges in the model specification for graph-based GPs, the usual drawbacks also apply: closed-form inference is only possible for conjugate observation models, and a naïve treatment results in a cubic computational scaling with respect to the number of observations (Rasmussen and Williams, 2006) . This means that vanilla GPs are not particularly suitable for large-scale problems. However, a number of modern methods exist to overcome this issue, including sparse Gaussian processes using inducing points (see Liu et al., 2020 , for a recent review). The development of SPDE-based approaches to Gaussian processes has led to several practical solutions. One of those is a popular R package for approximate Bayesian inference: R-INLA (Lindgren et al., 2015) that uses SPDEs to sample from spatial models. The approach we propose in this paper allows extending this framework to graph problems, and it is illustrated in Fig. 1 . The contributions of this paper: (i) We extend methods from spatial statistics and, hence, link GPs to SPDEs for temporal signals on graphs. (ii) We apply this approach to spatio-temporal modelling on graphs and consider the analogs of well-known SPDEs from spatial statistics, which allows us to derive nonseparable spatio-temporal kernels on graphs. (iii) We empirically evaluate these kernels and show their effectiveness on both synthetic data sets and applied machine learning problems: prediction of the distribution of chickenpox and COVID-19 epidemic. The connection between Gaussian processes and stochastic (partial) differential equations was established by Whittle (1963) . It resulted in a number of works in spatial statistics, most notably the work by Lindgren et al. (2011) , the library R-INLA (Lindgren et al., 2015) , and by Solin (2016) in the spatio-temporal context. These methods have been applied in many domains, for example, animal movement prediction (Hooten et al., 2017) , geographical information system modelling (Burrough et al., 2015) , and geostatistical analysis (Moraga et al., 2017) . To apply SPDE approaches to graphs, we need to introduce spectral graph theory, which has a long history of research and applications (e.g., Chung and Graham, 1997; Von Luxburg, 2007) . These works typically consider the spectrum of the graph Laplacian and derive properties that can be helpful for different tasks on graphs. Belkin and Niyogi (2003) used spectral graph theory to develop invariant embedding maps. These embedding maps then showed good results in dimensionality reduction (Yan et al., 2007) . Spectral graph theory is linked to GPs on graphs by considering suitable spatial (pseudo-)differential operators and their associated kernels (covariance functions). A number of graph kernels were developed using the methods from spectral graph theory. Smola and Kondor (2003) proposed the construction of kernels on graphs by introducing random walk and heat kernels on graphs, evaluated their performance, and derived necessary properties of the kernels. Recent work on graph approaches to Matérn fields considers graph discretization and theoretically compares the covariances produced by discrete approximation and Matérn kernels (Sanz-Alonso and Yang, 2020). Graph representation learning has gained popularity because of its large number of applications (Chami et al., 2020) . Kernel methods, such as kernel ridge regression, for prediction on graphs were proposed by Romero et al. (2016) . This approach was extended to spatio-temporal signals on dynamic graphs (Romero et al., 2017) . Matérn graph kernels were presented by recent works: Matérn kernels on manifolds (Borovitskiy et al., 2020) , and, then, on graphs (Borovitskiy et al., 2021) . We extend and generalize this approach with the SPDE framework for graph kernels and then go beyond spatial graph kernels to the spatio-temporal domain. Gaussian processes are a non-parametric machine learning paradigm, in which we model the target function as a stochastic process, whose evaluation at any finite set of points has a joint Gaussian distribution (Rasmussen and Williams, 2006) . A Gaussian process f (x) ∼ GP(m(x), k(x, x )) is defined by its mean function m(x) and covariance function k(x, x ) (Bishop, 2006) . The covariance function is often called a 'kernel'. The kernel encapsulates prior knowledge, and defining a good kernel is one of the key ingredients and challenges of setting up the GP model (Duvenaud, 2014) . GPs can be extended to vector-valued functions using multioutput GPs (Álvarez et al., 2012). Kernels then become matrix-valued: K(x, x ) : R n × R n → R m×m , for n-dimensional inputs and m-dimensional outputs. Standard GP toolchains include various kernels on continuous domains (e.g., periodic, polynomial, and Matérn family). However, application of GPs to other domains is often restricted by the lack of principled kernels. In spatial statistics, Gaussian random fields can be represented as solutions to an SPDE. The physical sense of an underlying process can be included in GPs in a natural way by forming the prior covariance function as a solution to an SPDE (Särkkä, 2011) . Many widely used GP kernels can be derived from the corresponding SPDEs. For example, the Matérn kernel family for multidimensional GPs can be derived from underlying SPDEs (Whittle, 1963). Stochastic partial differential equations (SPDEs) generalize partial differential equations via introducing random forces (noise) to some terms and coefficients, similarly to how stochastic differential equations generalize ordinary differential equations (Oksendal, 2013; Gardiner, 2004; Särkkä and Solin, 2019) . SPDEs are applied in various fields, including physics, signal processing, and machine learning (Schottl, 1997) . Spectral graph theory allows us to derive graph properties by operating with the spectrum of its Laplacian matrix (a.k.a. graph Laplacian), which will allow us to define partial differential equations (PDEs) on graphs. Definition 3.1. The graph Laplacian L of a graph G = (V, E) is the matrix where W is the matrix of the edge weights and D W = diag(w i ) is the diagonal matrix of the accumulated weights w i = j:(i,j)∈E w ij . In the unweighted case, D W is the degree matrix and W is the adjacency matrix. The graph Laplacian can be normalized to have unit diagonal entries. All derivations in this article apply to both normalized and non-normalized Laplacians in weighted and unweighted graphs. In practice, the choice between normalized and unnormalized Laplacians depends on the graph structure and should be based on the particular modeling task. We consider a spatio-temporal prediction problem on a graph G = (V, E) given a data set ( where the domain D = V ×R describes graph vertex and time. We focus on the regression task, where y i are the targets from the real numbers R. We aim to approximate a function f such that f : D → R. The methods we propose are applicable for classification problems f : D → C for a finite set of labels C as well. Notation. If not stated otherwise, we denote f (x) : R n → R a continuous spatial function, v ∈ R |V | a spatial function on a graph (i.e., a vector whose components are the function values at the vertices), and u(t) : R → R |V | a spatio-temporal function on a graph. We will denote w(x) spatial white noise 1 , w a |V |dimensional white noise 2 , and dW t the differential of |V |-dimensional Brownian motion. dt is the formal temporal derivative of Brownian motion 3 , and W t is Brownian motion. Our framework for deriving graph kernels via SPDEs consists of the following high-level steps: (i) define an SPDE using prior knowledge about the underlying process, (ii) convert it to a graph counterpart, (iii) solve the graph counterpart, and (iv) derive corresponding mean and covariance function of GP on graph. Next, we will provide specific examples and use this framework to derive novel kernels. Many PDEs are defined using the Laplace operator ∆. In Euclidean space, it is defined as ∆ = n i=1 ∂ 2 ∂x 2 i . On graphs, we define as the corresponding operator the negative graph Laplacian −L. This is justified by considering a lattice approximation of the real plane and ensuring that this operator represents exactly the discretization of ∆ (Wardetzky, 2020). We illustrate the approach by considering the analogue of Laplace's equation on graphs. Laplace's equation, ∆f (x) = 0, is a second-order partial differential equation relevant across physics, for example, in fluid dynamics and electromagnetism. Its stochastic counterpart, known as Laplace's stochastic partial differential equation, is ∆f (x) = w(x). For the analogue of Laplace's SPDE on a graph, we replace ∆ with −L and arrive at − Lv = w. (2) Because the space we are considering is discrete and the graph Laplacian replaces the Laplace operator, the graph counterparts of SPDEs become multidimensional SDEs. For a formal derivation using discrete calculus, one can consider v as a 0-cochain, as in Grady and Polimeni (2010) . Within our work, it is enough to operate with v as a vector. Eq. (2) defines the notion of harmonic functions on graphs: in functions that are solutions to Eq. (2), the value at each node is given by the mean of adjacent nodes. The solution v of Eq. (2) can be described as a Gaussian process (which in the discrete case, by definition, is a multivariate normal distribution): where A + denotes the Moore-Penrose pseudoinverse of matrix A (Moore, 1920) . This defines the Laplacian kernel on graphs, K = (L L) + . Another example of the SPDE framework applied to graph kernels is the Matérn family of kernels. The Matérn kernel family is an essential kernel family for many applications of GPs. The isotropic (rotation and translation invariant) Matérn kernels are defined as (Rasmussen and Williams, 2006) k Matérn (r) = 2 1−ν γ ν K ν (γ)/Γ(ν), γ = √ 2νr/κ, (4) where r = x − x is the Euclidean distance, κ is the lengthscale, and ν controls smoothness (samples from a GP with this kernel are ( ν − 1)-times differentiable). The Matérn kernel in R d can be derived from this SPDE with the fractional Laplace operator (Lindgren et al., 2011) : This motivates us to use the fractional graph Laplacian: Definition 4.1. Fractional graph Laplacian L: where ν and κ are positive scalar parameters. The fractional graph Laplacian is more flexible than L (for ν = 2 and κ → ∞, it coincides with L). Additionally, for some combinations of hyperparameters, the fractional graph Laplacian possesses non-local properties (see Benzi et al. (2020) ). We can formulate Eq. (5) on graphs as For a self-adjoint Laplacian L the covariance matrix is and the solution of Eq. (7) is v ∼ N (0, K). These results were recently considered by Borovitskiy et al. (2021) , and here we showed how they fit into a broader SPDE framework. To obtain a kernel on the combination of two domains (here, graph and time), one can take kernels for each domain and combine them, commonly as a product: , where x and x belong to the spatial domain and t and t to the temporal domain. The separability of a kernel is a choice made for convenience. An underlying process is not necessarily separable; an example of spatial covariance structure changing over time can be easily constructed: for instance, in spatio-temporal epidemiological modelling, the structure of the spatial covariance changes over time (e.g., travel restriction or development of a vaccine), thus it is not enough to separate spatial and temporal structure, and they have to be considered jointly. We describe non-separable spatio-temporal graph kernels using the SPDE framework. Let us consider two widely used parabolic and hyperbolic equations: • Heat equation: ∂f ∂t = c∆f , • Wave equation: ∂ 2 f ∂t 2 = c 2 ∆f . We will define the stochastic counterparts of these equations on graphs, derive the corresponding covariance functions, and study them. The kernel derivations from these two SPDEs are formulated as propositions. The full derivations are in Appendices B and C. The heat equation is written as ∂f ∂t = c∆f . This PDE is called the heat equation because it is used in physics for modeling heat transfer on surfaces. If f is temperature, the Laplace operator on the equation's right-hand side determines the difference between the temperature at a certain point and its neighboring points on a surface. We consider the discretization of this equation using the graph Laplacian. Let u(t) be a temporal signal on a graph which obeys the heat equation which is now an ODE with the solution This solution is known as the heat kernel. This result plays an important role in graph theory. The positive semi-definite matrix e −cLt in this equation can be used as a kernel on the graph. This kernel is related to graph random walks, which are used widely in graph algorithms (Lovász, 1993) . To see this, consider a Taylor expansion of this kernel (assume c = 1 for simplicity): A is the random walk matrix of the graph. In fact, this kernel defines continuous-time random walks, where each jump occurs in Poisson(1) time, similarly to the continuous case. By adding temporal white noise, we can construct a heat SPDE on graphs: Proposition 1. The stochastic heat equation kernel (SHEK) on graphs can be defined by adding spatio-temporal white noise, or for convenient integration, as a formal derivative of the Wiener process The solution is given by a Gaussian process: Or, when the matrix L is self-adjoint (the graph is undirected), as Cov The kernel is parameterized by diffusivity c, variance σ, and parameters of the fractional Laplacian ν and κ. We extend this result to the case of matrix-scaled white noise: Proposition 2. SHEK with matrix-scaled noise. A scaled extension of SHEK can be derived from a stochastic heat equation on graphs with a matrix-scaled white noise: The solution for undirected graphs is defined by a Gaussian process with covariance matrix in the following form: where P is a unitary matrix: The matrix P exists because L is normal and positive definite. C(t, s) is defined for t ≥ s as: For a directed graph, the stationary covariance matrix C * can be found from the Lyapunov equation Spreading waveforms in physics, such as mechanical waves (e.g., sound waves or seismic waves) as well as light waves, are commonly described by the wave equation: This PDE can be considered on graphs as follows: The solution of this equation for undirected graphs has the form where P is defined by diagonalization of the Laplacian matrix: L = P −1 diag(λ 1 , . . . , λ |V | )P . The solution of the stochastic wave equation cannot be used directly as a GP kernel (as it was with the heat equation), but this result is necessary for the derivation of a GP kernel from stochastic wave equations on graphs. In this section, we consider a stochastic extension of the wave equation on graphs. Adding stochasticity allows us to derive a GP kernel that will include prior information about wave nature of the underlying process. Proposition 3. The stochastic wave equation kernel (SWEK) on undirected graphs is defined by the second-order matrix differential equation and a solution to this equation for undirected graphs can be expressed by the Gaussian process: where Θ = c L and P is defined by the diagonalization of the fractional Laplacian matrix: We can observe that eigenvalues of the kernel oscillate with different frequencies. It makes this kernel potentially useful for periodic processes on graphs or discretizations of continuous periodic processes with a graph lattice. The stochastic wave equation kernel can be extended to directed graphs similarly by solving the graph wave equation through the eigenvectors of L. In a single-vertex graph, this kernel coincides with the product of Brownian motion kernel and cosine kernel. Many methods for sampling from GPs are available. A common method involves a Cholesky decomposition of the covariance matrix. The computational cost of sampling is a sum of O(n 3 ) on Cholesky factorization and O(n 2 ) (matrix-vector multiplication) for each sample, where n is the number of points. Many other sampling approaches are based on the expansion of the random field (process) as a series of basis functions. Those methods include Karhunen-Loeve expansion using the eigenfunctions of a covariance function, circulant embedding methods based on Fourier basis (Graham et al., 2018) , and a hierarchical matrix approximation of the covariance matrix (Feischl et al., 2018) . The SPDE based approach (Lindgren et al., 2011) requires solving an SPDE using the finite element method, which approximates the GP with a Gaussian Markov random field (GMRF). The SPDE framework allows for using the numerical solvers of SPDEs, for example, the Ruge-Stüben solver (Ruge and Stüben, 1987) for the spatial case or Euler-Maruyama (Higham et al., 2002) for the spatiotemporal case. This idea especially shines in the graph domain because the Laplacian matrix allows considering these equations as matrix equations; thus, it allows using all of the existing tools for numerical solutions of matrix equations. Moreover, in principle, continuous domains can be discretized via graph representations, and then numerical solvers can be used for sampling from GPs over continuous domains. We consider this opportunity as one more justification for the SPDE framework and leave its development for future works. For a deeper understanding of the properties of the kernels and a more pictorial tour of their properties, we study visualizations of the kernels on a simple graph with different values for the kernel hyperparameters. In Fig. 2 , we show the visualizations of the proposed kernels on a toy model: a line graph of three vertices. The first row shows the temporal part of the covariance matrix (summed over the graph vertices at each timepoint). The following three rows show mean (black) and samples (colored lines) as a function of time at each of the nodes, conditioned on y(t = 0) = (0, 0, 10), for different values of the hyperparameter c (ν = 3/2, κ = 1). For the temporal component, SHEK shows the expected behavior for a diffusion process: covariance is high between nearby time points and decreases with the temporal gap, whereas SWEK shows periodic dependence between temporal components (periods of high covariance are alternating with periods of low covariance). Additionally, stochasticity accumulates with time resulting in higher covariance for later timestamps (analogously to Brownian motion). To highlight practical applicability of the proposed methodology and benchmark against alternative approaches, we evaluate the described kernels on graph spatio-temporal applications: heat distribution over a one-dimensional line (Sec. 5.1), spreading of COVID-19 across the United States (Sec. 5.2), and spreading of chickenpox over Hungarian counties (Sec. 5.3). We used Gaussian process regression with Gaussian likelihood as a method for fitting the models. We maximized the log marginal likelihood with respect to kernel hyperparameters (σ and c for SWEK and SHEK) with the L-BFGS optimizer (Liu and Nocedal, 1989) . All the experiments were implemented using the GPflow (Matthews et al., 2017) library in Python. Extrapolation. Our experiments were repeated for different training and testing intervals (we call them validation rounds) for a given data set using sliding window backtesting. If we denote number of training timepoints as N train , number of testing timepoints as N test , and by t r the starting time for the r-th validation round, then sliding backtesting uses time interval [t r , . . . , t r + N train ] as the training data, and [t r + N train + 1, . . . , t r + N train + N test ] as test data. Interpolation. To evaluate interpolation properties of the methods, we randomly selected ten percent of the time interval [t r , . . . , t r + N train + N test ] as testing data set and used the rest for training. Evaluation. In forecasting problems, modelling may result in a high variance of the generalization error across the validation rounds. Thus, the comparison of generalization error over the rounds by confidence intervals can be inaccurate. The generalization errors should be evaluated with a thorough point-wise comparison over each evaluation. To make statistically significant conclusions, we used the Diebold-Mariano test (Diebold and Mariano, 2002) . The variance of the generalization error is also an important model property that shows the method's robustness over different validation rounds. We tested both interpolation and extrapolation performance and report mean absolute error (MAE int and MAE ext ) or mean absolute percentage error (MAPE int and MAPE ext ). We also report 95% confidence intervals over validation rounds. As a simple spatio-temporal process, we considered heat transfer in 1D on a line. We parameterized the process with conductivity k, and considered a fundamental solution of the heat distribution process: Φ(x, t) = 5/(4πkt) exp(−x 2 /4kt). We modelled the heat transfer distribution for a one-dimensional segment and discretized it as a linear graph with 21 vertices. We compared separable and non-separable graph kernels for temporal signals. We used a continuous timeseparable kernel as a reference model. We used 50 timestamps for training data and 10 timestamps as test data; we performed ten iterations of backtesting validation with different random seeds and different starting time points. The results are presented in Table 1; we report extrapolation and interpolation MAE with 95% confidence intervals, marking the top re- sults of the graph kernels in bold. We observed that the non-separable stochastic heat kernel outperformed separable kernels on the extrapolation task, but the product graph Matérn kernel and RBF outperformed it on the interpolation task. As expected, exponential kernel over continuous space outperformed graph discretization for extrapolation but did not outperform the product of graph Matérn kernel and RBF over time. As a real-world use-case of the proposed method, we considered the prediction of COVID-19 distribution and spread across the USA. We used data about COVID-19 cases and deaths published by The New York Times (2020). We aggregated the number of cases by week for each state. Moreover, we handcrafted a graph that represents the geographical neighbors of the states in the US. Each node of the graph represents a state, and two nodes are connected with an undirected edge if the corresponding states share a common border. The approach can be extended to other graphs that arise from, for example, traffic links or information about the flights. As a modelling target for the extrapolation task, we aimed to predict how many cases there will be in each state during the next two weeks. For numerical stability, we predicted the natural logarithm of this value. We used 33 weeks as training data and estimated the number of cases for the subsequent two weeks. We ran all experiments ten times with a sliding starting point and different random seeds. The goal of the experiment was to show that the proposed kernels can achieve good results within the GP framework. We argue that GPs are particularly useful in this case because of their excellent interpolation properties and the possibility to assess the uncertainties, which is crucial for decision-making. We compared the results with a graph neural network (GNN), for which we used a diffusion convolutional recurrent neural network (DCRNN) layer (Li et al., 2017) followed by a fully connected layer. We used the implementation from the library 'PyTorch Geometric Temporal' (Rozemberczki et al., 2021a) . We also deepened the architecture but the results remained largely unaffected. For this reason, we report the results for the simplest architecture. The results are shown in Table 2 . We can see that the models converged to a relatively low MAPE, and SHEK showed the best performance. This shows that the proposed methods are useful for the task we consider. The results on the interpolation task showed better performance for SHEK (DM-test, p = 0.1). The results on extrapolation performance are equal for considered separable and non-separable cases. However, SHEK's performance was more stable across validation rounds on the extrapolation task, resulting in two times smaller 95% confidence intervals of generalization error. When the extrapolation period was increased to four and six weeks, SHEK performed consistently better than other kernels: four weeks period MAE of SHEK(ν = 1/2) was less than MAE of the product of Matérn(1/2) and RBF (DM-test, p < 0.01), as well as on six week period (DM-test, p < 0.05). Experiment with a flight graph. Additionally, We conducted an experiment to check whether non-spatial graph information can be included in the Covid-19 modeling task. We collected information about flights between the states and used this graph instead of geographical adjacencies. Results showed MAE statistically better for non-separable (p = 0.05, DM-test); MAPE showed no statistically significant difference between separable and non-separable kernels in this case. We evaluated our methods on a data set of chickenpox cases in Hungary (Rozemberczki et al., 2021b) . The data set contains the weekly aggregated numbers of cases of chickenpox in Hungarian counties and Budapest from 2005 until 2015. The spatial relation of counties is represented as a graph with each county as a node (in total 20 nodes) and unweighted edges that indicate adjacency of the counties (in total 61 edges). This data set has a different structure than the COVID-19: the epidemic has strong seasonality and does not have periods of fast exponential growth. We observed similar results to the COVID-19 case: GP approaches outperformed the GNN approach, SHEK has shown comparable performance with separable kernel on the extrapolation task with more stable performance across the validation rounds. On the long-term (six weeks) extrapolation task, DCRNN and SHEK outperformed other approaches. The separable kernel has shown better performance on the interpolation task than SHEK (DM-test, p = 0.05). We repeated the experiment for the longer extrapolation periods: four and six weeks and observed that SHEK performed more stable and consistently better than separable kernels (DM-test, p = 0.1 for four weeks and p = 0.05 for six weeks). We also experimented with other types of separable kernels and selected Matérn-3/2 plus RBF as a consistently accurate combination on interpolation and extrapolation tasks. We provide full results of the evaluation and point-wise comparison in Appendix E.2. We considered spatio-temporal problems on graphs. We showed how these can be addressed by productseparable kernels and, then, introduced a framework to derive spatial and spatio-temporal graph kernels based on SPDEs. As concrete instances, we considered the stochastic heat and wave equations on graphs, and solved them to arrive at novel types of graph kernels: the stochastic heat equation kernel (SHEK) and the stochastic wave equation kernel (SWEK). We compared GPs with the proposed kernels against DCRNNs, and demonstrated that our approach outperforms graph neural networks on some machine learning tasks. Following our method for deriving kernels from SPDEs, this work creates a possibility for direct use of additional prior information in graph problems by creating new flexible kernels on graphs. These methods can be applied to other types of equations resulting in either an analytical solution or an SPDE that can be numerically solved. It may lead to the more frequent use of probabilistic methods on graphs overcoming predictive power issues with the pre-existing kernels. Our work opens up new avenues of research, such as the solution of other types of equations on graphs or discretization of continuous kernels using graph approximation. Moreover, it gives machine learning tools that can be used for essential problems on graphs, such as epidemic modelling. As a side contribution of our work, we constructed a timely data set of COVID-19 cases over a graph that can be used to evaluate temporal machine learning problems on graphs. In future work, the scalability of the methods to larger graphs has to be explored. Possible approaches include state space models, numerical solvers for SPDEs, and sparse methods. In our proofs, we rely on the following theorem. Theorem A.1 (Itô isometry, Oksendal (2013)). where X t and Y t are stochastic processes adapted to natural filtration. This property will be used actively in derivations of covariances from SDEs. Lemma A.2 (Integration of the matrix exponential). where A is a nonsingular matrix. In this section, we introduce spatial kernels on graphs derived by the SPDE framework. Proposition B.1 (Laplace's equation in kernel form). Consider a signal (process) v ∈ R |V | on a graph G, where L is the Laplacian of G and the signal is described with the Laplace SPDE on the graph: −Lv = w. Then the solution v can be described as a Gaussian process: Proof. By writing out v = −L + w, (B.2) and noting that the covariance matrix of white noise w is I, the covariance of the random variable u is Proposition B.2 (Matérn kernels on graphs). Matérn kernels on graphs are described with the equation and have the form For the self-adjoint Laplacian L: Thus, the solution of Eq. , then v = A −1 w. Now, we can derive a covariance matrix, analogously as in Proposition B.1. Proof. The solution of the differential equation is similar to the homogeneous initial value problem in the continuous case and can be written in terms of a matrix exponential: For stochastic heat and wave equations, we will be using the pseudo-differential operator L instead of the graph Laplacian because this operator is more flexible than L (for ν = 1 and κ → ∞, it coincides with L), and because it possesses non-local properties as was shown by Benzi et al. (2020) . Proposition C.2 (Stochastic heat equation on graphs). The stochastic heat equation can be defined on a graph by adding spatio-temporal white noise, or for convenient integration, as the formal derivative of Wiener process The solution can be defined as a Gaussian process: Or, when the matrix L is self-adjoint (the graph is undirected), as Proof. Let Γ = c L. Equation (C.4) is a matrix differential equation and a solution can be written in the form: The solution in Eq. (C.7) can be expressed as a GP. We give the corresponding mean and covariance as follows. Using Itô isometry, we can derive the covariance: Or, in the case of self-adjoint L the expression can be simplified: Proposition C.3 (Stochastic heat equation on undirected graphs with matrix-scaled white noise). Let us consider the same equation as in the previous theorem but with Σ-scaled white noise on undirected connected graphs: The covariance can be derived as follows. Consider Γ = c L and a diagonalization of Γ: P LP * = P L T P + = diag(λ 1 , λ 2 , . . . , λ n ). (C.12) Then, for t ≥ s: Cov u(t), u(s) = P * C(t, s)P , (C.13) where C(t, s) is defined as: ). (C.14) Proof. Let us write covariance between u(t) and u(s) and apply Itô isometry Because the matrix L d is diagonal, we can write the C(t, s) ij (for t ≥ s): Proposition C.4 (Wave equation on an undirected graph). The wave kernel on undirected graphs is defined by the second order matrix differential equation (C.18) and the solution of this equation has a form: Proof. L is a symmetric matrix and, consequently, diagonalizable: Then the equation will take a form: .. Replacing the variable y := P u, we will got the following equation: .. y + c 2 L d y = 0. This is equivalent to n independent scalar ODEs: .. The solution of each of these equation is: Let the initial conditions of the system be given as y(0), . y(0). Then, for λ k = 0 c 1 and c 2 can be expressed by and c 1 = c 2 = 0 for λ k = 0. By doing the reverse substitution, we get the result for u: Proposition C.5. The Stochastic wave equation kernel (SWEK) on undirected graphs is defined by second order matrix differential equation C.26) and the solution to this equation can be expressed as a Gaussian process: Here, Θ = c L. Proof. The homogeneous solution for wave equation is given in Theorem C.4: Let us define Θ = c L for convenience. In order to solve the given SPDE we need to find the inhomogeneous part. It can be found by variation of the parameters. The Wronskian of the basis functions (here denoted as Wr to distinguish it from the Wiener process) is: for v 1 = cos(Θt) and v 2 = sin(Θt). The particular solution will have the form: Then the solutions for the stochastic wave equation on graphs are Figure S1 : Temporal visualizations of SHEK (heat, left) and SWEK (wave, right) on a linear three-node graph. The first row shows the temporal part of the covariance matrix (summed over the graph vertices at each timepoint). The following three rows show mean (black) and samples (colored lines) as a function of time at each of the nodes, conditioned on y(t = 0) = (0, 0, 10), for different values of the hyperparameter c. We ran the experiments on a NVIDIA Tesla P100-PCIE-16GB GPU. We repeated the evaluations for several validation rounds (from eight to 12 depending on the experiment) using sliding window backtesting. Sliding window backtesting is schematically visualized in Figure S3a and explained in the caption. For all measurements, we report 95% confidence intervals. (a) Sliding window backtesting visualization. At each step, we select a training time interval ( ) followed by a testing time interval ( ) and evaluate the performance. During the next iteration, we select time intervals that are shifted by a particular value. (b) Visualization of the graph of adjacent states. Each node is a state in the US, and each edge indicates the adjacency between two states. E.1 Synthetic Wave Experiments We generated a wave distribution process over a one-dimensional line and discretized it with a graph with 11 vertices. We performed 12 iterations of sliding window backtesting on interpolation and extrapolation tasks. For the interpolation task, we used ten percent of randomly selected measurements over 52 timepoints. For the extrapolation task, we used 50 timepoints for training followed by two timepoints where we measured generalization error. We compared SHEK(ν=5/2) and SWEK(ν=5/2). We make conclusion that MAE is better for SWEK in interpolation (DM-test, p < 0.01) and extrapolation (DM-test, p < 0.01) tasks. The results are presented in Table S1 . The difference between SHEK and SWEK can be seen visually in Figure S4 . SWEK allows extrapolating wave behavior beyond training data, and SHEK, in contrast, generalizes badly on the synthetic wave dataset. We performed sliding window backtesting for 12 iterations. We report point-wise graphs that illustrate the comparison of generalization error on the extrapolation tasks with extended extrapolation periods in Figure S5 . The statistical significance can be measured with the Diebold-Mariano test. For example, on four weeks extrapolation period, MAE of SHEK(ν = 1/2) is less than MAE of the product of Matérn(3/2) and RBF (DM-test, p < 0.1), as well as on six week extrapolation period (DM-test, p < 0.01). We also experimented with a larger number of separable kernels and compared them with each other and SHEK. For example, we provide results of four-weeks extrapolation in Figure S6 . Figure S6 : Visualization of evaluation with extended extrapolation periods of four and six weeks on the Hungarian chickenpox dataset (more separable kernels). The dataset for this use-case consisted of two parts: information about COVID-19 cases and deaths published by The New York Times The New York Times (2020), and a graph that was generated as follows. Each vertex represents a state, and two nodes v 1 and v 2 are connected if two states share a common border. The visualization of the graph is presented in Figure S3b . We evaluated the performance of the model using ten runs with a sliding window backtesting. We used 33 weeks as training data and estimated the number of cases for the following two weeks. As a metric, we used mean absolute error (MAE). We additionally report the results of extrapolation over four and six weeks periods of time in Figure S7 . We can observe that the MAE of SHEK modifications is lower than separable Matérn kernels. It can be quantitatively measured using the Diebold-Mariano test. For example, four weeks period MAE of SHEK(ν = 1/2) is less than MAE of the product of Matérn(1/2) and RBF (DM-test, p < 0.01), as well as on six week period (DM-test, p < 0.05). Next, we provide the visualizations of fitted GP using the proposed graph kernel for the three states with the largest population in the USA. In Figure S8 , we showed 97.5% confidence intervals provided by the GP that was trained on COVID-19 dataset. Working with the graph allows for direct modelling of spatio-temporal diffusion over the graph nodes as described by the SPDE. Graph based anomaly detection and description: a survey Kernels for vector-valued functions: A review Laplacian eigenmaps for dimensionality reduction and data representation Non-local network dynamics via fractional graph laplacians Pattern Recognition and Machine Learning Matérn Gaussian processes on graphs Matérn Gaussian processes on Riemannian manifolds Principles of Geographical Information Systems Social network analysis with content and graphs Models and Methods in Social Network Analysis Machine learning on graphs: A model and comprehensive taxonomy Spectral Graph Theory Comparing predictive accuracy Automatic Model Construction with Gaussian Processes Fast random field generation with H-matrices Handbook of Stochastic Methods for Physics Discrete calculus: Applied Analysis on Graphs for Computational Science Analysis of circulant embedding methods for sampling stationary random fields Strong convergence of euler-type methods for nonlinear stochastic differential equations Animal Movement: Statistical Models for Telemetry Data Graph neural network for traffic forecasting: A survey Networks and epidemic models Diffusion convolutional recurrent neural network: Data-driven traffic forecasting Bayesian spatial modelling with R-INLA An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach On the limited memory BFGS method for large scale optimization When Gaussian process meets big data: A review of scalable GPs Random walks on graphs. Combinatorics GPflow: A Gaussian process library using TensorFlow On the reciprocal of the general algebraic matrix A geostatistical model for combined analysis of point-level and area-level data using INLA and SPDE Stochastic Differential Equations: an Introduction with Applications Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning Kernel-based reconstruction of space-time functions on dynamic graphs Kernel-based reconstruction of graph signals Py-Torch Geometric Temporal: Spatiotemporal signal processing with neural machine learning models Chickenpox cases in hungary: a benchmark dataset for spatiotemporal signal processing with graph neural networks Algebraic multigrid The SPDE approach to Matérn fields: Graph representations Linear operators and stochastic partial differential equations in gaussian process regression Applied Stochastic Differential Equations The graph neural network model Stochastic partial differential equations: A modeling, white noise functional approach Kernels and regularization on graphs Stochastic Differential Equation Methods for Spatio-Temporal Gaussian Process Regression This work was supported by the Academy of Finland Flagship Programme Finnish Center for Artificial Intelligence FCAI, Academy of Finland grants 339730, 341763 and 324345, and UKRI Turing AI World-Leading Researcher Fellowship, EP/W002973/1. We also acknowledge the computational resources provided by the Aalto Science-IT Project from Computer Science IT. We thank Ivan Yashchuk and Ivan Yaroslavtsev for their helpful comments. Opening the brackets in C(s, t) and using Itô isometry, we will get a sum of the four following expressions:1.cos(Θt) cos(Θs) + − cos(Θt) cos(Θs) sin(2Θ min(t, s)) 4 + sin(Θt) sin(Θs) sin(2Θ min(t, s)) 4 + cos(Θt) sin(Θs) cos 2 (Θ min(t, s)) 2 − 1 2 + sin(Θt) cos(Θs) cos 2 (Θ min(t, s)) In Figure S1 , we can observe behavior expected for heat and wave processes that start from the third node on a line graph. We observed similar visualizations for other types of graphs.