key: cord-0544918-5unsxcxq
authors: Natali, Alberto; Isufi, Elvin; Coutino, Mario; Leus, Geert
title: Learning Time-Varying Graphs from Online Data
date: 2021-10-21
journal: nan
DOI: nan
sha: 5bb25f4e4ffcfb8b36b59fa06acac295b7f06b03
doc_id: 544918
cord_uid: 5unsxcxq

This work proposes an algorithmic framework to learn time-varying graphs from online data. The generality offered by the framework renders it model-independent, i.e., it can be theoretically analyzed in its abstract formulation and then instantiated under a variety of model-dependent graph learning problems. This is possible by phrasing (time-varying) graph learning as a composite optimization problem, where different functions regulate different desiderata, e.g., data fidelity, sparsity or smoothness. Instrumental for the findings is recognizing that the dependence of the majority (if not all) data-driven graph learning algorithms on the data is exerted through the empirical covariance matrix, representing a sufficient statistic for the estimation problem. Its user-defined recursive update enables the framework to work in non-stationary environments, while iterative algorithms building on novel time-varying optimization tools explicitly take into account the temporal dynamics, speeding up convergence and implicitly including a temporal-regularization of the solution. We specialize the framework to three well-known graph learning models, namely, the Gaussian graphical model (GGM), the structural equation model (SEM), and the smoothness-based model (SBM), where we also introduce ad-hoc vectorization schemes for structured matrices (symmetric, hollows, etc.) which are crucial to perform correct gradient computations, other than enabling to work in low-dimensional vector spaces and hence easing storage requirements. After discussing the theoretical guarantees of the proposed framework, we corroborate it with extensive numerical tests in synthetic and real data.

L EARNING network topologies from data is very appealing.

On the interpretable side, the structure of a network reveals important descriptors of the network itself, providing to humans a prompt and explainable decision support system; on the operative side, it is a requirement for processing and learning architectures operating on graph data, such as graph filters [2] . When this structure is not readily available from the application, a fundamental question is how to learn it from data. The class of problems and the associated techniques concerning the identification of a network structure (from data) are known as graph topology identification (GTI), graph learning, or network topology inference [3] , [4] .

While up to recent years the GTI problem has been focused on learning static networks, i.e., networks which do not change their structure over time, the pervasiveness of networks with a time-varying component has quickly demanded new learning paradigms. This is the case for biological networks [5] , subject to changes due to genetic and environmental factors, or financial markets [6] , subject to changes due to political factors, among others. In these scenarios, a static approach would fail in accounting for the temporal variability of the underlying structure, which is strategic to, e.g., detect anomalies or discover new emerging communities.

In addition, prior (full) data availability should not be considered as a given. In real time applications, data need to be processed on-the-fly with low latency to, e.g., identify and block cyber-attacks in a communication infrastructure, or fraudulent transactions in a financial network. Thus, another learning component to take into account, is the modality of data acquisition. Here, we consider the extreme case in which data are processed on-the-fly, i.e., a fully online scenario.

It is then clear how the necessity of having algorithms to learn time-varying topologies from online data is motivated by physical scenarios. For clarity, we elaborate on the three keywords -identification, time-varying and online -which constitute, other than the title of the present work, also its main pillars.

• Identification/learning: it refers to the (optimization) process of learning the graph topology. • Time-Varying/dynamic: it refers to the temporal variability of the graph in its edges, in opposition to the static case. • Online/streaming: it refers to the modality in which the data arrive and/or are processed, in opposition to a batch approach which makes use of the entire bulk of data.

This emphasis on the terminology is important to understand the differences between the different existing works, presented next.

Static GTI has been originally addressed from a statistical viewpoint and only in the past decade under a graph signal processing (GSP) framework [7] , in which different assumptions are made on how the data are coupled with the unknown topology; see [3] , [4] for a tutorial. Only recently, dynamic versions of the static counterparts have been proposed. For instance, [8] , [9] learn a sequence of graphs by enforcing a prior (smoothness or sparsity) on the edges of consecutive graphs; similarly, the work in [10] extends the graphical Lasso [11] to account for the temporal variability, i.e., by estimating a sparse time-varying precision matrix. In addition to these works, the inference of causal relationships in the network structure, i.e., directed edges, has been considered in [12] , [13] . See [14] for a review of dynamic topology inference approaches.

The mentioned approaches tackle the dynamic graph learning problem by means of a two-step approach: i) first, all the samples are collected and split into possibly overlapping windows; ii) only then the topology associated to each window is inferred from the data, possibly constrained to be similar to the adjacent ones. This modus-operandi fails to address the online (data-streaming) setting, where data have to be processed on-the-fly either due to architectural (memory, processing) limitations or (low latency) application requirements, such as real-time decision making.

This line of work has been freshly investigated by [15] , which considers signals evolving according to a heat diffusion process, and by [16] , which assumes the data are graph stationary [17] . In [18] , the authors consider a vector autoregressive model to learn causality graphs by exploiting the temporal dependencies, while [19] proposes an online task-dependent (classification) graph learning algorithm, in which class-specific graphs are learned from labeled signals (training phase) and then used to classify new unseen data.

Differently from these works, our goal here is to provide a general (model-independent) algorithmic framework for timevarying GTI from online data that can be specialized to a variety of static graph learning problems. In particular, the generalization given by the framework enables us to render a static graph learning problem into its time-varying counterpart and to solve it via novel time-varying optimization techniques [20] , providing a trade off between the solution accuracy and the velocity of execution. We introduce ad-hoc vectorization schemes for structured matrices to solve graph learning problems in the context of the Gaussian graphical model, the structural equation model, and the smoothness based model. All in all, a mature time-varying GTI framework for online data is yet to be conceived. This is our attempt to pave the way for a unified and general view of the problem, together with solutions to solve it.

This paper proposes a general-purpose algorithmic blueprint which unifies the theory of learning time-varying graphs from online data. The specific contributions of this general framework are:

a) it is model-independent, i.e., it can be analyzed in its abstract form and then specialized under different graph learning models. We show how to instantiate three such models, namely, the Gaussian graphical model (GGM), the structural equation model (SEM) and the smoothnessbased model (SBM); b) it operates in non-stationary environments, i.e., when the data statistics change over time. This is possible by expressing the considered models in terms of the sample covariance matrix, which can be then updated recursively for each new streaming sample with a userdefined function, which discards past information. c) it is accelerated through a prediction-correction strategy, which takes into account the time-dimension. Its iterative nature enables a trade-off between following the optimal solution (accuracy) and an approximate solution (velocity). It also exhibits an implicit regularization of the cost function due to the limited iteration budget at each timeinstant, i.e., similar solutions at closed time instants are obtained. Notation: we use x(i) and X(i, j) to denote the i-th entry of the column vector x and the ij-th entry of the matrix X, respectively. Superscripts and † denote the transpose and the pseudoinverse of a matrix, respectively, while operators tr(·) and vec(·) denote the matrix trace and matrix vectorization, respectively. The vectors 0 and 1, and the matrix I, denote the all-zeros vector, the all-ones vector, and the identity matrix, with dimension clarified in the context. The operators ⊗, , and • stand for Kronecker product, Hadamard (entry-wise) product, Hadamard (entry-wise) division and Hadamard (entrywise) power, respectively. We have [ · ] + = max(0, ·), where the maximum operates in an entry-wise fashion. Also, ι X (·) is the indicator function for the convex set X , for which holds ι X (x) = 0 if x ∈ X and +∞ otherwise. Given two functions f (·) and g(·), f • g(·) denotes their composition. A function f (·) with argument x ∈ R N , which is parametrized by the time t, is denoted by f (x; t). The gradient of the function f (x; t) with respect to x at the point (x; t) is denoted with ∇ x f (x; t), while ∇ xx f (x; t) denotes the Hessian evaluated at the same point. The time derivative of the gradient, denoted with ∇ tx f (x; t), is the partial derivative of ∇ x f (x; t) with respect to the time t, i.e., the mixed first-order partial derivative vector of the objective. Finally, · p denotes the p norm of a vector or, for a matrix, the p norm of its vectorization. The Frobenius norm of a matrix is denoted with · F . Without any subscript, the norm · indicates the spectral norm.

In this section, we formalize the problem of learning graphs from data. In Section II-A, we introduce the static graph topology inference problem, where we also recall three wellknown models from the literature. Then, in Section II-B we formulate the (online) dynamic graph topology inference problem.

We consider data living in a non-Euclidean domain described by a graph G = {V, E, S}, where V = {1, . . . , N } is the vertex set, E ⊆ V × V is the edge set, and S is an N × N matrix encoding the topology of the graph. The matrix S is referred to as the graph shift operator (GSO) and typical instantiations include the (weighted) adjacency matrix W [7] and the graph Laplacian L [21] . By associating to each node i ∈ V a scalar value x(i), we define x = [x(1), . . . , x(N )] ∈ R N as a graph signal mapping the node set to the set of real numbers.

Consider now the matrix X = [x 1 , . . . , x T ] that stacks over the columns T graph signals generated from an unknown graph-dependent process F(·); i.e., X = F(S). Then, a GTI algorithm aims to learn the graph topology, i.e., to solve the "inverse" problem (not always well defined):

The function F(·) basically describes how the data are coupled with the graph and its knowledge is crucial. The data and the graph alone are insufficient to cast a meaningful graph learning problem. On one side, we need to know how the data depends on the graph from which they are generated. On the other side, we have to enforce some prior knowledge on the graph we want to learn.

Graph-data models. The choice of a data model is the forerunner of any GTI technique and, together with the graphdata coupling priors (e.g., smoothness, bandlimitedness) differentiates the different approaches. Due to their relevance for this work, we recall three widely used topology identification methods, namely the Gaussian graphical model [22] , the structural equation model [23] , and the smoothness-based model [24] .

1) Gaussian graphical model (GGM): assumes each graph signal x t is drawn from a multivariate Gaussian distribution N (µ, Σ) with mean µ and positive-definite covariance matrix Σ. By setting the graph shift operator to be the precision matrix S = Σ −1 , graph learning in a GGM amounts to precision matrix estimation, which in a maximum likelihood (MLE) sense can be formulated as:

whereΣ = 1 T XX is the sample covariance matrix and S N ++ is the convex cone of positive-definite matrices. In this context, matrix S can be interpreted as the adjacency matrix (with self loops), although the problem can also be solved under some additional constraints forcing S to be a Laplacian [25] .

2) Structural equation model (SEM): neglecting possible external inputs, and assuming an undirected graph, the SEM poses a linear dependence between the signal value x t (i) at node i and the signal values at some other nodes {x t (j)} j =i , representing the endogenous variables, i.e.,:

where S(i, j) weights the influence that node j exerts on node i, and e t (i) represents unmodeled effects. In this view, with S encoding the graph connectivity, model (3) considers each node to be influenced only by its one-hop neighbors. In vector form, we can write (3) as:

with S(i, i) = 0, for i = 1, 2, . . . , N . Also, we consider e t white noise with standard deviation σ e . Graph learning under a SEM implies estimating matrix S by solving:

where S = {S| diag(S) = 0, S(i, j) = S(j, i), i = j}, and g(S) is a regularizer enforcing S to have specific properties; e.g., sparsity. In this context, matrix S is usually interpreted as the adjacency matrix of the network (without self loops). The first term of (5) can be equivalently rewritten as:

which highlights its dependence onΣ.

3) Smoothness-based model (SBM): assumes each graph signal x t to be smooth over the graph G, where the notion of graphsmoothness is formally captured by the Laplacian quadratic form:

A low value of LQ G (x t ) suggests that adjacent nodes i and j have similar values x t (i) and x t (j) when the edge weight W (i, j) is high. Thus, the quantity:

represents the average signal smoothness on top of G, which can be rewritten as the graph-dependent function:

with S = W. Building upon this quantity, graph learning under a graph smoothness prior can be casted as:

where the term g(S) accommodates for additional topological properties (e.g., sparsity) and also helps avoiding the trivial solution S = 0. The set S = {S| diag(S) = 0, S(i, j) = S(j, i) ≥ 0, i = j} encodes the topological structure, which coincides with the set of hollow symmetric matrices (i.e., with zeros on the diagonal) with positive entries. Remark 1. In [24] , the authors express the smoothness quantity (8) in terms of the weighted adjacency matrix W and a matrix Z ∈ R N ×N + representing the row-wise (squared) Euclidean distance matrix of X; i.e., tr(X LX) = 1 2 tr(WZ) = 1 2 W Z 1 . This formulation mainly brings the intuition that adding explicitly a sparsity term to the objective function would simply add a constant term to Z. We favour (9) as a measure of graph signal smoothness since it fits within our framework, as will be clear soon. We emphasize however how the two formulations are equivalent, sinceΣ can be directly expressed as a function of Z.

When the graph topology changes over time, the changing interactions are represented by the sequence of graphs

where t ∈ N + is a discrete time index. This sequence of graphs, which is discrete in nature, can be interpreted as the sampling of some "virtual" continuous timevarying graph using the sampling period h = 1. To relate our expressions to existing literature, we will make the parameter h explicit in the formulas, yet it is important to remember that h = 1. Together with the graph sequence {G t } ∞ t=1 , we consider also streaming graph signals {x t } ∞ t=1 , such that signal x t is associated to graph G t . At this point, we are ready to formalize the time-varying graph topology identification (TV-GTI) problem.

Problem statement. Given an online sequence of graph signals {x t } ∞ t=1 arising from an unknown time-varying network, the goal is to identify the time-varying graph topology {G t } ∞ t=1 ; i.e., to learn the graph shift operator sequence

On top of this, to highlight the trade-off between accuracy and low-latency of the algorithm's solution.

Mathematically, our goal is to solve the sequence of timeinvariant problems:

where function F (·; t) is a time-varying cost function that depends on the data model [cf. Section II-A], and the index t makes the dependence on time explicit, which is due to the arrival of new data. Although we can solve problem (11) for each t separately with (static) convex optimization tools, the need of a low-latency stream of solutions makes this strategy unappealing. This approach also fails to capture the inherent temporal structure of the problem, i.e, it does not exploit the prior time-dependent structure of the graph, which is necessary in time-critical applications.

To exploit also this temporal information, we build on recent advances of time-varying optimization [26] , [20] and propose a general framework for TV-GTI suitable for non-stationary environments. The proposed approach operates on-the-fly and updates the solution as a new signal x t becomes available. The generality of this formulation enables us to define a template for the TV-GTI problem, which can be specialized to a variety of static GTI methods. The only information required is the firstorder (gradient) and possibly second-order (Hessian) terms of the function. In the next section, we lay down the mathematics of the proposed approach. The central idea is to follow the optimal time-varying solution of problem (11) with lightweight proximal operations [27] , which can be additionally accelerated with a prediction-correction strategy. This strategy, differently from other adaptive optimization strategies such as least mean squares and recursive least squares, uses an evolution model to predict the solution, and observes new data to correct the predictions. The considerations of Section III will be then specialized to the different data models of Section II-A in Section IV, further analyzed theoretically in Section V, and finally validated experimentally in Section VI.

To maintain our discussion general, we consider the composite time-varying function:

where f : R N ×N ×N + → R is a smooth 1 strongly convex function [28] encoding a fidelity measure and g : R N ×N ×N + → R is a closed convex and proper function, potentially non differentiable, representing possible regularization terms. For instance, function f (·) can be the GGM objective function of (2), the SEM least-squares term of (5), or the SBM smoothness measure in (8) . Solving a time-varying optimization problem implies solving the template problem:

S t := argmin S f (S; t) + λg(S; t) for t = 1, 2, . . . (13) In other words, the goal is to find the sequence of optimal solutions {S t } ∞ t=1 of (13), which we will also call the optimal trajectory. However, solving exactly problem (13) in real time is infeasible because of the computational and time constraints. The exact solution may also be unnecessary since by itself it still approximates the true underlying time-varying graph. Under these considerations, an online algorithm that updates the approximate solutionŜ t+1 of (13) at time t + 1, based on the former (approximate) solutionŜ t is highly desirable for low complexity and fast execution 2 .

Instrumental for the upcoming analysis is to observe that the number of independent variables of the graph representation matrix plays an important role in terms of storage requirements, processing complexity and, most importantly, in the correct computations of function derivatives with respect to those variables. Thus, when considering structured matrices, such as symmetric, hollow or diagonal, we need to take into account their structure. We achieve this by ad-hoc vectorization schemes through duplication and elimination matrices, inspired by [29] .

Consider a matrix S ∈ R N ×N and its corresponding "standard" vectorization vec(S) ∈ R N 2 . Depending on the specific structure of S, different reduction and vectorization schemes can be adopted, leading to a lift from a matrix space to a vector space. The following spaces are of interest. h-space. If S is symmetric, the number of independent variables is k = N (N + 1)/2, i.e., the variables in its diagonal and its lower (equivalently, upper) triangular part. We can isolate these variables by representing matrix S with its half-vectorization form, which we denote as s = vech(S) ∈ R k . This isolation is possible by introducing the elimination matrix E ∈ R k×N 2 and the duplication matrix D ∈ R N 2 ×k which respectively selects the independent entries of S, i.e., E vec(S) = s, and duplicates the entries of s, i.e, Ds = vec(S). We call this vector space as the half-vectorization space (h-space). hh-space. If S is symmetric and hollow, the number of independent variables is l = N (N − 1)/2, i.e., the variables on its strictly lower (equivalently, upper) triangular part. In this case, we can represent matrix S in its hollow half-vectorization form, which we denote as s = vechh(S) ∈ R l . This reduction is achieved by applying the hollow elimination and duplication matrices E h ∈ R l×N 2 and D h ∈ R N 2 ×l , respectively, to the vectorization of S. In particular, E h extracts the variables of the strictly lower triangular part of the matrix, i.e., s = E h vec(S), while D h duplicates the values and fills in zeros in the correct positions, i.e., vec(S) = D h s. We refer to the associated vector space as the hollow half-vectorization space (hh-space).

With the above discussion in place, we can now illustrate the general framework in terms of vector-dependent functions f (s) for a vector s, in contrast to matrix-dependent functions f (S), simplifying exposition and notation. However, we underline that the information embodied in S and s is the same.

We develop a prediction-correction strategy for problem (13) that starts from an estimateŝ t at time instant t, and predicts how this solution will change in the next time step t + 1. This predicted topology is then corrected after a new datum x t+1 is available at time t + 1. More specifically, the scheme has the following two steps: (1) Prediction: at time t, an approximate functionF (s; t + 1) of the true yet unobserved function F (s; t + 1) is formed, using only information available at time t. Then, using this approximated cost, we derive an estimate s t+1|t , of how the topology will be at time t + 1, using only the information up to time t. This estimate is found by solving:

To avoid solving (14) for each t, we find an estimateŝ t+1|t by applying P iterations of a problem-specific descent operatorT (e.g., gradient descent, proximal gradient) for which s t+1|t =T s t+1|t , i.e., s t+1|t is a fixed point ofT . See Appendix A for possible instances ofT . In other words, problem (14) is solved recursively as:

withŝ 0 =ŝ t . Once P steps are performed, the predicted topology is set toŝ t+1|t =ŝ P , which approximates the solution of (14) and, in turn, will be close to s t+1 at time t + 1.

For our framework, we consider a Taylor-expansion based prediction to approximate the first term of F (·; t + 1), i.e., f (·; t + 1) [cf. (12) ], leading to the following quadratic function:

where ∇ ss f (·) ∈ R N ×N is the Hessian matrix of f (·) with respect to s and ∇ ts f (·) ∈ R N is the partial derivative of the gradient of f (·) w.r.t. time t.

To approximate the second term of F (·; t + 1), i.e., g(·; t + 1) [cf. (12) ], we use a one step-back prediction, i.e.,ĝ(s; t + 1) = g(s; t). This implies thatĝ(·) does not depend on t, which in turn makes the constraint set and the regularization term independent of time, an assumption usually met in state-of-the-art topology identification [3] . Henceforth, we will omit this time dependency.

Require: FeasibleŜ 0 , f (S; t 0 ), P , C, operatorsT and T 1:ŝ 0 ← − ad-hoc vectorization ofŜ 0 2: for t = 0, 1, . . . do 3:

Initialize the predicted variableŝ 0 =ŝ t

for p = 0, 1, . . . , P − 1 do Predictŝ p+1 with (15) 6:

end for Set the predicted variableŝ t+1|t =ŝ P .

// Correction -time t + 1: new data arrive 8: Initialize the corrected variableŝ 0 =ŝ t+1|t 9: (18) 10:

end for Set the corrected variableŝ t+1 =ŝ C 11: end for (2) Correction: at time t + 1 the new data x t+1 and hence the cost function F (s; t+1) becomes available. Thus, we correct the predictionŝ t+1|t by solving the correction problem:

Also in this case, we solve (17) with iterative methods to obtain an approximate solutionŝ t+1 by applying C iterations of an operator T . In other words, the correction problem (17) is addressed through the recursion:

withŝ 0 =ŝ t+1|t . Once the C steps are performed, the correction graphŝ t+1 is set toŝ t+1 =ŝ C , which will approximate the solution s t+1 of (17).

Algorithm 1 shows the pseudocode for the general online TV-GTI framework.

Remark 2. We point out that the framework can adopt different approximation schemes, such as extrapolation-based techniques, and can also include time-varying constraint sets. The choice of approximation-scheme depends on the properties of the problem itself along with the required prediction accuracy. For an in-depth theoretical discussion regarding different prediction approaches and relative convergence results, refer to [30] .

In this section, we specialize the proposed framework to the three static topology inference models discussed in Section II-A. Notice that the data dependency of data-driven graph learning algorithms is exerted via the empirical covariance matrixΣ of the graph signals; we have already shown this for the three considered models of Section II-A. In other words, graph-dependent objective functions of the form F (S) could be explicitly expressed through their parametrized version F (S;Σ). This rather intuitive, yet crucial observation, is central to render the proposed framework model-independent and adaptive, as explained next.

Non-stationarity. Relying on the explicit dependence of function F (·) onΣ and envisioning non-stationary environments, we let the algorithm be adaptive by discarding past information. That is, function F (S; t) in (12) can be written as F (S;Σ t ), withΣ t the empirical covariance matrix, up to time t, with past data gradually discarded. This makes the framework adaptive and model-independent. The adaptive behavior can be shaped by, e.g., the exponentially-weighted moving average (EWMA) of the covariance matrix:

where the forgetting factor γ ∈ (0, 1) downweighs (for γ → 0) or upweighs (for γ → 1) past data contributions. For stationary environments, an option is the infinite-memory matrix covariance updateΣ

The GGM problem (2), adapted to a time-varying setting following template (13) leads to:

In this case g(·) encodes the constraint set of positive definite matrices and the regularization parameter is λ = 1.

Since S is symmetric, we use the half-vectorization s = vech(S) ∈ R k to reduce the number of independent variables from N 2 to k = N (N + 1)/2. Then, the gradient and the Hessian of the function f (·) in the h-space are respectively:

Likewise, the discrete-time derivative of the gradient is given by the partial mixed-order derivative [26] :

Note the Hessian term (21b) is time-independent, while the time-derivative of the gradient (22) is graph-independent. Now, by definingŝ t := vech(Ŝ t ) ∈ R k , we can particularize Algorithm 1 to:

• Prediction: withŝ 0 initialized asŝ 0 =ŝ t , the prediction update is :

for p = 0, 1, . . . , P − 1, where α t is a (time-varying) step size. Equation (23) entails a descent step along the approximate functionf (·; t + 1) in (16), followed by the projection onto the convex set S; see Appendix A for the definition of P S (·). Then, the predictionŝ t+1|t is set tô s t+1|t =ŝ P . • Correction: by settingŝ 0 =ŝ t+1|t , the correction update is:

for c = 0, 1, . . . , C − 1, where β t is a (time-varying) step size. Equation (24) entails a descent step along the true function f (·; t + 1), followed by the projection onto the set S. The correctionŝ t+1 is finally set toŝ t+1 =ŝ C . The prediction step (23) instantiates (15) 

Similarly, the correction step (24) instantiates (18) 

The overall computational complexity of one PC iteration is dominated by the matrix inversion and matrix multiplication, incurring a cost of O(N 3 ). A correctiononly algorithm would also incur a cost of O(N 3 ) per iteration. See Appendix C for details.

The SEM problem (5), adapted to a time-varying setting with sparsity-promoting regularizer, leads to [cf. (13) ]:

the set of hollow symmetric matrices, and S 1 = vec(S) 1 . Since S is symmetric and hollow, we operate on the hh-space to make the problem unconstrained and reduce the number of independent variables from N 2 to l = N (N − 1)/2, through its hollow half-vectorization form s = vechh(S) ∈ R l . In the hh-space, equations (25a) and (25b) become:

To solve the time-varying SEM (TV-SEM) problem, we derive the gradient and the Hessian of function f (·) in the hh-space as:

Notice here how the Hessian is time-varying and independent on s, differently from the GGM case. The time derivative of the gradient is given by the partial mixed-order derivative:

Now, by definingŝ t := vechh(Ŝ t ) ∈ R l , we can particularize Algorithm 1 to:

• Prediction: setŝ 0 =ŝ t . Then, the prediction is the proximal-gradient update:

for p = 0, . . . , P . Equation (29a) entails a descent step along the approximate functionf (·; t+1) in (16), followed by the non-negative soft-thresholding operator in (29b), which sets to zero all the (negative) edge weights of the graph obtained after the gradient descent in (29a). See Appendix A for the formal definition of proximal operator, leading to (29a) and (29b). The final predictionŝ t+1|t is set toŝ t+1|t =ŝ P .

• Correction: setŝ 0 =ŝ t+1|t . Then, the correction is the proximal-gradient update:

for c = 0, . . . , C − 1. Equation (30a) entails a descent step along the true function f (·; t + 1), followed by the non-negative soft-thresholding operator in (30b). Finally,

The prediction step (29) instantiates (15) 

Similarly, the correction step (30) instantiates (18) 

The overall computational complexity of one PC iteration is dominated by the computation of matrix Q t , incurring a cost of O(N 3 ). A correction-only algorithm would also incur a cost of O(N 3 ) per iteration. See Appendix C for details.

The SBM model (10) adapted to a time-varying setting is:

where S = {S ∈ S N |diag(S) = 0, S(i, j) = S(j, i) ≥ 0, i = j} is the set of hollow symmetric matrices. The log barrier term log(S1) is applied entry-wise and forces the nodes degree vector d = S1 to be positive while avoiding the trivial solution. The Frobenius norm term S 2 F controls the sparsity of the graph.

By operating in the hh-space, equations (31a) and (31b) become 3 :

To apply the proposed framework to solve the time-varying SBM (TV-SBM) problem, we derive the gradient and the Hessian of function f (·) in the hh-space as follows:

where and • represent the Hadamard division and power, respectively. The time derivative of the gradient is given by the partial mixed-order derivative: 3 We move the log-barrier and Frobenius norm terms of g(·) function (31b) into the f (·) function to fit the structure of the general template.

where z t = K σ d −2σ t . Now, by definingŝ t := vechh(Ŝ t ) ∈ R l , we can particularize Algorithm 1 to:

• Prediction: withŝ 0 initialized asŝ 0 =ŝ t , the prediction update is:

for p = 0, 1, . . . , P − 1. Equation (35) entails a descent step along the approximate functionf (·; t + 1) in (16), followed by the projection onto the non-negative orthant. Then, the predictionŝ t+1|t is set toŝ t+1|t =ŝ P . • Correction: by settingŝ 0 =ŝ t+1|t , the correction update is:

for c = 0, 1, . . . , C − 1. Equation (36) entails a descent step along the true function f (·; t + 1), followed by the projection onto the non-negative orthant. Finally,ŝ t+1 = s C . The prediction step (35) instantiates (15) 

Similarly, the correction step (36) instantiates (18) to T = P s 0 • (I − β t ∇ s f )(·). The overall computational complexity per iteration is dominated by the computation of the gradient ∇ s f (s; t) (or the Hessian if P > 1), incurring

In this section, we first discuss the convergence of Algorithm 1 and the associated error bounds. As solver we consider the proximal gradientT = T = prox g,ρ •(I − ρ∇ s f )(·) [31] , [32] . Then, we show how the parameters of the three introduced models are involved in the bounds. To ease notation, we use s ∈ R p to indicate the vectorization of matrix variable S ∈ R N ×N [cf. Section III-A].

For this analysis, we need the following mild assumptions.

Assumption 1. The function f : R p × N + → R is m-strongly convex and L-smooth uniformly in t, i.e., mI ∇ ss f (s; t) LI, ∀ s, t, while the function g : R p × N + → R ∪ {+∞} is closed convex and proper, or g(·; t) = 0, for all t ∈ N + .

This guarantees that problem (13) admits a unique solution for each time instant, which in turn guarantees uniqueness of the solution trajectory {s t } ∞ t=1 . Assumption 2. The gradient of function f (·) has bounded time derivative, i.e.

This guarantees that the solution trajectory is Lipschitz in time.

Assumption 3. The predicted functionf (·; t + 1) is m-strongly convex and L-smooth uniformly in t; andĝ(·; t + 1) is closed, convex and proper.

This implies that the prediction problem (14) belongs to the same class as the original problem, i.e., the functions of the two problems share the same strong convexity and Lipschitz constants m and L. Therefore, the same solver can be applied for the prediction and correction steps, i.e.,T = T . Assumption 4. The matrix S of (12) has finite entries, i.e., −∞ < S(i, j) < +∞, for all i, j.

This guarantees S < +∞, i.e., S is a bounded operator, and it holds in practical scenarios. In particular, it is known that (finite) weighted graphs exhibit bounded eigenvalues, see [33] [34] . Notably, if S is a normalized Laplacian, then S = 2.

Similarly, assumptions 1-3 are mild and hold for the considered models, as we show next. Proposition 1. The three considered models of Section IV can be m-strongly convex and L-smooth uniformly in t, for some scalar m and L, as supported by the following claims.

Claim 1. Denote with ξ > 0 and 0 < χ < ∞ the minimum and maximum admissible eigenvalues of the precision matrix S, respectively; i.e., consider the set S = {S ∈ S N ++ |ξI S χI}. Then, for the TV-GGM function f (·; t) in (20a), it holds:

Claim 2. Denote with λ min and λ max the smallest and highest eigenvalues for the set of empirical covariance matrices obtained with graph signals obeying (4). Then, for the TV-SEM function f (·; t) in (26a), it holds:

Claim 3. Consider the TV-SBM function f (·; t) in (32a), and recall that the log-barrier term avoids isolated vertices, i.e., d 0. Denote with d min > 0 the minimum degree of the GSO search space. Under these assumptions, it holds:

See Appendix B for a proof of Claim 1-3.

Thus, Assumption 1 holds since the Hessian of f (·; t) is bounded over time and g(·; t) is closed, convex and proper by problem construction; Assumption 2 holds since ∇ ts f (s; t) is the difference between bounded vectors which involve covariance matrices not too different from each other (one is the rank-one update of the other), which is finite as long as the graph signals are bounded, see (19) and, e.g., (34) . Assumption 3 holds sincef (·; t + 1) is a quadratic approximation of f (·; t) [cf. (16) ] andĝ(·; t + 1) = g(·; t), thus inheriting the properties of f (·; t) and g(·; t), which satisfy Assumption 1.

With this in place, we are now ready to show two different error bounds incurred during the prediction and correction steps performed by Algorithm 1, describing its sub-optimality as function of the model and algorithm's parameters. First, we show the error bound between the optimal prediction solution s t+1|t and the associated optimal correction s t+1 , which solve problems (14) and (17), respectively. Proposition 2. Let Assumptions 1-3 hold. Consider also the Taylor expansion based prediction (16) for f (·; t) and the onestep back prediction for g(·; t). Then, the distance between the optimal prediction solution s t+1|t , solving problem (14) , and the associated optimal correction s t+1 , solving problem (17), is upper bounded by:

whereŝ t is the approximate solution of the correction problem (17) at time t.

Proof. Follows from [30, Lemma 4.2] in which constant D 0 = 0 by considering a static function g(·).

This bound enables us to measure how far the prediction is from the true corrected topology at time t + 1. It depends on the estimation errorŝ t − s t achieved at time t, the ratio L/m and the variability of the function gradient ∇ ts f (s; t). The bound suggests that a small gap can be achieved if i) the ratio L/m is small, which for the three considered models translates in having a small condition number for the involved covariance matrices or GSOs; and ii) the time-gradient ∇ ts f (s; t) at consecutive time steps does not change significantly, which holds when the considered models have similar covariance matrices at adjacent time instants, i.e., the data statistics do not change too rapidly (see e.g. (22) and (28)).

Finally, we bound the error sequence { ŝ t − s t 2 , t = 1, 2, . . .} achieved by Algorithm 1 by means of the following non-asymptotic performance guarantee, which is an adaptation of [30, Proposition 5.1].

Theorem 4. Let Assumptions 1 and 3 hold, and consider two scalars {d t , φ t } ∈ R + such that:

for any t ∈ N + . Let also the prediction and correction steps use the same step-sizes ρ t = α t = β t . Then, by employing P prediction and C correction steps with the proximal gradient operator T = prox g,ρt •(I−ρ t ∇ s f )(·), the sequence of iterates {ŝ t } generated by Algorithm 1 satisfies:

Proof. Follows from [30, Proposition 5.1] and [30, Lemma 2.5], with variables λ = q t and χ = β = 1.

Theorem (42) states that the sequence of estimated graphs {s t } t∈N+ hovers around the optimal trajectory {s t } t∈N+ with a distance depending on: i) the numbers P and C of iterations; ii) the estimation error achieved at the previous time instant ŝ t − s t ; and iii) the quantities d t and φ t . Moreover, (42) is a contraction (i.e., q C+P t < 1) when ρ t < 2/L t ; in this case the initial starting pointŝ 0 does not influence the error s t+1 −s t+1 asymptotically, since the first term in (42) vanishes. However, the terms d t and φ t keep impacting the error also asymptotically, as long as the problem is time-varying; if the problem becomes static, i.e., the solution stops varying, then d t = φ t = 0, and the overall error asymptotically goes to zero. 

In this section, we show with numerical results how Algorithm 1, specialized to the three models (TV-GGM, TV-SEM, TV-SBM), can track the offline solution (13) obtained by the respective instantiations. For all the experiments, we initialize the empirical covariance matrixΣ 0 with some samples acquired prior to the analysis. We consider P = 1 prediction steps and C = 1 correction steps, which is the challenging setting of having the minimum iteration budget for streaming scenarios. We measure the convergence of Algorithm 1 via the normalized squared error (NSE) between the algorithm's estimateŝ t and the optimal (offline) solution s t :

We use CVX [36] as solver for the offline computations, and report the required computational time in seconds achieved by Algorithm 1 and CVX.

We generate a synthetic (seed) random graph S 0 of N nodes using the GSP toolbox [37] . Then, edges abide two different temporal evolution patterns: i) piecewise constant; and ii) smooth temporal variation. Finally, we generate the stream of data according to the three considered models [cf.

Section IV] for T time instants. Piecewise. For the piecewise constant scenario, we randomly select N/2 nodes of the initial graph S 0 and double the weight of their edges, after T /2 samples. Then, for t = {1, . . . , T } we generate each graph signal x t according to the three models: 1) for the TV-GGM, we use x t ∼ N (0, Σ t ), where Σ t = S −1 t ; 2) for the TV-SEM we use x t = (I − S t ) −1 e t [cf. (4)], with noise variance σ 2 e = 0.5; and 3) for the TV-SBM we use x t ∼ N (0, L † t + σ 2 e I N ) as in [38] with σ 2 e = 0.5. Smooth. For the smooth scenario, starting from the initial graph S 0 , the evolution pattern follows an edge-dependent behavior, S t (i, j) = S 0 (i, j)(1 + e −0.01ijt ) for t = {1, . . . , T }. This means that each edge follows an exponential decaying behavior, with the decaying factor depending on the edge itself. The data are generated as in the piecewise constant scenario. For the results, we will compare the following methods:

• Prediction-correction (PC) red curve: this is the proposed Algorithm 1 specialized to one of the three models, with P = C = 1. • Correction-only (CO) cyan curve: this is a predictionfree algorithm which only considers the original problem (17) and applies C = 1 iteration of the recursion (18) . It is equivalent to Algorithm 1 with P = 0, C = 1.

We consider this algorithm to study the benefits of the prediction step performed by PC. • Correction-correction (CC) blue curve: this is a prediction-free algorithm which only considers the original problem (17) and applies C = 2 iterations of the recursion (18) . It is equivalent to Algorithm 1 with P = 0, C = 2. This is a more fair comparison than CO, since the number of iterations is the same as the one of PC. • Stochastic gradient descent (SGD) ochre curve: this is a prediction-free and memory-less version of the algorithm which only considers the last acquired graph signal. That is, the empirical covariance matrixΣ t = x t x t in (19) is just a rank-one update, achieved by setting γ = 0. We consider this to show how much the temporal variability of the function, captured by the time-derivative of the gradient in PC, affects the algorithm's convergence. • Prediction-correction rank-one (PC-1) purple curve:

this is a rank-one (stochastic) implementation of the PC algorithm; i.e.,Σ t = x t x t for the update in (19) , and P = C = 1. Notice that, differently from SGD, it also uses the time-derivative of the gradient, which in this case is the difference between two rank-one covariance matrices (thus the length of the memory is equal to one). We consider this algorithm to check the impact of the prediction step in a stochastic implementation of PC; • Correction-correction rank-one (CC-1) orange curve:

this is a rank-one (stochastic) implementation of the CC algorithm; i.e., it considersΣ t = x t x t for the update in (19) , and P = 0, C = 2. It can be seen as a two-step SGD, and we consider it to study whether the prediction step of PC-1 is beneficial for stochastic implementations.

In addition, for the piecewise constant scenario, we also report (green curve) the NSE between the PC solution and the batch solution obtained having all the relevant data in advance, i.e., the solution that would be obtained with a static graph learning algorithm on the intervals where the graph remains constant. In general, a fair comparison can be made within the rankone implementations (SGD, PC-1 and CC-1) and within the memory-aware ones (PC, CO, CC).

Results. The NSE achieved by Algorithm 1 for the three models is shown in Fig. 1 , for both the piecewise constant (top row) and smooth (bottom row) scenarios. We use fixed step sizes for all the experiments. Notice that the only effect of the functions' hyperparameters is to shape the batch solution s t (and hence the time-varying trajectoryŝ t at convergence). Thus, we run Algorithm 1 with different hyperparameters 4 and manually select them by ensuring that the trivial and complete graphs are excluded; the selected ones are displayed, together with the other algorithm's parameters, in the captions of Fig. 1 .

GGM. Fig. 1a and Fig. 1d show the results for the piecewise constant and smooth scenarios, respectively. In both scenarios, the PC solution converges to the optimal offline counterpart and, for the piecewise constant, also to the batch solution(s). This demonstrates the adaptive nature of Algorithm 1 to react to changes in the data statistics. While for the piecewise constant scenario PC and CC offer the same convergence speed (which is expected, as explained in "Does prediction help?"), for the smooth scenario, the PC algorithm exhibits a faster convergence with respect to the prediction-free competitors CO and CC. This is because the temporal variability of the function (and of its gradient) is captured by the prediction step and exploited to fasten the convergence. SEM. Similar considerations hold for the TV-SEM, whose results are illustrated in Fig. 1b and Fig. 1e . In both scenarios, PC and CC offer the same convergence rate (which also converge to the batch solution for the constant scenario), faster than a CO and SGD implementation. Interestingly, after the triggering event at T /2, SGD can track the optimal solution faster than CO with performances similar to PC and CC. A possible justification may be the memory-less nature of SGD, i.e., it only considers the last sample for the gradient evaluation, thus discarding past data. This renders the SGD more reactive to adapt to sudden changes of the data statistics compared to the memory-aware alternatives, which however exhibit similar performances thanks to the extra iteration they can benefit. SBM. Finally, the TV-SBM results are shown in Fig. 1c and Fig. 1f . Also in this case, the PC solution converges to the offline counterpart for the two scenarios and faster than the prediction-free versions of the algorithm CC and CO. In particular, while in the piecewise constant scenario PC converges faster than CC and the rank-one implementations, in the smooth scenario the rank-one implementations exhibit faster convergent behavior with respect to the non-stochastic implementations. Similar to what has been said for the TV-SEM results, a possible reason can be the memory-aware characteristics of the non-stochastic methods; that is, while the information present in past data can be beneficial in the static scenario and thus help PC and CC to have a more reliable estimate of the true underlying (static) covariance matrix (and of the gradient), it may slow down the process in non-stationary environments with time-varying covariance matrices as in the smooth scenario. Required time. An important metric to consider in time-sensitive applications is the average time per iteration. We report this information in Table I , for the PC step and CVX, relative to the three considered models and settings in the top row of Fig. 1 . Combining the information of the table and that of the plots in Fig. 1 , it is clear how trading off the knowledge of the optimal solution for savings in terms of time seems an excellent compromise. Each prediction-correction step requires indeed around three orders of magnitude less time than the CVX counterpart, leading to a NSE at least smaller than 10e − 1.

Does prediction help? Notice how in the piecewise constant scenario, the PC strategy does not seem to offer a major advantage with respect the CC strategy. Although this behavior could be hypothesized (since the setting is static), it is here empirically confirmed. To can gain more insights we look at the structure of the prediction step (e.g., (23)), where the components playing a role in the descent direction are: the gradient ∇ s f (·); the Hessian ∇ ss f (·); and the time-derivative of the gradient ∇ ts f (·). Since we use P = 1, i.e., only one prediction step, the term (ŝ p −ŝ t = 0) that multiplies the Hessian does not contribute to the descent step. The added value of the prediction step with respect to a general (correction) descent method, in this case, would be only provided by the time-gradient ∇ ts f (·) (since the gradient ∇ s f (·) is common to either the prediction and the correction step). In the piecewise constant scenario, however, the underlying (true) covariance matrix is time-invariant within the two stationary intervals, leading to a zero time-derivative of the gradient (cf. (22) ). This means that in static scenarios, with P = 1, the prediction step boils down to a correction step. Differently, for P = 2, the contribution of the second-order information may speed up the convergence, as illustrated in Fig. 2a for TV-GGM, with respect to a correction-only algorithm using C = 3. In the smooth scenario, the temporal variability of the gradient captured by the time-derivative of the gradient ∇ ts f (·), plays a role in the prediction step, which can improve the convergence speed of the algorithm. The (bounded) norm of this vector over time is illustrated in Fig. 2b for the TV-GGM smooth scenario of Fig. 1d ; this norm is linked to the constant C 0 introduced in Assumption 2 and the error in (40) . All in all, the results indicate the convergence of Algorithm 1 to the optimal offline counterpart and its capability to track it in non-stationary environments. The algorithm also converges to the batch solutions of the two stationary intervals, obtained with all the relevant data. A defining characteristic of Algorithm 1 is its ability to naturally enforce similar solutions at each iteration, achieved with an early stopping of the descent steps, governed by the parameters P and C. That is, the algorithm adds an implicit temporal regularization to the problem which needs to be explicitly added when working with the entire batch of data. Given these results and insights, we can outline a few principles that can be adopted when considering Algorithm 1 for learning problems:

• The prediction step with P = 1 can be beneficial when the underlying data statistics change over time, so that the time-variability of the gradient can be exploited. Otherwise, in a complete static scenario, it coincides with a correction step.

• Increasing P can improve the convergence speed when the approximated cost function is a good surrogate of the cost function in the next time instant. • Memory-less (stochastic) variants of the algorithm can be suitable in fast-changing environments, due to their ability to discard past information and react quickly to changes in data statistics. Being confident on the convergence of the algorithm, we now corroborate its performance with real data.

We now test the three considered algorithms on real data. Among other indicators employed in the simulations to assess the performance of the algorithm, we use the graph temporal deviation TD(t) := ŝ t −ŝ t−1 2 , which measures the global variability on the edges of the graph for different time instants. To gain further insights on the network evolution over time, we consider additional metrics (such as number of edges and temporal gradient norm) and visual analysis tools which will be introduced in the application-specific scenario at hand. In this case, the hyperparameters of each function are chosen in such a way that the inferred graphs are neither trivial nor complete, and interpretable patterns consistent with real events are visible from the plots of the employed metrics. TV-GGM for Stock Price Data Analysis.

Data description: we collect historical stock (closing) prices relative to the S&P500 Index for seven pharmaceu- and Sanofi (SNY). Our goal is to leverage the TV-GGM in order to explore the relationships among these companies over time and observe the possible structural changes due to market instabilities.

Results: We consider T = 504 measurements (working days in August 2019 -August 2021) as graph signals {x t } for the N = 7 quantities of interest, which are further standardized, i.e., each variable is centered and divided by its empirical standard deviation; see Fig. 3a for a plot of the standardized time series. We run the TV-GGM algorithm for different values of the forgetting factor γ, and monitor the evolution of the metrics earlier introduced. The value γ = 0.75 yielded results most consistent with the data behavior.

It is clear from Fig. 3a and the TD indicator in Fig. 3b that around March 2020 and after January 2021 the market has changed significantly, due to the instability generated by the pandemic and by the follow-up starting vaccination campaign. The sharp peaks in Fig. 3b around around the same period are a consequence of the dynamic inter-relationships among the companies; the inferred graph changes substantially in the two periods of interest and TD captures the market variability.

To really enjoy the visualization potential offered by graphs as a tool, we show in Fig. 3c snapshots of the inferred timevarying graph at four different dates of interest. Common among the four graphs is the presence of the edge connecting MRNA and NVAX, and the edge connecting AZN and SNY. The pharmaceutical companies associated to the endpoints of each of these two edges also show a similar trend in Fig. 3a .

Notice moreover that since the sparsity pattern of the precision matrix reveals conditional independence among the variables indexed by its zero entries, these graphs enable us to visually inspect such independence over time. Although the information endowed in these graphs may carry a financial significance, we leave this possible knowledge-discovery task out of this manuscript, to avoid misleading or erroneous interpretations. TV-SEM for Temperature Monitoring. Data description: for this experiment we consider the publicly available weather dataset 5 provided by the Irish Meteorological Service, which contains hourly temperature (in • C) data from 25 stations across Ireland. We monitor the temperature evolution over the sensor network for the period January 2016 to May 2020, and leverage the TV-SEM to infer the time-varying features of the graph learned by the algorithm.

Results: for the analysis we consider T = 38713 measurements as graph signals {x t } for the N = 25 stations under consideration, standardizing the data as done in the previous experiment; Fig. 4a depicts the standardized time series. It is interesting to notice the sinusoidal-like behavior of the aggregate time-series, due to higher (lower) temperature during the summer (winter) period, resulting in a smooth signal profile. Fig. 4b illustrates the sparsity pattern of the time-varying graph and the importance of the weights at every time instant. This learn-and-show feature offered by Algorithm 1 gives us the ability to visualize the learning behavior of the algorithm on-the-fly, a strength of low-cost iterative algorithms w.r.t. batch counterparts. From the figure (and the observed almost zero graph temporal deviation, which is not illustrated here) a consistent temporal homogeneity is visible, i.e., the graph does not change significantly over adjacent time instants. In other words, nodes influencing each other in a particular time instant, are likely to influence each other in other time instants. A reasonable explanation is given by the smooth and regular pattern exhibited by the time-series of Fig. 4a , which is a consequence of the meteorological similarity over time, and by their high correlation coefficient. 5 Data available at https://www.met.ie/climate/available-data/historical-data An interesting trend arises when observing the number of edges of the graph inferred over time, shown in Fig. 5a . Although in adjacent time instants the number does not change abruptly, a pattern can be identified over a longer time span. In particular, during winter and summer there is a sharp increment in the number of edges, with respect to autumn and spring where there is a significant reduction. To ease the visualization, the vertical red lines are placed in correspondence of the winter period of every year, while blue lines in correspondence of the autumn period. A possible reason for this phenomenon is given by the reduced variability of the temperature among the stations during summer and winter, and a higher variability during spring and autumn, leading to different graphs. For the sake of visualization, we also report the inferred graphs for October 2016 (autumn) and January 2017 (winter). In line with our previous comments regarding Fig. 5a , a lower number of edges is visible in the autumn graph with respect to the winter graph; in particular, edges present in the autumn graph are also present in the winter one. Finally, notice how stations close in space tend to be connected, thus showing how stations close to each other have a greater influence with respect to stations farther away in space.

Data description: we use electrocorticography (ECoG) time series collected during an epilepsy study at the University of California, San Francisco (UCSF) Epilepsy Center, where an 8 × 8 grid of electrodes was implanted on the cortical brain's surface of a 39-year-old woman with medically refractory complex partial seizure [40] . The grid was supplemented by two strips of six electrodes: one deeper implanted over the left suborbital frontal lobe and the other over the left hippocampal region, thus forming a network of 76 electrodes, all measuring the voltage level in proximity of the electrode, which is an evidence of the local brain activity. The sampling rate is 400 Hz and the measured time series contains the 10 seconds interval preceding the seizure (pre-ictal interval) and the 10 seconds interval after the start of the seizure (ictal interval). Our goal is to leverage the TV-SBM in order to explore the dynamics among different brain areas at the seizure onset.

Results: for our analysis we consider T = 3200 time instants as graphs signals {x t } for the N = 76 electrodes, which are further filtered (over the temporal dimension) at {60, 180}Hz to remove the spurious power line frequencies, and standardized as explained in the previous experiments. Fig. 6 shows the graph temporal deviation, where we observe an increasing and protracted variability of the TD shortly after the seizure onset (red vertical line), proving TD to serve as an indicator of network alteration suitable for time-varying scenarios. To visualize the on-the-fly learning behavior of the algorithm, in Fig. 7a we show the evolution of (a fraction 6 of) the edge weights over time. In the first half of the time-horizon, 6 For visualization, we show 500 random edges, since we recall that the number of total edges in an undirected graph of N nodes is N (N − 1)/2.

we notice the presence of stronger edges with respect to the second half, where the graph is sparser. We show two snapshots of the time-varying graph in Fig. 7b , for the time instants 1500 (pre-ictal) and 1800 (ictal), where we also report the closeness centrality of each node, which expresses how "close" a node is to all other nodes in the network (calculated as the average of the shortest path length from the node to every other node in the network). During the ictal interval, the graph tends to be more disconnected and its nodes to have a lower closeness centrality value, especially in the lower part of the graph. In addition, we observe how the number of (strong) edges and the closeness centrality value drop in the ictal graph, especially in the lower part of the graph. This is consistent with the findings in [40] and indicates that, on average, signals in the pre-ictal interval behave more similar to each other as opposed to the signals in the ictal interval. 

In this manuscript, we proposed an algorithmic template to learn time-varying graphs from streaming data. The abstract time-varying graph learning problem, where the data influence is expressed through the empirical covariance matrix, is casted as a composite optimization problem, with different terms regulating different desiderata. The framework, which works in non-stationary environments, lies upon novel iterative time-varying optimization algorithms, which on one side exhibit an implicit temporal regularization of the solution(s), and on the other side accelerate the convergence speed by taking into account the time variability. We specialize the framework to the Gaussian graphical model, the structural equation model, and the smoothness-based model, and we propose ad-hoc vectorization schemes for structured matrices central for the gradient computations which also ease storage requirements. The proposed approach is accompanied by theoretical performance guarantees to track the optimal time-varying solution, and is further validated with synthetic numerical results. Finally, we learn time-varying graphs in the context of stock market, temperature monitoring, and epileptic seizures analysis. The current line of work can be enriched by specializing the framework to other static graph learning methods present in literature, possibly considering directed graphs, by implementing distributed versions of the optimization algorithms, and by applying the developed models in other real-world applications.

Consider the multi-valued function T : R N → R N , which we will refer to as operator. Here, we briefly review some operator theory concepts used in this manuscript; see [41] . Projection operator. Given a point x ∈ R N , we define projection of x onto the convex set C ⊆ R N as:

Proximal operator. Consider the convex function g : R N → R. We define the proximal operator of g(·), with penalty parameter ρ > 0, as: 

For some functions, the proximal operator admits a closed form solution [35, Ch. 6] . In particular:

• if g(x) = ι C (x) then prox g (x) = P C (x), i.e., it is the projection of x onto the convex set C. • if g(x) = λ x 1 then prox g (x) = sign(x) [x − λ1] + , i.e., it is the soft-thresholding operator. Consider the convex minimization problem:

with f, g : R N → R convex. It can be shown that problem (46) admits at least one solution [42] , which can be found by the fixed point equation:

Proof of Claim 1: TV-GGM. Recall the expression of the Hessian in (21b), i.e., H(S) = D (S ⊗ S) −1 D and that matrix S ∈ S is the precision matrix, with S = {S ∈ S N ++ |ξI S χI}.

For the strong convexity, notice that since S 0, then also H(S) 0. Indeed, by exploiting the semi-orthogonality of matrix D/ √ 2, we have:

λ min (H(S)) = min

x =1

x D (S ⊗ S) −1 Dx (48)

≥ min

x =1 y (Σ t ⊗ I)y = min

where λ min is the smallest eigenvalue ofΣ t .

For the Lipschitz continuity of the gradient, we have:

Proof of Claim 3: TV-SBM. For the strong convexity it suffices to notice that for m > 0, f (s; t) − m 2 s 2 = s z t − λ 2 1 log(Ks) + (λ 1 − m 2 ) s 2 is convex. In turn, this implies that strong convexity of f (·; t) is guaranteed for 0 < m ≤ 2λ 1 .

For the Lipschitz continuity of the gradient, recall that nodal degree vector d 0. Denote with d min the minimum degree of the GSO search space. Also, recall the expression of the Hessian H = K Diag(1 (Ks) •2 )K. Then:

K Diag(1 (Ks) •2 )K ≤ K 2 max(1 (Ks) •2 )) (52) = K 2 d −2 min = 2(N − 1)d −2 min , where we made use of [19, Lemma 1] for the bound of K.

The computational (arithmetic) complexity per iteration of Algorithm 1 is dominated by the rank-one covariance matrix update in O(N 2 ) and by the method-specific gradient computations involved in the prediction and correction steps (and eventually Hessian, if P > 1 [cf. Section VI "Does prediction help?" ]). Such method-specific computational complexities are shown next, together with a discussion on the costs for the offline counterparts. TV-GGM. The worst case scenario computational complexity of the gradient ∇ s f (s; t) in (21a) is O(N 3 ), which is due to the matrix inversion. This cost might be lowered exploiting the sparsity pattern of the sparse triangular factor of S or, in our case, exploiting the fact that it is a small perturbation with respect to the previous iterate. The multiplication with matrix D has a cost of O(N 2 ), since D ∈ R N 2 ×N (N +1)/2 has at most two 1's in each column and exactly one 1 in each row.

The worst case scenario computational complexity of the Hessian ∇ ss f (s; t) in (21b) would be O(N 3 ). However, because the Hessian is used in a matrix-vector multiplication [cf. (23) ], its factorization leads to a cost for the prediction step of O(N 3 ). Indeed, exploiting the Kronecker product, the Hessian can be written as D (S −1 ⊗ I N )(I N ⊗ S −1 )D; then, the multiplication of the Hessian for a vector simply entails the succession of four sparse matrix-vector multiplications all with a cost of O(N 3 ).

The term ∇ ts f (s; t) in (22) has a computational complexity of O(N 2 ). Thus the overall computational complexity per iteration is O (N 3 ) . TV-SEM. The overall cost is dominated by the computation of Q t = D h (Σ t ⊗ I)D h , which is present in the gradients and the Hessian. The matrix-matrix multiplication(s) have a cost of O(N 3 ), since D h ∈ R N 2 ×N (N −1)/2 has at most two 1's in each column and exactly one 1 in each row. Thus the overall computational complexity per iteration is O(N 3 ). TV-SBM. Each column of K has exactly two non-zero entries (and each row has N − 1 non-zero entries), thus Ks has a computational cost of 2|E|, with |E| the number of edges of the graph represented by S (in other words, s 0 ). The operation K (1 Ks) has a cost of O(N 2 ). The computational complexity of the Hessian is O(N 3 ), since it is the weighted sum of N outer products of vectors which are (N − 1)-sparse in the same positions.

Thus the overall computational complexity per iteration is O(N 2 ) if P = 0, 1 and O(N 3 ) if P > 1.

Offline. The computational complexity for each time instant t incurred by an offline solver to solve instances of problem (13) depends by its algorithm-specific implementation closely related to the problem structure. The three problems we consider are (converted into) semidefinite programs (SDPs) and solved, in our case, by SDPT3, a Matlab implementation of infeasible primal-dual path-following algorithms, which involves the computation of second-order information. Since these computations are continuously repeated, for a fixed time instant t, till the algorithm convergence (say I iterations), a trivial lower bound for computing the offline solution for the three considered problems is Ω (IN 3 ) . To this cost must be also added the cost of other solver-specific steps which we do not explicitly consider here.

Online time-varying topology identification via prediction-correction algorithms

Advances in distributed graph filtering

Connecting the dots: Identifying network structure via graph signal processing

Learning graphs from data: A signal representation perspective

Inference of dynamic networks using time-course data

Hierarchical structure in financial markets

Discrete signal processing on graphs

Learning time varying graphs

Time-varying graph learning with constraints on graph temporal variation

Network inference via the time-varying graphical lasso

Sparse inverse covariance estimation with the graphical lasso

Tracking switched dynamic network topologies from information cascades

Online non-linear topology identification from graph-connected time series

Topology identification and learning over graphs: Accounting for nonlinearities and dynamics

Online graph learning from sequential data

Online topology inference from streaming stationary graph signals with partial connectivity information

Stationary graph processes and spectral estimation

Online topology identification from vector autoregressive time series

Online discriminative graph learning from multi-class smooth signals

Time-varying convex optimization: Time-structured algorithms and applications

The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains

Covariance selection

Structural equation modeling

How to learn a graph from smooth signals

A unified framework for structured graph learning via spectral constraints

A class of prediction-correction methods for time-varying convex optimization

Régularisation d'inéquations variationnelles par approximations successives. rev. française informat

Strong and weak convexity of sets and functions

Matrix differential calculus with applications in statistics and econometrics

Primal and dual predictioncorrection methods for time-varying convex optimization

Primer on monotone operator methods

Proximal splitting methods in signal processing," in Fixed-point algorithms for inverse problems in science and engineering

Extremal eigenvalues of real symmetric matrices with entries in an interval

A sharp upper bound on the spectral radius of weighted graphs

First-order methods in optimization

Cvx: Matlab software for disciplined convex programming

GSPBOX: A toolbox for signal processing on graphs

Learning laplacian matrix in smooth graph signal representations

Emergent network topology at seizure onset in humans

Convex analysis and monotone operator theory in Hilbert spaces

Signal recovery by proximal forwardbackward splitting