key: cord-0589825-qky9168y authors: Shafipour, Rasoul; Mateos, Gonzalo title: Online Topology Inference from Streaming Stationary Graph Signals with Partial Connectivity Information date: 2020-07-07 journal: nan DOI: nan sha: 2659c162e3b661276648037809898c090a8e1e91 doc_id: 589825 cord_uid: qky9168y We develop online graph learning algorithms from streaming network data. Our goal is to track the (possibly) time-varying network topology, and effect memory and computational savings by processing the data on-the-fly as they are acquired. The setup entails observations modeled as stationary graph signals generated by local diffusion dynamics on the unknown network. Moreover, we may have a priori information on the presence or absence of a few edges as in the link prediction problem. The stationarity assumption implies that the observations' covariance matrix and the so-called graph shift operator (GSO -- a matrix encoding the graph topology) commute under mild requirements. This motivates formulating the topology inference task as an inverse problem, whereby one searches for a sparse GSO that is structurally admissible and approximately commutes with the observations' empirical covariance matrix. For streaming data said covariance can be updated recursively, and we show online proximal gradient iterations can be brought to bear to efficiently track the time-varying solution of the inverse problem with quantifiable guarantees. Specifically, we derive conditions under which the GSO recovery cost is strongly convex and use this property to prove that the online algorithm converges to within a neighborhood of the optimal time-varying batch solution. Numerical tests illustrate the effectiveness of the proposed graph learning approach in adapting to streaming information and tracking changes in the sought dynamic network. Network data supported on the vertices of a graph G representing pairwise interactions among entities are nowadays ubiquitous across disciplines spanning engineering as well as social and the bio-behavioral sciences; see e.g., [1, Ch. 1] . Such data can be conceptualized as graph signals, namely high-dimensional vectors with correlated entries indexed by the nodes of G. Graph-supported signals abound in real-world applications of complex systems, including vehicle congestion levels over road networks, neurological activity signals supported on brain connectivity networks, COVID-19 incidence in different geographical regions linked via mobility or social graphs, and fake news that spread on online social networks. Efficiently mining information from unprecedented volumes of network data promises to prevent or limit the spread of epidemics and diseases, identifying trends in financial markets, learning the dynamics of emergent social-computational systems, and also protect critical infrastructure including the smart grid and the Internet's backbone network [2] . In this context, the goal of graph signal processing (GSP) is to develop information processing algorithms that fruitfully exploit the relational structure of said network data [3] . However, oftentimes G is not readily available and a first key step is to use nodal observations (i.e., measurements of graph signals) to identify the underlying network structure, or, a useful graph model that facilitates signal representations and downstream learning tasks; see [4, 5] for recent tutorials on graph learning with a signal processing flavor and [1, Ch. 7] for a statistical treatment. Recognizing that many of these networks are not only unknown but also dynamic, there is a growing need to develop online graph learning algorithms that can process network information streams in an efficient fashion. Consider a weighted, undirected graph G consisting of a node set N of cardinality N, and symmetric adjacency matrix A ∈ R N×N + with entry A ij = A ji ≠0 denoting the edge weight between node i and node j. We assume that G contains no self-loops; i.e., A ii =0. The adjacency matrix encodes the topology of G. One could generically define a graph-shift operator (GSO) S ∈ R N×N as any matrix capturing the same sparsity pattern as A on its off-diagonal entries. Beyond A, common choices for S are the combinatorial Laplacian L ∶= diag(A1) − A as well as their normalized counterparts [3] . Henceforth we focus on S = A and aim to recover the sparse adjacency matrix of the unknown graph G. Other GSOs can be accommodated in a similar fashion. The graph can be dynamic with a slowly time-varying GSO S t , t = 1, 2, . . . (see also Section 3), but for now we omit any form of temporal dependency to simplify exposition. Our goal is to develop an online framework to estimate sparse graphs that explain the structure of a class of streaming random signals. At some time instant, let y = [y 1 , . . . , y N ] T ∈ R N be a zero-mean graph signal in which the ith element y i denotes the signal value at node i of the unknown graph G with shift operator S. Further consider a zero-mean white signal x. We state that the graph S represents the structure of the signal y ∈ R N if there exists a diffusion process in the GSO S that produces the signal y from the input signal x [6] , that is Under the assumption that C x = E xx T = I N (identity matrix), (1) is equivalent to the stationarity of y in S; see e.g., [7, Def. 1] , [8] , [9] . The product and sum representations in (1) are common (and equivalent) models for the generation of random network processes. Indeed, any process that can be understood as the linear propagation of a white input through G can be written in the form in (1) , and subsumes heat diffusion, consensus and the classic DeGroot model of opinion dynamics as special cases. The justification to say that S represents the structure of y is that we can think of the edges of G, i.e., those few non-zero entries in S, as direct (one-hop) relations between the elements of the signal. The diffusion in (1) modifies the original correlation by inducing indirect (multi-hop) relations in C y = E yy T . At a high level, the problem studied in this paper is the following [6, 10] . Problem. Recover the direct relations described by a sparse GSO S from a set {y t } T t=1 of T independent samples of the random signal y adhering to (1). We consider the challenging setup when we have no knowledge of the diffusion coefficients {α l } (or {β l }), nor do we get to observe the specific realizations of the driving inputs {x t } T t=1 . The stated inverse problem is severely underdetermined and nonconvex. It is underdetermined because for every observation y we have the same number of unknowns in the input x on top of the unknown diffusion coefficients and the shift S, the latter being the quantity of interest. The problem is nonconvex because the observations depend on the product of our unknowns and, notably, on powers of S [cf. (1) ]. Our main contribution is to develop novel algorithms to address these technical challenges in several previously unexplored settings. We tackle the network topology inference problem of the previous section in two fundamental settings. We start in Section 2 with the batch (offline) case where observations {y t } T t=1 are all available for joint processing as in [6, 10] . Here though we exploit the implications of stationarity from a different angle which leads to a formulation amenable to efficient solutions via proximal gradient (PG) algorithms; see e.g., [11, 12] . Specifically, as we show in Section 2.1 stationarity implies that the covariance matrix C y of the observations commutes with S under mild requirements; see also [7] . This motivates formulating the topology inference task as an inverse problem, whereby one searches for a sparse S that is structurally admissible and approximately commutes with the observations' empirical covariance matrixĈ y . In [6, 10] , the algorithmic issues were not thoroughly studied and relied mostly on off-the-shelf convex solvers whose scalability could be challenged. Unlike [13] but similar to link prediction problems [1, Ch. 7.2] , [14] , here we rely on a priori knowledge about the presence (or absence) of a few edges; conceivably leading to simpler algorithmic updates and better recovery performance. We may learn about edge status via limited questionnaires and experiments, or, we could perform edge screening prior to topology inference [15] . The batch PG algorithm developed in Section 2.3 to recover the network topology is computationally more efficient than existing methods for this (and related) problem(s). It also serves as a fundamental building block to construct online iterations, which constitutes the second main contribution of this paper. Focus in Section 3 shifts to online recovery of S from streaming signals {y 1 , . . . , y t , y t+1 , . . .}, each of them adhering to the generative model in (1) . For streaming data the empirical covariance estimatê C y,t can be updated recursively, and in Section 3.1 we show online PG iterations can be brought to bear to efficiently track the time-varying solution of the inverse problem with quantifiable (non-asymptotic) guarantees. Different from [13] , the updates are simple and devoid of multiple inner loops to compute expensive projections. Moreover, we establish convergence to within a neighborhood of the optimal time-varying batch solution as well as dynamic regret bounds (Section 3.2) by adapting results in [16] . The algorithm and analyses of this paper are valid even for dynamic networks, i.e., if the GSO S t in (1) is (slowly) time-varying. Indeed, we examine how the variability and eigenvectors of the underlying graph as well as the diffusion filters' frequency response influence the size of the convergence radius (or misadjustment in the adaptive filtering parlance). Numerical tests in Section 4 corroborate the efficiency and effectiveness of the proposed algorithm in adapting to streaming information and tracking changes in the sought dynamic network. Concluding remarks are given in Section 5. Early topology inference approaches in the statistics literature can be traced back to the problem of (undirected) graphical model selection [1, Ch. 7] , [4, 5] . Under Gaussianity assumptions, this line of work has well-documented connections with covariance selection [17] and sparse precision matrix estimation [18] [19] [20] , as well as neighborhood-based sparse linear regression [21] . Recent GSP-based network inference frameworks postulate that the network exists as a latent underlying structure, and that observations are generated as a result of a network process defined in such a graph [6, 10, [22] [23] [24] [25] . Different from [22, 24, 26, 27] that infer structure from signals assumed to be smooth over the sought undirected graph, here the measurements are assumed related to the graph via filtering [cf. (1) and the opening discussion in Section 2]. Few works have recently built on this rationale to identify a symmetric GSO given its eigenvectors, either assuming that the input is white [6, 10] -equivalently implying y is graph stationary [7] [8] [9] ; or, colored [28, 29] . Unlike prior online algorithms developed based on the aforementioned graph spectral domain design [13, 14] , here we estimate the (possibly) time-varying GSO directly (without tracking its eigenvectors) and derive quantifiable recovery guarantees; see Remark 2. Recent algorithms for identifying topologies of time-varying graphs [30, 31] operate in batch mode, they are non-recursive and hence their computational complexity grows linearly with time. While we assume that the graph signals are stationary, the online graph learning scheme in [32] uses observations from a Laplacian-based, continuous-time graph process. Unlike [33] that relies on a single-pole graph filter [34] , the filter structure underlying (1) can be arbitrary, but the focus here is on learning undirected graphs. Online PG methods were adopted for directed graph inference under dynamic structural equation models [35] , but lacking a formal performance analysis. The interested reader is referred to [36] for a comprehensive survey about topology identification of dynamic graphs with directional links. The recovery guarantees in Section 3.2 are adapted from the results in [16] , obtained therein in the context of online sparse subspace clustering. Convergence and dynamic regret analysis techniques have been recently developed to study solutions of time-varying convex optimization problems; see [37] for a timely survey of this body of work. The impact of these optimization advances to dynamic network topology identification is yet to fully materialize, and this paper offers a first exploration in this direction. In closing, we note that information processing from streams of network data has been considered in recent work that falls under the umbrella of adaptive GSP [38, 39] . Most existing results pertain to online reconstruction of partially-observed streaming graph signals, which are assumed to be bandlimited with respect to a known graph. An exception is [40] , which puts forth an online algorithm for joint inference of the dynamic network topology and processes from partial nodal observations. The entries of a matrix X and a (column) vector x are denoted by X ij and x i , respectively. Sets are represented by calligraphic capital letters and R + denotes the non-negative real numbers. The notation T stands for matrix or vector transposition; 0 and 1 refer to the all-zero and all-one vectors; while I N denotes the N × N identity matrix. For a vector x, diag(x) is a diagonal matrix whose ith diagonal entry is x i . The operators ⊗, ⊙, ○, and vec(⋅) stand for Kronecker product, Khatri-Rao (columnwise Kronecker) product, Hadamard (entrywise) product and matrix vectorization, respectively. The spectral radius of matrix X is denoted by λ max (X) and X p stands for the p norm of vec(X). Lastly, for a function f , denote f ∞ =sup X f (X) . We start with batch (offline) estimation of a time-invariant network topology as considered in [6] . This context is useful to better appreciate the alternative formulation in Section 2.1 and the proposed algorithm in Section 2.3. Together, these two elements will be instrumental in transitioning to the online setting studied in Section 3. To state the problem, recall the symmetric GSO S ∈ R N×N associated with the undirected graph [3, 34] , the Cayley-Hamilton theorem asserts that the model in (1) can be equivalently reparameterized as for some particular h and filter order L ≤ N. Since S is a local (one-hop) diffusion operator, L specifies the (multi-hop) range of vertex interactions when generating y from x. It should now be clear that we are constraining ourselves to observations of signals generated via graph filtering. As we explain next, this assumption leads to a tight coupling between S and the second-order statistics of y. The covariance matrix of y = Hx is given by [recall (2) and C x = I N ] We relied on the symmetry of H to obtain the third equality, as H is a polynomial in the symmetric GSO S. Using the spectral decomposition of S = VΛV T to express the filter as H = ∑ we can readily diagonalize the covariance matrix in (3) as Such a covariance expression is the requirement for a graph signal to be stationary in In other words, if y is graph stationary (equivalently if C x = I N ), (4) shows that the eigenvectors of the shift S, the filter H, and the covariance C y are all the same. Thus given a batch of observations {y t } T t=1 comprising independent realizations of (4), the approach in [6] advocates: (i) forming the sample covariancê C y = 1 T ∑ T t=1 y t y T t and extracting its eigenvectorsV as spectral templates of G; then (ii) recover S that is optimal in some sense by estimating its eigenvalues Λ = diag(λ 1 , . . . , λ N ). Namely, one solves the inverse problem which is convex for appropriate choices of the function f (S), the constraint set S and the matrix distance The tuning parameter > 0 accounts for the (finite) sample size-induced errors in estimating V, and could be set to zero if the eigenvectors were perfectly known. The formulation (5) entails a general class of network topology inference problems parameterized by the choices of f , d and S; see [6] for additional details on various application-dependent instances. Particular cases of (5) with = 0 were independently studied in [10] . In this paper we consider a different formulation from (5), which is amenable to efficient algorithms. For concreteness, we also make specific choices of f , d and S as described next. As a different take on the shared eigenspace property, observe that stationarity of y also implies C y S = SC y , thus forcing the covariance C y to be a polynomial in S [cf. (3) ]. This commutation identity holds under the pragmatic assumption that all the eigenvalues of S are simple and ∑ L−1 l=0 h l λ l i ≠ 0, for i = 1, . . . , N. It is also a natural characterization of (second-order) graph stationarity, requiring that second-order statistical information is shift invariant [7] . Since one can only estimateĈ y from data {y t } T t=1 , our idea to recover a sparse GSO is to solve [cf. (5) ] The inverse problem (6) is intuitive. The constraints are such that one searches for an admissible S ∈ S that approximately commutes with the observations' empirical covariance matrixĈ y . To recover a sparse graph we minimize the 1 -norm of S, the workhorse convex proxy to the intractable cardinality function S 0 counting the number of non-zero entries (edges) in the GSO. Different from (5) in [6] , the formulation (6) circumvents computation of eigenvectors (an additional source of errors beyond covariance estimation), and reduces the number of optimization variables. More importantly, as we show in Section 2.3 it offers favorable structure to invoke a proximal gradient solver with convergence guarantees, even in an online setting where the signals y t arrive in a streaming fashion. In closing, we elaborate on S as the means to enforce the GSO is structurally admissible as well as to incorporate a priori knowledge about S. Henceforth we let S = A represent the adjacency matrix of an undirected graph with non-negative weights (S ij = S ji ≥ 0) and no self-loops (S ii = 0). Unlike [6, 10] , we investigate the effect of having additional prior information about the presence (or absence) of a few edges, or even their corresponding weights. This is well motivated in studies where one can afford to explicitly measure a few of the pairwise relationships among the objects in N . Examples include social network studies involving questionnaire-based random sampling designs [1, Ch. 5.3] , or experimental testing of suspected regulatory interactions among selected pairs of genes [1, Ch. 7.3.4] . Since (6) is a sparse linear regression problem, one could resort to so-termed variable screening techniques to drop edges prior to solving the optimization; see e.g., [15] . Either way, one ends up with extra constraints S ij = s ij , for a few vertex pairs (i, j) in the set Ω ⊂ N × N of observed edge weights s ij . Accordingly, we can write the convex set of admissible adjacency matrices as Other GSOs can be accommodated using appropriate modifications. Here we examine the feasibility set and the degrees of freedom of the GSO S under the assumption that the perfect output covariance C y is available and S ∈ S [cf. (7)]. This would shed light on the dependency of the feasibility set's structure and dimensionality (hence the difficulty of recovering S) on the number Ω of observed edges. This dependency can serve as a guideline to the number of edges we may seek to observe prior to graph inference. As we show next, the feasibility set may potentially reduce to a singleton (the graph S ∈ S is completely specified by C y ), or more generally to a low-dimensional subspace. In the latter (more interesting) case, or more pragmatically when we approximateĈ y with the observations' sample covariance, we solve the convex optimization problem as in (6) to search for a sparse and structurally admissible GSO. Given the (ensemble) output covariance C y , consider the mapping SC y = C y S which stems from the stationarity of y (cf. the opening discussion in Section 2.1). This identity implies that C y and S are simultaneously diagonalizable. Thus the eigenvectors V of S are known and coincide with those of C y . Given the GSO eigenvectors V, consider the mapping S = VΛV T between S and Λ. This can precisely be rewritten as vec(S) = (V ⊙ V)λ = Wλ, where ⊙ denotes the Khatri-Rao (column-wise Kronecker) product, λ ∈R N collects the diagonal entries of Λ, and W ∶= V⊙V ∈R N 2 ×N . Recall that S ∈ S and accordingly the entries of vec(S) corresponding to the diagonal of S are zero. Upon defining the set D ∶= N(i − 1)+i i ∈{1, . . . , N} , we have the mapping W D λ = 0 to the null diagonal entries of S, where W D ∈ R N×N is a submatrix of W that contains rows indexed by the set D. Thus, it follows W D is rank-deficient and λ ∈ ker(W D ), where ker(.) denotes the null-space of its matrix argument. In particular, assume that rank(W D )= N−k, 1≤ k < N, which implies λ lives in a k-dimensional subspace. As some of the entries in S are known according to S, intuitively we expect that by observing k = Ω "sufficiently different" edges, the feasible set may boil down to a singleton resulting in a unique feasible S ∈ S. To quantify the effect of the partial connectivity constraints on the size of the feasible set, let M ∶= {N(j − 1)+i (i, j) ∈ Ω} correspond to the known entries of vec(S). Then upon defining U ∈ R N×k comprising the basis vectors of ker(W D ), the condition rank(W M U) = k would be sufficient to determine S uniquely in the k-dimensional null space of W D as stated in the following proposition. Proposition 1 further implies that rank(W M ) ≥ k under the assumption that (W M U) = k. This is due to the inequality rank(W M U) ≤ min{rank(W M ), rank(U)}. Observing k "sufficiently different" edges for unique recovery of S is the intuition behind the rank constraint on W M . In real-world graphs, we have observed that k is typically much smaller than N; see also Section 4 and [6, Section 3]. This would make it feasible to uniquely identify the graph, given only its eigenvectors and the status of k edges. However, in practice we may not know about the status of those many edges, or, the output covariance C y may only be imperfectly estimated via sample covariance matrixĈ y . This motivates searching for an optimal graph while accounting for the (finite sample size) approximation errors and the prescribed structural constraints, the subject dealt with next. Exploiting the problem structure in (6), a batch proximal gradient (PG) algorithm is developed in this section to recover the network topology; see [11] for a comprehensive tutorial treatment on proximal methods. Based on this module, an online algorithm for tracking the (possibly dynamically-evolving) GSO from streaming graph signals is obtained in Section 3. PG methods have been popularized for 1 -norm regularized linear regression problems, through the class of iterative shrinkage-thresholding algorithms (ISTA); see e.g., [41] [42] [43] . The main advantage of ISTA over off-the-shelf interior point methods is its computational simplicity. We show this desirable feature can permeate naturally to the topology identification context of this paper, addressing the open algorithmic questions in [6, Sec. IV]. To make the graph learning problem amenable to this optimization method, we dualize the constraint SĈ y −Ĉ y S 2 F ≤ in (6) and write the composite, non-smooth optimization . The quadratic function g(⋅) is convex and M-smooth [i.e., ∇g(⋅) is M-Lipschitz continuous] and µ > 0 is a tuning parameter. Notice that (8) and (6) are equivalent optimization problems, since for each there exists a value of µ such that the respective minimizers coincide. To derive the PG iterations, first notice that the gradient of g(S) in (8) has the simple form which is Lipschitz continuous with constant M =4µλ 2 max (Ĉ y ). With α > 0 and S a convex set, introduce the proximal operator of a function α f (⋅) ∶ R N×N → R evaluated at matrix M ∈ R N×N as With these definitions, the PG updates with fixed step size γ < 2 M to solve the batch problem (8) are given by (k = 1, 2, . . . denote iterations) It follows that GSO refinements are obtained through the composition of a gradient-descent step and a proximal operator. Instead of directly optimizing the composite cost in (8), the PG update rule in (11) can Require:Ĉ y , µ > 0. 1: Initialize S 0 ≠ 0 as a sparse, random symmetric matrix, γ = 1 [4µλ 2 max (Ĉ y )], k = 0. 2: while not converged do 3: Compute ∇g(S k ) = µ (S kĈy −Ĉ y S k )Ĉ y −Ĉ y (S kĈy −Ĉ y S k ) . 4: Take gradient descent step D k = S k − γ∇g(S k ). 5: Update S k+1 = prox γ ⋅ 1 ,S (D k ) via the proximal operator in (12). 6: k = k + 1. 7: end while 8: return S ⋆ = S k . be interpreted as the result of minimizing a quadratic overestimator of F(S) at judiciously chosen points (here the current iterate S k ); see [12] for a detailed justification. Evaluating the proximal operator efficiently is key to the success of PG methods. For our specific case of sparse graph learning with partial connectivity information, i.e., S as defined in (7) and f (S) = S 1 , the proximal operator Z in (10) has entries given by The resulting entry-wise separable nonlinear map nulls the diagonal entries of S k+1 , sets the edge weights corresponding to Ω to the observed values s ij , and applies a non-negative soft-thresholding operator to update the remaining entries. Note that for symmetric S, the gradient (9) will also be a symmetric matrix. So if the algorithm is initialized with, say, a very sparse random symmetric matrix, then all subsequent iterates S k , k ≥ 1, will be symmetric without requiring extra projections. The resulting iterations are tabulated under Algorithm 1, which will serve as the basis for the online algorithm in the next section. The computational complexity is dominated by the gradient evaluation in (9), incurring a cost of O( S k 0 N 2 ) per iteration k. The iterates S k tend to become (and remain) quite sparse at early stages of the algorithm by virtue of the soft-thresholding operations (a sparse initialization is useful to this end), hence reducing the complexity of the matrix products in (9) . As k → ∞ the sequence of iterates (11) provably converges to a minimizer S ⋆ [cf. (8) ]; see e.g., [44] . Moreover, F(S k ) − F(S ⋆ ) → 0 due to the continuity of the composite objective function F(⋅) -a remark on the convergence rate is now in order. assert that convergence speedups can be obtained through the so-termed accelerated (A)PG algorithm; see [12] for specifics in the context of ISTA that are also applicable here. Without increasing the computational complexity of Algorithm 1, one can readily devise an accelerated variant with a (worst-case) convergence rate guarantee of O(1 √ ε) iterations to return an ε-optimal solution measured by the objective value F(⋅) [cf. Algorithm 1 instead offering a O(1 ε) rate]. Additional challenges arise with real-time network data collection, where analytics must often be performed "on-the-fly" and without the opportunity to revisit past graph signal observations due to e.g., staleness or storage constraints [2] . Consider now online estimation of S (or even tracking S t in a dynamic setting) from streaming data {y 1 , . . . , y t , y t+1 , . . .}. To this end, a viable approach is to solve at each time instant t = 1, 2, . . . , the composite, time-varying optimization problem [cf. (8) ] In writingĈ y,t we make explicit that the covariance matrix is estimated with all signals acquired by time t. As data come in, the covariance estimate will fluctuate and hence the time dependence of the objective function F t (S) through its smooth component g t . Notice that even as t becomes large, the squared residuals in g t remain roughly of the same order due to data averaging (rather than accumulation) inĈ y,t . Thus a constant regularization parameter µ > 0 tends to suffice because g t (S) will not dwarf the 1 -norm term in (13) . The solution S ⋆ t of (13) is the batch network estimate at time t. Accordingly, a naive sequential estimation approach consists of solving (13) using Algorithm 1 repeatedly. However online operation in delay-sensitive applications may not tolerate running multiple inner PG iterations per time interval, so that convergence to S ⋆ t is attained for each t as required by Algorithm 1. If G is dynamic it may not be even prudent to obtain S ⋆ t with high precision (hence incurring high delay and unnecessary computational cost), since at time t + 1 a new datum arrives and the solution S ⋆ t+1 may deviate significantly from the prior estimate. These reasons motivate devising an efficient online and recursive algorithm to solve the time-varying optimization problem (13) . We are faced with an interesting trade-off that emerges with time-constrained data-intensive problems, where a high-quality answer that is obtained slowly can be less useful than a medium-quality answer that is obtained quickly. Our algorithm construction approach entails two steps per time instant t = 1, 2, . . ., where we: (i) recursively update the observations' covariance matrixĈ y,t in O(N 2 ) complexity; and then (ii) run a single iteration of the batch graph learning algorithm developed in Section 2.3 to solve (13) efficiently. Different from recent approaches that learn dynamic graphs from the observation of smooth signals [30, 31] , the resulting algorithm's memory storage requirement and computational cost per data sample y t does not grow with t. A similar idea to the one outlined in (ii) was advocated in [16] to develop an online sparse subspace clustering algorithm; see also [35] for dynamic SEM estimation from traces of information cascades. Step (i) is straightforward, and the sample covarianceĈ y,t−1 is recursively updated once y t becomes available through a rank-one correction as followŝ This so-termed infinite memory sample estimate is appropriate for time-invariant settings, i.e., when the graph topology remains fixed for all t. To track dynamic graphs, it is prudent to eventually forget about past observations [cf. (14) incorporates all signals {y τ } t τ=1 ]. This can be accomplished via exponentially-weighted empirical covariance estimators or by using a (fixed-length) sliding window of data, both of which can be updated recursively via minor modifications to (14) . Initialization of the covariance estimateĈ y,0 can be performed in practice via sample averaging of a few signals collected before the algorithm runs, or simply as a scaled identity matrix. To solve (13) online by building on the insights gained from the batch solver [cf. step (ii)], we let iterations k = 1, 2, . . . in Algorithm 1 coincide with the instants t of data acquisition. In other words, at time Require: {y 1 , . . . , y t , y t+1 , . . .},Ĉ y,0 , µ > 0. 1: Initialize S 1 ≠ 0 as a sparse, random symmetric matrix, γ 0 = 1 [4µλ 2 max (Ĉ y,0 )], k = 0. 2: for t = 1, 2, . . . do 3: UpdateĈ y,t = 1 t (t − 1)Ĉ y,t−1 + y t y T t and γ t . 4: Compute ∇g t (S t ) = µ (S tĈy,t −Ĉ y,t S t )Ĉ y,t −Ĉ y,t (S tĈy,t −Ĉ y,t S t ) . 5: Take gradient descent step D t = S t − γ t ∇g t (S t ). 6: Update S t+1 = prox γ t ⋅ 1 ,S (D t ) via the proximal operator in (12) . 7: end for 8: return S t+1 . t we run a single iteration of (11) to update S t before the new datum y t+1 arrives at time t + 1. Specifically, the online PG algorithm takes the form where the step size γ t is chosen such that γ t < 2 M t = 1 2µλ 2 max (Ĉ y,t ) . Recall that the gradient ∇g t (S t ) is given by (9) , and it is a function of the updated covariance matrixĈ y,t [cf. (14) ]. The proximal operator for the 1 −norm entails the pointwise nonlinearity in (12) , with a threshold γ t that will in general be time varying. If the signals arrive faster, one can create a buffer and perform each iteration of the algorithm on aĈ y,t updated with a sliding window of all newly observed signals (in a way akin to processing a minibatch of data). On the other hand, for a slower arrival rate additional PG iterations during (t − 1, t] would likely improve recovery performance; see also the discussion following Theorem 1. The proposed online scheme is tabulated under Algorithm 2; the update ofĈ y,t can be modified to accommodate tracking of dynamic graph topologies. In conference precursors to this paper [13, 14] , different algorithms were developed to track graph topologies from streaming stationary signals. The ADMM-based scheme in [13] minimizes an online criterion stemming from (5) and hence one needs to track sample covariance eigenvectors; the same is true for the online alternating-minimization algorithm in [14] . The merits of working with the inverse problem (6) are outlined in Section 2.1. Moreover, enforcing some of the admissibility constraints in S requires an inner ADMM loop to compute nontrivial projection operators, which could hinder applicability in delay-sensitive environments [13] . Neither of these schemes offer convergence guarantees to the solutions of the resulting time-varying optimization problems. For Algorithm 2, these type of results are established in the ensuing section. The key difference between the batch algorithm (11) and its online counterpart (15) is the variability of g t per iteration in the latter. Ideally, we would like Algorithm 2 to closely track the sequence of minimizers {S ⋆ t } for large enough t, something we corroborate numerically in Section 4. Following closely the analysis in [16] , we derive recovery (i.e., tracking error) bounds S t − S ⋆ t F under the pragmatic assumption that g t is strongly convex and S ⋆ t is the unique minimizer of (13), for each t. Before stating the main result in Theorem 1, the following proposition (proved in Appendix A) offers a condition for strong convexity of g t . Ψ t,D c ∈ R N 2 ×N(N−1) (the submatrix of Ψ t that contains columns indexed by the set D c ) is full column rank, then g t (S) in (13) is strongly convex with constant m t > 0 being the smallest (nonzero) singular value of Ψ t,D c . Consider the eigendecompositionĈ y,t =V tΛtV T t of the sample covariance matrix at time t, withV t denoting the matrix of orthogonal eigenvectors andΛ t the diagonal matrix of non-negative eigenvalues. Exploiting the structure of Ψ t it follows that the strong convexity condition stated in Proposition 2 will be satisfied if (i) all eigenvalues ofĈ y,t are distinct; and (ii) the matrixV t ○V t is non-singular. Recalling the covariance expression in (4), the aforementioned conditions (i)-(ii) immediately translate to properties of the diffusion filter's (squared) frequency response and the GSO eigenvectors. In extensive simulations involving several synthetic and real-world graphs, we have indeed observed that Ψ t,D c is typically full column rank and thus g t is strongly convex. Under the strong convexity assumption, we have the following (non-asymptotic) performance guarantee for Algorithm 2. The result is adapted from [16, Theorem 1]; see Appendix B for a proof. Theorem 1. Let ν t ∶= S ⋆ t+1 −S ⋆ t F capture the variability of the optimal solution of (13). If g t in (13) is strongly convex with constant m t , then for all t ≥ 1 the iterates S t generated by Algorithm 2 satisfy In terms of the objective values, it can be shown that [43, Theorem 10.29] . As expected, Theorem 1 asserts that the higher the variability in the underlying graph, the higher the recovery performance penalty. Even if the graph G (and hence the GSO) is time invariant, then ν t will be non-zero especially for small t since the solution S * t may fluctuate due to insufficient data. Much alike classic performance results in adaptive signal processing [46] , here there is misadjustment due to the "dynamics-induced noise" ν t and hence the online algorithm will only converge to within a neighborhood of S ⋆ t . To better distill what determines the size of the convergence radius, defineL t ∶= max τ=0,...,t L τ , ν t ∶= max τ=0,...,t ν τ and sum the geometric series in the right-hand side of (16) to obtain the simplified bound Further suppose m τ ≥ m and M τ ≤ M and the step-size chosen as γ τ = 2 (m τ + M τ ), for all τ = 0, . . . , t,. Then it follows thatL t ≤ (M − m) (M + m) < 1. The misadjustmentν t (1 −L t−1 ) in (17) grows withν t as expected, and will also increase when the problem is badly conditioned (M → ∞ or m → 0) becauseL t → 1. If one could afford taking i t PG iterations (instead of a single one as in Algorithm 2) per time step t, the performance gains can be readily evaluated by substitutingL t = ∏ t τ=0 L i τ τ in Theorem 1 to use the bound (16) . In closing, we state dynamic regret bounds which are weaker than the performance guarantees in (16)-(17), but they do not require strong convexity assumptions. The proof of the following result can be found in [16] and is omitted here in the interest of brevity. τ=0 S ⋆ τ be the average trajectory of the optimal solutions of (13), and introduce ρ t (τ) ∶= Ŝ ⋆ t − S ⋆ τ to capture the deviation of instantaneous solutions from this average trajectory. Define δ t ∶= g t+1 − g t ∞ as a measure of the variability in F t . Suppose M τ ≤ M and set the step-size γ τ = 1 M , for all τ = 0, . . . , t. Then for all t ≥ 1 the iterates S t generated by Algorithm 2 satisfy To interpret the result of Theorem 2, defineρ t ∶= max τ=0,...,t−1 ρ t (τ) andδ t ∶= max τ=0,...,t−2 δ τ . Using these definitions, the following simplified regret bound is obtained immediately from (18) Ifδ t andρ t are well-behaved (i.e., the cost function changes slowly and so does its optimal solution S * t ) then, on average, F t (S t ) hovers within a constant term of F t (S * t ). The network size N and the problem conditioning (as measured by the Lipschitz constant upper bound M) also play a natural role. Here we assess the performance of the proposed algorithms in recovering sparse real-world graphs. To that end, we: (i) illustrate the scalability of Algorithm 1 using a relatively large-scale Facebook graph with thousands of nodes in a batch setup; (ii) evaluate the performance of the proposed online Algorithm 2 in settings with streaming signals; and (iii) demonstrate the effectiveness of Algorithm 2 in adapting to dynamical behavior of the network. Throughout this section, we infer unweighted real-world networks from the observation of diffusion processes that are synthetically generated via graph filtering as in (2) . For the graph shift S = A, the adjacency matrix of the sought network, we consider a second-order filter H = ∑ 2 l=0 h l S l , where the coefficients {h l } are drawn uniformly from [0, 1]. To measure the edge-support recovery, we compute the F-measure defined as the harmonic mean of edge precision and recall (precision is the percentage of correct edges inŜ, and recall is the fraction of edges in S that are retrieved). To evaluate the scalability of Algorithm 1, consider a directed network of N =2888 Facebook users, where the 2981 edges represent friendships among the users [47, 48] . More precisely, an edge from node i to node j exists if user i is a friend of the user j. To make the graph amenable to our framework, we assume that the friendships are bilateral and ignore the directions. First, we notice that rank(W D ) = 2882 (cf. Proposition 1). This means that knowing the GSO's eigenvectors and k = 6 suitably chosen edges as a priori information would lead to a singleton feasibility set. To test Algorithm 1 in recovering this reasonably large-scale graph, the performance is averaged over 10 experiments wherein we assume that we know the existence of Ω = 5 randomly sampled edges in each experiment. We then generate T = 10 3 , 10 4 , 10 5 , or 10 6 synthetic random signals {y t } T t=1 adhering to generative diffusion process in (2) , where the entries of the inputs {x t } T t=1 are drawn independently from the standard Gaussian distribution yo yield stationary observations. We obtainĈ y via a sample covariance estimate. For the aforementioned different values of T, the recoveredŜ obtained after running Algorithm 1 results in the following F-measures. As the number of observations increase, the estimateĈ y becomes more reliable which leads to a better performance in recovering the underlying GSO; recall that perfect support recovery corresponds to an F-measure of 1. Moreover, the reported results corroborate the effectiveness of Algorithm 1 in recovering large-scale graphs. Off-the-shelf algorithms utilized to solve related topology inference problems in [6] are not effective for graphs with more than a few hundred nodes. Next, we consider the social network of Zachary's karate club [1, Ch. 1.2.2] represented by a graph G consisting of N = 34 nodes or members of the club and 78 undirected edges symbolizing friendships among them. We seek to infer this graph from the observation of diffusion processes that are synthetically generated via graph filtering as in (2) . The rank of W D (cf. Proposition 1) for this graph is 32. This implies that the knowledge of the perfect output covariance C y leaves the GSO S in a 2-dimensional subspace which can lead to a singleton feasibility set by observing only 2 different edges. However, here we work with a noisy sample covarianceĈ y . We observe one of the 78 edges (chosen uniformly at random) and aim to infer the rest of the edges. At each time step, 10 synthetic signals {y p } are generated through diffusion process H where the entries of the inputs {x p } are drawn independently from the standard Gaussian distribution. In the online case, upon sensing 10 signals at each time step, we first update the sample covarianceĈ y,t and then carry out i t = 10 PG iterations as Algorithm 2. Also, to examine the tracking capability of the online estimator, after 5000 time steps, we remove 10% of the existing edges and add the same number of edges elsewhere. This would affect the graph filter H accordingly. To corroborate the assumption in Theorem 1, it is worth mentioning that throughout the process we observed that Ψ t,D c was full column rank and thus the cost in (13) was strongly convex; see Proposition 2. Fig. 1-(a) depicts the running objective value F t (S t ) [cf. (13) ] averaged over 10 experiments as a function of the time steps and the a priori knowledge -3 randomly chosen edges. We also superimpose Fig. 1-(a) with the optimal objective value F t (S ⋆ t ) at each time step. First, we notice that the objective value trajectory converges to a region above the optimal trajectory. Also, we observe that after 5000 iterations, the performance deteriorates at first due to the sudden change of the network structure, but after observing enough new samples Algorithm 2 can adapt and track the batch estimator as well. This demonstrates the effectiveness of the developed online algorithm when it comes to adapting to network perturbations. Finally, we study the quality of the online learned graph S t at iteration 5000. Fig. 1-(b) depicts the heat maps of the ground-truth and inferred adjacency matrices for different a priori information. Although the procedure results in a slight gap between F t (S ⋆ t ) and F t (S t ), it still reveals the underlying support of A with reasonable accuracy. Interestingly, we notice that an edge with lower betweenness centrality [e.g., (6, 17) and (15, 34) compared to (2, 4) ] is more informative to identify the network topology; see also [14] . We studied the problem of identifying the topology of an undirected network from streaming observations of stationary signals diffused on the unknown graph. This is an important problem, because as a general principle network structural information can be used to understand, predict, and control the behavior of complex systems. The stationarity assumption implies that the observations' covariance matrix and the GSO commute under mild requirements. This motivates formulating the topology inference task as an inverse problem, whereby one searches for a (e.g., sparse) GSO that is structurally admissible and approximately commutes with the observations' empirical covariance matrix. For streaming data said covariance can be updated recursively, and we show online proximal gradient iterations can be brought to bear to efficiently track the time-varying solution of the inverse problem with quantifiable recovery guarantees. Ongoing work includes extensions of the online graph learning framework to observations of streaming signals that are smooth on the sought network. Funding: Work in this paper was supported by the National Science Foundation awards CCF-1750428 and ECCS-1809356. Strong convexity of g t (S) in (13) is equivalent to finding m t > 0 such that for S 1 , S 2 ∈ S. Note that the Frobenius inner product of two real square matrices is defined as ⟨A, B⟩ = ∑ i,j A ij B ij = trace(A T B). Substituting the time-varying counterpart of (9) in (A1) and using properties of the Khatri-Rao product, one can write the left hand side (LHS) of (A1) as where vec(⋅) stands for matrix vectorization and ⋅ denotes the Euclidean distance. Since GSOs are devoid of self-loops, we can discard the zero entries of vec(S 1 − S 2 ) corresponding to the diagonal entries of S as well as the related columns in Ψ t . This would further simplify the LHS of (A1) leading immediately to the result in Proposition 2. Also, notice that vec(S 1 − S 2 ) 2 2 = S 1 − S 2 2 F . For any S, S ′ ∈ S, we have where the first inequality is due to the Lipschitz continuity of g t (⋅) with constant M t =4µλ 2 max (Ĉ y,t ) along with the strong convexity condition (A1). The second inequality holds since M t ≥ m t . Noting that S * t is a fixed point of (15), we have where the first inequality is due to the nonexpansiveness of the proximal operator and the second one follows from (A2). Leveraging the triangle inequality and the definition ν t = S * t+1 −S * t F results in Statistical Analysis of Network Data: Methods and Models Modeling and optimization for big data analytics: (Statistical) learning tools for our era of data deluge Graph signal processing: Overview, challenges and applications Connecting the dots: Identifying network structure via graph signal processing Learning graphs from data: A signal representation perspective Network topology inference from spectral templates Stationary graph processes and spectral estimation Stationary signal processing on graphs Stationary graph signals using an isometric graph translation Characterization and inference of graph diffusion processes from observations of stationary signals Proximal algorithms. Foundations and Trends in Optimization A fast iterative shrinkage-thresholding algorithm for linear inverse problems Online topology inference from streaming stationary graph signals Online network topology inference with partial connectivity information Correlation-based ultra high-dimensional variable screening Online sparse subspace clustering Covariance selection Sparse inverse covariance estimation with the graphical lasso Discovering structure by learning sparse graphs Graph learning from data under Laplacian and structural constraints High-dimensional graphs and variable selection with the lasso Learning Laplacian matrix in smooth graph signal representations Signal processing on graphs: Causal modeling of unstructured data How to learn a graph from smooth signals Learning heat diffusion graphs Learning sparse graphs under smoothness prior Inferring sparse graphs from smooth signals with theoretical guarantees Identifying the topology of undirected networks from diffused non-stationary graph signals Network topology inference from non-stationary graph signals Learning undirected graphs in financial markets Learning time varying graphs Online graph learning from sequential data Tensor decompositions for identifying directed graph topologies and tracking dynamic networks Autoregressive moving average graph filtering Proximal-gradient algorithms for tracking cascades over social networks Topology identification and learning over graphs: Accounting for nonlinearities and dynamics Optimization and learning with information streams: Time-varying algorithms and applications Distributed adaptive learning of graph signals Adaptive graph signal processing: Algorithms and optimal sampling strategies Semi-blind inference of topologies and dynamical processes over dynamic graphs An iterative thresholding algorithm for linear inverse problems with a sparsity constraint Sparse reconstruction by separable approximation First-Order Methods in Optimization Convex Analysis and Monotone Operator Theory in Hilbert Spaces Adaptive Signal Processing Algorithms: Stability and Performance The Koblenz Network Collection Learning to discover social circles in ego networks