key: cord-0574100-lhp2k48r authors: Saboksayr, Seyed Saman; Mateos, Gonzalo; Cetin, Mujdat title: Online Graph Learning under Smoothness Priors date: 2021-03-05 journal: nan DOI: nan sha: 77034338831cd16157b96c6b483dab711967f862 doc_id: 574100 cord_uid: lhp2k48r The growing success of graph signal processing (GSP) approaches relies heavily on prior identification of a graph over which network data admit certain regularity. However, adaptation to increasingly dynamic environments as well as demands for real-time processing of streaming data pose major challenges to this end. In this context, we develop novel algorithms for online network topology inference given streaming observations assumed to be smooth on the sought graph. Unlike existing batch algorithms, our goal is to track the (possibly) time-varying network topology while maintaining the memory and computational costs in check by processing graph signals sequentially-in-time. To recover the graph in an online fashion, we leverage proximal gradient (PG) methods to solve a judicious smoothness-regularized, time-varying optimization problem. Under mild technical conditions, we establish that the online graph learning algorithm converges to within a neighborhood of (i.e., it tracks) the optimal time-varying batch solution. Computer simulations using both synthetic and real financial market data illustrate the effectiveness of the proposed algorithm in adapting to streaming signals to track slowly-varying network connectivity. Making sense of relational datasets from a network-centric perspective is essential to obtain actionable information in various fields of science and engineering. Graph signal processing (GSP) has played a key role to that end, underscoring the value of graphs as models of complex signals with irregular structures [1] , [2] . However, said graphs are often not readily available and a first crucial step is to use nodal observations (i.e., measurements of graph signals) to identify the network structure, or, a useful graph model that facilitates signal representations and downstream learning tasks; see [3] , [4] for tutorials on recent network topology inference advances. Acknowledging that many of these networks are also dynamic (e.g., in applications involving financial markets), there is a growing need to develop online graph learning algorithms that can process network data streams in an efficient manner [5] . Given a set of graph signal observations, the network topology inference problem is to find the graph (represented through e.g., an adjacency or Laplacian matrix) that is optimal in some sense. The optimality criterion is usually dictated by the adopted network-dependent model for the measurements, often augmented by structural (e.g., edge sparsity) priors motivated by physical characteristics to effect statistical regularization or Work in this paper was supported by the NSF awards CCF-1750428, CCF-1934962 and ECCS-1809356. Author emails: ssaboksa@ur.rochester.edu, gmateosb@ece.rochester.edu, and mujdat.cetin@rochester.edu. favor interpretability. Network data models are often given by probabilistic priors such as Gaussian Markov random fields (GMRFs), where the graph learning problem becomes one of graphical model (here covariance) selection [2, Ch. 7] . Other models are specified via signal parsimony-related properties with respect to the underlying graph, including stationarity [6] , [7] and smoothness (i.e., bandlimitedness, linked to GMRF selection under Laplacian constraints) [8] - [10] . Contributions in context of related prior work. In this paper, we develop an online algorithmic framework to estimate (possibly dynamic) graphs under smoothness priors. Many real-world signals are smooth over judicious networks, including temperature recordings [11] , movie ratings, and natural images [12] , to name a few. Exploitation of this cardinal property is at the heart of several graph-based statistical learning tasks including nearest-neighbor prediction (also known as graph smoothing), denoising, semi-supervised learning, and spectral clustering [1] , [2] . Revisiting the general graph learning framework in [8] , [10] -but with streaming data -we develop online proximal-gradient (PG) iterations to solve the resulting smoothness-regularized, time-varying optimization problem. There are noteworthy recent works on time-varying network topology inference from observations of smooth signals [12] - [14] . Unlike our online algorithm, those in [12] - [14] operate in batch mode, they are non-recursive, and hence their computational complexity and memory storage grow linearly with time. Borrowing techniques from [15] , we establish that the online PG algorithm converges to within a neighborhood of (i.e., it tracks) the optimal time-varying batch solution. To the best of our knowledge, this is the first work that addresses the problem of online graph learning from streaming smooth signals with quantifiable performance guarantees. Different from our signal smoothness assumption, the online graph learning scheme in [16] uses observations from a Laplacian-based, continuous-time graph process and [7] relies on stationarity. Numerical tests using both synthetic and real financial market data corroborate the efficiency and effectiveness of the proposed PG algorithm in adapting to streaming signals and tracking changes in the sought (slowlyvarying) dynamic network. E] one has W ij = 0. Graph G is devoid of self-loops, which implies W ii = 0, ∀i ∈ V. We will henceforth assume nodal degrees d := W1 are uniformly lower bounded away from zero, i.e., d d min 1 (entry-wise inequality) for some prescribed d min > 0. Else if degrees become arbitrarily small, it is prudent to apply a threshold and remove the loosely connected nodes from G. We study graph signals x = [x 1 , . . . , x N ] ∈ R N defined on G, where x i is the signal value at node i ∈ V. Directed graphs could be useful models [17] , but are beyond the scope of this paper. Complex-valued signals can be accommodated as well [1] . Signal smoothness with respect to G. The adjacency matrix W encodes the graph's topology. Beyond W, advances in spectral graph theory often motivate choosing the combinatorial graph Laplacian L := diag (d)−W. In particular, L plays a central role in defining a graph Fourier transform (GFT) [1] as well as a measure of signal variability with respect to G [18] . Focusing on the latter, the total variation (TV) of the graph signal x with respect to the Laplacian L (also known as Dirichlet energy) is defined as the following quadratic form: The TV(x) is a smoothness measure, quantifying how much the graph signal x changes with respect to G's topology. Smaller values of TV(x) are indicative of limited signal variability, with TV(α1) = 0 as an extreme. More germane to the graph-learning theme of this paper is to use smoothness as the criterion to construct graphs on which network data admit certain regularity, the subject dealt with next. Consider the following network topology identification problem. Given a set X := {x t } T t=1 of possibly noisy graph signal observations acquired at time t, the goal is to learn an undirected graph G(V, E, W) with |V| = N nodes such that the observations in X are smooth on G. The graph can be dynamic with a slowly time-varying adjacency matrix W t , t = 1, 2, . . . (see Section IV), but for now we omit any form of temporal dependency to simplify exposition. In this section, we briefly review the general graph learning framework proposed in [8] , [10] , that we build on in the rest of the paper. Graph learning under smoothness priors. Given X one can form the data matrix X = [x 1 , . . . , x T ] ∈ R N ×T , and letx i ∈ R 1×T denote its i-th row collecting those T measurements at vertex i. The neat idea in [8] is to establish a link between smoothness and sparsity, namely where • stands for the Hadamard (element-wise) product and the Euclidean-distance matrix Z ∈ R N ×N + has entries The intuition is that when the given distances in Z come from a smooth manifold, the corresponding graph has a sparse edge set, with preference given to edges (i, j) associated with smaller distances Z ij . Leveraging (2) a general graph-learning framework was put forth in [8] , which advocates solving the convex smoothnessregularized inverse problem The convex objective function g(W) encodes additional structural properties of G. Several choices for g(W) have been proposed to, e.g., recover graphs based on the Gaussian kernel [19] , accommodate time-varying graphs [12] , or to scale other related graph learning algorithms [9] . Identity (2) offers a favorable way of formulating the inverse problem (3), because the space of adjacency matrices can be described via simpler (meaning entry-wise decoupled) constraints relative to its Laplacian counterpart. As a result, (3) can be solved efficiently with complexity O(N 2 ) per iteration, by leveraging provably-convergent primal-dual solvers. Next, we present a different optimization approach based on PG methods [20] , which offers an (equally) efficient alternative in the batch setting while it lends itself naturally to online operation. Batch proximal gradient algorithm. To make (3) amenable to the PG method, recall first that the adjacency matrix W is symmetric with diagonal elements equal to zero. Thus, the independent decision variables are effectively the upper- Second, it will prove convenient to enforce the non-negativity constraints via a penalty function augmenting the original objective. Just like [8] we add an indicator function I {w 0} = 0 if w 0, and I {w 0} = ∞ otherwise. Given these definitions we recast the objective in (3) as the function F (w) of a vector variable, and write the equivalent composite, non-smooth optimization problem where z is a vector containing the upper-triangular entries of , where α, β > 0 are tuning parameters [8] . The logarithmic barrier on the nodal degree sequence Sw precludes the trivial solution w = 0. Moreover, it ensures the estimated graph is devoid of isolated vertices. The 2 -norm regularization on the adjacency matrix w controls the graphs' edge sparsity pattern by penalizing larger edge weights (the sparsest graph is obtained for β = 0). The gradient of said g has the form where 1/Sw stands for element-wise division. Moreover, ∇g is a Lipschitz-continuous function with constant η = 4β + 2α(N −1) ; see [21] for the proof that is omitted here due to lack of space. For constant step size µ < 2 η , the PG iterations to solve the batch graph learning problem (4) are given by (henceforth k = 0, 1, 2, . . . denote iterations) where the proximal operator of h in (4) is The non-negative soft-thresholding operator in (7) sets to zero all edge weights in w that fall below the data-dependent thresholds in vector 2µz (the max operator is applied entrywise). Inspection of (6) shows that graph estimate refinements are generated via the composition of a gradient-descent step and a proximal operator. All in all, (6) scales well to large graphs with thousands of nodes and it is competitive with the state-of-the-art primaldual solver in [8] . In terms of convergence, as k → ∞ the sequence of iterates (6) provably approaches a minimizer of the composite cost F in (4); see e.g., [20] for the technical details. Moreover, the worst-case convergence rate of PG algorithms is well documented (namely O(1/ ) iteration complexity to return a -optimal solution measured in terms of F values), and can be boosted to O(1/ √ ) via Nesterov-type acceleration techniques. Building on recent advances in timevarying convex optimization [15] , our main contribution is to develop novel online PG algorithms for time-varying graphs in the previously unexplored streaming setting. We switch gears to online estimation of W (or even tracking W t in a dynamic setting) from streaming data {x 1 , . . . , x t , x t+1 , . . .}. To this end, a viable approach is to solve at each time instant t = 1, 2, . . . , the composite, timevarying optimization problem [cf. (4) ] In writing z 1:t we make explicit that the Euclidean-distance matrix is computed using all signals acquired by time t. As data come in, the edge-wise 1 -norm weights will fluctuate explaining the time dependence of F t (w) through its nonsmooth component h t . Online algorithm construction. A naive sequential estimation approach consists of solving (8) repeatedly using the batch PG algorithm in Section III. However (pseudo) real-time operation in delay-sensitive applications may not tolerate running multiple inner PG iterations per time interval, so that convergence to w t is attained for each t. For time-varying graphs it may not be even prudent to obtain w t with high precision (hence incurring high delay and unnecessary computational cost), since at time t + 1 a new datum arrives and the solution w t+1 may be substantially off the prior estimate. These reasons motivate Algorithm 1: Online graph learning via PG Input parameters α, β, γ, stream z 1 , z 2 , . . ., initial w 1 ,z 0 . for t = 1, 2, . . . , do devising an efficient online and recursive algorithm to solve the time-varying optimization problem (8) . Our approach entails two steps per time instant t = 1, 2, . . .. First, we recursively update the upper-triangular entries z 1:t = z 1:t−1 +z t of the Euclidean-distance matrix via an exponential moving average (EMA), namelȳ The constant γ ∈ (0, 1) is a discount factor, which downweighs past data to facilitate tracking dynamic graphs in nonstationary environments. The larger the constant γ, the faster EMA discounts past observations. Second, we run a single iteration of the batch graph learning algorithm developed in Section III to update w t+1 , namely The resulting iterations are tabulated as Algorithm 1. The computational complexity is dominated by the gradient evaluation in (5), incurring a cost of O(N 2 ) per instant t due to scaling and additions of vectors of length N (N − 1)/2. For sparse graphs, the iterates w t tend to become (and remain) quite sparse at early stages of the algorithm by virtue of the soft-thresholding operations (a sparse initialization w 1 is preferable). It is thus possible to reduce the complexity further if Algorithm 1 is implemented carefully using sparse vector operations. Unlike recent approaches that learn dynamic graphs from the observation of smooth signals [12] - [14] , Algorithm 1's memory storage requirement and computational cost per data sample x t does not grow with t. Convergence analysis. Here we establish that Algorithm 1 can closely track the sequence of minimizers w t for large enough t; see also the simulations in Section V. Noting that g is 4β-strongly convex and its gradient η-Lipschitz continuous, we can derive bounds for the tracking error w t − w t F . To this end, let us define v t := w t+1 − w t F to quantify the temporal variability of the optimal solution of (8). We have the following (non-asymptotic) performance guarantee for Algorithm 1, adapted from [15, Theorem 1] . Theorem 1 For all t ≥ 1, the sequence of iterates w t generated by Algorithm 1 satisfies: where L t = max {|1 − 4µ t β|, |1 − µ t η t |},L t = t τ =1 L τ . Moreover, for the sequence of objective values we can write F t (w t )−F t (w t ) ≤ ηt 2 w t −w t F ; see [22, Theorem 10.29 ]. To gain further insights let us defineL t := max τ =1,...,t L τ , v t := max τ =1,...,t v τ . The sum of the geometric series in the right-hand side of (11) can be simplified to Accordingly,L t = (η t − 4β)/η t < 1 since µ t = η −1 t . Therefore, (L t−1 ) t → 0 and Algorithm 1 converges to the vicinity of the optimal solution with a misadjustmentv t /(1 −L t−1 ). It follows that the tracking error increases withv t (rapidlyvarying graphs are more challenging) and also if the problem is badly conditioned (i.e., β → 0 in which caseL t → 1). Here we test Algorithm 1 on synthetic and real-world financial market data. Throughout, we perform a grid search to determine the best regularization parameters α, β. Synthetic data. To assess the performance of the proposed online graph learning algorithm, we test it on simulated streaming data. We generate a piecewise-constant sequence of two random Erdős-Rényi graphs (edge formation probability p = 0.15) with N = 50 nodes. The initial graph switches after t = 4000 time samples. The final graph is obtained from the initial one by redrawing 40 percent of its edges. For each time instant t = 1, . . . , 8000, we simulate i.i.d. smooth signals with respect to the time-varying underlying graph. The signals are drawn from a Gaussian distribution x t ∼ N 0, L † t + σ 2 e I N , where σ e represents the noise level; see e.g., [9] . Results are averaged over 10 independent Monte Carlo trials. The simulation is repeated for graphs with N = 100 nodes as well. As a baseline, we compute the time-varying solution F t in (8) by running the batch PG algorithm until convergence. Fig. 1 (top) shows that after around 1000 time samples (iterations) the objective value of the online Algorithm 1 reaches the optimal value of its time-varying batch counterpart. The observed oscillation depends on the step size µ t in Algorithm 1. As expected, increasing the step size leads to faster convergence with larger oscillations. Conversely, reducing the step size leads to smoother and slower convergence. The other noteworthy observation is that the objective value markedly increases when the graph changes (after 4000 time samples), but the online algorithm can effectively track the dynamic graph after a sufficient number of samples have been acquired. A similar trend can be observed for the F-measure of the detected edges (defined as the harmonic mean between edge precision and recall); see Fig. 1 (bottom) . Financial market. Here, we perform a study involving real financial data. Since there is no ground-truth dynamic network, we endeavor to indirectly validate the proposed method by commenting on the intuitive structure observed from the learned sequence of graphs. We consider the daily stock prices for ten large American companies, including e.g., Microsoft (MSFT), Apple (AAPL) and Amazon (AMZN). We collect their daily stock prices from Yahoo! Finance over the time period from May 1st, 2019 to August 1st, 2020, which overlaps with the very recent COVID-19 pandemic that led to significant market instabilities. Under normal circumstances, we would expect limited variations in the network describing the pairwise relationships between the chosen stock prices, since these large companies are well-established in the market [23] . However, events like COVID-19 can cause abrupt changes in said network. We run Algorithm 1 to estimate daily graphs in order to monitor the sudden changes in the stock market. Following studies like [13] , [23] , we quantify the variation of the network via relative temporal deviation W t − W t−1 F / W t−1 F . We also learn networks from signals given by the relative temporal variation of stock prices. In Fig. 2 (red) we plot the S&P500 log-price as an indicator of the market's condition. The relative temporal variation of the learned graphs is illustrated in Fig. 2 (blue) , which are obtained by setting the regularization parameters as α = 0.316, and β = 0.05. Fig. 2 (red) shows that the COVID-19 impacts on the markets started in February 2020 and it got to its worse situation during March 2020. The following sudden changes Fig. 2 . The S&P500 log-price per day (red). The daily relative temporal variation of the learned graphs (blue). The temporal variation indicates some sudden changes which can be due to e.g., COVID-induced market tensions. are apparent by inspection of the spikes in Fig. 2 We also learned networks from graph signals representing the relative temporal variation (single-day discrete gradients) of the stock prices, instead of the raw daily values considered so far. We set α = 3.162, and β = 0.088. Two instances of the learned graphs (Jan. 10, 2020, and Mar. 9, 2020) are depicted in Fig. 3 . The stock prices dropped sharply during the economic crisis in Mar'20; see Fig. 2 (red). This has an equalization effect on the stock price gradients, which suddenly become negative for all nodes (companies). Therefore, the Mar. 9 graph becomes more connected as a consequence of the imposed smoothness prior. However, during relatively normal operation of the market (e.g, during early Jan'20) the temporal stock-price gradients will be less-tightly coupled (possibly fluctuating up or down depending on some other latent effects). Accordingly, the learned graph tends to be relatively more disconnected as seen in Fig. 3 (left) . In this paper, we developed a novel online algorithm to learn graphs from streaming smooth signals. We exploited recent advances in time-varying convex optimization to derive efficient iterations that are guaranteed to track the (slowly) time-varying optimal solution of the batch estimator. The tracking ability of the proposed method is corroborated using synthetic and real-world financial market data. Graph signal processing: Overview, challenges, and applications Statistical Analysis of Network Data: Methods and Models Connecting the dots: Identifying network structure via graph signal processing Learning graphs from data: A signal representation perspective Topology identification and learning over graphs: Accounting for nonlinearities and dynamics Network topology inference from spectral templates Online topology inference from streaming stationary graph signals with partial connectivity information How to learn a graph from smooth signals Learning Laplacian matrix in smooth graph signal representations Large scale graph learning from smooth signals Learning sparse graphs under smoothness prior Learning time varying graphs Learning undirected graphs in financial markets Time-varying graph learning with constraints on graph temporal variation Online sparse subspace clustering Online graph learning from sequential data Signal processing on directed graphs A regularization framework for learning from graph data The Elements of Statistical Learning Proximal algorithms Online discriminative graph learning from multi-class smooth signals First-order Methods in Optimization Network inference via the time-varying Graphical Lasso