key: cord-0605117-d57l7os0
authors: Cardoso, Jos'e Vin'icius de Miranda; Ying, Jiaxi; Palomar, Daniel Perez
title: Algorithms for Learning Graphs in Financial Markets
date: 2020-12-31
journal: nan
DOI: nan
sha: 61dd63ed8584722df447bb9882b0fc5acce6a41e
doc_id: 605117
cord_uid: d57l7os0

In the past two decades, the field of applied finance has tremendously benefited from graph theory. As a result, novel methods ranging from asset network estimation to hierarchical asset selection and portfolio allocation are now part of practitioners' toolboxes. In this paper, we investigate the fundamental problem of learning undirected graphical models under Laplacian structural constraints from the point of view of financial market times series data. In particular, we present natural justifications, supported by empirical evidence, for the usage of the Laplacian matrix as a model for the precision matrix of financial assets, while also establishing a direct link that reveals how Laplacian constraints are coupled to meaningful physical interpretations related to the market index factor and to conditional correlations between stocks. Those interpretations lead to a set of guidelines that practitioners should be aware of when estimating graphs in financial markets. In addition, we design numerical algorithms based on the alternating direction method of multipliers to learn undirected, weighted graphs that take into account stylized facts that are intrinsic to financial data such as heavy tails and modularity. We illustrate how to leverage the learned graphs into practical scenarios such as stock time series clustering and foreign exchange network estimation. The proposed graph learning algorithms outperform the state-of-the-art methods in an extensive set of practical experiments. Furthermore, we obtain theoretical and empirical convergence results for the proposed algorithms. Along with the developed methodologies for graph learning in financial markets, we release an R package, called fingraph, accommodating the code and data to obtain all the experimental results.

Graph learning from data has been a problem of critical importance for the statistical graph learning and graph signal processing fields (Friedman et al., 2008; Lake and Tenenbaum, 2010; Witten et al., 2011; Kalofolias, 2016; Egilmez et al., 2017; Pavez et al., 2018; Zhao et al., 2019) , with direct impact on applied areas such as unsupervised learning, clustering (Hsieh et al., 2012; Sun et al., 2014; Tan et al., 2015; Nie et al., 2016; Hao et al., 2018; Kumar et al., 2020 Kumar et al., , 2019a , applied finance (Mantegna, 1999; de Prado, 2016; Marti et al., 2017a) , network topology inference (Segarra et al., 2017; Mateos et al., 2019; Coutino et al., 2019; Shafipour and Mateos, 2020) , community detection (Fortunato, 2010; Li et al., 2018; , and graph neural nets (Wu et al., 2019; Pal et al., 2019) .

The basic idea behind graph learning is to answer the following question: given a data matrix whose columns represent signals (observations) measured at the graph nodes, how can one design a graph representation that "best" fits such data matrix without possibly any (or with at most partial) knowledge of the underlying graph structure? By "graph representation" or "graph structure", it is often understood the Laplacian, adjacency, or incidence matrices of the graph, or even a more general graph shift operator (Marques et al., 2016) . In addition, the observed signals need not to live in regular, ordered spaces and can take arbitrary values, such as categorical and numerical, hence the probability distribution of the data can be highly unknown. Figure 1 illustrates such setting.

(a) A graph signal. (b) A graph to be estimated on the basis of its graph signals. Figure 1 : Illustration of a hypothetical signal observed in a graph ( Figure 1a ). The gray squares represent the graph nodes, the thin black lines denote the graph edges, indicating the relationships among nodes, whereas the vertical red bars denote the signal intensities measured at each node. Graph learning techniques seek to estimate the underlying graph structure (edge weights W ij in Figure 1b ) through the graph signal measurements.

Data derived from financial instruments such as equities and foreign exchanges, on the other hand, are defined on the well-known, ordered time domain (equally-spaced intraday 1 , daily, weekly, monthly, etc) and take on real values whose returns are often modeled by Gaussian processes. Then, the question is: How can graph learning be a useful tool for financial data analysis? In fact, Mantegna (1999) in his pioneer work showed that learning topological arrangements, such as graphs, in a stock market context, provides critical information that reveals economic factors that affect the price data evolution. The benefits that graph representations bring to applications on financial stocks are vastly discussed in the literature. A non-exhaustive, yet representative list of examples include: (i) identifying "business as usual" and "crash" periods via asset tree graphs (Onnela et al., 2003b,a) , (ii) understanding portfolio dynamics via the topological properties of simulated minimum spanning trees (Bonanno et al., 2003 (Bonanno et al., , 2004 Mantegna and Stanley, 2004) , (iii) constructing networks of companies based on graphs (Onnela et al., 2004) , (iv) understanding risks associated with a portfolio (Malevergne and Sornette, 2006) , (v) leveraging properties of the learned graph into follow-up tasks such as hierarchical portfolio designs (de Prado, 2016; Raffinot, 2018a,b) , (vi) mining the relationship structure among investors (Yang et al., 2020) , (vii) exploring graph properties such as degree centrality and eigenvector centrality for market crash detection and portfolio construction (Millington and Niranjan, 2020) , and (viii) community detection in financial stock markets (Ramakrishna et al., 2020; .

Despite the plethora of applications, learning the structure of general graphical models is an NP-hard problem (Anandkumar et al., 2012) whose importance is critical towards visualizing, understanding, and leveraging the full potential contained in the data that live in such structures. Nonetheless, most existing techniques for learning graphs are often unable to impose a particular graph structure due to their inability to incorporate prior information in the learning process. More surprisingly, as it is shown in this work, state-of-the-art learning algorithms fall short when it comes to estimate graphs that posses certain properties such as k-components.

Moreover, graph learning frameworks are designed with the operational assumption that the observed graph signals are Gaussian distributed (Friedman et al., 2008; Lake and Tenenbaum, 2010; Dong et al., 2016; Kalofolias, 2016; Egilmez et al., 2017; Zhao et al., 2019; Kumar et al., 2020; Ying et al., 2020b) , inherently neglecting situations where there may exist outliers. As a consequence, those methods may not succeed in fully capturing a meaningful representation of the underlying graph especially in data from financial instruments, which are known to be heavy-tailed and skewed (Gourieroux and Monfort, 1997; Cont, 2001; Tsay, 2010; Harvey, 2013; Feng and Palomar, 2015; Liu et al., 2019) .

While estimators for connected graphs have been proposed (Dong et al., 2016; Kalofolias, 2016; Egilmez et al., 2017; Zhao et al., 2019) , some of its properties, such as sparsity, are yet being investigated (Ying et al., 2020b,a) , and only recently Kumar et al. (2020) have presented estimators for learning more general graphical structures such as k-component, bipartite, and k-component bipartite. However, one major shortcoming in (Kumar et al., 2020) is the lack of constraints on the degrees of the nodes. As we show in this work, the ability to control the degrees of the nodes is key to avoid trivial solutions while learning k-component graphs.

Recently, Nie et al. (2016) ; Kumar et al. (2020 Kumar et al. ( , 2019a proposed optimization programs for learning the class of k-component graphs, as such class is an appealing model for clustering tasks due to the spectral properties of the Laplacian matrix. From a financial perspective, clustering financial time-series, such as stock return data, has been an active research topic (Mantegna, 1999; Dose and Cincotti, 2005; Marti et al., , 2017a . However, these works rely primarily on hierarchical clustering techniques and on the assumption that the underlying graph has a tree structure, which does bring advantages due to its hierarchical clustering properties, but also have been shown to be unstable (Carlsson and Mémoli, 2010; Lemieux et al., 2014; Marti et al., 2015) and not suitable when the data is not Gaussian distributed . In this work, on the contrary, we tackle the problem of clustering stocks from a probabilistic perspective, similarly to the approach layed out by Kumar et al. (2019a Kumar et al. ( , 2020 , where the Laplacian matrix of a k-component graph is assumed to model the pairwise conditional correlations between stocks. A crucial advantage of this approach is that we can consider more realistic probabilistic assumptions such as heavy tails.

In practice, prior information about clusters of stocks is available via sector classification systems such as the Global Industry Classification Standard (GICS) (Standard & Poor's, 2006; Morgan Stanley Capital International and S&P Dow Jones, 2018) or the Industry Classification Benchmark (ICB) (Schreiner, 2019) . However, more often than not, stocks have impacts on multiple industries, e.g., the evident case of technology companies, such as Amazon, Apple, Google, and Facebook, whose influence on prices affect stocks not only in their own sector, but spans across multiple sectors. One reason for this phenomena is the myriad of services offered by those companies, resulting in challenges to precisely pin point which stock market sector they should belong to.

Motivated by practical applications in finance such as clustering of financial instruments and network estimation, we investigate the problem of learning graph matrices whose structure follow that of a Laplacian matrix of an undirected weighted graph.

The main contributions of our paper include:

1. As far as the authors are aware of, we for the first time provide interpretations for Laplacian constraints of graphs from the perspective of stock market data. Those interpretations naturally lead to meaningful and intuitive guidelines on the data pre-processing required for learning graphs from financial data.

2. We show that rank constraints alone, a practice often used by state-of-the-art methods, are not sufficient to learn non-trivial k-component graphs. We achieve learning of k-component graphs without isolated nodes by leveraging linear constraints on the node degrees of the graph.

3. We propose novel formulations to learn k-component graphs and heavy-tailed graphs, which are solved via carefully designed Alternating Direction Method of Multipliers (ADMM) algorithms. In addition, we establish theoretical convergence guarantees for the proposed algorithms along with experiments on their empirical convergence. The proposed algorithms can be easily extended to account for additional linear constraints on the graph weights.

4. We present extensive practical results that showcase the advantage of the operational assumptions used in the proposed algorithms when compared to state-of-the-art methods. Along with the methods proposed in this paper, we release an R package, called fingraph , containing fast, unit-tested code that implements the proposed algorithms and it is publicly available at: https://github.com/mirca/fingraph.

The reals, nonnegative reals, and positive reals fields are denoted as R, R + , and R ++ , respectively. We use the abbreviation iff to denote "if and only if". Scalars and real-valued random variables are denoted by lowercase roman letters like x. Matrices (vectors) are denoted by bold, italic, capital (lowercase) roman letters like X, x. Vectors are assumed to be column vectors. The (i, j) element of a matrix X ∈ R n×p is denoted as X ij . The i-th row (column) of X is denoted as x i, * ∈ R p×1 (x * ,i ∈ R n×1 ). The i-th element of a vector x is denoted as x i . The transpose of X is denoted as X . The identity matrix of order p is denoted as I p . The Moore-Penrose inverse of a matrix X is denoted as X † . The trace of a square matrix, i.e., the sum of elements on the principal diagonal, is denoted as tr(X).

The inner product between two matrices X, Y is denoted as X, Y tr X Y . The element-wise sum of the absolute values and the Frobenius norm of a matrix X are denoted as X 1 = ij |X ij | and X F = tr (X X), respectively. For x ∈ R p , x 2 stands for the usual Euclidean norm of x. If A is a symmetric matrix, λ max (A) denotes the maximum eigenvalue of A. Let A, B be two self-adjoint matrices. We write A B (A B) iff A − B is nonnegative (positive) definite. The symbols 1 and 0 denote the all ones and zeros vectors of appropriate dimension, respectively. The operator diag : R p×p → R p extracts the diagonal of a square matrix. The operator Diag : R p → R p×p creates a diagonal matrix with the elements of an input vector along its diagonal. (x) + denotes the projection of x onto the nonnegative orthant, i.e., the elementwise maximum between 0 and x.

An undirected, weighted graph is usually denoted as a triple G = (V, E, W ), where V = {1, 2, . . . , p} is the vertex (or node) set, E ⊆ {{u, v} : u, v ∈ V} is the edge set, that is, a subset of the set of all possible unordered pairs of p nodes such that {u, v} ∈ E iff nodes u and v are connected. W ∈ R p×p + is the symmetric weighted adjacency matrix that satisfies W ii = 0, W ij > 0 iff {i, j} ∈ E and W ij = 0, otherwise. We denote a graph as a 4-tuple G = (V, E, W , f t ), where f t : V → {1, 2, . . . , t} is a function that associates a single type (label) to each vertex of the graph, where t is the number of possible types. This extension is necessary for computing certain graph properties of practical interest such as graph modularity. We denote the number of elements in E by |E|. The combinatorial, unnormalized graph Laplacian matrix L is defined, as usual, as L D − W , where D Diag(W 1) is the degree matrix.

An Improper Gaussian Markov Random Field (IGMRF) (Rue and Held, 2005; Slawski and Hein, 2015) of rank p − k, k ≥ 1, is denoted as a p-dimensional, real-valued, Gaussian random variable x with mean vector E [x] µ and rank-deficient precision matrix

The probability density function of x is then given as

where det * (Ξ) is the pseudo (also known as generalized) determinant of Ξ, i.e., the product of its positive eigenvalues (Knill, 2014) . The data generating process is assumed to be a zero-mean, IGMRF x ∈ R p , such that x i is the random variable generating a signal measured at node i, whose rank-deficient precision matrix is modeled as a graph Laplacian matrix. This model is also known as Laplacian constrained Gaussian Markov Random Field (LGMRF) (Ying et al., 2020b) . Assume we are given n observations of x, i.e., X = x 1, * , x 2, * , . . . , x n, * , X ∈ R n×p , x i, * ∈ R p×1 . The goal of graph learning algorithms is to learn a Laplacian matrix, or equivalently an adjacency matrix, given only the data matrix X, i.e., often without any knowledge of E and f t .

To that end, the classical penalized Maximum Likelihood Estimator (MLE) of the Laplacian-constrained precision matrix of x, on the basis of the observed data X, may be formulated as the following optimization program:

where S is a similarity matrix, e.g., the sample covariance (or correlation) matrix S ∝ X X, and h α (L) is a regularization function, with hyperparameter vector α, to promote certain properties on L, such as sparsity or low-rankness. Problem (2) is a fundamental problem in the graph signal processing field that has served as a cornerstone for many extensions, primarily those involving the inclusion of structure onto L (Egilmez et al., 2017; Pavez et al., 2018; Kumar et al., 2019a Kumar et al., ,b, 2020 . Even though Problem (2) is convex, provided we assume a convex choice for h α (·), it is not adequate to be solved by Disciplined Convex Programming (DCP) languages, such as cvxpy (Diamond and Boyd, 2016) , particularly due to scalability issues related to the computation of the term log det * (L) (Egilmez et al., 2017; Zhao et al., 2019) . Indeed, recently, considerable efforts have been directed towards the design of scalable, iterative algorithms based on Block Coordinate Descent (BCD) (Wright, 2015) , Majorization-Minimization (MM) (Sun et al., 2017) , and ADMM (Boyd et al., 2011) to solve Problem (2) in an efficient fashion, e.g., (Egilmez et al., 2017) , (Zhao et al., 2019) , (Kumar et al., 2020) , and (Ying et al., 2020b) , just to name a few.

To circumvent some of those scalability issues related to the computation of the term log det * (L), Lake and Tenenbaum (2010) proposed the following relaxed version with an 1 -norm penalization:

In words, Problem (3) relaxes the original problem by forcing the precision matrix to be positive definite through the introduction of the term σI p , which bounds the minimum eigenvalue ofL to be at least σ, and thus the generalized determinant can be replaced by the usual determinant. Although the technical issues related to the generalized determinant have been seemingly dealt with (albeit in an indirect way), in this formulation, there are twice as many variables to be estimated, which turns out to be prohibitive when designing practical, scalable algorithms, and the applicability of cvx, like it was done in (Lake and Tenenbaum, 2010) , is only possible in small scale (p ≈ 50) scenarios.

In order to solve this scalability issue, Hassan-Moghaddam et al. (2016) and Egilmez et al. (2017) proposed customized BCD algorithms (Shalev-Shwartz and Tewari, 2011; Saha and Tewari, 2013; Wright, 2015) to solve Problem (2) assuming an 1 -norm regularization, i.e., h α (L) α i =j |L ij | = −α i =j L ij , in order to promote sparsity on the resulting estimated Laplacian matrix. However, as recently shown by Ying et al. (2020b,a) , in contrast to common practices, the 1 -norm penalization surprisingly leads to denser graphs to the point that, for a large value of α, the resulting graph will be fully connected with uniformly distributed graph weights.

On the other hand, due to such nuisances involved in dealing with the term log det * (L), several works departed from the LGMRF formulation altogether. Instead, they focused on the assumption that the underlying signals in a graph are smooth (Kalofolias, 2016; Dong et al., 2016; Chepuri et al., 2017) . In its simplest form, learning a smooth graph from a data matrix X ∈ R n×p is tantamount to finding an adjacency matrix W that minimizes the Dirichlet energy, i.e.,

Problem (4) can also be equivalently expressed in terms of the Laplacian matrix:

In order for Problems (4) and (5) to be well-defined, i.e., to avoid the trivial solution W = 0 (L = 0), several constraints have been proposed in the literature. For instance, Dong et al. (2016) proposed one of the first estimators for graph Laplacian as the following nonconvex optimization program:

where the constraint tr (L) = p is imposed to fix the sum of the degrees of the graph, and α and η are positive, real-valued hyperparameters that control the amount of sparsity in the estimated graph. More precisely, for a given fixed α, increasing (decreasing) η leads to sparser (denser) graphs. Dong et al. (2016) adopted an iterative alternating minimization scheme in order to find an optimal point of Problem (6), whereby at each iteration one of the variables is fixed while the solution is found for the other variable. However, the main shortcoming of formulation (6) is that it does not scale well for big data sets due to the update of the variable Y ∈ R n×p , which scales as a function of the number of observations n. In some graph learning problems in financial markets, the number of price recordings may be orders of magnitude larger than the number of nodes (financial assets), particularly in high frequency trading scenarios (Kirilenko et al., 2017 ).

Yet following the smooth signal assumption, Kalofolias (2016) proposed a convex formulation as follows:

where Z ij x * ,i − x * ,j 2 2 , α and η are positive, real-valued hyperparameters that control the amount of sparsity in the estimated graph, with the same interpretation as in Problem (6), and log(W 1), assumed to be evaluated element-wise, is a regularization term added to avoid the degrees of the graph from becoming zero.

Problem (7) is convex and can be solved via primal-dual, ADMM-like algorithms (Komodakis and Pesquet, 2015) . It can be seen that the objective function in Problem (7) is actually an approximation of that of Problem (2). From Hadamard's inequality (Różański et al., 2017) , we have

Therefore, Problem (7) can be thought of as an approximation of the penalized maximum likelihood estimator with a Frobenius norm regularization in order to bound the graph weights.

The graph learning formulations previously discussed are only applicable to learn connected graphs. Learning graphs with a prior structure, e.g., k-component graphs, poses a considerably higher challenge, as the dimension of the nullspace of the Laplacian matrix L is equal to the number of components of the graph (Chung, 1997) . Therefore, algorithms have to ensure that the algebraic multiplicity of the the 0 eigenvalue is equal to k. However, the latter condition is not sufficient to rule out the space of "trivial" k-component graphs, i.e., graphs with isolated nodes. In addition to the rank (or nullity) constraint on L, it is necessary to specify a constraint on the degrees of the graph.

Recent efforts have been made to introduce theoretical results from spectral graph theory (Chung, 1997) into practical optimization programs. For instance, a formulation to estimate k-component graphs based on the smooth signal approach was proposed in (Nie et al., 2016) . More precisely, they proposed the Constrained Laplacian-rank (CLR) algorithm, which works in two-stages. On the first stage it estimates a connected graph using, e.g., the solution to Problem (7), and then on the second stage it heuristically projects the graph onto the set of Laplacian matrices of dimension p with rank p − k, where k is the given number of graph components. This approach is summarized in the following two stages:

1. Obtain an initial affinity matrix A as the optimal value of:

2. Find a projection of A such that L = Diag( B +B 2 ) − B +B 2 has rank p − k:

where k is the desired number of graph components.

Spectral constraints on the Laplacian matrix are an intuitive way to recover k-component graphs as the multiplicity of its zero eigenvalue, i.e., the nullity of L, dictates the number of components of a graph. The first framework to impose structures on the estimated Laplacian matrix under the LGMRF model was proposed by Kumar et al. (Kumar et al., 2019a , 2020 , through the use of spectral constraints, as follows:

where the term η 2 L − U Diag(λ)U 2 F , often called spectral regularization, is added as a penalty term to indirectly promote L to have the same rank as U Diag(λ)U , i.e., p − k, k is the number of components of the graph to be chosen a priori, and η > 0 is a hyperparameter that controls the penalization on the spectral factorization of L, and c 1 and c 2 are positive, real-valued constants employed to promote bounds on the eigenvalues of L.

Note that Problem (11) learns a k-component graph without the need for a two-stage algorithm. However, a clear caveat of this formulation is that it does not control the degrees of the nodes in the graph, which may result in a trivial solution that contains isolated nodes, turning out not to be useful for clustering tasks especially when applied to noisy data sets or to data sets that are not significantly Gaussian distributed. In addition, choosing values for hyperparameters η, c 1 , and c 2 , is often an intricate task.

In this section, we present novel interpretations and motivations for the Laplacian constraints from the point of view of graphs learned from financial markets data. Those interpretations lead to paramount guidelines that users may benefit from when applying graph learning algorithms in financial problems. In addition, we provide sound justifications for the usage of the Laplacian matrix as a model for the inverse correlation matrix of financial assets.

Graphical representations of data are increasingly important tools in financial signal processing applied to uncover hidden relationships between variables (de Prado, 2016; Marti et al., 2017a) . In financial markets, one is generally interested in learning quantifiable dependencies among assets and how to leverage them into practical scenarios such as portfolio design and crisis forecasting.

Arguably, one of the most successful methods to estimate sparse graphs is the Graphical Lasso (Friedman et al., 2008; Banerjee et al., 2008) , modeled as the solution to the following convex optimization problem:

where S ∝ X X is an empirical covariance (or correlation) matrix. The solution to Problem (12) can be efficiently computed via the well-known glasso algorithm (Friedman et al., 2008; Sustik and Calderhead, 2012) .

While this model has been extremely successful in numerous fields, imposing a Laplacian structure onto Σ −1 brings significant benefits in financial data settings. To see that, we empirically evaluate the out-of-sample log-likelihood using three models:

1. Graphical Lasso as defined in (12).

2. Multivariate Totally Positive of Order 2 (MTP 2 ), given as

3. Laplacian GMRF (LGMRF) as defined in (2), without regularization. 2

We collect log-returns data from p = 50 randomly chosen stocks from the S&P500 index during the period between Jan. 4th 2005 to Jul. 1st 2020 totalling n = 3900 observations. We subdivide the observations into 26 sequential datasets each of which containing 150 observations. For the i-th dataset, we estimate the models (12) and (13) for different values of the hyperparameter α, and compute their log-likelihood using the (i + 1)-th dataset. We then average the log-likelihood measurements over the datasets. In this fashion, we can infer how well these models generalize to unseen data. Figure 2 shows the log-likelihood measurements in this experiment. We can readily notice that not only the LGMRF model has the higher explanatory power among the considered models, but it is also the simplest of them, as it does not contain any hyperparameter.

In addition, we plot one instance of the estimated networks from each of the models. Interestingly, Figure 3a reveals that Graphical Lasso estimates most conditional correlations as positive (blue edges), with only a few negative ones (red edges). In addition, both Graphical Lasso and MTP 2 (Figure 3b ) do not clearly uncover strong connections between clearly correlated stocks. The LGMRF model (Figure 3c ), on the other hand, displays vividly the interactions between evidently correlated nodes, e.g., {CAH and MCK} and {SIVB and PBCT}, which are companies in the health care industry and bank holdings, respectively.

From the perspective of the LGMRF model, we would like to estimate a matrix L that enjoys the following two key properties:

The first property states that the Laplacian matrix L is singular and the eigenvector associated with its zero eigenvalue is given by a1, a ∈ R, i.e., the eigenvector is constant along all its components.

Based on the empirical and theoretical discussions about the spectrum of correlation matrices of stock time series (Plerou et al., 1999) , we conduct an additional experiment to verify whether the sample inverse correlation matrix of stocks share the aforementioned properties: we query data from p = 414 stocks belonging to the S&P500 index from January 4th 2005 to June 18th 2020, totalling n = 3869 observations. We then divide this dataset Figure 2 : Average log-likelihood in out-of-sample data for different precision matrix estimation models as a function of the sparsity promoting hyperparameter α. into 19 sequential overlapping datasets each of which containing n = 2070 observations, such that n/p = 5. For each dataset, we compute two attributes of the inverse sample correlation matrix: (i) its condition number, defined as the ratio between its maximum and minimum eigenvalues, and (ii) the variance of each eigenvector. We observe that the smallest condition number across all datasets is of the order of 10 4 , while the median of the condition numbers is of the order of 10 5 , indicating that in fact the inverse sample correlation matrices are nearly singular. In addition, the average variance of the eigenvector associated with the zero eigenvalue is 2 · 10 −4 , which is around an order of magnitude smaller than the average variances of any the other eigenvectors, indicating that it is, in fact, a constant eigenvector. Figure 4 illustrates this phenomenon for the aforementioned dataset, where the constant nature of the market eigenvector (eigenvector#1) is clearly observable when compared to the variability of the next two eigenvectors (eigenvector#2, eigenvector#3). Figure 4 : Eigenvectors of the sample correlation matrix of 414 S&P500 stocks corresponding to the largest three eigenvalues over the period between Jan. 2005 to Jun. 2020. eigenvector#1 represents the market factor. eigenvector#2 and #3 are displayed for reference on the expected variability.

In practice, (P1) implies that signals living in a graph G have zero graph-mean, i.e., the sum of the graph signals, at a given time, is zero. From a stock market perspective, the vector of log-returns of a set of stocks, at a given time i, is often assumed to follow a linear factor model (Sharpe, 1964; Fama and French, 2004) , i.e., x * ,i = βx mkt,i + i , where x * ,i ∈ R p×1 contains the log-returns of p stocks, x mkt,i ∈ R is the log-return of the market factor, i is the vector of idiosyncratic log-returns that is often assumed to be a Gaussian process with zero mean vector and covariance matrix Ψ, and β is the vector of market factor loadings. Because G has zero graph-mean, this implies that a graph designed to accommodate stock signals, assumed to follow a linear market factor model, will automatically remove the market component from the learning process of the conditional dependencies among stocks. This is a crucial feature because the market component would likely be a confounding factor in the estimation of the conditional correlations due to its strong influence on all the stocks log-returns.

In addition, (P2) together with (P1) implies that L is positive semidefinite. The fact that the off-diagonal entries are symmetric and non-positive means that the Laplacian matrix only represents non-negative conditional dependencies 3 . This assumption is often met for 3. The correlation between any two pair of nodes conditioned on the rest of the graph is given as −

stock data, as assets are typically positively dependent (Plerou et al., 2002; Kazakov and Kalyagin, 2016; Agrawal et al., 2020; Soloff et al., 2020; Wang et al., 2020) .

These two properties along with efficient learning frameworks make the Laplacian-based graphical model a natural candidate for learning graphs of stock data. As a consequence of using the Laplacian model, we propose the following guidelines when estimating Laplacian matrices with stock market data:

• Correlation vs Covariance: Both the LGMRF and smooth signal approaches rely on the Dirichlet energy term tr(SL) ∝ tr(W Z), which quantifies the smoothness of the graph signals over the graph weights, where S is the sample covariance matrix. From the definition of Z in (7), we observe that two perfectly correlated stocks but with large Euclidean distances would be translated as largely far apart nodes on the graph. Hence, we advocate the use of the sample correlation matrixS = Diag(S) −1/2 SDiag(S) −1/2 (or equivalently scaling the columns of X such that they have unit variance) in case we would like two highly correlated stocks to have a strong graph connection regardless of their individual variances.

• Removing the market trend: A widely used and tested model for the returns of the stocks is the linear factor model, which explicitly includes the dependency on the market factor: x * ,i = βx mkt,i + i . Assuming that most of the stocks are heavily dominated by the market index x mkt,i , it may be convenient to remove that component if we seek to explore the structure of the residual cross-dependency among the stocks, i . Thus, an alternative to using the full covariance matrix Σ is to use the covariance matrix Ψ of the idiosyncratic component. However, if one first normalizes each stock, whose variances are V(x * ,i ) ≈ β 2 i , we havex * ,i = 1x mkt,i +¯ i , then it turns out that the market factor is automatically removed in the normalized squared distance matrix Z:

• Degree control: Enforcing a rank smaller than p − 1 on the Laplacian matrix will generate a k-component graph, which is one desired goal. However, one may get the undesired result of having isolated nodes. One possible strategy to avoid isolated nodes is via introducing constraints on the nodes degrees. The LGMRF formulation has the natural penalty term log det * (L) in the objective, but that does not help in controlling the degrees of the nodes. Instead, some of the graph learning formulations from smooth signals include degree control via the constraint W 1 = 1, which fixes the degrees of all the nodes to 1. The regularization term 1 log(W 1) also avoids the trivial solution of any degree equals 0. Hence, any graph learning formulation that enforces a k-component graph (or low-rank Laplacian matrix) should also control the degrees of the nodes to avoid a trivial solution with isolated nodes.

In this section, we design iterative algorithms for numerous graph learning formulations to account for k-component structures and the heavy-tail nature of financial stock market data.

The proposed algorithms are based on the ADMM (Boyd et al., 2011) and MM (Ortega and Rheinboldt, 2000; Sun et al., 2017) frameworks. We begin by briefly revisiting ADMM and MM.

ADMM is a primal-dual framework designed to solve the following class of optimization problems: minimize

where x ∈ R n and z ∈ R m are the optimization variables; A ∈ R p×n , B ∈ R p×m , and, c ∈ R p are parameters; and f and g are convex, proper, closed, possibly non-differentiable functions. The central object in the ADMM framework is the augmented Lagrangian function, which is given as

where ρ is a penalty parameter. The basic workflow of the ADMM algorithm is summarized in Algorithm 1.

The convergence of ADMM algorithms is attained provided that the following conditions are met:

both closed nonempty convex sets;

2. The unaugmented Lagrangian function L 0 has a saddle point.

We refer readers to (Boyd et al., 2011) where elaborate convergence results are discussed.

The MM framework seeks to solve the following general optimization problem:

where here we consider f a smooth, possibly non-convex function. The general idea behind MM is to find a sequence of feasible points x i i∈N by minimizing a sequence of carefully constructed global upper-bounds of f . The popular expectationmaximization (EM) algorithm is a special case of MM (Wu and Lange, 2010) .

At point x i , we design a continuous global upper-bound function g ·,

Then, in the minimization step we update x as

The global upper-bound function g(·, x i ) must satisfy the following conditions in order to guarantee convergence:

A thorough discussion about MM, along with a significant number of its extensions, with practical examples, can be found in (Sun et al., 2017) .

We formulate the graph learning problem from the LGMRF perspective as the following general optimization program:

where C L is a set describing additional constraints onto the structure of the estimated Laplacian matrix, e.g., C L = {L : diag (L) = d1, d > 0} specifies the set of d-regular graphs. Now, to split the constraints in Problem (20), we introduce the following linear transformations: (a) L = Lw, w ∈ R p(p−1)/2 + , where L is the Laplacian operator (cf. Definition (62)) and w is the vector of edges weights; and (b) Θ = Lw. With this, we equivalently rewrite Problem (20) as minimize

where C Θ and C w are sets describing additional constraints onto the structure of the estimated Laplacian matrix. For example, to estimate connected d-regular graphs we can use (64)).

While Problem (21) can be convex for a limited family of graph structures, convex programming languages, such as cvxpy, have shown to perform poorly even for considerably small (p ≈ 50) graphs (Egilmez et al., 2017) . Hence, we develop scalable algorithms based on the ADMM and MM frameworks.

We first specialize Problem (21) to the class of connected graphs. The rationale for that is twofold: (1) while this problem has been well studied, we propose a significantly different algorithm than previous works (Egilmez et al., 2017; Zhao et al., 2019; Ying et al., 2020b) by splitting the optimization variables whereby additional constraints can be easily introduced and handled via ADMM; (2) in addition, the mathematical developments described for this simple class of graphs will serve as building blocks when we tackle more elaborate classes of graphs such as k-component or heavy-tailed.

For connected graphs, we rely on the fact that det * (Θ) = det (Θ + J ) (Egilmez et al., 2017) , where J = 1 p 11 , to formulate the following convex optimization problem:

The partial augmented Lagrangian function of Problem (22) can be written as

where Y and y are the dual variables associated with the constraints Θ = Lw and dw = d, respectively. Note that we will deal with the constraints w ≥ 0 and Θ 0 directly, hence there are no dual variables associated with them. The subproblem for Θ can be written as

Now, making the simple affine transformation Ω l+1 = Θ l+1 + J , we have

which can be expressed as a proximal operator (Parikh and Boyd, 2014) , cf. Definition (69),

whose closed-form solution is given by Lemma 1.

The global minimizer of problem (26) is (Witten and Tibshirani, 2009; Danaher et al., 2014 )

where U ΓU is the eigenvalue decomposition of ρ Lw l + J − Y l .

Hence the closed-form solution for (24) is

Now, using the linear properties of adjoint operators, we have that tr (SLw) = w, L * S and Lw 2 F = tr (LwLw) = w L * Lw. Then, the subproblem for w can be written as

which is a nonnegative, convex quadratic program.

Lemma 2 Problem (29) is strictly convex.

Proof It suffices to show that the matrix d * d + L * L is positive definite. For any

To see that L * L is positive definite, we refer the readers to (Ying et al., 2020a, Lemma 5.3) .

While the solution to Problem (29) might seem straightforward to obtain via quadratic programming solvers, it actually poses an insurmountable scalability issue: the dimension of the matrices d * d and L * L is p(p − 1)/2 × p(p − 1)/2, implying that the worst-case complexity of a convex QP solver for this problem is O(p 6 ) (Ye and Tse, 1989) , which is impractical. In addition, no closed-form solution is available.

Given these difficulties, we resort to the MM method, whereby we construct an upperbound of the objective function of (29) at point w

where f (·) is the objective function in the minimization in (29), µ = ρλ max (d * d + L * L), and the maximum eigenvalue of d * d + L * L is given by Lemma 3.

The maximum eigenvalue of the matrix d * d + L * L is given as

Proof The proof is deferred to Appendix C.1.

Thus, we have the following approximate strictly convex subproblem for w,

whose solution can be readily obtained via its KKT optimality conditions and its given as

which is a projected gradient descent step with learning rate 2ρ(2p − 1). Thus, we iterate (35) in order to obtain the unique optimal point, w l+1 , of Problem (29). In practice, we observe that a few (≈ 5) iterations are sufficient for convergence. The dual variables Y and y are updated as

and

A practical implementation for the proposed ADMM estimation of connected graphs is summarized in Algorithm 2, whose convergence is stated in Theorem 4.

Data: Similarity matrix S, initial estimate of the graph weights w 0 , desired degree vector d, penalty parameter ρ > 0, tolerance > 0 Result: Laplacian estimation:

iterate (35) until convergence so as to obtain w l+1 6 update Y l+1 as in (36) 7 update y l+1 as in (37) 

Theorem 4 The sequence Θ l , w l , Y l , y l generated by Algorithm 2 converges to the optimal primal-dual solution of Problem (22).

Proof The proof is deferred to Appendix C.2.

As discussed in Section 3, in addition to a rank constraint, some form of control of the node degrees is necessary to learn meaningful k-component graphs. Here we choose

which is translated to the framework of Problem (21) as

Remark: Although the rank constraints on both variables Θ and w may seem redundant, we have observed that it greatly improves the empirical convergence of the algorithm. In addition, the rank constraint on Θ does not incur any additional computational cost, as will be shown in the numerical algorithmic derivations bellow. Thus, Problem (21) can be specialized for the task of learning a k-component graph as the following non-convex optimization program:

However, unlike the rank constraint in the subproblem associated with Θ, the constraint rank(Lw) = p − k cannot be directly dealt with. An alternative is to move this constraint to the objective function by approximating it by noting that it is equivalent to having the sum of the k smallest eigenvalues of Lw equals zero, i.e., k i=1 λ i (Lw) = 0 (Nie et al., 2016) , assuming the sequence of eigenvalues {λ i (Lw)} p i=1 in increasing order. By Fan's theorem (Fan, 1949) 

Thus, moving (41) into the objective function of Problem (40), we have the following relaxed problem:

where η > 0 is a hyperparameter that controls how much importance is given to the term tr V LwV , which indirectly promotes rank(Lw) = p − k. Therefore, via (41), we are able to incorporate the somewhat intractable constraint rank(Lw) = p − k as a simple term in the optimization program.

The partial augmented Lagrangian function of Problem (42) can be written as

The subproblem for Θ can be written as

which is tantamount to that of (25). Its solution is also given as

except that now U ΓU is the eigenvalue decomposition of ρLw l − Y l , with Γ having the largest p − k eigenvalues along its diagonal and U ∈ R p×(p−k) contains the corresponding eigenvectors.

The update for w is carried out similarly to that of (35), i.e.,

except that the coefficient a i is given as

We have the following subproblem for V :

whose closed-form solution is given by the k eigenvectors associated with the k smallest eigenvalues of Lw l+1 (Horn and Johnson, 1985; Absil et al., 2007) . The updates for the dual variables Y and y are exactly the same as in (36) and (37), respectively.

A practical implementation for the proposed ADMM estimation of k-component graphs is summarized in Algorithm 3, named kGL. Its complexity is bounded by the complexity of the eigenvalue decomposition in line 4. Its convergence is stated in Theorem 5.

Theorem 5 Algorithm 3 subsequently converges for any sufficiently large ρ, that is, the sequence Θ l , w l , V l , Y l , y l generated by Algorithm 3 has at least one limit point, and each limit point is a stationary point of (43).

Proof The proof is deferred to Appendix C.3.

Following the LGMRF framework, Ying et al. (2020a,b) recently proposed non-convex regularizations so as to obtain sparse representations of the resulting estimated graphs. Enforcing sparsity is one possible way to remove spurious conditional correlations between nodes in the presence of data with outliers. However, we advocate that assuming a principled, heavy-tailed statistical distribution has more benefits for the financial data setting, rather than simply imposing arbitrary, non-convex regularizations, because they are often cumbersome to deal with from a theoretical perspective and, in practice, they bring the additional task of tunning hyperparameters, which is often repetitive.

Data: Similarity matrix S, initial estimate of the graph weights w 0 , desired number of graph components k, desired degree vector d, penalty parameter ρ > 0, tolerance > 0 Result: Laplacian estimation: (46) until convergence with a i given as in (47) so as to obtain w l+1 6 update V l+1 as in (48) 7 update Y l+1 as in (36) 8 update y l+1 as in (37) 

In order to address the inherently heavy-tailed nature of financial stock data (Resnick, 2007) , we consider the Student-t distribution under the Improper Markov Random Field assumption (Rue and Held, 2005) with Laplacian structural constraints, that is, we assume the data generating process to be modeled a multivariate zero-mean Student-t distribution, whose probability density function can be written as

where Θ is a positive-semidefinite inverse scatter matrix modeled as a combinatorial graph Laplacian matrix.

This results in a robustified version of the penalized MLE for connected graph learning, i.e.,

Problem (50) is non-convex due to the terms involving the log(·) function and hence it is difficult to be dealt with directly. To tackle this issue, we leverage the MM framework whereby the concave terms in (50) are linearized, which essentially results in a weighted Gaussian likelihood (Sun et al., 2016 (Sun et al., , 2017 Wald et al., 2019) .

We start by following the exposition in the preceding sections, then the partial augmented Lagrangian function of Problem (50) is given as

The subproblem for Θ is identical to that of (24). The subproblem for w can be written as

which is similar to that of subproblem (29), except it contains the additional concave term

Similarly to subproblem (29), we employ the MM framework to formulate an efficient iterative algorithm to obtain a stationary point of Problem (52). We proceed by constructing a global upper bound of Problem (52). Using the fact that the logarithm is globally upper-bounded by its first-order Taylor expansion, we have

which results in the following upper bound:

where

By upper-bounding the objective function of Problem (52), at point w j = w l , with (54), the vector of graph weights w can then be updated by solving the following nonnegative, quadratic-constrained, strictly convex problem:

The objective function of Problem (55) can be upper-bounded once again following the same steps as the ones taken for Problem (30), which results in a projected gradient descent step as in (35) with

The dual variables Y and y are updated exactly as in (36) and (37), respectively. Algorithm 4, named tGL, summarizes the implementation to solve Problem (50). The complexity of Algorithm 4 is bounded by the complexity of the eigenvalue decomposition in line 4 and its convergence is stated by Theorem 6.

Data: Data matrix X ∈ R n×p , initial estimate of the graph weights w 0 , desired degree vector d, penalty parameter ρ > 0, degrees of freedom ν, tolerance > 0 Result: Laplacian estimation: Remark: in practical code implementations, the rank-1 data matrices x i, * x i, * , i = 1, 2, ..., n, involved in the computation of (56), are only necessary through the terms L * x i, * x i, * , which can be readily pre-computed before the starting of the iterative process.

Theorem 6 Algorithm 4 subsequently converges for any sufficiently large ρ, that is, the sequence Θ l , w l , Y l , y l generated by Algorithm 4 has at least one limit point, and each limit point is a stationary point of (50).

Proof The proof is deferred to Appendix C.4.

Extending Problem (50) for k-component graphs follows the same strategy as in Problem (42), which results in the following optimization program

Following the exposition in the preceding sections, the partial augmented Lagrangian function of Problem (57) is given as

The subproblems for the variables Θ and V are identical to those of Problem (42), hence they follow the same update expressions.

The subproblem for w is virtually the same as in (55), except for the additional term ηtr(LwV l V l ) = η w, L * V l V l . Hence, its update is also a projected gradient descent step, alike (35) where

The dual variables Y and y are updated as in (36) and (37), respectively. Algorithm 5, named ktGL, summarizes the implementation to solve Problem (57).

Theorem 7 Algorithm 5 subsequently converges for any sufficiently large ρ, that is, the sequence Θ l , w l , V l , Y l , y l generated by Algorithm 5 has at least one limit point, and each limit point is a stationary point of (57).

Proof See Appendix C.5.

We perform experiments with price data queried from S&P500 stocks. In such real-world data experiments, where the a ground-truth graph cannot possibly be known, we evaluate the performance of the learned graphs by visualizing the resulting estimated graph network and verifying whether it is aligned with prior, expert knowledge available, e.g., the GICS sector information of each stock 4 . In addition, we employ measures such as graph modularity (cf. Definition (68)) and density as an objective criterion to evaluate the quality of the estimated graphs.

Algorithm 5: k-component Student-t graph learning (ktGL)

Data: Data matrix X ∈ R n×p , initial estimate of the graph weights w 0 , desired number of graph components k, desired degree vector d, degrees of freedom ν, penalty parameter ρ > 0, tolerance > 0 Result: Laplacian estimation: Lw 1 initialize Y = 0, y = 0 2 l ← 0 3 while max |r l | > or max |s l | > do 4 update Θ l+1 via (28) 5 update w l+1 as in (35) with a j given as in (59) 6 update V l+1 as in (48) 7 update Y l+1 as in (36) 8 update y l+1 as in (37) 

Baseline algorithms: We compare the proposed algorithms (Table 1) with state-of-the-art, baseline algorithms, depending on the specific graph structure that they are suitable to estimate. In the existing literature, it is a common practice not to compare graph algorithms that adopt distinct operational assumptions, i.e., the LGMRF approach and the smooth signal approach. This separation is certainly useful from a theoretical perspective. In this work, however, we are mostly interested in the applicability of graph learning algorithms in practical scenarios and whether the estimated graphs are aligned with prior expert knowledge available irrespective of their underlying assumptions. Therefore, in our experimental analysis, we consider algorithms from both operational assumptions. A summary of the baseline algorithms along with their target graph structure is illustrated in Table 2 . For a fair comparison among algorithms, we set the degree vector d equal to 1 for the proposed algorithms, i.e., we do not consider any prior information on the degree of nodes. Initial graph: Because the algorithms proposed in this paper work in an iterative fashion, they naturally require an initial estimate of the graph. An appropriate initial estimation is critical to obtain a meaningful solution, especially in cases when the optimization problem is non-convex. However, obtaining an initial estimate inherently involves a trade-off between computational efficiency and quality. The latter being measured by how far the initial point is from an actual optimal point. Since the computational complexity of the proposed algorithms are bounded below by the eigenvalue decomposition O(p 3 ), we are interested in simple strategies. Here we consider the strategy used by Kumar et al. (2019a) , where the initial graph w 0 is set as (w) + , wherew is the upper triangular part of the pseudo sample inverse correlation matrix S † .

In the experiments that follow, we use daily price time series data of stocks belonging to the S&P500 index. We start by constructing the log-returns data matrix, i.e., a matrix X ∈ R n×p , where n is the number of price observations and p is the number of stocks, such Table 1 : Proposed algorithms, their target graph structure, operational assumption, and computational complexity. Table 2 : Baseline algorithms, their target graph structure, operational assumption, and computational complexity.

Graph Structure Assumption Complexity GL-SigRep (Dong et al., 2016) connected smooth signals O(np 2 ) SSGL (Kalofolias, 2016) connected smooth signals O(p 2 ) GLE-ADMM (Zhao et al., 2019) connected

LGMRF O(p 3 ) NGL-MCP (Ying et al., 2020a) connected

LGMRF O(p 3 ) SGL (Kumar et al., 2019a (Kumar et al., , 2020 

that the j-th column contains the time series of log-returns of the j-th stock, which can be computed as

where P i,j is the closing price of the j-th stock on the i-th day.

To illustrate the importance of controlling the nodes degrees while learning k-component graphs, we conduct a comparison between the spectral constraints algorithm proposed in (Kumar et al., 2020) , denoted as SGL, and the proposed k-component graph learning (Algorithm 3) on the basis of the sample correlation matrix. To that end, we set up experiments with two datasets: (i) stocks from four sectors, namely, Health Care, Consumer Staples, Energy, and Financials, from the period starting from Jan. 1st 2014 to Jan. 1st 2018. This datasets results in n = 1006 stock price observations of p = 181 stocks; (ii) we expand the dataset by including two more sectors, namely, Industrials and Information Technology. In addition, we collect data from Jan. 1st 2010 to Jan. 1st 2018, resulting in p = 292 stocks and n = 2012 observations. Figure 5 shows the estimated financial stocks networks with k = 4 (Figures 5a and 5b ) and k = 6 (Figures 5c and 5d) . Clearly, the absence of degrees constraints in the learned graph by SGL (benchmark) (Kumar et al., 2020) shows evidence that the algorithm is unable to recover a non-trivial k-component solution, i.e., a graph without isolated nodes. In addition, the learned graphs by SGL present a high number of inter-cluster connections (grey-colored edges), which is not expected from prior expert knowledge of the sectors. The proposed algorithm not only avoids isolated nodes via graph degree constraints, but most importantly learns graphs with meaningful representations, i.e., they are aligned with the available sector information. 5c) , nonetheless the learned graph conveys little information due to the lack of control on the degrees, which allows the learning of trivial k-component graphs, i.e, those containing isolated nodes.

Removing the market factor prior to performing analysis on a set of stock prices is a common practice (Mantegna, 1999; Laloux et al., 2000) . The market factor is the component of stock signals associated with the strongest spectral coefficient. As we have argued in Section 3, removing the market when learning graph matrices is implicitly done via the constraint L1 = 0, for the estimation of the Laplacian matrix, or via the construction of the Z matrix for the estimation of the adjacency matrix. Therefore, it is not necessary to remove the market factor when learning graphs. Another guideline presented in Section 3 is that one should use the correlation matrix of the stock times series (or, equivalently, rescale the data such that each stock time series has unit variance) so as to obtain a meaningful cluster representation.

In order to verify these claims in practice, we set up an experiment were we collected price data from Jan. 3rd 2014 to Dec. 29th 2017 (n = 1006 observations) of 82 selected stocks from three sectors: 28 from Utilities, 31 from Real Estate, and 23 from CommunicationServices.

We then proceed to learn four graphs, using the proposed kGL algorithm, with the following settings for the input data:

1. No data scaling and with market signal removed (Figure 6a ).

No data scaling and with market signal included, i.e., no data preprocessing (Figure 6b ).

3. Scaled data and with market signal removed (Figure 6c ). (Figure 6d ).

The market signal is removed via eigenvalue decomposition of the sample correlation (covariance) matrix, where the largest eigenvalue is set to be zero. Figures 6a and 6b depict evidence that using the sample covariance matrix (or equivalently, not scaling the input data), regardless of whether the market signal has been removed, leads to a graph with possibly many spurious connections (grey edges) that is not in agreement with the GICS sector classification. Figures 6c and 6d show that using the sample correlation matrix (or equivalently, scaling the stock time series such that they have the same variance) prior to learning the graph clearly shows meaningful graphical representations from stocks belonging to three distinct sectors, regardless of whether the market signal has been removed. In addition, the relative error between the estimated graphs in Figures 6a and 6b is 0.09, whereas the relative error between the estimated graphs in Figures 6c and 6d is 4.6 · 10 −5 . Those relative error measurements further confirm that removing the market has little effect on the estimated graph due to the constraint L1 = 0, as explained in Section 3.

In addition, it has been argued that the sample correlation matrix may not always be a good measure of dependency for highly noisy, often non-linear dependent signals such as log-returns of stocks (de Prado, 2020). In the proposed framework, other measures of similarities can be used in place of the sample correlation matrix. For instance, we learn a graph under the same settings as the aforementioned experiment, but using the normalized mutual information,Ī, between the log-return signals (assuming they follow a Gaussian distribution) as the input similarity matrix, which may be computed as

whereS 2 ij is the sample correlation coefficient between the log-returns of stock i and j. Figure 7 depicts the graph structure learned using the normalized mutual information. As it can be observed, the structure of Figure 7 is very similar to that of Figure 6c the f-score between the learned graphs is 0.91 while the relative error is 0.35, which may indicate that using either the normalized mutual information or the sample correlation matrix are equally acceptable inputs for the learning algorithm.

Finally, we use the state-of-the-art, two-stage CLR algorithm (Nie et al., 2016) to learn a 3-component graph for the selected stocks on the basis of the scaled input data matrix. Figure 8 depicts the learned graph network. As it can be observed, unlike the proposed algorithm, CLR clusters together most of the stocks belonging to the Real State and Communication Services sectors, which is not expected from an expert prior information such as GICS. (Nie et al., 2016) with unit-variance, scaled log-return data matrix as input.

In this experiment, we would like to convey the advantages of using graph learning algorithms based on the assumptions that the data is heavy-tailed. To that extent, we compare three Laplacian-constrained models: (1) Gaussian (GLE-ADMM (Zhao et al., 2019) ) (2) Gaussian with minimax concave sparsity penalty (NGL-MCP (Ying et al., 2020b) ), and (3) Student-t (tGL Algorithm 4). These models are investigated under two scenarios: (i) strong (ν ≈ 4) and (ii) weak (ν ≈ 10) presence of heavy-tails. On both scenarios, we selected stocks belonging to five different sectors, namely: Consumer Staples, Consumer Discretionary, Industrials, Energy, and Information Technology.

Remark on hyperparameters: For the Gaussian model with minimax concave penalty, we tune the sparsity hyperparameter so that the estimated graph obtains the highest modularity. While tunning an one-dimensional hyperparameter may not pose issues while performing post-event analysis, it does compromise the performance of real-world online systems where the value of such hyperparameter is often unknown and data-dependent. For the Student-t model, the degrees of freedom ν can be computed in a prior stage directly from the data using, e.g., the methods in (Liu et al., 2019) , or in a sliding-window fashion for the case of real-time systems.

Strong heavy-tails: for this experiment, we queried data from 222 stocks from Jan. 3rd 2008 to Dec. 31st 2009, which represents 504 data observations per stock, resulting in a sample-parameter size ratio of n/p ≈ 2.27. This particular time-frame presents a high amount of volatility due to the 2008 US depression. To quantify the extent of heavy-tails in the data, we fit a multivariate Student-t distribution using the matrix of log-returns X, where we obtain ν ≈ 4.06, which indeed indicates a high presence of heavy-tailed data points. In addition, we measured the average annualized volatility across all stocks and obtained volatility ≈ 0.53. Figure 9 provides a summary of this market scenario. Figure 9a shows the S&P500 log-returns time series, where the increase in volatility due to the global financial crisis in 2008 is clearly noticeable. Figure 9b shows a histogram of the S&P500 log-returns during the aforementioned time period, where the solid curve represents a Gaussian fit. It can be noticed that the tails of the Gaussian decays much faster than the tails of the empirical histogram, indicating the presence of heavy-tails or outliers.

Weak heavy-tails: in this scenario, we collected data from 204 stocks from Jan. 5th 2004 to Dec. 30th 2006, which represents 503 data points per stock, resulting in a sample-parameter size ratio of n/p ≈ 2.47. During this time-window, the market was operating relatively nominal. By fitting a multivariate Student-t distribution to the matrix of log-returns, we obtain ν ≈ 10.11, which indicates little presence of outliers, and that the data is nearly Gaussian. The average annualized volatility measured across all stocks is volatility ≈ 0.27, which is half of the annualized volatility in the strong heavy-tails case. Figure 10 provides a summary of this market scenario. Figure 10a shows the S&P500 log-returns time series, where no noticeable volatility clustering event is present, while Figure 10 depicts its histogram along with a Gaussian fit that closely matches the empirical distribution. Figure 11 depicts the learned stock graphs on these scenarios. In either scenario, it can be readily noticed that the graphs learned with the Student-t distribution are sparser than those learned with the Gaussian assumption, which results from the fact that the Gaussian distribution is more sensitive to outliers. As for the Gaussian graphs with sparsity, they present a significant improvement when compared to the non-sparse counterpart. The Student-t graphs, on the other hand, present the highest degree of interpretability as measured by their higher modularity value and ratio between the number of intra-sector edges and inter-sector edges (cf. Tables 3 and 4), which is the expected behavior from stock sector classification systems such as GICS.

Among the learned Gaussian graphs (Figures 11a and 11d) , it can be seen that the learned graph in the weak heavy-tailed scenario presents a cleaner graphical representation, by having less inter-sector and intra-sector edges, while also having a higher graph modularity (cf. Tables 3 and 4), than that of the Gaussian graph in the strong heavy-tailed scenario. (Figures 11a and 11d) , Gaussian with sparsity (Figures 11b and 11e) , and Student-t (Figures 11c and 11f) , for contrasting heavy-tail scenarios. Strong heavy-tails: for this experiment, we queried data from 347 stocks from Jan. 5th 2016 to Dec. 23rd 2020, which represents 1253 data observations per stock, resulting in a sample-parameter ratio of n/p ≈ 3.61. This particular time-frame presents an extreme high amount of volatility around the beginning of 2020 due to the financial crisis caused by the COVID-19 pandemic. To quantify the amount of outliers in this time frame, we fit a multivariate Student-t distribution using the matrix of log-returns X, where we obtain ν ≈ 4.15, which indeed indicates a high presence of heavy-tailed data points. In addition, we measured the average annualized volatility across all stocks and obtained volatility ≈ 0.34. Figure 12 provides a summary of this market scenario. Figure 12 : State of the US stock market, as captured by the S&P500 index, on the strong heavy-tails scenario, which starts from Jan. 5th 2016 until Jul. 20th 2020. Figure 12a shows the S&P500 log-returns time series, where the increase in volatility due to the COVID-19 pandemic is prominent. Figure 12b illustrates the empirical distribution of the S&P500 log-returns along with its Gaussian fit. It can be noticed that events far beyond the tails decay are present.

Moderate heavy-tails: for this setting, we queried data from 332 stocks from Jan. 2nd 2013 to Jun. 29th 2018, which represents 1383 data observations per stock, resulting in a sample-parameter ratio of n/p ≈ 4.17. We fit a multivariate Student-t distribution using the matrix of log-returns X, where we obtain ν ≈ 7.11. In addition, we measured the annualized average volatility across all stocks and obtained volatility ≈ 0.25. Figure 13 provides a summary of this market scenario. Figure 14 depicts the learned graphs on the aforementioned scenarios. Similarly from the previous experiment, it can be noticed that in either scenario the graphs learned with Student-t are indeed sparser, more modular, and hence, more interpretable, than those learned with the Gaussian assumption (cf. Tables 5 and 6).

The learned Gaussian graphs (Figures 14a and 14d) , are very dense and present a high number of spurious connections (grey edges) among stocks that are arguably not related in practice. However, it can be noticed that the learned graph in the heavy-tailed scenario presents a cleaner graphical representation, by having less inter-sector edges, while also having a higher graph modularity (cf. Tables 6 and 5) than that of the Gaussian graph in the strong heavy-tailed scenario, which is consistent with the previous experiment. The usage of sparsity in does improve the Gaussian graphs (Figures 14b and 14e) , but only to limited extent when compared to the graphs learned using the Student-t assumption.

The Student-t graphs, on the other hand, present a high degree of modularity, where most of the sectors can easily be identified. We also measured that the Student-t distribution outputs graphs (Figure 14c and 14f) with higher graph modularity. Figure 13 : State of the US stock market, as captured by the S&P500 index, on the moderate heavy-tails scenario, which starts from Jan. 2nd 2013 until Jun. 29th 2018. Figure 13a shows the S&P500 log-returns time series, where a few significant heavy-tailed observations are noticeable, along with its histogram (Figure 13b ) whose Gaussian fit indicates that the presence of outlier data points are mostly concentrated around the turning points of the tails. 

In order to verify the learning of heavy-tail and k-component graphs jointly, we estimate a graphs of stocks using the datasets described in subsections 5.1 5.2 via the ktGL algorithm. Figure 15 depicts the learned graphs where we can observe sparse characteristics that agree with the connected graphs estimated in the preceding section. In addition, when compared to the Gaussian case (Figures 5b and 6c) , the graphs estimated in Figures 15 and 15b reveal a finer, possibly more accurate description of the actual underlying stock market scenario.

In the experiments that follow, we focus on illustrating the impact of the COVID-19 economic crisis on the learned graphs from the S&P500 stock market and the foreign exchange market. The COVID-19 pandemic affected the US stock market quite significantly especially throughout the month of March 2020. (Figures 14a and 14d) , Gaussian with sparsity (Figures 14b and 14e) , and Student-t (Figures 14c and 14f) , for contrasting heavy-tail scenarios.

In this experiment, we investigate the effects of the financial crisis caused by the COVID-19 pandemic on the learned graphs of stocks. We start by selecting 97 stocks across all 11 sectors of the S&P500 and computing their log-returns during two time frames: (i) from Apr. 22nd 2019 to Dec. 31st 2019 and (ii) from Jan. 2nd 2020 to Jul. 31st 2020.

Out of those stocks, nine of them showed growth in returns over the period of 24 days starting from Feb. 18th 2020 to March 20th 2020. Their symbols along with their monthly return during this period is summarized in Table 7 . Figure 16 shows the log-returns of the selected stocks over the considered time period. In Figure 16b , the economic crisis is noticeable from the increase in the spread of the log-returns, throughout the month of March 2020, caused by the COVID-19 pandemic. Figure 17 shows the learned networks using the proposed tGL algorithm 4 with Student-t assumption. Figure 17a shows that the stocks with positive average linear return (blue squares) during the COVID-19 pandemic are not particularly correlated on the period before the pandemic. Figure 17b , on the other hand, shows that the learned graph is able to correctly cluster those stocks. In particular, it can be observed that the network of Figure 17b is objectively more modular than that of Figure 17a . This experiment shows Figure 17 : Graphs of stocks learned with data prior and during the financial crisis caused by the COVID-19 pandemic in 2020. Figure 17a shows that before the COVID-19 pandemic the nodes in blue are somewhat independent among themselves. Figure 17b shows that during the COVID-19 pandemic the stocks highlighted in blue are strongly connected and separated from the rest of the graph, as in fact they showed a positive monthly return during the severe economic period between Feb. 18th and Mar. 20th, 2020.

In addition, we compare the proposed learned graphs in Figure 17 with graphs learned from algorithms that employ the smooth signal approach. Figure 18 shows the learned graphs from SSGL (Kalofolias, 2016) and GL-SigRep (Dong et al., 2016) . As we can observe from both Figure 18a and 18b, the learned networks do not present a meaningful graph representation in this setting. While the clustering property of the stocks with positive monthly return is preserved in the network learned with GL-SigRep, it does not capture the fine dependencies between pairs of stocks like the ones shown by the proposed kGL algorithm in Figure 17 . In addition, tunning the hyperparameters in the smooth signal algorithms is an involved task. (6) (Dong et al., 2016) with α = 10 −3 , γ = 0.5. Figure 18 : Learned graphs with existing smooth signal-based algorithms from stock data during the financial crisis caused by COVID-19.

We set up an experiment with data from the foreign exchange (FX) market, where, similarly to the previous experiment, we would like to investigate whether the learned graph is able to identify currencies that became more valuable with respect to the US dollar during the COVID-19 pandemic. To that end, we query FX data of the 34 most traded currencies as of 2019 in two time windows: (i) from Feb. 1st 2019 to May 1st 2019 and (ii) from Feb. 3rd 2020 to May 1st 2020.

We then obtain the list of currencies for which the US dollar became less valuable during the period from Feb. 15th to Apr. 15th 2020. Table 8 shows the list of such currencies along with the annualized return of the ratio USD/CUR, where CUR is a given currency. We then use the proposed heavy-tail graph learning framework (tGL Algorithm 4) to learn the graph networks of foreign exchange data for the two aforementioned time frames. Figure 19 depicts the learned graphs, where in Figure 19b clearly shows that the currencies that had an increase in value during the pandemic are clustered together (red edges), except for the HKD, which was the currency that presented the smallest increase. On the other hand, prior to the pandemic, the FX market behaved in a somewhat random fashion, which is seen by the smaller value in graph modularity.

(a) Learned Student-t graph of FX data one-year prior COVID-19 financial crisis. Q = 0.039.

(b) Learned Student-t graph of FX data during COVID-19 financial crisis. Q = 0.085. Figure 19 : Learned graphs from FX data.

This paper has presented novel interpretations for Laplacian constraints of graphs from the perspective of financial data. Those interpretations serve as guidelines for users when it comes to apply graph learning algorithms to estimate networks of financial instruments such as stocks and currencies. We have also proposed novel algorithms based on the ADMM and MM frameworks for learning graphs from data. Those algorithms fill major gaps in the literature, especially on what concerns learning heavy-tailed and k-component graphs.

In particular, the heavy-tail graph learning framework is paramount for financial data, which exceptionally outperforms conventional state-of-the-art algorithms derived on the assumption that the input data is Gaussian. Another feature of heavy-tail graphs is that they are naturally sparse. State-of-the-art sparse graph learning frameworks, while useful in many contexts beyond finance, are cumbersome to tune due to their hyperparameters. We, on the other hand, advocate that sparsity provided from heavy-tail distributions is a more principled way to obtain interpretable graph representations. In the case of k-component graphs, we proposed a principled, versatile framework that avoids isolated nodes via a simple linear constraint on the degree of the nodes. This extension allows, for instance, the estimation of particular types of graphs such as regular graphs. Moreover, the proposed graph algorithms have shown significant potential to capture nuances in the data caused by, e.g., a financial crisis event. Finally, it is worth noting that the methods developed in this paper may be applicable to scenarios beyond financial markets, in particular, we envision benefits for practical applications where the data distributions significantly departs from that of Gaussian.

In this supplementary section, we illustrate the empirical convergence performance of the proposed algorithms for the experimental settings considered. All the experiments were carried out in a MacBook Pro 13in. 2019 with Intel Core i7 2.8GHz, 16GB of RAM. The quantities r l and s l , which are defined as r l = Θ l − Lw l , s l = dw l − d, are the primal residuals and v l = ρL * Θ l − Θ l−1 is the dual residual.

From Figures 20-25 , we can observe that the norm of the residuals quantities quickly approach zero after a transient phase typical of ADMM-like algorithms. 

Definition 8 (Laplacian operator) The Laplacian operator (Kumar et al., 2019a) L : R p(p−1)/2 + → R p×p , which takes a nonnegative vector w and outputs a Laplacian matrix Lw, is defined as

Definition 9 (Adjacency operator) The adjacency operator (Kumar et al., 2020) A : R p(p−1)/2 + → R p , which takes a nonnegative vector w and outputs an Adjacency matrix Aw, is defined as

where s(j) = j−1 2 (2p − j) − j.

Definition 10 (Degree operator) The degree operator d : R p(p−1)/2 → R p , which takes a nonnegative vector w and outputs the diagonal of a Degree matrix, is defined as

Definition 11 (Adjoint of Laplacian operator) The adjoint of Laplacian operator (Kumar et al., 2019a) L * : R p×p → R p(p−1)/2 is defined as

where

Definition 12 (Adjoint of adjacency operator) The adjoint of adjacency operator (Kumar et al., 2019a) A * : R p×p → R p(p−1)/2 is defined as

Definition 13 (Adjoint of degree operator) The adjoint of degree operator d * : R p → R p(p−1)/2 is given as

where s(i, j) = i − j + j−1 2 (2p − j), i > j.

An alternative expression for d * is d * y = L * Diag (y), where L * is the adjoint of the Laplacian operator.

Definition 14 (Modularity) The modularity of a graph (Newman, 2006) is defined as Q : R p×p → [−1/2, 1]:

where d i is the weighted degree of the i-th node, i.e. d i [d (w)] i , t i f t (i), i ∈ V, is the type of the i-th node, and 1(·) is the indicator function.

Definition 15 (Proximal Operator) The proximal operator of a function f , f : R p×p → R, with parameter ρ, ρ ∈ R ++ , is defined as (Parikh and Boyd, 2014) prox ρ −1 f (V ) arg min

Appendix C. Proofs

Proof We define an index set Ω t :

Then we have

with equality if and only if x 1 = · · · = x p(p−1)/2 = 2 p(p−1)/2 or x 1 = · · · = x p(p−1)/2 = − 2 p(p−1)/2 . The last equality follows the fact that |Ω t | = p − 1. Similarly, we can obtain

with equality if and only if x 1 = · · · = x p(p−1)/2 = 2 p(p−1)/2 or x 1 = · · · = x p(p−1)/2 = − 2 p(p−1)/2 . Finally, we have

Note that the equality in (71) can be achieved because the eigenvectors of L * L and d * d associated with the maximum eigenvalue are the same. Therefore, we conclude that λ max (L * L + d * d) = 4p − 2, completing the proof.

We can rewrite the updating of Θ, w, Y and y in a compact form:

Now we can see that our ADMM algorithm satisfies the standard form with two blocks of primal variables Θ and w, and one block of dual variable (Y , y). Our ADMM approach splits the original problem into two blocks in (22). According to the existing convergence results of ADMM in (Boyd et al., 2011) , we can conclude that Algorithm 2 will converge to the optimal primal-dual solution for (22).

Proof To prove Theorem 5, we first establish the boundedness of the sequence Θ l , w l , V l , Y l , y l generated by Algorithm 3 in Lemma 16, and the monotonicity of L ρ Θ l , w l , V l , Y l , y l in Lemma 17.

The sequence Θ l , w l , V l , Y l , y l generated by Algorithm 3 is bounded.

Proof Let w 0 , V 0 , Y 0 and y 0 be the initialization of the sequences w l , V l , Y l and y l , respectively, and w 0 , V 0 F , Y 0 F and y 0 are bounded. We prove the lemma by induction. Recall that the sequence Θ l is established by

where Γ l−1 contains the largest p − k eigenvalues of ρLw l−1 − Y l−1 , and U l−1 contains the corresponding eigenvectors. When l = 1, Γ 0 F is bounded since both w 0 and Y 0 F are bounded. Therefore, we can conclude that Θ 1 F is bounded. The sequence w l is established by solving the subproblems

where a l = L * S − Y l−1 − ρΘ l + d * y l−1 − ρd . We get that a 1 is bounded because of the boundedness of Y 0 F , y 0 , and Θ 1 F . By (Ying et al., 2020a) , we know that L * L is a positive definite matrix and the minimum eigenvalue λ min (L * L) = 2. On the other hand, d * d is a positive semi-definite matrix as follows,

Therefore, we obtain that

implying that f 1 (w) is coercive, i.e., f 1 (w) → +∞ for any w → +∞. Thus w 1 is bounded. According to (48), we can see that V is in a compact set, and thus V l F is bounded for any l ≥ 1. Finally, according to (36) and (37), one has

and

It is obvious that both Y 1 F and y 1 are bounded. Therefore, it holds for l = 1 that

is bounded under 2 -norm or Frobenius norm. Following from (75), we can obtain that Θ l F is bounded. By (76), similarly, we can get that w l is bounded. Similar to (78) and (79), we can also obtain that Y l and y l are bounded, because of the boundedness of Θ l F , w l , Y l−1 and y l−1 . Thus, Θ l , w l , V l , Y l , y l is bounded, completing the induction. Therefore, we can conclude that the sequence Θ l , w l , V l , Y l , y l is bounded.

Lemma 17 The sequence L ρ Θ l , w l , V l , Y l , y l generated by Algorithm 3 is lower bounded, and

holds for any sufficiently large ρ.

Proof According to (43), we have

We can see that the lower boundedness of the sequence L ρ Θ l , w l , V l , Y l , y l can be established by the boundedness of Θ l , w l , V l , Y l , y l in Lemma 16. Now we first establish that

We have

Note that Θ l+1 minimizes the objective function

Therefore

holds for any l ∈ N + . One has

The term I 1 can be written as

Note that V l+1 is the optimal solution of the problem

Thus the term I 1a ≥ 0, and we can obtain

For the term I 2 , we have

where the first equality is due to the updating of y l+1 as below

For the term I 3 , similarly, we have

where the first equality follows from

Therefore, we can obtain

Recall that w l+1 is the optimal solution of the problem

Thus, w l+1 satisfies the KKT system of (93) as below

w l+1 i ν i = 0, for i = 1, . . . , p(p − 1)/2; (95) w l+1 ≥ 0, ν ≥ 0; (96) According to (94), we have

Together with (92) and (97), we obtain

where the inequality is established by ν, w l − w l+1 ≥ 0, which follows from the fact that ν, w l+1 = 0 according to (95), and ν, w l ≥ 0 because ν ≥ 0 by (96) and w l ≥ 0.

Plugging (87) and (98) into (85), by calculation we can obtain

where the equality follows from (89) and (91). If ρ is sufficiently large such that

holds with some constant c > 1, together with (84), we can conclude that

for any l ∈ N + . Note that besides a fixed parameter ρ, an alternative strategy is to increase the ρ iteratively (Ying et al., 2017) until the condition (100) could be satisfied well.

Now we are ready to prove Theorem 5. By Lemma 16, the sequence Θ l , w l , V l , Y l , y l is bounded. Therefore, there exists at least one convergent subsequence Θ ls , w ls , V ls , Y ls , y ls s∈N , which converges to a limit point denoted by Θ l∞ , w l∞ , V l∞ , Y l∞ , y l∞ . By Lemma 17, we obtain that L ρ Θ l , w l , V l , Y l , y l is monotonically decreasing and lower bounded, and thus is convergent. Note that the function log det * (Θ) is continuous over the set S = Θ ∈ S p + |rank(Θ) = p − k . We can get lim l→+∞ L ρ Θ l , w l , V l , Y l , y l = L ρ (Θ ∞ , w ∞ , V ∞ , Y ∞ , y ∞ ) = L ρ Θ l∞ , w l∞ , V l∞ , Y l∞ , y l∞ .

The (99), (100) and (101) together yields 

Obviously, Lw ls − Θ ls F → 0 and dw ls − d 2 → 0 also hold for any subsequence as s → +∞, which implies that Y l∞ and y l∞ satisfy the condition of stationary point of L ρ (Θ, w, V , Y , y) with respect to Y and y, respectively. By (91) and (89) 

Recall that V l contains the k eigenvectors associated with the k smallest eigenvalues of Lw l , and thus it is easy to check that

For the limit point Θ l∞ , w l∞ , V l∞ , Y l∞ , y l∞ of any subsequence Θ ls , w ls , V ls , Y ls , y ls s∈N , Θ l∞ minimizes the following subproblem

By (103) and (104), we conclude that Θ l∞ satisfies the condition of stationary point of L ρ (Θ, w, V , Y , y) with respect to Θ. Similarly, w l∞ minimizes the subproblem w l∞ = arg min w≥0 ρ 2 w (d * d + L * L) w + w, d * y l∞−1 − ρd + w, L * S + ηV l∞−1 V l∞−1 − Y l∞−1 − ρΘ l∞ = arg min w≥0 ρ 2 w (d * d + L * L) w + w, d * y l∞ − ρd

By (103), (104) and (105), w l∞ satisfies the condition of stationary point of L ρ (Θ, w, V , Y , y) with respect to w. V ∞ minimizes the subproblem V l∞ = arg min V ∈R p×k tr V Lw l∞ V , subject to V V = I, which implies that V l∞ satisfies the condition of stationary point of L ρ (Θ, w, V , Y , y) with respect to V . To sum up, we can conclude that any limit point Θ l∞ , w l∞ , V l∞ , Y l∞ , y l∞ of the sequence generated by Algorithm 3 is a stationary point of L ρ (Θ, w, V , Y , y).

Proof Similar to the proof of Theorem 5, we establish the boundedness of the sequence Θ l , w l , Y l , y l generated by Algorithm 4 in Lemma 18, and the monotonicity of L ρ Θ l , w l , Y l , y l in Lemma 19.

The sequence Θ l , w l , Y l , y l generated by Algorithm 4 is bounded.

Proof Let w 0 , Y 0 and y 0 be the initialization of the sequences w l , Y l and y l , respectively, and w 0 , Y 0 F and y 0 are bounded. We prove the boundedness of the sequence by induction. Notice that the subproblem for Θ is the same with that in (75), and thus we can directly get the boundedness of Θ l F for l = 1. The sequence w l is established by solving the subproblems

Let

where a l = L * S − Y l−1 − ρΘ l + d * y l−1 − ρd . Note that a 1 is bounded because Y 0 F , y 0 , and Θ 1 F are bounded. In the proof of Lemma 16, we have shown that L * L is a positive definite matrix with the minimum eigenvalue λ min (L * L) = 2, and d * d is a positive semi-definite matrix. Since log 1 + x i, * Lwx i, * ν ≥ 0 for any w ≥ 0, we have lim w →+∞ g 1 (w) ≥ lim w →+∞ ρw w + w, a 1 = +∞.

Thus g 1 (w) is coercive. Recall that we solve the optimization (106) by the MM framework. Hence, the objective function value is monotonically decreasing as a function of the iterations, and w l is a stationary point of (106). Then the coercivity of g 1 (w) yields the boundedness of w 1 . Finally, Y 1 F and y 1 are also bounded, because Y 1 and y 1 are updated as done in (78) and (79), respectively, and thus the proof is the same. Now we assume that Θ l−1 , w l−1 , Y l−1 , y l−1 is bounded for some l ≥ 1, and check the boundedness of Θ l , w l , Y l , y l . Similar to the proof in (75), we can prove that Θ l F is bounded. By (106), we can also obtain the boundedness of w l . We can also obtain that Y l and y l are bounded according to the boundedness of Θ l F , w l , Y l−1 and y l−1 . Thus, Θ l , w l , Y l , y l is bounded, completing the induction. Therefore, we establish the boundedness of the sequence Θ l , w l , Y l , y l .

The sequence L ρ Θ l , w l , Y l , y l generated by Algorithm 4 is lower bounded, and L ρ Θ l+1 , w l+1 , Y l+1 , y l+1 ≤ L ρ Θ l , w l , Y l , y l , ∀ l ∈ N + ,

holds for any sufficiently large ρ.

Proof According to (50), we have L ρ (Θ l , w l , Y l , y l ) = p + ν n n i=1 log 1 + x i, * Lw l x i, * ν − log det Θ l + J + y l , dw l − d

We can see that the lower boundedness of the sequence L ρ Θ l , w l , Y l , y l can be established by the boundedness of Θ l , w l , Y l , y l in Lemma 18. By a similar argument in the proof of Lemma 17, we can also establish that L ρ Θ l+1 , w l , Y l , y l ≤ L ρ Θ l , w l , Y l , y l , ∀ l ∈ N + .

One has

where r(L) = p + ν n n i=1 log 1 +

x i, * Lx i, * ν . According to (88) and (90), we obtain

According to the convergence result of the majorization-minimization framework (Sun et al., 2017) , we know that any limit point of the sequence is a stationary point of the following problem min w≥0 p + ν n n i=1 log 1 + x i, * Lw l x i, * ν + ρ 2 w (d * d + L * L) w− w, L * Y l + ρΘ l − d * y l − ρd .

(114) The set of the stationary points for the optimization (114) is defined by

where g l (w) is the objective function in (114). The existence of the limit point can be guaranteed by the the coercivity of g l (w), which has been established in the proof of Lemma 18. Therefore, w l+1 is a stationary point. By taking z = w l and w = w l+1 in (115), we obtain L * ∇r Lw l+1 + ρ (d * d + L * L) w l+1 − L * Y l + ρΘ l + d * y l − ρd w l − w l+1 ≥ 0.

Thus, we have d * y l − L * Y l , w l − w l+1 ≥ − ∇r Lw l+1 , Lw l − Lw l+1

Plugging (113) and (116) into (112), we obtain

where the last inequality is due to the fact that r(L) is a concave function and has L r -Lipschitz continuous gradient, in which L r > 0 is a constant, thus we have r Lw l − r Lw l+1 − ∇r Lw l+1 , Lw l − Lw l+1 ≥ − L r 2 Lw l+1 − Lw l 2 F .

By calculation, we obtain that if ρ is sufficiently large such that ρ ≥ max   Lr, max l L r Lw l+1 − Lw l 2 F + L 2 r Lw l+1 − Lw l 4 F + 8abc

holds with some constant c > 1, where a = dw l+1 − dw l 2 2 + Lw l+1 − Lw l 2 F and b = y l+1 − y l 2 2 + Y l+1 − Y l 2 F , then, together with (111), we conclude that L ρ (Θ l , w l , Y l , y l ) ≥ L ρ (Θ l+1 , w l , Y l , y l ) ≥ L ρ (Θ l+1 , w l+1 , Y l+1 , y l+1 ), for any l ∈ N + . Note that for the case of Gaussian distribution in Section 4.4, the constant L r will be zero, and ρ in (119) will be consistent with that in (100).

Now we are ready to prove Theorem 6. By Lemma 18, the sequence Θ l , w l , Y l , y l generated by Algorithm 4 is bounded. Therefore, there exists at least one convergent subsequence Θ ls , w ls , Y ls , y ls s∈N , which converges to the limit point denoted by Θ l∞ , w l∞ , Y l∞ , y l∞ .

By Lemma 19, we obtain that the sequence L ρ Θ l , w l , Y l , y l dedfined in (110) is monotonically decreasing and lower bounded, implying that L ρ Θ l , w l , Y l , y l is convergent. We can get lim l→+∞ L ρ Θ l , w l , Y l , y l = L ρ Θ l∞ , w l∞ , Y l∞ , y l∞ . The (117), (119) and (120) together yields

The convergence of L ρ Θ l , w l , Y l , y l yields 

Similar to the proof of Theorem 5, by (121), (122) and (123), we can prove that Θ l∞ , w l∞ , Y l∞ and y l∞ satisfy the condition of stationary point of L ρ (Θ, w, V , Y , y) with respect to Θ, w, Y and y, respectively. To sum up, we can conclude that any limit point Θ l∞ , w l∞ , Y l∞ , y l∞ of the sequence is a stationary point of L ρ (Θ, w, Y , y).

Proof The proof of Theorem 7 is similar to the proof of Theorems 5 and 6. We can establish the boundedness of the sequence Θ l , w l , V l , Y l , y l generated by Algorithm 5 as done in Lemma 16 and 18. Similar to Lemmas 17 and 19, we can establish the monotonicity and boundedness of L ρ Θ l , w l , V l , Y l , y l . Therefore, we omit the details of the proof of Theorem 7 to avoid redundancy.

Optimization Algorithms on Matrix Manifolds

Covariance Matrix Estimation under Total Positivity for Portfolio Selection

High-dimensional Gaussian graphical model selection: Walk summability and local separation criterion

Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data

Topology of correlation-based minimal spanning trees in real and model markets

Networks of equities in financial markets

Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning

Characterization, stability and convergence of hierarchical clustering methods

Supervised community detection with graph neural networks

Learning sparse graphs under smoothness prior

Spectral Graph Theory

Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance

State-space network topology identification from partial observations

igraph: Network analysis and visualization

Fast and elegant numerical linear algebra using the RcppEigen package

The joint graphical lasso for inverse covariance estimation across multiple classes

Learning undirected graphs in financial markets

Building diversified portfolios that outperform out of sample

Machine Learning for Asset Managers (Elements in Quantitative Finance)

CVXPY: A Python-embedded modeling language for convex optimization

Learning Laplacian matrix in smooth graph signal representations

Toward a generic representation of random variables for machine learning

Clustering of financial time series with application to index and enhanced index tracking portfolio

Rcpp: Seamless R and C++ integration

RcppArmadillo: Accelerating R with high-performance C++ linear algebra

Graph learning from data under Laplacian and structural constraints

On a theorem of Weyl concerning eigenvalues of linear transformations I

A signal processing perspective on financial engineering

Community detection in graphs

Sparse inverse covariance estimation with the graphical lasso

CVXR: An R package for disciplined convex optimization

Time Series and Dynamic Models

Simultaneous clustering and estimation of heterogeneous graphical models

Dynamic models for volatility and heavy tails: with applications to financial and economic time series

Topology identification of undirected consensus networks via sparse inverse covariance estimation

Matrix Analysis

A divide-and-conquer method for sparse inverse covariance estimation

How to learn a graph from smooth signals

Spectral properties of financial correlation matrices

The flash crash: High-frequency trading in an electronic market

Cauchy-Binet for pseudo-determinants

Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems

Structured graph learning via laplacian spectral constraints

Bipartite structured Gaussian graphical modeling via adjacency spectral priors

A unified framework for structured graph learning via spectral constraints

Discovering structure by learning sparse graph

Random matrix theory and financial correlations

Clustering techniques and their effect on portfolio formation and risk analysis

Community detection in attributed graphs: An embedding approach

Parameter estimation of heavy-tailed ar model with missing data via stochastic em

Extreme Financial Risks: From Dependence to Risk Management

Hierarchical structure in financial markets

An Introduction to Econophysics: Correlation and Complexity in Finance

Sampling of graph signals with successive local aggregations

A proposal of a methodological framework with experimental guidelines to investigate clustering stability on financial time series

Clustering financial time series: How long is enough?

A review of two decades of correlations, hierarchies, networks and clustering in financial markets

On clustering financial time series: a need for distances between dependent random variables. Computational Information Geometry

Connecting the dots: Identifying network structure via graph signal processing

Partial correlation financial networks

Revisions to the global industry classification standard (gics) structure

Modularity and community structure in networks

The constrained Laplacian rank algorithm for graph-based clustering

Dynamic asset trees and black monday

Dynamics of market correlations: Taxonomy and portfolio analysis

Clustering and information in correlation based financial networks

Iterative Solution of Nonlinear Equations in Several Variables

Bayesian graph convolutional neural networks using non-parametric graph learning

Proximal algorithms

Learning graphs with monotone topology properties and multiple connected components

Universal and nonuniversal properties of cross correlations in financial time series

Random matrix approach to cross correlations in financial data

Hierarchical clustering-based asset allocation

The hierarchical equal risk contribution portfolio

A user guide to low-pass graph signal processing and its applications

Heavy-Tail Phenomena: Probabilistic and Statistical Modeling

Gaussian Markov Random Fields: Theory And Applications

More subtle versions of the Hadamard inequality

On the nonasymptotic convergence of cyclic coordinate descent methods

Equity Valuation Using Multiples: An Empirical Investigation

Network topology inference from spectral templates

Online topology inference from streaming stationary graph signals with partial connectivity information

Stochastic methods for 1 -regularized loss minimization

Capital asset prices: A theory of market equilibrium under conditions of risk

Estimation of positive definite m-matrices and structure learning for attractive gaussian markov random fields

Covariance estimation with nonnegative partial correlations

Standard & Poor's. Global Industry Classification Standard (GICS)

Adaptive variable clustering in Gaussian graphical models

Robust estimation of structured covariance matrix for heavy-tailed elliptical distributions

Majorization-minimization algorithms in signal processing, communications, and machine learning

Glassofast: An efficient glasso implementation. The University of Texas at Austin

The cluster graphical Lasso for improved estimation of Gaussian graphical models

Analysis of Financial Time Series

Globally optimal learning for structured elliptical losses

Learning high-dimensional gaussian graphical models under total positivity without adjustment of tuning parameters

Covariance-regularized regression and classification for high dimensional problems

New insights and faster computations for the graphical lasso

Coordinate descent algorithms. Mathematical Programming

The MM alternative to EM

A comprehensive survey on graph neural networks

Novel fast networking approaches mining underlying structures from investment big data

An extension of karmarkar's projective algorithm for convex quadratic programming

Hankel matrix nuclear norm regularized tensor completion for n-dimensional exponential signals

Does the 1 -norm Learn a Sparse Graph under Laplacian Constrained Graphical Models? arXiv e-prints

Nonconvex Sparse Graph Learning under Laplacian-structured Graphical Model

Optimization algorithms for graph laplacian estimation via ADMM and MM

The numerical algorithms proposed in this work were implemented in the R language and made use of softwares such as CVXR (Fu et al., 2020) , Rcpp (Eddelbuettel and Francois, 2011) , RcppEigen (D. Bates, 2013) , RcppArmadillo (Eddelbuettel and Sanderson, 2014) , and igraph (Csárdi, 2019) . This work was supported by the Hong Kong GRF 16207019 research grant.