key: cord-0977943-8ql7ls0e
authors: Burda, Zdzislaw; Jarosz, Andrzej
title: Cleaning large-dimensional covariance matrices for correlated samples
date: 2021-07-03
journal: Phys Rev E
DOI: 10.1103/physreve.105.034136
sha: 52ad7703c0373bef4b90917cde2f1d6ecc20bd55
doc_id: 977943
cord_uid: 8ql7ls0e

We elucidate the problem of estimating large-dimensional covariance matrices in the presence of correlations between samples. To this end, we generalize the Marcenko-Pastur equation and the Ledoit-Peche shrinkage estimator using methods of random matrix theory and free probability. We develop an efficient algorithm that implements the corresponding analytic formulas, based on the Ledoit-Wolf kernel estimation technique. We also provide an associated open-source Python library, called"shrinkage", with a user-friendly API to assist in practical tasks of estimation of large covariance matrices. We present an example of its usage for synthetic data generated according to exponentially-decaying auto-correlations.

In experimental research, when working with large datasets, one often faces a generic problem of determining correlations between multiple entities of interest, on the basis of observed data. This problem, in its minimalistic form, can be formulated as follows: One performs T measurements of a statistical system with N degrees of freedom, Y i , i = 1, . . . , N , and collects the observations in a matrix Y = [y it ] of dimensions N × T , where y it is the t-th measured value of the i-th entity. The challenge lies in estimating from this dataset the underlying two-point covariances,

collected in the so-called population (a.k.a. true, or signal ) covariance matrix, C = [C ij ]. (We assume throughout the article that the data has already been centered, Y i = 0.) This setting is very general, commonly encountered in physics, finance, genomics and bioinformatics, signal processing and acoustics, image recognition, speech recognition, cancer research, climatology, neuroscience, and many other areas. Below we sketch several such occurrences, also in each case providing relevant orders of magnitude (marked with the symbol ∼) for N and T ; we will see that, crucially, both these numbers are typically quite large and of comparable value.

For instance, in lattice QCD, one may consider [1] neutron-antineutron two-and three-point correlation functions, described using N (∼ a few dozen) parameters related to the number of dynamics time steps. One simulates field configurations on the lattice, obtaining thereby T samples (in one version of the procedure, one sample per gauge field configuration, thus T ∼ a few dozen) for these correlation functions. These samples are used to estimate the statistical correlation matrix between the N degrees of freedom, necessary for a least-squares fit to the correlation functions.

In cosmology [2] , when investigating cosmic shear, i.e., the distortion of images of distant galaxies by weak gravitational lensing caused by the large-scale structure, the basic observable is the shear two-point correlation function, computed from products of ellipticities of galaxy pairs located within given angular bins, N ∼ a hundred to thousand. This correlation function can be estimated from mock simulations, only T ∼ a few thousands, as they are computationally expensive.

In finance, one needs to know correlations between N investment assets, e.g. S&P500 stocks (N ∼ a few hundred), in order to construct optimal, well-diversified portfolios, in the spirit of Markowitz theory [3] [4] [5] . We estimate these correlations from T daily stock returns (T ∼ hundreds to thousands, due to stationarity requirements).

In genomics [6] , one analyzes expression profiles of N genes (N ∼ thousands, from which we may select a subset of differentially expressed genes, say N ∼ a hundred) in order to describe similarities and functional groupings of them. Each gene is sampled T times in a microarray experiment (T ∼ a few observation times), and the resulting data matrix Y , having chosen some suitable gene expression measure, is called a gene expression matrix.

In signal processing and acoustics [7] , one considers an array of N (∼ a dozen) directional sensors, and performs so-called beamforming, combining elements of an antenna array in such a way as to have constructive/destructive interference along given directions. But (frequency-domain) beamforming algorithms, such as the Capon beamformer, require the knowledge of the correlation matrix between the array outputs, which in turn is estimated from T (∼ a few dozen) samples.

In anomaly detection in hyperspectral images [8] , one has a sensor "image" comprised of N spectral bands (∼ a hundred), i.e., collected narrow ranges of the electromagnetic spectrum. When the goal is detecting anomalous pixels in such an image, one needs the Mahalanobis distance between each (N -dimensional) pixel and their background mean; that in turn requires the correlation matrix between the bands, estimated from a local sliding window of pixels, of length T which is not too large in order to cover a homogeneous area of space (∼ hundreds).

This problem of estimating the covariance matrix is thus of paramount importance in very diverse fields of science, and has accordingly inspired a large body of research. Actually, a situation that is even more typical than what the above examples show, is that of N T , meaning that one is able to collect many more samples than there are correlated entities. This is the textbook regime of classical multivariate statistics, N finite and T → ∞.

(

In this setting, the usual sample covariance estimator,

converges almost surely to C, making it a strongly consistent estimator of the population covariance matrix.

In particular, the expected value of the Frobenius norm of the difference (E − C), i.e., the mean squared error (MSE),

tends to zero in this limit (2) . From the point of view of estimation theory, L is a loss function, and measures the quality of an estimator, here E. In the classical statistical regime, the loss is zero for the sample estimator E (3).

Yet quite often, like in the examples outlined above, one rather faces a different limit, that of both N and T large and of comparable magnitude,

with q some positive constant. In this big data regime, the simultaneous estimation of ∼ N 2 elements of C from T ∼ N samples leads to substantial noise (variance) in the sample estimator E, rendering the whole process close to meaningless. To put it in a practical context of one of the examples, if a financial manager wanted to allocate assets according to a Markowitz-type procedure, and used E in the relevant formulas (which seems a very natural thing to do!), they would be optimizing their portfolio for noise! Clearly, more powerful statistical methods are required [3] . In mathematical terms, the MSE loss (4) is no longer zero in the big data regime. The challenge is then to construct other estimators Ξ of C which have a lower value of the loss as compared to the sample estimator E. Such novel estimators would have some of the statistical noise cleaned, rendering them more useful in practice. (For instance, our portfolio manager could more safely use such Ξ in the Markowitz formulas.) Due to the prevalence of big data scenarios in scientific and industrial applications, and the generic nature of the estimation problem, this challenge has been met in recent years with considerable effort. We will now outline some key results, as well as add a new brick to the construction.

One way to search for estimators that are optimal in the regime (5), i.e. at finite q, is through an approach generally known as shrinkage. The idea is to construct estimators with higher bias, yet much lower variance, so that the MSE (4) (which can be decomposed as bias 2 + variance) is in effect reduced ("shrunk").

This concept of shrinkage has been first introduced in the problem of estimating the mean µ of an Ndimensional normally-distributed variable Y with given covariance matrix σ 2 I, based on a single observation y. The standard least-squares (or maximum-likelihood) estimator is simply,

However, James-Stein (JS) [9] demonstrated that a biased estimator of the mean,

in dimension N ≥ 3, has in fact a lower MSE than (6), precisely due to reduced variance, which more than compensates for increased bias. The existence of the JS estimator is often expressed as Stein's paradox, that when estimating three or more parameters simultaneously, there exists a combined estimator with better MSE than any method handling them separately. The paradox's resolution lies in noticing that we are reducing the total MSE, not that of individual components, and so the JS estimator should really be used only when this total error is of interest rather than the individual errors. The JS estimator can be naturally understood within the empirical Bayes approach (cf. [4] for an extensive discussion): µ itself is considered a random variable, with a prior distribution assumed Gaussian of zero mean and some given variance. Minimizing the MSE (4), where now averaging is w.r.t. the joint distribution of Y and µ, leads to expression (7) . The choice of the prior is dictated by tractability: it is a conjugate prior w.r.t. the Gaussian likelihood, which means that the posterior distribution is analytically calculable; thus the minimization problem of the loss L can be carried out explicitly.

Let's return now to our current problem of covariance estimation. It has been precisely an analogous Bayesian reasoning that led to the first successful attempt at constructing a shrinkage estimator of C. Indeed, assuming the likelihood P (Y |C) to be multivariate Gaussian with covariance C, which is a common starting point in many applications, the so-called inverse-Wishart prior P (C) turns out to be conjugate to it. Hence, loss (4) minimization can again be carried out analytically, and the result is the Ledoit-Wolf linear shrinkage estimator [10] ,

It interpolates (with some α s estimated from the data) between the sample estimator E and the null hypothesis, which here is the unit matrix; the null hypothesis can also be a more general prior matrix C 0 , encoding a more specific prior belief, achieved by a generalized version of the inverse-Wishart prior. This simple result (8) has been the workhorse of large-dimensional estimation, and can be found in all the applications mentioned in the introduction.

The choice of the inverse-Wishart conjugate pair is, however, dictated more by computational tractability than any insight gleaned from the data, and in fact the observed sample estimator E strongly constraints the prior distribution of C; indeed, through the Marčenko-Pastur law (41), which we discuss later on. The recent strand of research [4, [12] [13] [14] [15] , initiated by [11] , thus approaches the problem of constructing shrinkage estimators in greater generality, and with marked attention to data. This we are now going to outline.

Consider the eigendecomposition of the sample estimator,

As we have repeatedly stressed, in the big data regime (5) , the curse of dimensionality means that both the eigenvalues and eigenvectors have high variance, thus "concealing" behind statistical noise the underlying nature of the true C. Suppose, in the first approximation, that we have no prior belief on the eigenvectors of C, so that our goal is merely to "clean" the eigenvalues. In the absence of any preferred direction (bias) in the eigenvector space, the only available basis is that of E. In other words, we wish to seek a MSE-optimal estimator of C in the space of matrices with the eigenvectors |λ i ,

This form is called a rotationally-invariant estimator (RIE) [4] , because it can alternatively be obtained from a Bayesian argument with a prior on C invariant under orthogonal similarity transformations. Note that the linear shrinkage estimator (8) also has the same sample eigenvectors, so it belongs to the RIE class; its eigenvalues are simple linear functions of the sample eigenvalues, ξ i = α s λ i + 1 − α s . This Ledoit-Wolf estimator, despite being better in terms of the MSE (4) than the plain sample estimator, is nonetheless not optimal. Indeed, we will find out that the MSE-optimal shrunk eigenvalues ξ i are nonlinear functions of λ i , making (10) an example of nonlinear shrinkage.

Minimizing the MSE in this space (10) is a quadratic problem, and so its solution is straightforward,

where we introduce the eigendecomposition of the population covariance matrix,

as well as the overlaps between the eigenvectors of E and C,

Note that the scalar products λ i |c j are of order 1/ √ N , and so the factor of N makes the overlaps of order 1 in the large-N limit. The solution (11) can be rewritten as an integral,

making use of the density of eigenvalues of C,

where δ is the Dirac delta. This form is especially convenient in the big data limit (5) . The formal solution (11) is unfortunately not useful in practice as it depends on the knowledge of C, which is the very object we wish to estimate! After all, the purpose of this research is uncovering from data as much information about C as possible. Since it requires insight into the unknown, this solution is called the oracle estimator.

It is however a documented phenomenon that a MSEoptimal estimator may in fact happen to be independent of the parameter being estimated. A canonical example is estimating the population variance σ 2 of a Gaussian random variable Y . If we restrict the search space to estimators of the form cV , where V = 1 T T t=1 (Y t −Ȳ ) 2 is the sample variance estimator, and we are looking for a constant c which minimizes the MSE, (cV − σ 2 ) 2 , then we find c = T /(T + 1), independently of the true σ 2 .

And indeed, a similar miracle occurs for our oracle estimator (11) in the big data limit (5) . In this regime, it turns out that the right hand side can be entirely expressed as a function of the sample estimator E, which can of course be easily computed based on the data at hand, without any recourse to C. This has been first demonstrated for uncorrelated samples by Ledoit and Péché [11] , who derived an analytic expression (53) for the ξ i 's, dependent only on E rather than C.

The purpose of the current paper is to extend this result to samples that are correlated. Our framework will be that of random matrix theory (RMT), as well as some ideas from the free probability calculus. In the next section, we will review the relevant notions of these theories.

A. Random matrix theory transforms Let's start by recalling a few definitions and relations which are useful in the derivation below. For a selfadjoint random matrix, an example of which is the sample covariance estimator E, the fundamental object is the resolvent, also known as the Green function. It is a matrix-valued function of a complex argument z,

To fix attention, think of E as coming from the ensemble of N × N matrices of the form (3), with Y a rectangular N × T random matrix. We will also assume Y to have normally-distributed entries. Our problem will then turn out to be analytically feasible; but at the same time it will be argued that the results are in fact representative of a wider class of matrices; hence, this assumption is less restrictive than it appears. Now, returning to the resolvent (16) in full generality, one may suppose that the randomness of E makes it a random function; however, the claim is that in the large-N limit it is self-averaging, that is independent of any realization of E, but converging to a deterministic matrix. With a slight abuse of notation, we will keep calling that deterministic function G E (z). In other words, the average of the resolvent over the ensemble of E can be replaced by the resolvent for a single realisation of E when N → ∞. Another important object is the normalized trace of the resolvent,

In the large-N limit, it tends to the Stieltjes transform of the eigenvalue density ρ E (λ),

and thereby encodes the information about the spectrum of E. Indeed, evaluating g E (z) near the real axis, that is at z = λ − i , with → 0 + , and applying the Sokhotski-Plemelj formula,

where "p.v." stands for the Cauchy principal value, yields,

where,

is the Hilbert transform of the eigenvalue density. The imaginary part of the above expression, on the other hand, is simply the eigenvalue density ρ E (λ).

A notion closely related to the resolvent is that of the moment generating function, also called the Mtransform. In its matrix form, analogous to (16) , it is defined as,

Its normalized trace,

has a useful property of generating, in the 1/z-expansion, the moments of E,

where,

In the free probability theory [16, 17] , one commonly uses instead of m E (z) another generating function, the ψ-transform,

since it produces the moments as coefficients of the zexpansion, rather than 1/z as in (24). Let's remark at this point that we have introduced our objects in pairs: a matrix version and its normalized trace version. In free probability, only the latter is relevant, and that is because there we only deal with distributions of eigenvalues. Here we are however also concerned about eigenvectors, and thus we need to encode the rotational information, too, which we do in the more general matrix form. This is why we distinguish between scalar and matrix versions of the moment-generating functions and related objects.

The normalized-trace transforms of free probability, especially the ψ-transform (26), play a central role in free multiplication [16, 17] .

We are given two large (N → ∞) random matrices, A and B; in particular we know their respective eigenvalue densities, ρ A (λ) and ρ B (λ). We assume they are free with respect to each other; "freeness" is a matrix analog of statistical independence of random variables. In such a case, it turns out it is possible to derive the eigenvalue density of their matrix product, ρ AB (λ), based solely on the individual densities. To be more precise, A and B should be self-adjoint, so that their eigenvalues are real, and the densities are real-valued. But the product AB won't then typically be self-adjoint. If A is positive-definite, and B has trace different from zero, then the alternative product √ AB √ A has the same moments as AB, and is self-adjoint. This we will call the "free product".

The prescription to do so makes use of a χ-transform, which is the functional inverse of the ψ-transform (26) ,

Furthermore, yet another useful object, the S-transform, is related to the above as,

A justification for such a seemingly arbitrary construction is that in this language the law of free multiplication becomes very straightforward, namely the S-transforms are simply multiplicative under the free product of matrices,

The logic here is that the known eigenvalue densities of the constituents A and B are used to calculate the respective Stieltjes transforms (18) of both matrices, then their ψ-transforms through the simple formula (26) , whereupon functional inversion gives the χ-transforms (27) , and finally the S-transforms (28) . After multiplying the two according to (29), we follow the above path backward to eventually find the eigenvalue density of the free product √ AB √ A.

Remember that our present goal is to demonstrate that the oracle estimator (14) , with the eigenvector overlaps (13) , can be expressed entirely in terms of the observable E, rather than the true and unknown C. To this end, we need to somehow model the relationship between C and E, or equivalently, between C and the observed data Y . Such a model, simplistic as it may be, should be capable of capturing essential features of the data-generating process.

Recall that Y = [y it ] is an N × T matrix of observations, all of which we now treat as random variables. The simplest possible model is that these entries are normallydistributed, in such a way that the samples of a given entity are uncorrelated among themselves, while any two entities i and j are correlated according to the population matrix C ij ,

This structure has been extensively studied. The topic of the current publication is to take it one step further in a very natural direction [18] [19] [20] : by replacing the Kronecker delta in (30) with an arbitrary matrix A ts , real symmetric and positive semi-definite,

We assume also

so that the decomposition on the right hand side is unambiguous. You may think of the dataset generated synthetically by a multivariate stochastic process (e.g., a multidimensional Ornstein-Uhlenbeck process, or a VARMA process), alternatively the dataset consisting of historic observations; A ts describes auto-correlations between samples. Note that, crucially, these auto-correlations are assumed identical for all the entities, i.e., decoupled from the cross-correlations. It would be quite interesting to extend this work to a more general coupled structure, y it y js = C ijts .

For a stationary stochastic process, the autocorrelation matrix A is a Toeplitz matrix,

with a some function such that a(0) = 1 (to ensure normalized trace). A simple and often physically justified case is that of exponentially-decaying auto-correlations,

with auto-correlation time τ . This model is a good starting point for systems exhibiting short-range correlations between samples.

The model (31) is simple enough to allow for an analytical solution (below), yet, as in the central limit theorem, it is believed to describe in the large-dimensional limit a broader class of distributions (not necessarily Gaussian), only perhaps without fat tails. In fact, even fat tails may be accounted for: it is known that the maximumlikelihood estimator of correlations for fat-tailed random variables is the robust Maronna estimator ; and it turns out [21] it can be expressed in the form (31) with a proper choice of A.

Yet another application [22] of the model (31) is that a proper choice of A allows us to describe the exponentiallyweighted moving average (EWMA) estimator of C. Essentially, instead of treating all the samples equally as in the sample estimator E (3), older samples are given lower weights than newer ones, thereby taking into account possible non-stationarity of the data. (We plan to address these cases in a separate publication.)

An attentive reader may wonder at this point how it is that while C is unknown (and we are trying to estimate it), we seem to presume the knowledge of A. Indeed, in the latter two applications we mentioned (the Maronna and EWMA estimators), A plays more of a technical role: it turns out that the structures appearing in other contexts (covariance estimation under fat tails, or samples getting obsolete due to non-stationarity, respectively), can be translated to the language of (31) with a certain A. In the first example, though, we should genuinely try to estimate A (e.g., via estimating τ ) from the data. However, an alternative way of thinking is that τ is an effective parameter, giving our model an additional degree of freedom; that added flexibility can then be used to fit the data better. A thorough investigation of this topic is left for another paper; but already at this point we announce the logic of the matter: the resulting rotationally-invariant estimator, toward which we are working, will obviously depend on τ , and in a real-world scenario that would allow us to choose this effective auto-correlation time so that the loss function (4) is lower than for A = I. Interestingly, even though the MSE clearly depends on the unknown C, we will see a way of fitting τ even without the knowledge of C.

To recapitulate, we assume a simple data-generating model, a Gaussian random matrix Y with the correlation structure (31). Since a linear combination of Gaussian variables remains Gaussian, it is easy to show that an equivalent formulation is that of a linear combination of standard normal variables,

The square roots are well-defined as C and A are positive semi-definite.) The modeled relation between the sample estimator E and the true C is therefore,

Its structure is that of a sandwich consisting of several layers of matrices multiplied together.

We remind again that we are on a quest to express the oracle estimator (11) through E rather than C. The first step has been to set up a model between the two matrices, tractable on one hand, and realistic enough on the other. The "sandwich model" (36) has been our choice, allowing not only for correlated samples (a phenomenon observed in practical applications), but also technical extensions such as the Maronna or EWMA estimators.

The second step is to express this relationship mathematically through the random matrix theory transforms defined above. We will see that this language will allow us to rewrite the eigenvector overlaps (13) entirely in terms of E.

There are two levels to this construction. First, we have already remarked that (36) is a product of matrices; in fact, we see that up to a certain reshuffling of factors it is a product of C, A, and the standardized Wishart random matrix,

It turns out that from the point of view of free probability this reshuffling does not matter, as we elucidate in appendix A (it is essentially due to the cyclic property of trace), and so the free multiplication law (29) leads directly to [19] ,

The last term is the S-transform of the Wishart matrix (37), S W (z) = 1/(1+qz). The additional factor q = N/T in the argument of S A (qz) comes from the difference in dimensions between A and C.

Using elementary algebra and relations between the various transforms, (38) can be rewritten in a suggestive form,

In other words, the scalar (i.e., the normalized-trace version) moment-generating function (MGF) (23) of the sample estimator, evaluated at a complex argument z, which we denote for short m ≡ m E (z), is equal to the scalar MGF of the population covariance matrix, but evaluated at a different complex argument Z, m C (Z). This complex transformation z → Z depends on m, and does so through the S-transform of the auto-correlation matrix A. Note that we may equivalently express the right hand side through the χ-transform (27), Z/z = χ A (qm)/qm, which will in fact be more convenient for us.

We will refer to (39) as the (scalar) generalized Marčenko-Pastur equation.

Let's also mention that (39) is sometimes spelled in the literature by explicitly writing the moment-generating function of C as an integral (18),

In particular, this equation for A = I, i.e., with Z = z/(1 + qm), is the classical Marčenko-Pastur law of 1967 [23] ; that is why we call (39) "generalized". We will however refrain from such lengthy integral expression, preferring instead the brevity offered by the various transforms we have introduced.

One way of thinking about (39) is the following: Suppose one is given the true matrices C and A; say, one generates the data from a stochastic process with these matrices as inputs. Consequently, the MGF of C and the S-transform of A can be (at least in principle) calculated. Then (39) is a system of equations with unknown m. In some cases this system is explicitly solvable; for instance, for C = I N and A = I T , we have m C (u) = 1/(u − 1) and S A (u) = 1, directly from the definitions, and hence (39) becomes a quadratic equation for m, qm 2 + m(1 + q − z 2 ) + 1 = 0. Solving this equation, and using the Sokhotski-Plemelj formula (20) , leads to the density of eigenvalues of E,

with λ ± ≡ (1 ± √ q) 2 . This is the famous Marčenko-Pastur density [23] , and it beautifully demonstrates the very problem we are trying to solve in this paper: the true correlation eigenvalue is in this case c i = 1 (with multiplicity N ), but the observed sample eigenvalues λ i are scattered around 1, smeared by statistical noise, the more so the greater q = N/T is, i.e., the more we are into the big data regime (and irrespective of how many samples T we have collected!); the underlying eigenvalue 1 is hidden inside this "blob" of sample eigenvalues, [λ − , λ + ].

Assuming the knowledge of C and A, and calculating m ≡ m E (z) from them, is thus an useful perspective on the generalized Marčenko-Pastur equation (39), but this is not the problem we are trying to solve. Rather, we know m, which is straightforward to compute from the observed data through the sample estimator E. We also suppose we know A, which, as we have discussed above, is either a technical construct, or can be treated as a set of effective parameters. The goal is to find C, which will turn the oracle estimator (11) into an observable quantity.

An attentive reader will now notice, though, that the oracle estimator depends on the overlaps (13) of the eigenvectors of C and E, which the scalar Marčenko-Pastur equation is not capable of capturing. Rather, we need an equation relating the matrix versions of the transforms in question. Interestingly enough, such an equation exists, and is in fact completely parallel to its scalar version (39),

with complex numbers Z and z related in the exact same way as before. This is an N × N matrix equation, expressed through the matrix moment-generating functions (22) . Of course, the previous scalar version follows by taking normalized trace of both sides. We will call (42) the matrix generalized Marčenko-Pastur equation. It has first been derived in [18] using diagrammatic methods; [4] presents another derivation via the replica trick and the low-rank orthogonal Harish-Chandra-Itzykson-Zuber integral.

A. Shrinkage estimator for the sandwich model

We are now ready to return to the problem at hand, which is to express the oracle estimator entirely through E. The key point is to rewrite the eigenvector overlaps (13) in the language of the above transforms, and to apply the generalized Marčenko-Pastur equation (42), (39) to translate all the references to transforms of C to those of E only.

To this end, write the matrix resolvent (16) of E through the sample eigenvalues and eigenvectors (9) , in an integral form weighted by the eigenvalue density,

Evaluate this expression near the real axis, at z = λ−i0 + , and retain the imaginary part of the result; the Sokhotski-Plemelj formula (19) implies,

Insert this equality inside a scalar product with an eigenvector of C,

We have thereby expressed the eigenvector overlap through the matrix resolvent of E; still, however, there is an explicit reference to an eigenvector of C, which we will now remove in favor of some transform of C. Indeed, the matrix resolvent of E evaluated at any complex z is equivalent, thanks to the matrix Marčenko-Pastur equation (42), to the matrix resolvent of C at the transformed Z = Z(z) (39), which has in turn a simple behavior when surrounding it with an eigenvector of C,

Inserting this to (45),

In the oracle estimator's generic eigenvalue ξ (14) , the overlap appears multiplied by the corresponding eigenvalue c, and integrated over c with the proper eigenvalue density, ρ C (c). This operation, applied to the left hand side of (47), produces according to (23) , (18) the scalar moment-generating function of C evaluated at the complex argument Z,

The final step is to replace in (48) the scalar MGF of C at argument Z by the scalar MGF of E at argument z, according to the scalar Marčenko-Pastur equation (39),

And this is precisely what we set out to achieve! The oracle eigenvalue ξ, previously accessible only through the unknown C, is now expressed solely in terms of the observable E, through the scalar MGF, m = m E (λ − i0 + ), present in (49) either explicitly, or inside Z/z = S A (qm)/(1 + qm) = χ A (qm)/qm (39). Indeed, let's rewrite (49) even more appealingly. For any observed sample eigenvalue λ i , for i = 1, . . . , N , choose as a basic quantity,

where since we are close to the real axis, the Sokhotski-Plemelj formula implies (23) , (20) ,

through the directly observable Hilbert transform and density of eigenvalues of the sample estimator E. In this notation, (49) becomes the main formula of this paper, the nonlinear shrinkage for correlated samples,

where recall the χ-transform (27) of A. We note also that since the S-transform (28) is probably better known than the χ-transform, we may replace the latter in (52) by u i S A (u i )/(1 + u i ).

We stress again the logic behind this formula:

• We perform T observations of some N given entities, and collect these measurements in an N × T data matrix Y .

• From this dataset, we calculate the standard sample estimator E (3), and diagonalize it, obtaining in particular a set of sample eigenvalues λ i , i = 1, . . . , N . As we have extensively discussed, in the big data regime (5), these eigenvalues have a crucial component of statistical noise, making them unreliable for any estimation purposes.

• From this set of sample eigenvalues, we first estimate the Hilbert transform h E (λ) and density of eigenvalues ρ E (λ). We will discuss a relevant procedure below, based on the kernel method.

• For any given sample eigenvalue λ i , we thus readily calculate the corresponding α i and β i (51), and so u i = α i + iβ i (50).

• Having assumed some model of the autocorrelations A (see below), we know (at least in principle) the χ-(equivalently, S-) transform of A. Thus (52) gives us explicitly the cleaned (a.k.a. shrunk ) eigenvalue ξ i . This is an eigenvalue of an optimal, w.r.t. the MSE loss function (4), rotationally-invariant estimator (RIE) (10) . Since the left hand side of (52) is the ratio ξ i /λ i , the right hand side may be called a shrinkage factor.

One sanity check is that for q → 0 we are moving from the big data regime (5) to the classical regime (2), in which case we should have Ξ = E, as the sample estimator is then optimal (and of course E belongs to the RIE class). Indeed, in this limit, u i → 0. An expansion of the S-transform around zero is known [24] , and when coupled with the assumed normalization condition of A (32), it implies S A (u i ) → 1, hence χ A (u i ) → u i , and so ξ i → λ i , as expected.

Historically the first nonlinear shrinkage formula, and a cornerstone of the theory we are elaborating here, has been the Ledoit-Péché (LP) shrinkage [11] (cf. also [4] ), derived for the sandwich model in the case of no correlations between samples (30).

To verify that our formula (52) reduces appropriately, set A = I. Directly from the definitions (16), (22) , (23), (26) , the scalar ψ-transform of A is ψ A (z) = z/(1 − z), from which the χ-transform (27) follows by functional inversion, χ A (z) = z/(z + 1); note that the S-transform (28) is simply S A (z) = 1. The shrinkage factor (52) thus becomes Im (u i /(1 + u i ))/Im u i , that is,

which is indeed the celebrated Ledoit-Péché formula.

A non-trivial example of a model of correlations between the samples, often relevant for situations of short-ranged dependence between the steps of a sampling process, is that of an exponentially-decaying function (34), with auto-correlation time τ . It is a straightforward exercise [18, 19] to derive the χ-transform of A,

with γ = coth(1/τ ). The shrinkage factor (52) for this case follows immediately.

There are other interesting models of auto-correlations, such as those generated by general VARMA(r 1 , r 2 ) stochastic processes,

with a matrix of i.i.d. standard Gaussian (or some other distribution, say Student-t) variables. Such a process has (r 2 + 1) "MA" (moving average) parameters a α , and r 1 "AR" (auto-regressive) parameters b β . Note that the exponential decay is a case of VAR(1) with a particular set of parameters, b 1 = e −1/τ and a 0 = 1 − b 2 1 . Moreover, as we have alluded to above, specific models of A appear for the Maronna or EWMA estimators. These issues are left for another publication.

As mentioned before, our shrinkage formula (52) for a general sandwich model (31) is complete except for one important component, that is a numerical method of estimating from the observed dataset the Hilbert transform and density of eigenvalues of E, which are inputs to the basic variable u i (50), (51).

The problem is that the eigenvalue density is approximated by a numerical histogram, and one cannot use it directly as an input for the Hilbert transform, which is an integral transform (21) . Several ideas have been proposed to alleviate this difficulty.

A very simple procedure is to choose a small but finite in (50), and approximate u i ≈ qm E (λ i − i ). In particular, [4] uses this approach with = N −1/2 , and reports satisfactory behavior. In our numerical experiments, we have however found this algorithm very sensitive to the choice of , and generally unstable.

For a time, a standard and comprehensive numerical solution (for uncorrelated samples) consisted of quite a complicated scheme termed inverse QuEST (= "Quantized Eigenvalues Sampling Transform") by Ledoit and Wolf [14] . In essence, it encompasses the following steps: First, suppose we know the true covariance eigenvalues c 1 , . . . , c N . The scalar Marčenko-Pastur relation (39), for z = λ − i0 + , can be manipulated to yield the sample eigenvalue density ρ E (λ) in a parametric form, dependent on the c i 's. (This is the same train of thought that led to the classical Marčenko-Pastur density formula (41).) Second, we integrate the density numerically to find the cumulative distribution function, CDF(λ) = λ 0 dλ ρ E (λ ). Third, we invert it numerically to find the quantile function, i.e., from CDF(λ) = p to λ = Q(p). Fourth, an assumption is made that the observed eigenvalues are distributed according to this quantile function, i.e., that we can approximate their positions byλ i = Q(i/N ). The above algorithm (already quite involved) thus estimates the sample eigenvalues based on known population eigenvalues,λ i (c 1 , . . . , c N ); this is the QuEST function.

Now the "inverse" part of the prescription is the following: In reality, we do not know the c i 's, rather we observe the λ i 's. We estimate the former by minimizing the mean squared error between the QuEST estimates and the measured values,

(56) Finally, we calculateλ i (c 1 , . . . ,c N ), from which u i = qm E (λ i ), without a regularizer. As we see, the Ledoit-Wolf inverse QuEST is a complicated algorithm (and we have not mentioned additional numerical intricacies that need to be addressed), but provides a very solid and stable solution.

However, recently Ledoit and Wolf [15] proposed a kernel method that is computationally straightforward, stable, and exceptionally easy to use in practice. Both the Hilbert transform and the eigenvalue density are approximated as a sum of N kernels set up around each λ i . That is, for the density we write,

Here r is a kernel function, assumed real, non-negative, normalized ( dx r(x) = 1), centered ( dx x r(x) = 0), and with unit width ( dx x 2 r(x) = 1). The parameters b i (bandwidths) are local scale factors, determining the width of the individual peaks. They should be chosen to make the neighbouring peaks overlap. Ledoit and Wolf suggest to select them identical for all the eigenvalues (i.e., independent of i), and equal to

It is a purely heuristic choice, but we too have found it stable and accurate. The Hilbert transform (21) is a linear operation on the eigenvalue density, hence it transforms the sum (57) into another sum,

where h is the Hilbert transform of r.

The idea is to choose the latter such that its Hilbert transform is analytically tractable and easy to implement. A good candidate is the Epanechnikov kernel, essentially the non-negative part (denoted x + ≡ max(x, 0)) of a parabola,

(60)

Its Hilbert transform (21) reads,

where the second term is understood to be zero for x = ± √ 5. To sum up, we have straightforward estimates of the sample eigenvalue density (57) and Hilbert transform (59), with the heuristic bandwidth (58), and the Epanechnikov kernel (60), (61). They are inserted directly into the inputs (51) to the main shrinkage formula (52).

We have now completed the main derivation of the paper, with an essentially analytical formula for shrunk eigenvalues; it does require numerical components in the form of the Epanechnikov kernel estimation, as well as some further fitting procedure described below, needed in realistic situations, but the expression (52) itself is explicit. There exists, however, a purely numerical alternative construction that allows one to estimate the oracle eigenvalues directly from the dataset Y , with no complicated random matrix theory in the process, and with excellent performance! We will now outline the method, and discuss its applicability. It has been introduced in sec. 8.2. of [4] , and is a certain modification of the method by Bartz [25] .

Recall that we have started our journey from the oracle estimator (11) , minimizing the Frobenius norm between the true C and the sought-for Ξ in the space of rotationally invariant estimators (10) . The solution is simple (as the problem is quadratic), ξ i = λ i |C|λ i , but depends on C, which is in principle unknown. The machinery of random matrix theory in the big data limit (5) is then invoked to show that the actual dependence is solely on observable quantities, the sample eigenvalues λ i , via one of the shrinkage formulas discussed in this paper.

This oracle formula can however be estimated by a numerical procedure of a moving-window cross-validation. First, imagine we have collected some more samples, and now have T total > T of them. Consider now a series of K pairs of consecutive moving windows. Each such pair consists of a "training" ("in-sample") window of length T , and a "testing" ("out-of-sample") window of some length T out . This pair of windows of length (T + T out ) we keep shifting K times through the whole dataset; this in particular implies, K = (T total −T )/T out . For the µ-th fold (µ = 0, . . . , K − 1) denote t µ = T + µT out + 1; following the convention used in the paper that the temporal index starts from 1, we see that the µ-th test fold has indices from t µ up to (t µ + T out − 1) (both inclusive), while the µ-th train fold stretches between (t µ − T ) and (t µ − 1).

For each fold µ, we denote by E train,µ the sample estimator calculated on the current train fold, and in particular |λ train,µ i its eigenvectors. On the other hand, by E test,µ denote the sample estimator calculated on the current test fold.

The claim then is that the oracle eigenvalues can be estimated by the mean over all the folds of the average test sample estimator in the state given by any train sample eigenvector,

In other words, the numerical prescription consists of repeating K times a calculation of the (current) train sample estimator, its diagonalization, then calculation of the (current) test sample estimator, and finally taking the relevant scalar product. An intuition behind this formula is that the unknown C in the oracle formula can be approximated by the outof-sample, i.e., unknown from the point of view of an in-sample observer, estimator.

As a side note, let us make the following remark: The test sample estimator is derived from a dataset of shape N × T out , i.e., with a different noise ratio than q = N/T . One might wonder if this mismatch leads then to correct estimation. An intuition why this should not matter is that the test sample estimator corresponds in the oracle formula to the true C, which after all knows nothing about q.

Let us also stress the importance of performing crossvalidation that preserves the time ordering, i.e., by moving windows. This is crucial especially when we believe auto-correlations are present in the system. The method of Bartz, mentioned in the beginning, is exactly identical except that the train and test folds are chosen randomly from among the samples, thus leveling out any temporal dependence.

The moving-window cross-validation-estimated oracle eigenvalues (62) turn out to have a significant variance; this will be visually clear in the figures below, where they form a broad cloud of scatter points.

The idea of [22] is extraordinarily simple and effective: fit isotonic regression to the estimated oracle eigenvalues, i.e., a monotonic function closest to the observations. We shall see that such a simple estimator is very hard to beat! Nonetheless, we shall also demonstrate that our VARMA shrinkage may be superior, once a proper search for its parameters is executed.

All the functionalities we have expounded on in this article, including the Epanechnikov kernel estimation of h E (λ) and ρ E (λ), and computing the shrunk/cleaned eigenvalues (52), under several sensible models of A, as well as the cross-validation and isotonic regression procedures for the oracle eigenvalues, have been collected in an open-source Python module called shrinkage, available at https://github.com/yedrek/shrinkage. It is versatile, extensible, and, we hope, easy to use [26] .

Below we present a few examples produced by this library, at version 1.1.0. We work here with synthetic data, that is, we randomly generate the data matrix Y of dimensions N = 500, T = 1000 according to some given models of C and A.

Since the true C is known, we will be able to quantify the efficiency of various shrinkage methods. We will measure it by the mean squared error (4); more precisely, for a given shrinkage estimator Ξ we will calculate its Frobenius ratio,

The lower it is, the better the cleaning scheme; in other words, Ξ should be "closer" to C than E is. The Frobenius ratios found in the analysis to follow are summarized in table I. First, we assume a multivariate Gaussian distribution of Y , with the underlying true covariance matrix C having two distinct eigenvalues, 1 and 3, in 50% proportions, and with the auto-correlation matrix A given by the exponential decay model (34), with the auto-correlation time τ = 3. This case encompasses the first group of figures 1. Figures 1a, 1b, 1c all show the same histogram of the sample eigenvalues λ i 's (peach), as well as the Epanechnikov-kernel-approximated density of eigenvalues ρ E (λ) (orange), enveloping the histogram. We see how broad this distribution is, completely obscuring behind statistical noise the true eigenvalues 1 and 3. We expect that shrinkage will narrow down this histogram, making it more localized about the true eigenvalues.

The first shrinkage approach we choose is the simplest, but very powerful isotonic regression fitted to the crossvalidation-estimated oracle eigenvalues (62) (with K = 10 and T out = 50). Their histogram (aqua) and density (teal ) in figure 1a reveal two clear peaks close to 1 and 3. The very low Frobenius ratio of 12% further confirms how effective this shrinkage prescription is.

In figure 1b we suppose for a moment that we are ignorant of the fact that there are auto-correlations in Y , and we choose to apply the Ledoit-Péché shrinkage (53) (which recall is valid for the sandwich model with no auto-correlations (30)). The histogram (pale pink ) and density (pale purple) of these ξ LP i is only slightly narrowed as compared to the sample spectrum. The Frobenius ratio of 53% is large.

One might argue at this point [28] that the reason why the Ledoit-Péché formula does not shrink the eigenspectrum enough is that it underestimates the noise level q = N/T . Indeed, when auto-correlations with a characteristic time τ eff are present in the system, the effective number of samples is T eff = T (1 − e −1/τ eff ) ∼ T /τ eff , smaller than T . An interesting exercise would therefore be to find an effective number of samples (by varying τ eff ) such that the Ledoit-Péché estimator performs better in terms of the Frobenius ratio. A caveat here is that it would be illegal to minimize the Frobenius ratio itself; it is, after all, an unknown quantity, dependent on C. One can however employ the following trick: fit rather to the scatter points of the cross-validation-estimated oracle eigenvalues (by minimizing the MSE between the two). Indeed, we thus get T eff = 426, with the Frobenius ratio F = 33%. It is significantly lower than for the true-T Ledoit-Péché result, thus demonstrating the validity of this effective approach. Visually, the histogram (lilac) and density (violet) in figure 1b start slowly developing two peaks in the vicinity of 1 and 3.

Finally, apply our VARMA shrinkage formula (52). Since the data-generating process is known, and in particular A is that of an exp-decay model with τ = 3, one could be tempted to use the same VARMA parameters in the shrinkage formula. But one should immediately object that this information is not available in any realworld scenario. However, just like with the effective T eff above, there is a way to fit τ to the data even without knowing C; indeed, to the cross-validation-estimated oracle eigenvalues. In this way, τ becomes an effec- tive parameter, giving our model greater flexibility than the Ledoit-Péché case. Figure 1c shows the histogram (mauve) and density (maroon) of the shrunk eigenvalues ξ i . The fit here is τ = 2.9 (probably by some numerical accident slightly different from 3), and we observe how the broad expansion of the sample eigenvalues is efficiently shrunk to two peaks close to 1 and 3, the true eigenvalues. The Frobenius ratio is much lower, too, at about 12%. This number is comparable to the (much simpler) isotonic regression estimator from figure 1a.

B. Example 2: Student-t + VMA (1) To further simulate a real-world situation, generate now the data matrix Y from the Student-t distribution with µ = 5 degrees of freedom (i.e., with heavy tails), and with the auto-correlation matrix A of the VMA(1) (vector moving average) model with parameters a 0 = 0.8, a 1 = 0.5, cf. (55). In other words, the distribution is quite far from the Gaussian, nor the correlations between samples are exponentially decaying. (The model for C stays as before, with two peaks at 1 and 3.)

Looking at figure 2 we see that now all the four shrinkage estimators perform quite similarly, with the best isotonic regression at 22% Frobenius ratio, the LP at 33%, the effective LP (with fitted T eff = 811) at 26%, while our exp-decay shrinkage (with fitted τ = 1.1) has 24%. Two peaks are visible in all the cases. To make the data distribution even more complicated, generate now Y from the Student-t distribution with 3 degrees of freedom (so, very heavy tails), C with eigenvalues following the inverse-Wishart distribution with κ = 2, and A that of a VARMA(1, 1) model (55) with a 0 = 0.8, a 1 = 0.5, b 1 = 0.4. Recall [4] that for C to have random eigenvalues from the inverse-Wishart distribution with a parameter κ, we first calculate a noise ratio q IW = 1/(1 + 2κ), and the corresponding number of samples, T IW = N/q IW . Generate an i.i.d. Gaussian random matrix R of shape N × T IW , and form from it a Wishart random matrix, W = 1 T IW RR , i.e., its sample estimator. Finally, C = (1 − q IW )W −1 . Figure 3 reveals the same set of graphs as in the previous two examples. There is one difference in the setup, though: Shown in figure 3c are the shrunk eigenvalues corresponding not to an exp-decay shrinkage with some optimal τ fitted to the cross-validation-estimated oracle eigenvalues, as we did before, but a more general threeparameter VARMA(1, 1) shrinkage. We fit a 0 , a 1 , b 1 by brute-force, simply iterating over a certain grid of values (quite coarse, as this is simply a proof-of-concept), and selecting those for which the MSE distance to the estimated oracle eigenvalues is the smallest. And the main finding from this analysis is that the Frobenius ratio of 27% of such an optimal fit outperforms the isotonic regression estimator, the latter with 34%.

In other words, despite severe model mismatch, the mere fact of incorporating auto-correlations into the shrinkage formula vastly improves the quality of eigenvalue cleaning, sometimes even better than the numerical isotonic regression scheme. This bodes well for empirical, real-world applications, where the underlying datagenerating processes are genuinely unknown.

The main result of this paper is an exact formula (50), (51), (52) for the rotationally-invariant estimator (RIE, that is, sharing the eigenbasis with the standard sample estimator (3), but with different eigenvalues) (10) of the true covariance matrix (1), which is optimal w.r.t. the mean squared error (4) . The result is crucial in the "big data" regime (5), when the number of correlated entities N is large and of comparable magnitude with the number of collected samples T , since then the classical sample estimator contains mostly statistical noise instead of any information about actual correlations. The main novelty of our formula is that it allows for correlations between samples, thus extending the method of Ledoit and Péché [11] . Such auto-correlations effectively decrease the number of samples T , thus making the estimation task even harder; a specifically tailored solution such as we present here is therefore of great importance.

We couple our analytical expression with a straightforward numerical scheme based on the kernel estimation method of Ledoit and Wolf. This means a numerically stable and effective pipeline which leads directly from the observed dataset to the cleaned/shrunk eigenvalues. We implement this end-to-end in an open-source Python library, encouraging its use to practitioners of various fields of science and industry where the problem of covariance estimation in the big data regime appears.

We leave for further work the following questions: One clear task is to apply our shrinkage formula in a realistic situation, for instance, a Markowitz-type portfolio optimization, on different markets, hopefully demonstrating that taking into account auto-correlations leads to better performance.

Another question is to consider more general, but still tractable, models of auto-correlations, by which we mean that their χ-transform is analytically accessible, akin to (54). We have already checked that some loworder VARMA models, in particular VAR(1), VMA(1), VARMA(1, 1), VAR(2), VMA(2), belong to this class, and in fact, they are now available in the shrinkage library. In fact, we have used here a VARMA(1, 1) model; it would be very beneficial to devise better methods of fitting parameters of such higher-order models (instead of a brute-force evaluation over a given grid), and interesting to see how they perform, especially compared to the isotonic regression estimator.

We have talked about how EWMA and Maronna estimators naturally fit within our framework of autocorrelations. One task would be to investigate the performance of an EWMA estimator in a situation of nonstationary data, such as on financial markets. On the other hand, the Maronna estimator has a much more involved form, and it would be interesting to see if it can be handled analytically.

This whole work falls within the class of RIEs, in which the key assumption is that we possess no belief about the eigenvectors of C. Once such a bias is available, though, for example as a factor model, one could attempt to work out an appropriate optimal estimator. Finally, and more speculatively, one could look at nonsymmetric correlations, such as between different entities at different time moments, say with a time lag of one. Would it be possible to extend the matrix Marčenko-Pastur equation to this case, working out a proper generalization of the replica trick and the HCIZ integral?

We thank Artur Święch, Vincent W.C. Tan, and Christopher Wells for interesting discussions. In this appendix we sketch a derivation of the scalar Marčenko-Pastur equation (38) . An important auxiliary result concerns multiplying rectangular random matrices. If W is N × T , while V is T × N , then W V is N × N , and V W is T × T . Since the moments (25) are normalized by the matrix dimension, we find, due to the cyclic property of trace, m V W,k = 1 T Tr (V W ) k = q 1 N Tr (W V ) k = qm W V,k . Thus, the scalar moment generating function (23) satisfies the same relation, m V W (z) = qm W V (z). A short manipulation leads then to the corresponding relation between the S-transforms (28) [19] ,

For brevity, denote the prefactor here by f , as it won't be important in the following calculation; the important piece is the q in the argument on the right hand side. Consider now the S-transform of the sample estimator in the sandwich model (36). Since the S-transform depends on the matrix through its moments, we have as a consequence of the cyclic property of trace that,

where the second equality comes from the free multiplication law (29). Now, in the first term, we move X to the front; this changes the dimension of the matrix, so we need to compensate according to (A1), where again the second equality is the multiplication law. We now again move X to the front, in the first term, obtaining for the last expression the value of S 1 T XX (z)S A (qz), according to (A1). The first matrix here is the standard Wishart random matrix (37), and so we can write altogether, S E (z) = S C (z)S A (qz)S W (z), with the q in the argument of S A reminiscent of (A1). The S-transform of the Wishart matrix is well-known [27] , S W (z) = 1/(1 + qz), which completes our proof of the scalar Marčenko-Pastur equation (38), based on the free multiplication law (29).

Shrinkage model: Isotonic regression fitted to cross-validation-estimated oracle eigenvalues, with Frobenius ratio 34%. (b) Shrinkage model: Ledoit-Péché, with Frobenius ratio 60%. Ledoit-Péché with "effective" T eff = 565, and Frobenius ratio 37%

Proc. Fourth Berkeley Symp

Free random variables

A good entry point to the library is the User's Guide Jupyter notebook present in the repository, with detailed descriptions of its structure. For a quick look at the whole pipeline, the reader may consult its sec. 4.3.6., in which an end-to-end problem is solved

We thank the anonymous referee for this suggestion