key: cord-0223518-ip5iyst3
authors: Boileau, Philippe; Hejazi, Nima S.; Laan, Mark J. van der; Dudoit, Sandrine
title: Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions
date: 2021-02-19
journal: nan
DOI: nan
sha: 984c4228f909719ef6a9ac187cfc259e3e336a5c
doc_id: 223518
cord_uid: ip5iyst3

The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience, however. As such, a variety of estimators have been derived to overcome the shortcomings of the sample covariance matrix in these settings. Yet, the question of selecting an optimal estimator from among the plethora available remains largely unaddressed. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. In particular, we propose a general class of loss functions for covariance matrix estimation and establish finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validated estimator selector with respect to these loss functions. We evaluate our proposed approach via a comprehensive set of simulation experiments and demonstrate its practical benefits by application in the exploratory analysis of two single-cell transcriptome sequencing datasets. A free and open-source software implementation of the proposed methodology, the cvCovEst R package, is briefly introduced.

The covariance matrix underlies numerous exploratory and inferential statistical procedures central to analyses regularly performed in diverse fields. For instance, in computational biology, this statistical parameter serves as a key ingredient in many popular dimensionality reduction, clustering, and classification methods which are regularly relied upon in quality control assessments, exploratory data analysis, and, recently, the discovery and characterization of novel types of cells. Other important areas in which the covariance matrix is critical include financial economics, climate modeling and weather forecasting, imaging data processing and analysis, probabilistic graphical modeling, and text corpora compression and retrieval. Even more fundamentally, the covariance matrix plays a key role in assessing the strengths of linear relationships within multivariate data, in generating simultaneous confidence bands and regions, and in the construction and evaluation of hypothesis tests.

Accurate estimation of this parameter is therefore essential.

When the number of observations in a data set far exceeds the number of features, the estimator of choice for the covariance matrix is the sample covariance matrix: it is an efficient estimator under minimal regularity assumptions on the data-generating distribution (Anderson, 2003; Smith, 2005) . In high-dimensional regimes, however, this simple estimator has undesirable properties. When the number of features outstrips the number of observations, the sample covariance matrix is singular. Even when the number of observations slightly exceeds the number of features, the sample covariance matrix is numerically unstable on account of an overly large condition number (Golub and Van Loan, 1996) . Its eigenvalues are also generally over-dispersed when compared to those of the population covariance matrix (Johnstone, 2001; Ledoit and Wolf, 2004) : the leading eigenvalues are positively biased, while the trailing eigenvalues are negatively biased (Marčenko and Pastur, 1967) .

High-dimensional data have become increasingly widespread in myriad scientific domains, with many examples arising from challenges posed by cutting-edge biological sequencing technologies. Accordingly, researchers have turned to developing novel covariance matrix estimators to remediate the deficiencies of the sample covariance matrix. Under certain sparsity assumptions, Bickel and Levina (2008b,a) ; Rothman et al. (2009) ; Lam and Fan (2009) ; Cai et al. (2010) , and Cai and Liu (2011) , among others, demonstrated that the true covariance matrix can be estimated consistently under specific losses by applying elementwise thresholding or tapering functions to the sample covariance matrix. Another thread of the literature, which includes notable contributions by Stock and Watson (2002); Bai (2003) ; Fan et al. (2008 Fan et al. ( , 2013 Fan et al. ( , 2016 Fan et al. ( , 2019 , and Onatski (2012) , has championed methods employing factor models in covariance matrix estimation. Other popular proposals include the families of estimators inspired by the empirical Bayes framework (Robbins, 1964; Efron, 2012) , formulated by and Ledoit and Wolf (2004 , 2012 . We briefly review several of these estimator families in Section 4 of the Online Supplement.

Despite the flexibility afforded by the apparent wealth of candidate estimators, this variety poses many practical issues. Namely, identifying the most appropriate estimator from among a collection of candidates is itself a significant challenge. A partial answer to this problem has come in the form of data-adaptive approaches designed to select the optimal estimator within a particular class (for example, Bickel and Levina, 2008b,a; Cai and Liu, 2011; Fan et al., 2013; Fang et al., 2016; Bartz, 2016) . Such approaches, however, are inherently limited by their focus on relatively narrow families of covariance matrix estimators.

The successful application of such estimator selection frameworks requires, as a preliminary step, that the practitioner make a successful choice among estimator families, injecting a degree of subjectivity in their deployment. The broader question of selecting an optimal estimator from among a diverse library of candidates has remained unaddressed. We offer a general loss-based framework building upon the seminal work of van der Laan and Dudoit (2003) ; Dudoit and van der Laan (2005) ; van der Vaart et al. (2006) for asymptotically optimal covariance matrix estimator selection based upon cross-validation.

In the cross-validated loss-based estimation framework, the parameter of interest is defined as the risk minimizer, with respect to the data-generating distribution, based on a loss function chosen to reflect the problem at hand. Candidate estimators may be generated in a variety of manners, including as empirical risk minimizers with respect to an empirical distribution over parameter subspaces corresponding to models for the data-generating distribution. One would ideally select as optimal estimator that which minimizes the "true" risk with respect to the data-generating distribution. However, as this distribution is typically unknown, one turns to cross-validation for risk estimation. van der Laan and Dudoit (2003) ; Dudoit and van der Laan (2005) ; van der Vaart et al. (2006) have shown that, under general conditions on the data-generating distribution and loss function, the cross-validated estimator selector is asymptotically optimal in the sense that it performs asymptotically as well in terms of risk as an optimal oracle selector based on the true, unknown data-generating distribution. These results extend prior work summarized by Györfi et al. (2002, Ch. 7-8) .

Focusing specifically on the covariance matrix as the parameter of interest, we address the choice of loss function and candidate estimators, and derive new, high-dimensional asymptotic optimality results for multivariate cross-validated estimator selection procedures.

Requiring generally nonrestrictive assumptions about the structure of the true covariance matrix, the proposed framework accommodates arbitrary families of covariance matrix estimators. The method therefore enables the objective selection of an optimal estimator while fully taking advantage of the plethora of available estimators. As such, it generalizes existing, but more limited, data-adaptive estimator selection frameworks where the library of candidate estimators is narrowed based on available subject matter knowledge, or, as is more commonly the case, for convenience's sake.

Consider a data set X n×J = {X 1 , . . . , X n : X i ∈ R J }, comprising n independent and identically distributed (i.i.d.) random vectors, where n ≈ J or n < J. Let X i ∼ P 0 ∈ M, where P 0 denotes the true data-generating distribution and M the statistical model, that is, a collection of possible data-generating distributions P for X i . We assume a nonparametric statistical model M for P 0 , minimizing assumptions on the form of P 0 . We denote by P n the empirical distribution of the n random vectors forming X n×J . Letting E[X i ] = 0 without loss of generality and defining ψ 0 := Var[X i ], we take as our goal the estimation of the covariance matrix ψ 0 . This is accomplished by identifying the "optimal" estimator of ψ 0 from among a collection of candidates, where, as detailed below, optimality is defined in terms of risk.

For any distribution P ∈ M, define its covariance matrix as ψ = Ψ(P ), where Ψ is a mapping from the model M to the set of J × J symmetric, positive semi-definite matrices.

Furthermore, candidate estimators of the covariance matrix are defined asψ k :=Ψ k (P n ) for k = 1, . . . , K in terms of mappingsΨ k from the empirical distribution P n to Ψ := {ψ ∈ R J×J |ψ = ψ }. While this notation suggests that the number of candidate estimators K is fixed, and we treat it as such throughout, this framework may be extended such that K grows as a function of n and J. It also follows that {ψ = Ψ(P ) : P ∈ M} ⊂ Ψ; that is, the set of all covariance matrices corresponding to the data-generating distributions P belonging to the model M is a subset of Ψ.

In order to assess the optimality of estimators in the set Ψ, we introduce a generic loss function L(X; ψ, η) characterizing a cost applicable to any ψ ∈ Ψ and X ∼ P ∈ M, and where η is a (possibly empty) nuisance parameter. Specific examples of loss functions for the covariance estimation setting are proposed in Section 3.1. Define H as the mapping from the model M to the nuisance parameter space H := {η = H(P ) : P ∈ M} and let η :=Ĥ(P n ) denote a generic nuisance parameter estimator, whereĤ is a mapping from P n to H. Given any η ∈ H, the risk under P ∈ M for any ψ ∈ Ψ is defined as the expected value of L(X; ψ, η) with respect to P :

Under the additional constraint on the loss function that a risk minimizer exists under the true data-generating distribution P 0 , the minimizer is given by the parameter of interest

where η 0 := H(P 0 ). The risk minimizer need not be unique. The optimal risk under P 0 is

which is to say that a risk minimizer ψ 0 attains risk θ 0 .

For any given estimatorψ k of ψ 0 , its conditional risk given P n with respect to the true data-generating distribution P 0 is

Defining the risk difference of the k th estimator asθ n (k, η 0 ) − θ 0 , the index of the estimator that achieves the minimal risk difference is

The subscript n emphasizes that the risk and optimal estimator index are conditional on the empirical distribution P n . They are therefore random variables.

Given the high-dimensional nature of the data, it is generally most convenient to study the performance of estimators of ψ 0 using Kolmogorov asymptotics, that is, in the setting in which both n → ∞ and J → ∞ such that J/n → m < ∞. Historically, estimators have been derived within this high-dimensional asymptotic regime to improve upon the finite sample results of estimators brought about by traditional asymptotic arguments. After all, the sample covariance matrix retains its asymptotic optimality properties when J is fixed, even though it is known to perform poorly in high-dimensional settings.

Naturally, it would be desirable for an estimator selection procedure to select the estimator indexed byk n ; however, this quantity depends on the true, unknown data-generating distribution P 0 . As a substitute for the candidates' true conditional risks, we employ instead the cross-validated estimators of these same conditional risks.

Cross-validation (CV) consists of randomly, and possibly repeatedly, partitioning a data set into a training set and a validation set. The observations in the training set are fed to the candidate estimators and the observations in the validation set are used to evaluate the performance of these estimators (Breiman and Spector, 1992; Friedman et al., 2001) . A range of CV schemes have been proposed and investigated, both theoretically and computationally;

Dudoit and van der Laan (2005) provide a thorough review of popular CV schemes and their properties. Among the variety, V -fold stands out as an approach that has gained traction on account of its relative computational feasibility and good performance. Any CV scheme can be expressed in terms of a binary random vector B n , which assigns observations into either the training or validation set. Observation i is said to lie in the training set when B n (i) = 0 and in the validation set otherwise. The training set therefore contains i (1 − B n (i)) = n(1 − p n ) observations and the validation set i B n (i) = np n observations, where p n is the fixed validation set proportion corresponding to the chosen CV procedure.

The empirical distributions of the training and validation sets are denoted by P 0 n,Bn and P 1 n,Bn , respectively, for any given realization of B n . B n , we emphasize, is independent of P n . Using this general definition, the cross-validated estimator of a candidateΨ k 's risk iŝ θ pn,n (k,Ĥ(P 0 n,Bn )) := E Bn Θ(Ψ k (P 0 n,Bn ),Ĥ(P 0 n,Bn ), P 1 n,Bn )

for a nuisance parameter estimator mappingĤ. Here, E Bn [·] denotes the expectation with respect to B n . The corresponding cross-validated selector iŝ k pn,n := argmin k∈{1,...,K}θ pn,n (k,Ĥ(P 0 n,Bn )).

As a benchmark, the unknown cross-validated conditional risk of the k th estimator is θ pn,n (k, η 0 ) := E Bn Θ(Ψ k (P 0 n,Bn ), η 0 , P 0 ) .

The cross-validated oracle selector is theñ k pn,n := argmin k∈{1,...,K}θ pn,n (k, η 0 ).

As before, the p n and n subscripts highlight the dependence of these objects on the CV procedure and the empirical distribution P n , respectively, thus making them random variables.

Ideally, the cross-validated estimator selection procedure should identify ak pn,n that is asymptotically (in n, J, and possibly K) equivalent in terms of risk to the oraclek pn,n , under a set of nonrestrictive assumptions based on the choice of loss function, target parameter space, estimator ranges, and, if applicable, nuisance parameter space, in the sense that θ pn,n (k pn,n , η 0 ) − θ 0 θ pn,n (k pn,n , η 0 ) − θ 0 P −→ 1 as n, J → ∞.

That is, the estimator selected via CV is equivalent in terms of risk to the CV scheme's oracle estimator chosen from among all candidates.

When Equation (2) holds, a further step may be taken by relating the performance of the cross-validated selector to that of the full-dataset oracle selector,k n :

When the cross-validated selection procedure's full-dataset conditional risk difference converges in probability to that of the full-dataset oracle's, the chosen estimator is asymptotically optimal. In other words, the data-adaptively selected estimator performs asymptotically as well, with respect to the chosen loss, as the candidate that would be picked from the collection of estimators if the true data-generating distribution were known.

The choice of loss function should reflect the goals of the estimation task. While loss functions based on the sample covariance matrix and either the Frobenius or the spectral norms are often employed in the covariance matrix estimation literature, Dudoit and van der Laan (2005)'s estimator selection framework is more amenable to loss functions that operate over random vectors. Accordingly, we propose the observation-level Frobenius loss:

where X (j) is the j th element of a random vector X ∼ P ∈ M, ψ (jl) is the entry in the j th row and l th column of an arbitrary covariance matrix ψ ∈ Ψ, and η 0 is a J × J matrix acting as a scaling factor, that is, a potential nuisance parameter. For an estimatorη of η 0 , the cross-validated risk estimator of the k th candidate estimatorΨ k under the observation-level = 1/J, ∀j, l. This particular choice of scaling factor is such that whatever the value of J, I J×J F,η 0 = 1. With such a scaling factor, the loss function may be viewed as a relative loss whose yardstick is the J × J identity matrix. A similarly reasonable option for when the true covariance matrix is assumed to be dense is η (jl) 0 = 1/J 2 . This weighting scheme effectively computes the average squared error across every entry of the covariance matrix; however, when the scaling factor is constant, it only impacts the interpretation of the loss. Constant scaling factors have no impact on our asymptotic analysis. Since it need not be estimated, it is not a nuisance parameter in the conventional sense.

When the scaling factor of Equation (4) is constant, the risk minimizers are identical for the cross-validated observation-level Frobenius risk and the common cross-validated Frobenius risk (used by, for example, Rothman et al., 2009; Fan et al., 2013; Fang et al., 2016) .

Proposition 1 Define the cross-validated Frobenius risk for an estimatorΨ k aŝ

where S n (P 1 n,Bn ) is the sample covariance matrix computed over the validation set P 1 n,Bn , and η 0 is some constant scaling matrix. Then,R n (Ψ k , η 0 ) −θ pn,n (k, η 0 ) is constant with respect toΨ k (P 0 n,Bn ) such thatk pn,n = argmin k∈{1,...,K}θ pn,n (k, η 0 ) = argmin k∈{1,...,K}R n (Ψ k , η 0 ).

Note that the traditional Frobenius loss corresponds to the sum of the squared eigenvalues of the difference between the sample covariance matrix and the estimate. Proposition 1 therefore implies the existence of a similar relationship for our observation-level Frobenius loss. It may therefore serve as a surrogate for a loss based on the spectral norm.

We are not restricted to a constant scaling factor matrix. One might consider weighting the covariance matrix's off-diagonal elements' errors by their corresponding diagonal entries, especially useful when the random variables are of different scales. Such a scaling factor might offer a more equitable evaluation across all entries of the parameter:

Here, η 0 = diag(ψ 0 ) is a genuine nuisance parameter which can be estimated via the diagonal entries of the sample covariance matrix.

Finally, the covariance matrix ψ 0 is the risk minimizer of the observation-level Frobenius loss if the integral with respect to X and the partial differential operators with respect to ψ are interchangeable.

Proposition 2 Let the integral with respect to X and the partial differential operators with respect to ψ be interchangeable, and let η be some fixed J × J matrix. Then

for Θ(·) defined under the observation-level Frobenius loss.

The proof is provided in the Online Supplement. Our proposed loss therefore satisfies the condition of Equation (1). The main results of the paper, however, relate only to the constant scaling factor case. In a minor abuse of notation, we set η 0 = 1, and suppress dependence of the loss function on the scaling factor wherever possible throughout the remainder of the text.

Having defined a suitable loss function, we turn to a discussion of the theoretical properties of the cross-validated estimator selection procedure. Specifically, we present, in Theorem 1, sufficient conditions under which the method is asymptotically equivalent in terms of risk to the commensurate CV oracle selector (as per Equation (2)). This theorem extends the general framework of Dudoit and van der Laan (2005) for use in high-dimensional multivariate estimator selection. Adapting their existing theory to this setting requires a judicious choice of loss function, new assumptions, and updated proofs reflecting the use of high-dimensional asymptotics. Corollary 1 then builds on Theorem 1 and details conditions under which the procedure produces asymptotically optimal selections in the sense of Equation (3). All proofs are provided in Section 1 of the Online Supplement.

Theorem 1 Let X 1 , . . . , X n be a random sample of n i.i.d. random vectors of dimension J, such that X i ∼ P 0 ∈ M, i = 1, . . . , n. Assume, without loss of generality, that E[X i ] = 0, and

Next, define the observation-level Frobenius loss function as L(X; ψ) := X X − ψ 2 F,1 . Finally, designate p n as the proportion of observations in any given cross-validated validation set. Consider the following assumptions:

High-Dimensional Asymptotic Result. The finite-sample result in Equation (6) has the following asymptotic implications: If c(δ, M (J))(1 + log(K))/(np n E P 0 [θ pn,n (k pn,n ) − θ 0 ]) → 0 and J/n → m < ∞ as n, J → ∞, then

The proof relies on special properties of the random variable Z k := L(X;Ψ k (P n )) − L(X; ψ 0 ) and on an application of Bernstein's inequality (Bennett, 1962) . Together, they are used to

show that 2c(δ, M (J))(1 + log(K))/(np n ) is a finite-sample bound for comparing the performance of the cross-validated selector,k pn,n , against that of the oracle selector over the training sets,k pn,n . Once this bound is established, the high-dimensional asymptotic results follow immediately.

Only a few sufficient conditions are required to provide finite-sample bounds on the expected risk difference of the estimator selected via our CV procedure. First, that each element of the random vector X be bounded, and, second, that the entries of all covariance matrices in the parameter space and the set of possible estimates be bounded. Together, these assumptions allow for the definition of M (J), the object permitting the extension of the loss-based estimation framework to the high-dimensional covariance matrix estimation problem.

The first assumption is technical in nature -it makes the proofs tractable. While it may appear stringent, and, for instance, is not satisfied by Gaussian distributions, we believe it to be generally applicable. We stress that parametric data-generating distributions, like those exhibiting Gaussianity, rarely reflect reality, that is, they are merely mathematical conveniences 1 . Most random variables, or transformations thereof, arising in scientific practice are bounded by limitations of the physical, electronic, or biological measurement process; thus, our method remains widely applicable. For example, in microarray and nextgeneration sequencing experiments, the raw data are images on a 16-bit scale, constraining them to [0, 2 16 ). Similarly, the measurement of immunologic markers, of substantial interest in vaccine efficacy trials of HIV-1, COVID-19, and other infectious diseases, are bounded by the limits of detection and/or quantitation imposed by assay biotechnology.

Verifying that the additional assumptions required by Theorem 1's asymptotic results hold proves to be more challenging. We write f (y) = O(g(y)) if |f | is bounded above by g,

) if f is bounded below by g, and f (y) = ω(g(y)) if f dominates g, all in asymptotics with respect to n and J. Further, a subscript "P" might be added to these bounds, denoting a convergence in probability. Now, note that c(δ, M (J))(1+log(K))/(np n ) = O(J) for fixed p n and as J/n → m > 0. Then the conditions associated with Equation (7) and Equation (8) hold so long as

1 Anecdotally, one cannot help but find themself reminded that "Everyone is sure of this [that errors are normally distributed] . . . since the experimentalists believe that it is a mathematical theorem, and the mathematicians that it is an experimentally determined fact." (Poincaré, 1912, p. 171) andθ pn,n (k pn,n ) − θ 0 = ω P (J), respectively.

These requirements do not seem particularly restrictive given that the complexity of the problem generally increases as a function of the number of features. There are many more entries in the covariance matrix requiring estimation than there are observations. This intuition is corroborated by our extensive simulation study in the following section. Consistent estimation in terms of the Frobenius risk is therefore not possible in high-dimensions without additional assumptions about the data-generating process.

Some additional insight might be gained by identifying conditions under which these assumptions are unmet for popular structural beliefs about the true covariance matrix. In particular, we consider the sparse covariance matrices defined in and accompanying hard-thresholding estimators (see Section 4.1 of the Online Supplement):

Proposition 3 In addition to Assumptions 1 and 2 of Theorem 1, assume that ψ 0 is a member of the following set of matrices:

where s(J) is the row sparsity, that the hard-thresholding estimator is in the library of candidates, and that it uses a "sufficiently large" thresholding hyperparameter value in the sense of . Then, by Theorem 2 of , we have

Proposition 3 states that the conditions for achieving the asymptotic results of Theorem 1 are not met if the proportion of non-zero elements in the covariance matrix's row with the most non-zero elements converges to zero faster than 1/ log J and the library of candidates possesses a hard-thresholding estimator whose thresholding hyperparameter is reasonable in the sense of Bickel and Levina (2008a)'s Theorem 2 and its subsequent discussion. Plainly, the true covariance matrix cannot be too sparse if the collection of considered estimators contains the hard-thresholding estimator with a correctly specified thresholding hyperparameter value.

This implies that banded covariance matrices whose number of bands are fixed for J do not meet the criteria for our theory to apply, assuming that one of the candidate estimators correctly specifies the number of bands. Nevertheless, we observe empirically in Section 4 that our cross-validated procedure selects an optimal estimator when the true covariance matrix is banded or tapered more quickly in terms of n and J than any other type of true covariance matrix.

These results are likely explained by the relatively low complexity of the estimation problem in this setting. High-dimensional asymptotic arguments are perhaps unnecessary when the proportion of entries needing to be estimated in the true covariance matrix quickly converges to zero. These limitations of our theory reflect stringent, and typically unverifiable, structural assumptions about the estimand. We reiterate that the conditions of Theorem 1 are generally satisfied. In situations where the true covariance matrix is known to possess this level sparsity, practitioners might instead appeal to Equation (39) of to support their use of a cross-validated estimator selection procedure. This result, coupled with that of Proposition 1, likely explains the aforementioned simulation findings of the banded and tapered covariance matrices.

Now, Theorem 1's high-dimensional asymptotic results relate the performance of the cross-validated selector to that of the oracle selector for the CV scheme. As indicated by the expression in Equation (3), however, we would like our cross-validated procedure to be asymptotically equivalent to the oracle over the entire data set. The conditions to obtain this desired result are provided in Corollary 1, a minor adaptation of previous work by Dudoit and van der Laan (2005) . This extension accounts for increasing J, thereby permitting its use in high-dimensional asymptotics.

Corollary 1 Building upon Assumptions 1 and 2 of Theorem 1, we introduce the additional assumptions that, as n,

Under these assumptions, it follows that

The proof is a direct application of the asymptotic results of Theorem 1.

As before, the assumption that c(δ, M (J))(1 + log(K))/(np n (θ pn,n (k pn,n ) − θ 0 )) P → 0 remains difficult to verify, but essentially requires the estimation error of the oracle to increase quickly as the number of features grows. That is, np n (θ pn,n (k pn,n ) − θ 0 ) = ω P (J). We posit that this condition is generally satisfied, similarly to the asymptotic results of Theorem 1. Now, a sufficient condition for Equation (9) is that there exists a γ > 0 such that

for a single random variable Z with P(Z > a) = 1 for some a > 0. For single-split validation,

Equation (9) essentially requires that the (appropriately scaled) distributions of the crossvalidated and full-dataset conditional risk differences of their respective oracle selections converge in distribution as p n → 0. Again, this condition is unrestrictive. As p n → 0, the composition of each training set becomes increasingly similar to that of the full-dataset.

The resulting estimates produced by each candidate estimator over the training sets and the full-dataset will correspondingly converge. Naturally, so too will the cross-validated and full-dataset conditional risk difference distributions of their respective selections.

While the number of candidates in the estimator library K has been assumed to be fixed in this discussion of the proposed method's asymptotic results, it may be allowed to grow as a function of n and J. Of course, this will negatively impact the convergence rates of

The sufficient conditions outlined in the asymptotic results of Theorem 1 are achieved so long as the library of candidates does not grow too aggressively. That is, we can make the additional assumptions that K = o(exp{E P 0 [θ pn,n (k pn,n ) − θ 0 ]/J}) and K = o P (exp{(θ pn,n (k pn,n ) − θ 0 )/J}) such that the results of Equations (7) and (8) are achieved, respectively.

Finally, we have assumed thus far that E P 0 [X] is known. This is generally not the case in practice. In place of a random vector centered at zero, we might instead consider the set of n demeaned random vectorsX n×J whereX i = X i −X andX (j) = 1/n X

i . It follows from the details given in Remark 1 of the Online Supplement that the asymptotic results of Theorem 1 and Corollary 1 apply toX n×J .

4 Simulation Study

We conducted a series of simulation experiments using prominent covariance models to verify the theoretical results of our cross-validated estimator selection procedure. These models are described below.

Model 1: A dense covariance matrix, where

Model 2: An AR(1) model, where ψ (jl) = 0.7 |j−l| . This covariance matrix, corresponding to a common timeseries model, is approximately sparse for large J, since the off-diagonal elements quickly shrink to zero.

Model 3: An MA(1) model, where ψ (jl) = 0.7 |j−l| · I(|j − l| ≤ 1). This covariance matrix, corresponding to another common timeseries model, is truly sparse. Only the diagonal, subdiagonal, and superdiagonal contain non-zero elements.

Model 4: An MA(2) model, where

This timeseries model is similar to Model 3, but slightly less sparse.

Model 5: A random covariance matrix model. First, a J ×J random matrix whose elements are i.i.d. Uniform(0, 1) is generated. Next, entries below 1/4 are set to 1, entries between 1/4 and 1/2 are set to −1, and the remaining entries are set to 0. The square of this matrix is then computed and added to the identity matrix I J×J . Finally, the corresponding correlation matrix is computed and used as the model's covariance matrix.

Model 6: A Toeplitz covariance matrix, where

Like the AR(1) model, this covariance matrix is approximately sparse for large J.

However, the off-diagonal entries decay less quickly as their distance from the diagonal increases.

Model 7: A Toeplitz covariance matrix with alternating signs, where

This model is almost identical to Model 6, though the signs of the covariance matrix's entries are alternating. We applied our estimator selection procedure, which we refer to as cvCovEst, using a 5fold CV scheme. The library of candidate estimators is provided in the Online Supplement's Table 1 , which includes details on these estimators' possible hyperparameters. Seventyfour estimators make up the library of candidates. We note that no penalty is attributed to estimators generating rank-deficient estimates, like the sample covariance matrix when J > n, though this limitation is generally of practical importance. When the situation dictates that the resulting estimate must be positive-definite, the library of candidates should be assembled accordingly.

To examine empirically the optimality results of Theorem 1, we computed analytically, for each replication, the cross-validated conditional risk differences of the cross-validated selection,k pn,n , and the cross-validated oracle selection,k pn,n . The Monte Carlo expectations of the risk differences stratified by n, J/n, and the covariance model were computed from the cross-validated conditional risk differences. The ratios of the expected risk differences are presented in Figure 1A . These results make clear that, for virtually all models considered, the estimator chosen by our proposed cross-validated selection procedure has a risk difference asymptotically identical on average to that of the cross-validated oracle.

A stronger result, corresponding to Equation (8) for Model 8, in which the covariance matrices are more difficult to estimate due to their dense structures, we find that our selector identifies the optimal estimator with probability tending to 1 for n ≥ 200 and J/n = 5.

More impressive still are the results presented in Figure 1B that characterize the fulldataset conditional risk difference ratios. For all covariance matrix models considered, with the exception of Model 1, our procedure's selections attain near asymptotic optimality for moderate values of n and J/n. This suggests that our loss-based estimator selection approach's theoretical guarantee, as outlined in Corollary 1, is achievable in many practical settings.

In addition to verifying our method's asymptotic behavior, we compared its estimates against those of competing approaches. We computed the Frobenius and spectral norms of each procedure's estimate against the corresponding true covariance matrix. The mean norms over all simulations were then computed for each covariance matrix estimation procedure, again stratified by n, J/n, and the covariance matrix model (Figures 3 and 4 of the Online Supplement). Our data-adaptive procedure is found to perform at least as well as the best alternative estimation strategy across all simulation scenarios under both norms.

Single-cell transcriptome sequencing (scRNA-seq) allows researchers to study the gene expression profiles of individual cells. The fine-grained transcriptomic data that it provides have been used to identify rare cell populations and to elucidate the developmental relationships between diverse cellular states.

Given that a typical scRNA-seq data set possesses tens of thousands of features (genes), most workflows prescribe a dimensionality reduction step. In addition to reducing the amount of computational resources needed to analyze the data, reducing the dimensions mitigates the effect of corrupting noise on interesting biological signal. The lower-dimensional embedding is then used in downstream analyses, like novel cell-type discovery via clustering.

One of the most popular methods used to reduce the dimensionality of scRNA-seq data is uniform manifold approximation and projection (UMAP) (McInnes et al., 2018) . This method captures the most salient non-linear relationships among a high-dimensional data set's features and projects them onto a reduced-dimensional space. Instead of applying UMAP directly, the scRNA-seq data set's leading principal components (PCs) are often used as an initialization.

This initial dimensionality reduction by PCA is believed to play a helpful role in denoising. However, PCA typically relies on the sample covariance matrix, and so when the data set is high-dimensional, the resulting principal components are known to be poor estimates of those of the population (Johnstone and Lu, 2009 ). We hence posit that our cross-validated estimator selection procedure could form a basis for an improved PCA. That is, we hope that the eigenvectors resulting from the eigendecomposition of our estimated covariance matrix could be used to generate a set of estimates closer to the true PCs in terms of risk. These PCs could then be fed to UMAP to produce an enhanced embedding. Indeed, additional simulation results provided in Section 3 of the Online Supplement suggest that cvCovEst produces estimates of the leading eigenvalue at least as well as those produced by the sample covariance matrix, in terms of the spectral norm.

We applied our procedure to two scRNA-seq data sets for which the cell types are known a priori. These data were obtained from the scRNAseq Bioconductor R package (Risso and Cole, 2020), and prepared for analysis using a workflow outlined in Amezquita et al. (2020) .

A 5-fold CV scheme was used; the library of candidate estimators is provided in the Online Supplement's Table 2 . We expect that cells of the same class will form tight, distinct clusters within the low-dimensional representations. The resulting embeddings, which we refer to as the cvCovEst-based embeddings, were then compared to those produced by UMAP using traditional PCA for initialization, which we refer to as the PCA-based embeddings. For each embedding, the 20 leading PCs were fed to UMAP. The first data set is a collection of 285 mouse visual cortex cells (Tasic et al., 2016) , and the second data set consists of 2,816 mouse brain cells (Zeisel et al., 2015) . The 1,000 most variable genes of each data set were used to compute the PCs of both embeddings.

The resulting UMAP plots are presented in Figure 2 . Though the two embeddings generated for each data set are qualitatively similar, the low-dimensional representation relying on our loss-based approach is more refined in Figure 2A . Figure 3B provides further insight into the discrepancies between the UMAP results of Figure 2A : the sample covariance matrix likely over-estimates many of the leading eigenvalues.

The embeddings in Figure 2B qualitatively identical, and so too are their average silhouette widths. This is expected, the Zeisel et al. (2015) is not truly high-dimensional.

The sample covariance matrix likely is a reasonable estimator in this setting. Ideally, dataadaptive selection procedures should be cognizant of this. Indeed, cvCovEst, when applied to the Zeisel et al. (2015) data set, selects an estimator whose cross-validated empirical risk is only slightly smaller than that of the sample covariance matrix, and whose leading PCs are virtually identical ( Figure 2 of the Online Supplement). 

This work extends Dudoit and van der Laan (2005)'s framework for asymptotically optimal, data-adaptive estimator selection to the problem of covariance matrix estimation in high-dimensional settings. We provide sufficient conditions under which our cross-validated procedure is asymptotically optimal in terms of risk, and show that it generalizes the crossvalidated hyperparameter selection procedures employed by existing estimation approaches.

Future work might derive analogous results for other loss functions, or perhaps even for other parameters like the precision matrix.

The simulation study provides evidence that near-optimal results are achieved in data sets with relatively modest numbers of observations and many features across models indexed by diverse covariance matrix structures. These results also establish that our cross-validated procedure performs as well as the best bespoke estimation procedure in a variety of settings.

Our scRNA-seq data examples further illustrate the utility of our approach in fields where high-dimensional data are collected routinely.

Practitioners need no longer rely upon intuition alone when deciding which candidate estimator is best from among a library of diverse estimators. We expect that a variety of computational procedures relying upon the accurate estimation of the covariance matrix beyond the exploratory analyses considered here, like clustering and latent variable estimation, stand to benefit from the application of this framework. 

Proof. Proposition 1. Assume without loss of generality that E P 0 [X] = 0 and η 0 = 1. Then,

where C 1 is constant with respect toΨ k (P 0 n,Bn ). From Equation (5) of the main text, notice that R n (Ψ k , 1) = E Bn Ψ k (P 0 n,Bn ) − S n (P 1 n,Bn ) 2

where C 2 is constant with respect toΨ k (P 0 n,Bn ). Thus,k pn,n = arg min k∈{1,...,K}θpn,n (k, 1) = arg min k∈{1,...,K}Rn (Ψ k , 1).

For any a = 1, . . . , J and b = 1, . . . , J, we find that

It follows that ψ 0 is a risk minimizer.

Let the hard-thresholding estimator minimizing the expected cross-validated conditional risk under P 0 be indexed by k 0 . Then

The first inequality follows from the definition of the cross-validated conditional risk under P 0 ; the last equality follows from Theorem 2 (and its subsequent discussion) of , since X is sub-Gaussian by the boundedness condition of Assumption 1 and J/n → m as n, J → ∞.

Lemma 1. Let Z k := L(X;Ψ k (P 0 n,Bn )) − L(X; ψ 0 ). Then,

where M (J) := 4(M 1 + M 2 ) 2 J 2 .

Proof.

where the second to last equality follows by noting that E P 0 [X (j) X (l) |P 0 n,Bn , B n ] = E P 0 [X (j) X (l) ] and that, by definition, E P 0 [X (j) X (l) ] = ψ (jl) 0 . Then,

Here, the second inequality holds from the application of Jensen's Inequality to the square of the double sum, which is effectively an expectation of a discrete uniform random variable when scaled by J 2 . The final inequality results from Assumptions 1 and 2, and concludes the proof.

The following lemma is a result taken directly from Dudoit and van der Laan (2005) . It is restated here for convenience.

Lemma 2. Let Y 1 , Y 2 , . . . be a sequence of random variables. If E[|Y n |] = O(g(n)) for some positive function g(·), then Y n = O P (g(n)).

Proof. We must show that, for each > 0, there exists an N and B > 0 such that P(|Y n |/g(n) > B) < for all n > N . By assumption, there exists an N and a C > 0 such that E[|Y n |]/g(n) < C for all n > N . By defining C/B = and making use of Markov's Inequality, we find that

Having derived Lemma 1, the remainder of the proof for Theorem 1 closely follows the proof of the first theorem in Dudoit and van der Laan (2005) .

Proof. Theorem 1, Finite-Sample Result. 0 ≤θ pn,n (k pn,n ) − θ 0 Given B n and P 0 n,Bn , let Z k,i , 1 ≤ i ≤ np n , denote the np n i.i.d. copies of Z k corresponding with the validation set, that is, with {X i : B n (i) = 1} (as defined in Lemma 1). Then,Ĥ k = i Z k,i /np n andH k = E P 0 [Z k,i | P 0 n,Bn , B n ]. Hence,H k −Ĥ k is an empirical mean of np n i.i.d. centered random variables. Further, by Assumptions 1 and 2, |Z k,i | < 2(M 1 + M 2 ) 2 J 2 a.s.. Next, we apply Bernstein's Inequality to the centered empirical meanH k −Ĥ k , using the property of the Z k,i 's derived in Lemma 1, to obtain a bound for the tail probabilities of R k,n (B n ) and T k,n (B n ):

Then, for s > 0, Bernstein's Inequality yields

Note that

And so, for s > 0,

where c(δ, M (J)) = 2(1 + δ) 2 M (J)( 1 δ + 1 3 ). The same bound applies for the marginal probabilities of P P 0 (R k,n (B n ) > s) since they hold for arbitrary choices of B n and P 0 n,Bn . The second inequality follows from K being larger than or equal to 1 by definition.

Finally, for any u > 0, we have by the properties of expectations and the previously derived result that

Since the expression on the right-hand side of the inequality above achieves its minimum value of c(δ, M (J))(1 + log(K)/(np n ) at u n = c(δ, M (J))log(K)/(np n ), then

The same bound applies to E P 0 [Tk ,n ]. Therefore, taking the expected values of the inequality in Equation (2), we produce the desired finite-sample result in Equation (6) of the main text.

Proof. Theorem 1, High-Dimensional Asymptotic Result. The expected risk differences ratio's convergence follows directly from the main text's Equation (6) for some δ > 0, so long as c(δ, M (J))(1 + log(K))/(np n E P 0 [θ pn,n (k pn,n ) − θ 0 ]) → 0 as n, J → ∞. Given the assumption in Kolmogorov asymptotics that J/n → m < ∞ as J, n → ∞, an equivalent condition is that m(M 1 +M 2 ) 2 J(1+log(K))/(p n E P 0 [θ pn,n (k pn,n )−θ 0 ]) → 0 as n, J → ∞. Convergence in probability then follows from Lemma 2.

Though there are minor adaptations to the assumptions to reflect the use of high-dimensional asymptotics, the proof of Corollary 1 follows that of Corollary 1 in Dudoit and van der Laan (2005) .

Proof. Corollary 1.

The asymptotic statement of the main text's Equation (10) is an immediate result of Theorem 1's Equation (8)

Letting Z 1,n := (n(1 − p n )) γ (θ pn,n (k pn,n ) − θ 0 ) and Z 2,n := n γ (θ n (k n ) − θ 0 ), and assuming that the sufficient condition in Equation (11) of the main text holds, we find that Z 1,n /Z 2,n d → 1 by the Continuous Mapping Theorem. Then, notice that

which yields the desired sufficient condition when p n → 0. In the case of single-split validation, Z 2,n d = Z 1,n/(1−pn) , and so Z 2,n d → Z implies that (Z 1,n , Z 2,n ) d → (Z, Z).

Remark 1. We assume throughout this work that E P 0 [X] = 0 without loss of generality. In practice, however, the mean vector is generally unknown. Consider the uniformly bounded random vector Y such that X = Y −E P 0 [Y ]. We might therefore consider using the demeaned random vector

i . EmployingỸ in place of X in Lemma 1, and denoting Z k := L(Ỹ ;Ψ k (P 0 n,Bn ))−L(Ỹ ; ψ 0 ), we find that

n,Bn )). It then follows that, as n → ∞,

The asymptotic results of Theorem 1 and Corollary 1 are therefore achievable when E P 0 [Y ] is unknown. The same cannot be said for the finite sample result of Theorem 1: E P 0 [Z k |P 0 n,Bn , B n ] is not strictly nonegative. For large enough values of n, however, we do expect these finite bounds to be approximately correct. Tables  Sample covariance matrix Not applicable Hard thresholding Thresholds = {0.1, 0.2, . . . , 1.0} SCAD thresholding (Fan and Li, 2001; Rothman et al., 2009) Thresholds = {0.1, 0.2, . . . , 1.0} Adaptive LASSO Thresholds = {0.1, 0.2, . . . , 0.5}; exponential weights = {0.1, 0.2, . . . , 0.5} Banding (Bickel and Levina, 2008b) Bands = {1, 2, . . . , 5} Tapering (Cai et al., 2010) Bands = {2, 4, . . . , 10} Linear shrinkage (Ledoit and Wolf, 2004) Not applicable Dense linear shrinkage Not applicable Nonlinear shrinkage (Ledoit and Wolf, 2018) Not applicable POET (Fan et al., 2013) Table 2 : Families of candidate estimators used in single-cell transcriptomic data analyses Figure 1 : Comparison of the cross-validated (k pn,n ) and cross-validated oracle (k pn,n ) selections' crossvalidated conditional risk differences. The proposed cross-validated selection procedure achieves asymptotic equivalence in most settings for relatively small sample sizes and numbers of features. 

We compared the estimates generated by our method against those of the individual candidate procedures using the simulated data sets. This was accomplished by computing the Frobenius norm of each estimate against the corresponding true covariance matrix. The mean norms over all simulations were then computed for each covariance matrix estimation procedure, again stratified by n, J/n, and the covariance matrix model (Figure 3 ). Our CV scheme was used to select hyperparameters of these competing approaches where necessary. As stated previously, our CV approach is generally equivalent to these estimators' hyperparameter selection procedures. The hyperparameters considered are provided in Table 3 . Where appropriate, the competing methods' hyperparameters are more varied than those used by cvCovEst, reflecting more aggressive estimation procedures that one might employ when only using a single family of estimators.

We repeated this benchmarking experiment using the spectral norm to assess the accuracy of our estimation procedure with respect to the leading eigenvalue of the covariance matrix. Recall that the spectral norm of a square matrix is defined as it's largest absolute eigenvalue. Though our theoretical results do not relate to to this norm, outcomes similar to those in Figure 3 are expected given the relationship between these two norms for reasons previously described.

The results, presented in Figures 3 and 4 , demonstrate that our estimator selection procedure performs at least as well as the best alternative estimation strategy. This suggests that procedures dedicated to or relying upon the accurate estimation of leading eigenvalues and eigenvectors, like principal component analysis and latent variable estimation, might benefit from the integration of our cross-validated covariance matrix estimation framework. Table 3 : Families of candidate estimators compared against the cross-validated loss-based estimator selection procedure. Note that the library of candidate estimators used by the proposed method is provided in Table 1 4

Using the proposed cross-validated selection procedure effectively requires a large and diverse set of candidate covariance matrix estimators. In this spirit, we provide in the sequel a brief overview of select covariance matrix estimators that have proven to be effective in a variety of settings. Note, however, that the proposed selection framework need not be limited to those described here. Thorough reviews of estimators have been provided by Pourahmadi (2013), Fan et al. (2016) , and Ledoit and Wolf (2020) , to name only a few.

An often natural, simplifying assumption about the true covariance matrix's structure is that it is sparse, that is, a non-negligible portion of its off-diagonal elements have a value near zero or equal to zero. This assumption is not altogether unreasonable: Given a system of numerous variables, it seems unlikely in many settings that a majority of these variables would depend on one another. The class of generalized thresholding estimators Rothman et al., 2009; Cai and Liu, 2011) is one collection of covariance matrix estimators based upon this structural assumption. Given the sample covariance matrix S n , the generalized thresholding estimator is defined asΨ gt (P n ; t) := t S (jl) n : j, l = 1, . . . , J , where t(·) is a thresholding function that often requires one or more hyperparameters dictating the amount of regularization. The hard, soft, smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001) , and adaptive LASSO functions are among the collection of suitable thresholding operators. As an example, the hard-thresholding function is defined as

n I(S (jl) n > u) for some threshold u > 0. Cai and Liu (2011) also demonstrated that element-specific thresholding functions might be useful when the features' variances are highly variable. The hyperparameters for any specific thresholding function are often selected via CV. Regardless of the choice of t(·), these estimators preserve the symmetry of the sample covariance matrix and are invariant under permutations of the features' order. and Rothman et al. (2009) have shown that these estimators are consistent under the spectral norm (which is defined as a square matrix's largest absolute eigenvalue), assuming that log J/n → 0, that the observations' marginal distributions satisfy a tail condition, and that the true covariance matrix is a member of the class of matrices satisfying a particular notion of "approximate sparsity." Cai and Liu (2011) have derived similar results for their entry-specific thresholding estimator over an even broader parameter space of sparse covariance matrices.

A family of estimators related to thresholding estimators are banding and tapering estimators (Bickel and Levina, 2008b; Cai et al., 2010) . Like thresholding estimators, these estimators rely on the assumption that the true covariance matrix is sparse. However, the structural assumption of these estimators on ψ 0 is much more rigid than that of thresholding estimators. Specifically, such estimators assume that the true covariance matrix is a band matrix, that is, a sparse matrix whose non-zero entries are concentrated about the diagonal. These estimators therefore require a natural ordering of the observations' features, operating under the assumption that "distant" variables are uncorrelated. Such structure is often encountered in longitudinal and time-series data.

Given the sample covariance matrix S n , the banding estimator of Bickel and Levina (2008b) is defined asΨ

where b is a hyperparameter that determines the number of bands to retain from the sample covariance matrix and is chosen via a CV procedure. For the class of "bandable" covariance matrices, i.e., the set of well-conditioned matrices whose elements not in the central bands of the matrix decay rapidly, this banding estimator has been shown to be uniformly consistent in the spectral norm so long as log(J)/n → 0. The tapering estimator of Cai et al. (2010) is the smooth generalization of the banding estimator, gradually shrinking the off-diagonal bands of the sample covariance matrix towards zero. It is defined asΨ tap (P n ; b) := W b • S n , for some weight matrix W b . Here, "•" denotes the Hadamard (element-wise) matrix product. Clearly, letting W (jl) b = I(|j − l| ≤ b) for some positive integer b results in the banding estimator. A popular weighting scheme derived by Cai et al. (2010) is

which we use in our simulation study presented in Section 4 of the main text. Cai et al. (2010) also derived the optimal rates of convergence for this estimator under the Frobenius and spectral norms, considering a class of bandable covariance matrices that is more general than that considered by Bickel and Levina (2008b) : The smallest eigenvalue of the covariance matrices in this class can take on a value of zero. However, this estimator does not improve upon the bounds set by the banding estimator.

We next consider the linear and non-linear shrinkage estimators inspired by Stein's work on empirical Bayesian methods. These estimators are rotation-equivariant, shrinking the eigenvalues of the sample covariance matrix towards a set of target values, whilst leaving its eigenvectors untouched. In doing so, the resultant estimators are better-conditioned than the sample covariance matrix in a manner guaranteeing that the resultant covariance matrix estimator be non-singular. Further, these estimators do not rely on sparsity assumptions about the true covariance matrix, setting them apart from those previously discussed.

One of the first shrinkage estimators, the linear shrinkage estimator of the sample covariance matrix, was proposed by Ledoit and Wolf (2004) . This estimator is defined as the convex combination of the sample covariance matrix and the identity matrix. Hence, it represents a compromise between S n , an unbiased but highly variable estimator of ψ 0 in high dimensions, and I J×J , a woefully biased but fixed estimator. Ledoit and Wolf (2004) found that, under mild conditions, the asymptotically optimal shrinkage intensity with respect to the scaled (by J) Frobenius norm can be estimated consistently in high dimensions. This estimator is defined aŝ for m n = tr(S n )/J, d 2 n = S n − m n I 2 F,1/J ,b 2 n = n −2 i X i X i − S n 2 F,1/J , b 2 n = min(b 2 n , d 2 n ), and a 2 n = d 2 n − b 2 n . Bespoke shrinkage targets may be used in place of the identity. For example, one might consider a dense matrix target whose diagonal elements are the average of the sample covariance matrix's diagonal elements and whose off-diagonal elements are equal to the average of all the sample covariance matrix's off-diagonal elements. For the sake of brevity, discussion of such estimators is omitted, but examples are provided by, among others, Ledoit and Wolf (2003) and , particularly for use in financial economics and statistical genomics applications, respectively.

When assumptions about the true covariance matrix's structure are unfounded, it can become impossible to select an appropriate linear shrinkage target. Instead, one might consider generalizing these estimators to shrink the eigenvalues of the sample covariance matrix in a non-linear fashion. That is, an estimator that shrinks the sample covariance matrix's eigenvalues not by a common shrinkage factor (as with linear shrinkage estimators) but with shrinkage factors tailored to each sample eigenvalue. As with the aforementioned linear shrinkage estimators, such non-linear shrinkage estimators produce positive-definite estimates so long as the shrunken sample eigenvalues are positive and rotation-equivariant. These estimators belong to a class initially introduced by Stein (1986) and have since seen a resurgence in the work of Wolf (2012, 2015) . More recently, Ledoit and Wolf (2018) derived an analytical non-linear shrinkage estimator that is asymptotically optimal in high dimensions and more computationally efficient than their previously formulated estimators.

Covariance matrix estimators based on factor models form another broad family of estimators that do not assume sparsity of the true covariance matrix. Often encountered in econometrics and finance, these estimators utilize the operating assumption that the dataset's observations are functions of a few common, often latent, factors. The factor model can be described as follows:

where X J×1 represents a random observation, µ J×1 a mean vector, β J×L a matrix of factor coefficients, F L×1 a random vector of L common factors, and J×1 a random error vector. Assuming that F and are uncorrelated, the covariance matrix of X is given by

where ψ is the covariance matrix of the random error. For a review on estimating the covariance matrix in the presence of observable factors, see Fan et al. (2016) . We now briefly discuss the estimation of ψ when the factors are unobservable. Notice that when ψ is not assumed to be diagonal, the decomposition of ψ in Equation (4) is not identifiable for fixed n and J, since the random vector X is the only observed component of the model. By letting J → ∞, and assuming that the eigenvalues of ψ are uniformly bounded or grow at a slow rate relative to J and that the eigenvalues of (1/J) β β are uniformly bounded away from zero and infinity, it can be shown that βCov(F )β is asymptotically identifiable (Cai et al., 2010) . It follows from these assumptions that the signal in the factors increases as the number of features increases, while the noise contributed by the error term remains constant. The eigenvalues associated with βCov(F )β therefore become easy to differentiate from those of ψ . Now, even with βCov(F )β being asymptotically identifiable, β and F cannot be distinguished. As a solution, the following constraint is imposed on F : Cov(F ) = I L×L . It then follows that ψ = ββ + ψ .

Under the additional assumption that the columns of β be orthogonal, Fan et al. (2013) found that the leading L eigenvalues of ψ are spiked, meaning that they are bounded below by some constant (Johnstone, 2001) , and grow at rate O(J) as the dimension of ψ increases. The remaining J −L eigenvalues are then either bounded above or grow slowly. This implies that the latent factors and their loadings can be approximated via the eigenvalues and eigenvectors of ψ. Fan et al. (2013) therefore proposed the Principal Orthogonal compleEment Thresholding (POET) estimator of ψ, which was motivated by the spectral decomposition of the sample covariance matrix S n = J j=1 λ j V ·,j V ·,j ≈ L j=1 β ·,j β ·,j + J j=L+1 λ j V ·,j V ·,j , where λ j and V ·,j are the j th eigenvalues and eigenvectors of S n , respectively, and β ·,j is the j th column of β. For ease of notation, we denote the second term by S and refer to this matrix as the principal orthogonal complement. The estimator for L latent variables is then defined aŝ Ψ POET (P n ; L, s) := L j=1 λ j V ·,j V ·,j + T ,s , where T ,s is the generalized thresholding matrix of S T (jl)

,s := S (jj) , when j = l s S (jl) , otherwise , for some generalized thresholding function s. Although this estimator is computationally efficient, the assumptions encoding the factor based model under which it is derived are such that the latent features' eigenvalues grow in J. This results in a poor convergence rate under the spectral norm (Fan et al., 2016) .

Orchestrating single-cell analysis with Bioconductor

An introduction to multivariate statistical analysis

Inferential Theory for Factor Models of Large Dimensions

Cross-validation based Nonlinear Shrinkage

Probability inequalities for the sum of independent random variables

Covariance Regularization by Thresholding

Regularized Estimation of Large Covariance Matrices

cvCovEst: Cross-validated covariance matrix estimator selection and evaluation in R

Submodel selection and evaluation in regression. The X-random case

Sparsistency and rates of convergence in large covariance matrix estimation

A well-conditioned estimator for large-dimensional covariance matrices

Nonlinear shrinkage estimation of large-dimensional covariance matrices

Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions

Analytical nonlinear shrinkage of large-dimensional covariance matrices

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Asymptotics of the principal components estimator of large factor models with weakly influential factors

Calcul des probabilités

R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing

scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets

The empirical Bayes approach to statistical decision problems

Generalized Thresholding of Large Covariance Matrices

A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics

Covariance, subspace, and intrinsic Crame/spl acute/r-Rao bounds

Forecasting Using Principal Components from a Large Number of Predictors

Adult mouse cortical cell taxonomy revealed by single cell transcriptomics

Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples

Oracle inequalities for multi-fold cross validation

Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq

Regularized Estimation of Large Covariance Matrices

Adaptive Thresholding for Sparse Covariance Matrix Estimation

Optimal rates of convergence for covariance matrix estimation

Asymptotics of cross-validated risk estimation in estimator selection and performance assessment

Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties

An overview of the estimation of large covariance and precision matrices

Large covariance estimation by thresholding principal orthogonal complements

On the distribution of the largest eigenvalue in principal components analysis

Improved estimation of the covariance matrix of stock returns with an application to portfolio selection

A well-conditioned estimator for large-dimensional covariance matrices

Nonlinear shrinkage estimation of large-dimensional covariance matrices

Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions

The Power of (Non-)Linear Shrinking: A Review and Guide to Covariance Matrix Estimation

High-dimensional covariance estimation, Wiley series in probability and statistics

A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics

Lectures on the theory of estimation of many parameters

= E Bn (L(x;Ψk pn,n (P 0 n,Bn )) − L(x; ψ 0 ))dP 0 (x)− (1 + δ) L((x;Ψk pn,n (P 0 n,Bn )) − L(x; ψ 0 ))dP 1 n,Bn (x)Ψk pn,n (P 0 n,Bn )) − L(x; ψ 0 ))dP 1 n,Bn (x)(1)The first inequality is by assumption, and the second is by definition ofk pn,n such thatθ pn,n (k pn,n ) ≤ θ pn,n (k) ∀k. For simplicity in the remainder of the proof, we replacek pn,n andk pn,n withk andk, respectively, in a slight abuse of notation. Now, let the first two terms of the last expression in Equation (1) be denoted by Rk ,n , and the third and fourth terms by Tk ,n . The last term is the cross-validated oracle's risk difference:(1 + 2δ)(θ pn,n (k) − θ 0 ). Thus, 0 ≤θ pn,n (k) − θ 0 ≤ (1 + 2δ)(θ pn,n (k) − θ 0 ) + Rk ,n + Tk ,n .(We next show that E P 0 [Rk ,n + Tk ,n ] ≤ 2c(δ, M (J))(1 + log(K))/(np n ), where c(δ, M (J)) = 2(1 + δ) 2 M (J)(1/δ + 1/3) for some δ > 0. For convenience, let H k := (L(x;Ψ k (P 0 n,Bn )) − L(x; ψ 0 ))dP 1 n,Bn (x) H k := (L(x;Ψ k (P 0 n,Bn )) − L(x; ψ 0 ))dP 0 (x) R k,n (B n ) := (1 + δ)(H k −Ĥ k ) − δH k T k,n (B n ) := (1 + δ)(Ĥ k −H k ) − δH k , so that R k,n = E Bn [R k,n (B n )] and T k,n = E Bn [T k,n (B n )].