key: cord-0209436-mhclk0g0
authors: Huang, Zhendong; Ferrari, Davide
title: Fast construction of optimal composite likelihoods
date: 2021-06-09
journal: nan
DOI: nan
sha: 3a9a8cc879ddaf950172fef41d72c902d0f1e309
doc_id: 209436
cord_uid: mhclk0g0

A composite likelihood is a combination of low-dimensional likelihood objects useful in applications where the data have complex structure. Although composite likelihood construction is a crucial aspect influencing both computing and statistical properties of the resulting estimator, currently there does not seem to exist a universal rule to combine low-dimensional likelihood objects that is statistically justified and fast in execution. This paper develops a methodology to select and combine the most informative low-dimensional likelihoods from a large set of candidates while carrying out parameter estimation. The new procedure minimizes the distance between composite likelihood and full likelihood scores subject to a constraint representing the afforded computing cost. The selected composite likelihood is sparse in the sense that it contains a relatively small number of informative sub-likelihoods while the noisy terms are dropped. The resulting estimator is found to have asymptotic variance close to that of the minimum-variance estimator constructed using all the low-dimensional likelihoods.

The likelihood function is central to many statistical analyses. There are a number of situations, however, where the full likelihood is computationally intractable or difficult to specify. These challenges have motivated the development of composite likelihood methods, which avoid intractable full likelihoods by combining a set of low-dimensional likelihood objects. Due to its flexible framework and computational advantages, composite likelihood inference has become popular in many areas of statistics; e.g., see Varin et al. (2011) for an overview and applications.

Let X ⊆ R d be a d × 1 vector random variable with density in the family f (x; θ), where θ ∈ Θ ⊆ R p is an unknown parameter. Let θ * denote the true parameter. Suppose now that the full d-dimensional distribution is difficult to specify or compute, but it is possible to specify m tractable pdfs f 1 (s 1 ; θ), . . . , f m (s m ; θ) for sub-vectors S 1 , . . . , S m of X, each with dimension smaller than d. For example, S 1 may represent a single element like X 1 , a variable pair like (X 1 , X 2 ) or a conditional sub-vector like X 1 |X 2 . The total number of sub-models m may grow quickly with d; for example, taking all variable pairs in X yields m = d(d − 1)/2. Thus, from one vector X we form the composite log-likelihood (θ, w; X) = m j=1 w j j (θ; X) = m j=1 w j log f j (S j ; θ),

where w is a m × 1 vector of constants to be determined as a solution to an optimality problem. For n independent, identically distributed vectors X (1) , . . . , X (n) we define (θ, w; X (1) , . . . , X (n) ) = n i=1 (θ; w, X (i) ).

The score functions are obtained in the usual way:

U (θ, w; X) = ∇ (θ, w; X) = m j=1 w j U j (θ; X), U j (θ; X) = ∇ j (θ; X), U (θ, w; X (1) , . . . , X (n) ) = ∇ (θ, w; X (1) , . . . , X (n) ) = m j=1 w j n i=1 U j (θ; X (i) ),

where "∇" denotes the gradient with respect to θ. The maximum composite likelihood estimatorθ(w) is defined as the solution to the estimating equation U (θ, w; X (1) , . . . , X (n) ) = m j=1 w j n i=1 U j (θ; X (i) ) = 0, (1.1)

for some appropriate choice of w. Besides computational advantages and modeling flexibility, one reason for the popularity of the composite likelihood estimator is that it enjoys properties analogous to maximum likelihood (Lindsay, 1988; Lindsay et al., 2011; Varin et al., 2011) . Under typical is the so-called Godambe information matrix, H(θ, w) = −E{∇U (θ, w; X)} and K(θ, w) = cov{U (θ, w; X)} are the p × p sensitivity and variability matrices, respectively. Although the maximum composite likelihood estimator is consistent, G(θ * , w) is generally different from the Fisher information cov{∇ log f (X; θ * )}, with the two coinciding only in special cases where H(θ * , w) = K(θ * , w).

The choice of w determines both the statistical properties and computational efficiency of the composite likelihood estimator (Lindsay et al., 2011; Xu and Reid, 2011; Huang et al., 2020) . On one hand, the established theory of unbiased estimating equations would suggest to find w so to maximize tr{G(θ * , w)} (Heyde, 2008, Chapter 2) . Although theoretically appealing, these optimal weights depend on inversion of the score covariance matrix whose estimates are often singular. Different selection strategies to balance the trade-off between statistical efficiency and computing cost have been explored in the literature. A common practice is to retain all feasible sub-likelihoods with w j = 1, for all j ≥ 1, but this is undesirable from either computational parsimony or statistical efficiency viewpoints, since the presence of too many correlated scores inflates the variability matrix K (Cox and Reid, 2004; Ferrari et al., 2016) . A smaller subset may be selected by setting some of the w j s equal to zero, but determining a suitable subset remains challenging. Dillon and Lebanon (2010) and Ferrari et al. (2016) develop stochastic approaches where sub-likelihoods are sampled according to a statistical efficiency criterion. Ad-hoc methods have been developed depending on the type of model under exam; for example, in spatial data analysis it is often convenient to consider sub-likelihoods corresponding to close-by observations; e.g., see Heagerty and Lele (1998) ; Sang and Genton (2014) ; Bevilacqua and Gaetan (2015) .

Motivated by this gap in the literature, the present paper develops a methodology to select sparse composite likelihoods in large problems by retaining only the most informative scores in the estimating equations (1.1), while dropping the noisy ones. To this end, we propose to minimize the distance between the maximum likelihood score and the composite likelihood score subject to a constraint representing the overall computing cost.

The reminder of the paper is organized as follows. In Section 2 we describe the main sub-likelihood selection and combination methodology.

In Section 3, we discuss the properties of our method. Particularly, while

Theorem 2 shows that the proposed empirical composition rule is asymptot-ically equivalent to the optimal composition rule that uses all the available scores, Theorems 3 and 4 give consistency and asymptotic normality of the resulting parameter estimator. In Section 4, we discuss some illustrative examples related to common families of models. In Section 5, we apply our method to real Covid-19 epidemiological data. Finally, in Section 6 we conclude and provide final remarks.

We propose to solve equation (1.1) with weights w = w λ (θ) selected by minimizing the penalized score distance

is the maximum likelihood score, · 2 denotes the L 2 -norm, λ ≥ 0 is a regularization parameter. The resulting minimizer, say w λ (θ), is then used for parameter estimation by solving the composite likelihood estimating equation (1.1) in θ with w = w λ (θ).

The vector of coefficients minimizing (2.3) is allowed to contain positive, negative or zero values, although negative elements do not cause any specific concerns in our method. The size of such coefficients is expected to be larger 2.1 Penalized score distance minimization 7

for those sub-likelihoods that are strongly correlated with the full likelihood.

Thus, a negative coefficient associated with certain sub-likelihood score does not imply that such a sub-likelihood is less informative, but simply means that it has negative correlation with the maximum likelihood score.

The first term in the objective (2.3) aims at improving statistical efficiency by finding a composite score close to the maximum likelihood score.

Note that minimizing the first term alone (when λ = 0) corresponds to finding finite-sample optimal O F -optimal estimating equations. This criterion formalizes the idea of minimization of the variance in estimating equations; see (Heyde, 2008, Ch.1) . Lindsay et al. (2011) point out that this type of criterion is suitable in the context of composite likelihood estimation, although finding a general computational procedure to minimize such a criterion in large probelms is still an open problem.

The term λ m j=1 |w j | is a penalty discouraging overly complex scores.

The geometric properties of the L 1 -norm penalty ensure that several elements in the solution w λ (θ) are zero for sufficiently large λ, thus simplifying the resulting estimating equations. This is a key property of the proposed approach which is exploited to reduce the computation burden.

The optimal solution w λ (θ) may be interpreted as one that maximizes statistical accuracy, subject to a given level of afforded computing. Alter-2.1 Penalized score distance minimization 8 natively, w λ (θ) may be viewed as a composition rule that minimizes the likelihood complexity subject to some afforded efficiency loss compared to maximum likelihood. The constant λ balances the trade-off between statistical efficiency and computational cost: λ = 0 is optimal in terms of asymptotic efficiency, but offers no reduction in likelihood complexity, while for increasing λ > 0 informative data subsets and their scores might be tossed away.

Difficulties related to the direct minimization of (2.3) are the presence of the intractable likelihood score function U M L and the expectation depending on the unknown parameter θ. Up to an additive term independent of w, the penalized score distance in (2.3) can be expressed as

Let M (θ; X) = (U 1 (θ; X), . . . , U m (θ; X)) be the p × m matrix with columns given by p × 1 score vectors U j (θ; X) (j = 1, . . . , m). Then, the first term of (2.4) may be expressed as w J(θ)w/2, where J(θ) is the m × m score covariance matrix

Note that at θ = θ * , assuming unbiased scores with E{U j (θ * ; X)} = 0 (j = 1, . . . , m), the second Bartlett equality gives E{U M L (θ * ; X)U j (θ * ; X) } = 2.1 Penalized score distance minimization 9

E{U j (θ * ; X)U j (θ * ; X) } = −E{∇U j (θ * ; X)}; this implies that the second term in (2.4) is −w diag{J(θ * )}, where diag(A) denotes the diagonal vector of the matrix A. Therefore, (2.4) may be approximated by

For n independent observations X (1) , . . . , X (n) on X, we obtain the empirical composition ruleŵ λ (θ) by minimizing the empirical criterion

The final composite likelihood estimator may be found by replacing w =ŵ λ (θ) in (1.1) and then solving the following estimating equation with respect to θ U (θ,ŵ λ (θ); X (1) , . . . , X (n) ) = 0.

(2.7)

Althoughŵ λ (θ) is generally smooth in a neighborhood of θ * , it may exibit a number of nondifferentiable points on the parameter space Θ. This means that convergence of standard gradient-based root-finding algorithms, such as the Newton-Raphson algorithm, is not guaranteed.

To address this issue, we propose to take a preliminary root-n consistent estimateθ, and find the final estimatorθ λ instead solving the estimating U (θ,ŵ λ (θ); X (1) , . . . , X (n) ) = 0.

(2.8)

whereŵ λ (θ) is a quantity fully dependent on the data. A preliminary estimate is often easy to obtain, for example by solving (1.1) with w j = 1 for all 1 ≤ j ≤ m. Alternatively, a computationally cheap root-n consistent estimate may be obtained by setting w j = 1 for j in some random subset S ⊆ {1, . . . , m} and w j = 0 otherwise.

For the numerical examples in this paper, the following implementation is considered. We first compute the sparse composition ruleŵ λ (θ), by minimizing the convex criterion (2.6) with θ =θ being a preliminary rootn consistent estimator. Minimization of (2.6) is implemented through the least-angle regression algorithm (Efron et al., 2004) . Finally, the estimator θ λ is obtained by a one-step Newton-Raphson update starting from θ =θ applied to (1.1). See Chapter 5 in Van der Vaart (2000) for an introduction to one-step estimation. As preliminary estimatorθ, one may choose the composite likelihood estimator with uniform composition rule w j = 1, for all j ≥ 1. From theoretical view point, any root-n consistent initial estimator θ leads to the same asymptotic results of the final estimate. We find that 2.2 Computational aspects and selection of λ 11 the impact of the initial estimates is negligible in many situations.

Analogous to the least-angle algorithm originally developed by Efron et al. (2004) in the context of sparse linear regression, each step of our implementation includes the score U j (θ; X) having the largest correlation with the residual difference U j (θ; X) − U (θ, w; X), followed by an adjustment step on w. An alternative computing approach is to solve (2.6) with respect to θ with w = w λ (θ) using a Newton-Raphson algorithm with w λ (θ) updated through a coordinate-descent approach as in Wu et al. (2008) in each iteration.

Selection of λ is an aspect of practical importance since it balances the trade-off between statistical and computational efficiency. In many practical applications this choice ultimately depends on the available computing resources and the objective of one's analysis. Although a universal approach for selection is not sought in this paper, the following heuristic strategy may be considered. Taking a grid Λ of values for λ corresponding to different numbers of selected scores, we considerλ = max{λ ∈ Λ : φ(λ) > τ }, for some user-specified constant 0 < τ ≤ 1, where φ(λ) = tr{Ĵ λ }/tr{Ĵ}. Herê J λ denotes the empirical covariance matrix for the selected partial scores evaluated atθ, whilstĴ is the covariance for all scores. Thus φ(λ) can be viewed as the approximate proportion of score variance explained by the selected scores. In practice, one may choose τ to be a sufficiently large value, such as τ = 0.75 or τ = 0.90.

When the covariance for all scoresĴ is difficult to obtain due to excessive computational burden, we propose to use an upper bound of φ(λ) instead.

Letλ ∈ Λ be the next value for λ ∈ Λ smaller thanλ, i.e. we setλ =λ ifλ = min{Λ}, andλ = max{λ ∈ Λ : λ <λ} otherwise. Note that φ(λ) < tr{Ĵλ}/tr{Ĵλ}, where the right hand side represents the relative proportion of score variance explained by reducing λ fromλ toλ. Thus, in practice, one may takeλ such that tr{Ĵλ}/tr{Ĵλ} > δ for some relative tolerance level 0 < δ < 1.

This section gives an explicit expression for the minimizer of the penalized score distance criterion and provides sufficient conditions for its uniqueness.

The main requirement for uniqueness is that each partial score cannot be fully determined by a linear combination of other scores. Specifically, we require the following condition:

. For any λ > 0 and θ ∈ Θ, the random vectors (U 1 , U 1 U 1 ± λ), . . . , (U m , U m U m ± λ) are linearly independent.

We note that the condition is automatically satisfied, unless some partial score is perfectly correlated to the others, which is rarely the case in real applications.

For a vector a ∈ R m , we use a E to denote the sub-vector corresponding to index E ⊆ {1, . . . , m}, while A E denotes the sub-matrix of the squared matrix A formed by taking rows and columns corresponding to E. The notation sign(w) is used for the vector sign function with jth element taking values −1, 0 and 1 if w j < 0, w j = 0 and w j > 0. Let

q denotes the number of zero eigenvalues ofĴ(θ) and V (θ) is a m × q matrix collecting eigenvectors corresponding to zero eigenvalues.

Theorem 1. Under Condition 1, for any θ ∈ Θ and λ > η, the minimizer of the penalized distanced λ (θ, w) defined in (2.6) is unique with probability one and given bŷ

3.2 Asymptotic optimality of the empirical composition rule 14

and \Ê denotes the complement index set {1, . . . , m} \Ê. Moreover,ŵ λ,Ê (θ)

contains at most np ∧ m non-zero elements.

When λ = 0, we haveÊ = {1, . . . , m}, meaning that the corresponding composition ruleŵ λ,Ê (θ) does not contain zero elements. In this case, the empirical score covariance matrixĴ(θ) is required to be non-singular which is certainly violated when np < m. Even for the case np > m,Ĵ(θ) may be singular due to the presence of largely correlated partial scores.

On the other hand, setting λ > η always gives a non-singular score covariance matrix and guarantees existence ofŵ λ,Ê (θ). For sufficiently large λ, a relatively small subset of scores is selected. The formula in (3.9)

suggests that the jth score is selected when it contributes enough information in the overall composite likelihood. Particularly, the jth score is included if the estimated absolute difference between its Fisher information E{U j (θ; X) U j (θ; X)} and the covariance with the overall composite likelihood score E{U j (θ; X) U (θ, w; X)} is greater than λ.

The asymptotic behavior of the sparse composition ruleŵ λ (θ) is investigated in this section as the sample size n diverges. The main result is the convergence ofŵ λ (θ) to the ideal composition rule w λ (θ), which is de-3.2 Asymptotic optimality of the empirical composition rule 15 fined as the minimizer of the population criterion d λ (θ, w) specified in (2.5).

Letting λ → 0 as n increases implies that the sparse ruleŵ λ (θ) is asymptotically equivalent to the optimal rule w 0 (θ) in terms of the estimator's variance, with the latter, however, involving all m scores. To show this, an additional technical requirement on the covariance between sub-likelihood scores is introduced.

respectively. Moreover, each element of J(θ) is continuous and bounded, and the smallest eigenvalue of J(θ) is bounded away from zero for all θ ∈ Θ uniformly.

Theorem 2. Under Conditions 1 and 2, for any λ > 0 and θ ∈ Θ, we have

Since the preliminary estimateθ is consistent, continuity of w λ (θ) (shown in Lemma 2 in the Appendix) implies immediately thatŵ λ (θ) converges to w λ (θ * ), i.e. the empirical composition rule converges to the ideal composition rule evaluated at the true parameter. Theorem 2 implies that the proposed sparse composite likelihood score is a suitable approximation for the optimal score involving m sub-likelihoods. Specifically, under regularity 3.3 Limit behavior of the estimatorθ λ and standard errors 16

in probability as n → ∞ and λ → 0. Note the optimal composition rule w 0 (θ * ) and the related Godambe information matrix G{θ, w 0 (θ * )} are typically hard to compute due to inverting the m × m score covariance matrix

On the other hand, the sparse composition ruleŵ λ (θ) and the implied Godambe information have the advantage to be computationally tractable since they only involve a fraction of scores.

The final estimatorθ λ is an M-estimator solving estimating equations of the form

Since the vectorŵ λ (θ) converges to w * λ = w λ (θ * ) by Theorem 2 and Lemma 2, the limit of (3.11) may be written as

To show thatθ λ is consistent for the solution of (3.12), we assume additional regularity conditions, particularly, to ensure a unique root of 3.3 Limit behavior of the estimatorθ λ and standard errors 17 the ideal composite likelihood score and stochastic equicontinuity on each sub-likelihood scores.

Condition 3. For all constants c > 0, inf {θ:

Theorem 3. Under Conditions 1-3,θ λ converges in probability to θ * .

To obtain asymptotic normality of the final estimatorθ λ , we assume the following condition for the sub-likelihood scores.

Condition 4. Assume for all j ≥ 1, var{U j (θ * ; X)} < ∞. In a neighborhood of θ * , each U j (θ; x) is twice continuously differentiable in θ, and the partial derivatives are dominated by some fixed integrable functions only depending on x. Moreover, assume H(θ * , w * λ ) defined in (1.2) is nonsingular.

Theorem 4. Under Conditions 1-4, we have

The p×p Godambe information matrix G λ (θ) in (3.13) can be estimated by the sandwich estimatorĜ λ =Ĥ λK −1 λĤ λ , where p × p matricesĤ λ and K λ are obtained, respectively, by replacing θ =θ λ in

Practical advantages of using the sparse composition ruleŵ λ (θ) are the reduction of computational cost and increased stability of the standard error calculations. Although the score variance matrixK λ (θ) may be difficult to obtain when λ = 0 due to potentially O(m 2 ) covariance terms, choosing a sufficiently large value for λ > 0 avoids this situation by reducing the number of terms in the composite likelihood score.

Let X ∼ N m (θ1 m , Σ), where the m×m covariance matrix Σ has off-diagonal elements σ jk (j = k) and diagonal elements σ 2 k (j = k). Computing the maximum likelihood estimator of θ requires Σ −1 and usually Σ is estimated by the sample covarianceΣ. When n < m, however, the maximum likelihood estimator is not available since the sample covarianceΣ is singular;

on the other hand, the composite likelihood estimator is still feasible. The jth marginal score is U j (θ; x) = (x j − θ)/σ 2 j and the composite likelihood

estimating equation based on the sample X (1) , . . . , X (n) is

(4.14)

Then the population and empirical score covariances are m × m matrices with jkth entries

respectively.

It is useful to inspect the special case where X has independent components (σ jk = 0 for all j = k). This setting corresponds to the fixed-effect meta-analysis model where estimators from m independent studies are combined to improve accuracy. From Theorem 1, the optimal composition rule isŵ λ,j (θ) = 1 −

whilst the optimal population composition rule w λ,j (θ) is the same as the above expression with sample averages replaced by expectations. The composition rule w λ,j (θ) evaluated at the true parameter is w * λ,j = (1−λσ 2 j )I(σ −2 j > λ) (j = 1, . . . , m). This highlights that overly noisy data subsets with variance σ 2 j > λ −1 are dropped and thus do not influence the final estimator for θ. Particularly, the number of non-zero elements in w * λ is m j=1 I(σ 2 j < λ −1 ).

Finally, substituting w j =ŵ λ,j (θ) in (4.14) gives the following fixed-point When λ = 0, we have uniform weights w * 0 = (1, . . . , 1) and the corresponding estimatorθ 0 is the usual optimal meta-analysis solution. Although the implied estimatorθ 0 has minimum variance, it offers no control for the overall computational cost since all m sub-scores are selected. On the other hand, choosing judiciously λ > 0 may lead to low computational burden with negligible efficiency loss for the resulting estimator. For instance, assuming σ 2 j = j 2 , a calculation shows

where E = {j : j 2 < λ −1 } and θ here is the true parameter. Since the number of selected scores is m j=1 I (j 2 < λ −1 ) = λ − 1 2 , we can write

, which converges to zero as λ → 0; additional calculations also show j / ∈E j −2 → 0 as λ → 0. Hence, the left hand side of (4.16) goes to zero as long as λ → 0. This suggests that the truncated composite likelihood score approximates suitably the optimal score,

while containing a relatively small number of terms. If the elements of X are correlated with σ jk = 0 for j = k, the partial scores contain overlapping information on θ. In this case, tossing away highly correlated partial scores would improve computing while maintaining satisfactory statistical efficiency for the final estimator. 

Suppose X follows a multivariate normal distribution with zero mean vec- with Σ jk = j (j = k) and Σ jk = ρ(jk) 1/2 (j = k).

Σ(θ) jk = 1 (j = k). The quantity δ jk may be regarded as the distance between spatial locations j and k, and the case of equally distant points corresponds to covariance estimation for exchangeable variables described in detail by Cox and Reid (2004) . The maximum composite likelihood estimator solves

where U jk (θ; x j , x k ) is the score of a bivariate normal distribution for the pair (X j , X k ). Figure 2 shows the solution path of the optimal composition rule w * λ for different values of λ, and the asymptotic relative efficiency of the estimatorθ λ compared to the maximum likelihood estimator. A number of variable pairs ranging from m = 45 to m = 1225 is considered. When λ = 0, the proposed estimator has relatively high asymptotic efficiency. Interestingly, efficiency stays at around 90% until only a few sub-likelihoods are left, suggesting that a very small proportion of partial-likelihood components already contains the majority of the information about θ. In such cases, the proposed selection procedure is useful by reducing the computing

burden while retaining satisfactory efficiency for the final estimator. where m j denotes the population size (in millions) of the jth site and t jk is the distance between sites computed as t jk = {(lat j − lat k ) 2 + (lon j − lon k ) 2 } 1/2 , where lat j and lon j represent latitude and longitude for the jth site, respectively. Then each X (i) = (X

1 , . . . , X (i) d ) may be regarded as a set of d observations on a Gaussian random field with exponential spatial covariance function given in (5.17); see Bevilacqua and Gaetan (2015) for more details and comparisons on pairwise likelihood estimation for Gaussian random fields. Clearly correlation depends on population density and potential flows of people across pairs of provinces. To control for these aspects, we included both population size and distance between provinces are included in the gravity model specified in (5.17) . This means that θ should be interpreted as a covariance parameter for new cases between provinces, conditional on population size and distance between provinces. Population size and distance are the two main factors often used in formulating distance decay laws, such as gravity models, to represent spatial interactions related to disease diffusion processes.

The main interest is in estimating the covariance parameter θ to monitor contagion across provinces. To this end, the data are first de-trended and normalized; estimation of the local trendsμ j are obtained by a Nadaraya-Watson kernel smoother, implemented in the R function ksmooth. The error variance σ 2 j are estimated byσ 2

j ) 2 /n. The covariance parameter θ is subsequently estimated using the pair-wise likelihood approach in §4.2 with δ jk = t jk /(m j m k ). Table 1 shows estimates corresponding to a sequence of decreasing values for λ. As the number of pair-wise likelihood terms increases, the estimatorθ λ tends to approach some stable value and its standard error decreases. For instance, taking τ = 0.75 as defined in with only about 58 sub-likelihoods selected. In comparison, the uniform composite likelihood estimator using all pairs of sites with equal weights iŝ θ unif = 6.15 × 10 −2 , with standard error 2.42 × 10 −3 .

Composite likelihood inference plays an important role as a remedy to the drawbacks of traditional likelihood approaches with advantages in terms of computing and modeling flexibility. Nevertheless, a universal procedure to construct composite likelihoods that is statistically justified and fast to execute does not seem to exist (Lindsay et al., 2011) . Motivated by this gap in the literature, this paper introduces a selection methodology resulting in composite likelihood estimating equations with good statistical properties.

The selected equations are sparse for sufficiently large λ, meaning that they contain only the most informative sub-likelihood score terms. This sparsitypromoting mechanism is found to be useful in common situations where the sub-likelihood scores are heterogeneous in terms of their information or when the ideal O F -optimal score is difficult to obtain. Remarkably, the sparse score is shown to approximate O F -optimal score in large samples under reasonable conditions; see Theorem 2 and Equation 3.10.

For implementation, we proposed a selection criterion to choose λ which perform well in the examples, instead of providing a universal approach. In practice, it could be feasible to use any alternative criterions to choose λ, according to the realization of the problems. For example, when the full score covariance is not available due to computational burden, one may consider to use the upper bound provided in Section 2.2, or to choose λ up to some given level of information gain, defined by the ratio of the smallest eigenvalue to the trace of the current selected score covariance, which is decreasing with λ by min-max theorem in linear algebra. As another idea, we note that by the Karush-Kuhn-Tucker condition of quadratic optimization, λ represents the norm of the estimated covariance between the current selected sub-score U j (θ; X) and the residual {U (θ, w; X) − U j (θ; X)}. One

can choose λ such that the covariance is smaller than some pre-fixed value.

Building on the recent success of shrinkage methods for the full likelihood, many works have proposed the use of sparsity-inducing penalties in the composite likelihood framework; e.g., see Bradic et al. (2011) ; Xue et al. (2012); Gao and Carroll (2017) . However, the spirit of our approach is entirely different from these methods, since our penalty focuses on entire sub-likelihood functions rather than on elements of θ. In contrast to the above approaches, our penalization strategy has the advantage of retaining asymptotically unbiased estimating equations, thus leading to desirable asymptotic properties of the related parameter estimator.

A number of developments of the present study may be pursued in the future from either theoretical or applied viewpoints. Although the current paper focuses on the case where p is finite, penalties able to deal with situations where both m and p are allowed to grow with n may be useful for the analysis of high-dimensional data. Implementations of the convex 

For convenience, we use U j (θ; X (i) ) for U Proof of Theorem 1. We first note that for any θ ∈ Θ,d λ (θ, w) is lower bounded, thus the minimizer exists. This is implied by taking the eigen decomposition of the real Hermitian matrixĴ(θ) and then re-organized λ (θ, w) as a summation of perfect square terms corresponding to nonzero eigenvalues, a non-negative first order term corresponding to zero eigenvalues and a constant. We also note that U (i) is unique due to the strict convexity of the first term ofd λ (θ, w) with respect to U (i) , the convexity of the rest of the terms with respect to w, and the linearity of U (i) with respect to w. By the Karush-Kuhn-Tucker conditions for quadratic optimization, the solution must satisfy . . . , m, (6.18) where γ j = sign(w j ) if w j = 0 and γ j ∈ [−1, 1] if w j = 0. This implies that ε defined in 3.9 is unique. Note that the rank ofĴε

is at most min(m, np). Next we show thatĴε has full rank. Otherwise, by the rank equality of the Gram matrix, there exists a subsetε ⊆ε, |ε| ≤ min(m, np + 1) and some k ∈ε, such that (U Together with the Karush-Kuhn-Tucker condition, there exist constants a j , j ∈ε and j = k (with a j = 0 for some j) such that U

for all i = 1, . . . , n, and

This represents a linear system with (np + 1) equations but only |ε| − 1 degrees of freedom, meaning that the rank of the (np + 1) × |ε| matrix generated by columns (U

is smaller or equal to |ε|−1. Since |ε| ≤ np+1, we have that the |ε| columns are linearly dependent. Under Condition 1, this event has zero probability, which is a contradiction. The statement in the theorem then follows by solving the Karush-Kuhn-Tucker equations in (6.18).

Lemma 1. Under Conditions 1 and 2, for any λ > 0, sup θ∈Θ d λ {θ,ŵ λ (θ)}− d λ {θ,ŵ λ (θ)} 1 → 0 in probability, as n → ∞.

Proof of Lemma 1. By definition, it suffices to show that for all j, k ≥ 1, sup θ∈Θ |Ĵ(θ) jk − J(θ) jk | → 0 in probability as n → ∞, and that ŵ λ (θ) 1 is uniformly bouned with probability tending to one. The first part is ensured by Condition 2. For the second part, by Theorem 1, it suffices to show thatĴˆ (θ) andĴˆ (θ) −1 are uniformly bounded entry-wise with probability tending to 1, which is guarenteed by the uniform convergence ofĴ(θ) in probability, the boundedness of each element of J(θ) and the invertibility ofĴˆ (θ) accoridng to the min-max theorem in linear algebra.

Proof of Theorem 2. Recall thatŵ λ (θ) = argmin wd λ (θ, w) and w λ (θ) = argmin w d λ (θ, w). Let ξ be the smallest eigenvalue of J(θ). By definition, we have sup θ where the second inequality is due to the Karush-Kuhn-Tucker conditions of quadratic optimization, and the second last inequality is due to that w λ (θ) and w λ (θ) are the corresponding minimizers. By Lemma 1, the first term of the last inequality converges to zero in probability, and the same holds for the second term. Under Condition 2, ξ > 0. Since m is fixed, it concludes the proof.

Lemma 2. Under Conditions 1 and 2, w λ (θ) is continuous with respect to both λ and θ on λ ≥ 0 and θ ∈ Θ.

Proof of Lemma 2. For simplicity, here we show the continuity of w λ (θ)

with respect to θ. The proof for continuity with respect to λ is the same and thus omitted. For any c > 0 and θ 1 ∈ Θ, it suffices to show that there exist some δ > 0, such that θ − θ 1 1 < δ implies w λ (θ) − w λ (θ 1 ) < c. To find δ, recall that w λ (θ) is the minimizer of d λ (θ, w) defined in 2.5. Under

Condition 2, d λ (θ, w) is strictly convex with respect to w. Thus, there exists c 1 = inf {w: w−w λ (θ 1 ) 1 =c} d λ (θ 1 , w) > d λ {θ 1 , w λ (θ 1 )}. Moreover, d λ (θ, w) is uniformly continuous on the closed domain {w ∈ R m : w − w λ (θ 1 ) 1 ≤ c} × {θ ∈ R p : θ − θ 1 1 ≤ δ 1 } for some δ 1 > 0. Thus we can find δ ∈ (0, δ 1 ) such that for any {θ : θ − θ 1 1 < δ} and {w : w − w λ (θ 1 ) 1 ≤ c}, d λ (θ, w) − d λ (θ 1 , w) 1 < {c 1 − d λ (θ 1 , w)}/2. This implies that when θ − θ 1 1 < δ, d λ (θ, w) > d λ {θ, w λ (θ 1 )} for all {w : w − w λ (θ 1 ) 1 = c}.

Since d λ (θ, w) is strictly convex, we have w λ (θ) − w λ (θ 1 ) < c.

Proof of Theorem 3. Note thatθ λ and θ * are the solutions of the estimating equations U (θ, w; X (1) , . . . , X (n) )/n = 0 and E{U (θ, w; X)} = 0, with w replaced byŵ λ (θ) and w * λ , respectively. By Theorem 2 and Lemma 2, we have ŵ λ (θ) − w * λ 1 → 0 in probability, as n → ∞. Under Condition 2, E{U j (θ, X)} is bounded. Moreover, under Condition 3, each sub-likelihood score n i=1 U j (θ; X (i) )/n → E{U j (θ; X)} uniformly with probability tending to one. Since U {θ,ŵ λ (θ); X (1) , . . . , X (n) }/n and E{U (θ, w * λ ; X)} are the product ofŵ λ (θ), w * λ and the sub-likelihood scores, we have sup θ∈Θ 1 n U {θ,ŵ λ (θ); X (1) , . . . , X (n) } − E{U (θ, w * λ ; X)} 1 → 0, (6.19) in probability as n → ∞.

Moreover, by Condition 3, inf {θ: θ−θ * 1 ≥c} E{U (θ, w * λ ; X)} 1 > 0 for any constant c > 0. Thus, for any c > 0, there exists a δ > 0 such that the event θ λ − θ * 1 ≥ c implies the event E{U (θ λ , w * λ ; X)} 1 > δ. We have Pr{ θ λ − θ * 1 ≥ c} ≤ Pr{ E{U (θ λ , w * λ ; X)} 1 > δ} = Pr{ E{U (θ λ , w * λ ; X)} − 1 n U {θ λ ,ŵ λ (θ); X (1) , . . . , X (n) } 1 > δ} →0,

Comparing composite likelihood methods based on pairs for spatial gaussian random fields

Penalized composite quasi-likelihood for ultrahigh dimensional variable selection

A note on pseudolikelihood constructed from marginal densities

Stochastic composite likelihood

Least angle regression. The REFERENCES 38

by gibbs sampling

Data integration with high dimensionality

A composite likelihood approach to binary spatial data

Quasi-likelihood and its application: a general approach to optimal parameter estimation

On specification tests for composite likelihood inference

Contemporary mathematics volume

Issues and strategies in the selection of composite likelihoods

Tapered composite likelihood for spatial max-stable models

Asymptotic statistics

An overview of composite likelihood methods

Coordinate descent algorithms for lasso penalized regression

Annals of statistics 32 (2), 407-499.Ferrari, D., G. Qian, and T. Hunter (2016). Parsimonious and efficient likelihood composition