key: cord-0451150-l14n87gu
authors: Legramanti, Sirio; Durante, Daniele; Dunson, David B.
title: Bayesian cumulative shrinkage for infinite factorizations
date: 2019-02-12
journal: nan
DOI: nan
sha: 3fcf18528126a0a957fd96059433e61c5b90ba37
doc_id: 451150
cord_uid: l14n87gu

There are a variety of Bayesian models relying on representations in which the dimension of the parameter space is, itself, unknown. For example, in factor analysis the number of latent variables is, in general, not known and has to be inferred from the data. Although classical shrinkage priors are useful in these situations, incorporating cumulative shrinkage can provide a more effective option which progressively penalizes more complex expansions. A successful proposal within this setting is the multiplicative gamma process. However, such a process is limited in scope, and has some drawbacks in terms of shrinkage properties and interpretability. We overcome these issues via a novel class of convex mixtures of spike and slab distributions assigning increasing mass to the spike through an adaptive function which grows with model complexity. This prior has broader applicability, simple interpretation, parsimonious representation, and induces adaptive cumulative shrinkage of the terms associated with redundant, and potentially infinite, dimensions. Performance gains are illustrated in simulation studies.

There has been a considerable interest in designing shrinkage priors for high dimensional parameters (Ishwaran and Rao, 2005; Carvalho et al., 2010; Polson and Scott, 2010 ) but most of the focus has been on regression, where there is not a natural ordering in the coefficients. There are several settings, however, where an order is present or desirable. For example, in models relying on low-rank factorizations or basis expansions, such as factor models, probabilistic principal component analysis and tensor factorizations, it is natural to expect that additional terms play a progressively less important role in characterizing the data or model structure, and hence the associated parameters have a stochastically decreasing effect.

Motivated by the above discussion, Bhattacharya and Dunson (2011) proposed the multiplicative gamma process as a prior for the column-specific scale parameters of the loadings matrix in infinite factor models. In particular, they define zero-mean Gaussian priors for each factor loading λ jh ∈ ℜ (j = 1, . . . , p; h = 1, . . . , ∞) and incorporate a column-specific variance parameter θ h ∈ ℜ + having prior

ϑ 1 ∼ Ga(a 1 , 1), ϑ l ∼ Ga(a 2 , 1) (l = 2, . . . , ∞).

(1) Sarkar et al., 2018) due to its ability of allowing the amount of shrinkage to increase with the model complexity. However, as discussed in Durante (2017) , the increasing shrinkage induced by the multiplicative gamma process heavily depends on the hyper-parameters and there is a tendency to sometimes over-shrink, thereby affecting posterior inference in factorizations or expansions having a higher rank.

To overcome these issues, we propose a novel class of convex mixtures of spike and slab distributions assigning increasing mass to the spike through an adaptive function which grows with model dimension via the cumulative sum of stick-breaking probabilities (Sethuraman, 1994; Ishwaran and James, 2001) . This process shrinks increasingly with model complexity and adaptively learns both low and high rank factorizations. As discussed in §2, such a prior has improved interpretability and provides a general formulation which can also be considered in a variety of non-Gaussian expansions; see, for example, Dunson and Xing (2009) , Rousseau and Mengersen (2011) and Gopalan et al. (2014) .

We focus here on the general case in which the effect of the h-th factor or basis function is controlled by a scalar parameter θ h ∈ Θ ⊆ ℜ (h = 1, . . . , ∞), so that the deletion of redundant terms can be obtained by progressively shrinking the countable sequence θ = {θ h ∈ Θ ⊆ ℜ : h = 1, . . . , ∞} towards an appropriate value. Although this scenario can be easily relaxed, such a setting is commonly found in several factorizations. For example, in infinite factor models, θ h ∈ ℜ + may denote the variance of the loadings associated with the h-th factor, for h = 1, . . . , ∞, and the goal is to define a prior on these parameters which stochastically shrinks redundant factors and flexibly adapts to the unknown rank. As discussed in Bhattacharya and Dunson (2011) and Rousseau and Mengersen (2011) , combining infinite or over-complete representations with shrinkage priors is conceptually appealing in bypassing the estimation of the model dimension while avoiding over-fitting. Motivated by this reasoning, Definition 2.1 characterizes our proposed cumulative shrinkage process.

Definition 2.1 Let θ = {θ h ∈ Θ ⊆ ℜ : h = 1, . . . , ∞} denote a countable sequence of parameters. We say that θ is distributed according to a cumulative shrinkage process with parameter α > 0, starting slab distribution P 0 and target spike δ θ∞ if

independently for every h = 1, . . . , ∞, where v 1 , . . . , v ∞ are independent Beta(1, α) variables and P 0 is a diffuse continuous distribution which can be specified depending on the type of factorization considered.

Equation (2) implies that the probability π h assigned to the spike δ θ∞ increases with the model dimension h, and that lim h→∞ π h = 1 (Ishwaran and James, 2001) . Hence, as complexity grows, P h increasingly concentrates around θ ∞ , which is specified to facilitate the deletion of redundant terms, while the slab P 0 allows flexible modeling of the active ones. Definition 2.1 can be naturally extended to sequences in ℜ p and the delta mass δ θ∞ can be replaced with a continuous distribution, without affecting the basic properties of the prior. Proposition 2.2 provides a formal interpretation for each π h (h = 1, . . . , ∞). All the proofs can be found in the Appendix.

Proposition 2.2 Representation (2) defines a probability measure-valued stochastic process {P h : h = 1, . . . , ∞} travelling between P 0 and δ θ∞ . The probability π h can be interpreted as the proportion of the distance between P 0 and δ θ∞ covered up to step h, in the sense that π h = d TV (P 0 , P h )/d TV (P 0 , δ θ∞ ), for h = 1, . . . , ∞, where d TV (·, ·) denotes the total variation distance between two probability measures.

Hence, {π 1 , . . . , π ∞ } effectively controls the rate of increasing shrinkage. Using similar arguments, we can obtain analogous expressions for ω h and v h , which represent the proportions of the total d TV (P 0 , δ θ∞ ) and the remaining d TV (P h−1 , δ θ∞ ) path, respectively, travelled between steps h − 1 and h. Specifically, ω h = d TV (P h−1 , P h )/d TV (P 0 , δ θ∞ ) and v h = d TV (P h−1 ,P h )/d TV (P h−1 , δ θ∞ ) for any h = 1, . . . , ∞. The expectations of these quantities are explicitly available as

for each h = 1, . . . , ∞. Combining (3) with Definition 2.1, the expectation of θ h (h = 1, . . . , ∞) is, instead,

where θ 0 denotes the expectation under the slab P 0 . Hence, as h grows, the prior expectation of θ h collapses exponentially towards the spike location θ ∞ . As stated in Lemma 2.3, this cumulative shrinkage property interestingly holds not only in expectation, but also in distribution, under our prior in (2).

Then, for any h = 1, . . . , ∞ and ε > 0,

Equations (3)-(5) highlight how the rate of increasing shrinkage is controlled by α. In particular, lower α induces faster concentration around θ ∞ and, as a consequence, more rapid deletion of redundant terms. This control over the rate of increasing shrinkage via α is separated from the specification of the slab P 0 , whose shape and location can be specified independently from α, thereby allowing flexible modelling of active terms. As discussed in Durante (2017), this is not possible for the multiplicative gamma process in (1), whose hyper-parameters (a 1 , a 2 ) control both the rate of shrinkage and the prior for the active terms. This creates a trade-off between the need to maintain moderately diffuse priors for the active loadings and the attempt to adaptively shrink those corresponding to redundant factors. Such a trade-off is additionally complicated by the fact that the increasing shrinkage holds, in expectation, only for specific choices of (a 1 , a 2 ), with these settings typically inducing over-shrinkage. As is clear from equations (3)-(5), although the cumulative shrinkage process in Definition 2.1 is similar to the multiplicative gamma process in inducing exponential shrinkage, such a prior achieves an improved flexibility by ensuring increasing shrinkage in distribution for any α, without affecting the ability of the slab P 0 to model active terms.

Another advantage of our proposed process is that the shrinkage parameter α coincides with the prior expectation of the total number of active terms in θ modelled via the diffuse slab P 0 . This can be shown after noticing that (θ h | π h ) in (2) can be alternatively obtained by marginalizing out the augmented

According to this representation, H * = ∞ h=1 c h counts the number of active elements in θ, and its prior mean is

Hence, in prior specification, α > 0 should coincide with the expected total number of active components modeled via the slab P 0 . This distribution P 0 should be, instead, sufficiently diffuse to capture active terms in the factorization considered, whereas θ ∞ has to be chosen to facilitate deletion of redundant terms.

Before concluding our analysis of the probabilistic properties of the prior in Definition 2.1, let us study how a finite version of (2) for a truncated sequence {θ h ∈ Θ ⊆ ℜ : h = 1, . . . , H} approximates the prior on the infinite sequence. Indeed, when performing posterior computation it is customary to rely on a finite-dimensional factorization which reasonably approximates the infinite representation through a conservative but finite number of components. Such a finite representation is typically implemented by substituting the complete sequence θ with the truncated sequence θ (H) obtained by fixing to zero all the elements θ h having index h = H + 1, . . . , ∞. For instance, in factor models, the deletion of factors beyond the H-th one can be obtained by setting to zero the variances θ h of the zero-mean Gaussian priors for the loadings λ jh (j = 1, . . . , p) for h = H + 1, . . . , ∞. Theorem 2.4 provides an upper bound for the prior probability of the truncated sequence θ (H) being arbitrarily far from the complete sequence θ, in sup-norm.

Theorem 2.4 If the infinite sequence θ has prior (2), then, for any truncation index H and ε > |θ ∞ |,

Hence, the prior probability of θ (H) being close to θ converges to one at a rate which is exponential in H, thus justifying posterior inference under finite sequences based on a conservative H. Although the above bound holds for ε > |θ ∞ |, in general θ ∞ is set close to zero and hence Theorem 2.4 is valid for small ε.

3 Application to Bayesian infinite factor models 3.1 Cumulative shrinkage process for infinite factor models

The prior in §2 can be considered in many factorizations under appropriate choices of P 0 and δ θ∞ . To facilitate comparison with the multiplicative gamma process we first focus on infinite factor models. Indeed, Bhattacharya and Dunson (2011) are motivated by Bayesian inference on the covariance matrix Ω of the data y i = (y i1 , . . . , y ip ) T ∼ N p (0, Ω) (i = 1, . . . , n) and rely on the expansion Ω = ΛΛ T + Σ which arises by marginalizing out the infinite-dimensional vector η i of latent variables η ih ∼ N (0, 1) (h = 1, . . . , ∞) in the factor model y i = Λη i + ǫ i (i = 1, . . . , n), where ǫ i ∼ N p (0, Σ) and Σ = diag(σ 2 1 , . . . , σ 2 p ). The authors perform Bayesian inference by assuming σ 2 j ∼ InvGa(a σ , b σ ) priors for each j = 1, . . . , p, and consider independent N (0, φ jh θ h ) priors for each factor loading λ jh (j = 1, . . . , p; h = 1, . . . , ∞). The local scales φ jh are assigned InvGa(ν/2, ν/2) priors, whereas the global variances θ h have the multiplicative gamma process prior in (1) to facilitate growing shrinkage towards low-rank factorizations.

Motivated by the issues with this approach discussed in §2, we instead let (λ jh | θ h ) ∼ N (0, θ h ) (j = 1, . . . , p; h = 1, . . . , ∞) and, adapting (2) to this setting, we assume (6) independently for h = 1, . . . , ∞, where v 1 , . . . , v ∞ are independent Beta(1, α) variables. Although additional flexibility can be incorporated by assuming hyper-priors for α > 0, θ ∞ ∈ ℜ + and (a θ , b θ ) ∈ ℜ 2 + , we leverage the results in §2 and consider fixed default values coherent with the interpretable role of these hyper-parameters in the cumulative shrinkage process. An additional result which can drive prior specification is that, integrating out θ h from (λ jh | θ h ) ∼ N (0, θ h ), the factor loadings λ jh have marginal priors

is a Student-t distribution with 2a θ degrees of freedom, location 0 and scale b θ /a θ . Hence, θ ∞ should be set close to zero to allow effective shrinkage of redundant factors, whereas (a θ , b θ ) should be specified to induce a moderately diffuse prior with scale b θ /a θ for the active loadings. Exploiting these marginal priors, it also follows that pr(|λ j(h+1) | ≤ ε) > pr(|λ jh | ≤ ε), for each j = 1, . . . , p, h = 1, . . . , ∞, and ε > 0, when b θ /a θ > θ ∞ . This guarantees cumulative shrinkage in distribution also for the factor loadings.

Posterior inference for the infinite factor model in §3.1, truncated at H terms with σ 2 j ∼ InvGa(a σ , b σ ) (j = 1, . . . , p) and prior (6) for the factor loadings, proceeds via the Gibbs sampler in Algorithm 1. This routine incorporates a data augmentation step guaranteeing conjugate full-conditionals. In fact, a finite version of (6), with v H = 1, can be also obtained by marginalizing out the independent indicators z 1 , . . . , z H having probabilities pr(z h = l | ω l ) = ω l , for each l = 1, . . . , H, in prior

where 1(z h ≤ h) = 1 if z h ≤ h and 0 otherwise. As is clear from Algorithm 1, conditioned on z 1 , . . . , z H , it is possible to efficiently sample from the full-conditionals via classical conjugacy results, whereas the updating of the augmented data relies on the full-conditional distribution

for l = 1, . . . , H, where N p (λ h ; 0, θ ∞ I p ) and t 2a θ {λ h ; 0, (b θ /a θ )I p } are densities of p-variate Gaussian and Student-t distributions, respectively, evaluated at λ h = (λ 1h , . . . , λ ph ) T .

We consider illustrative simulations to understand whether our prior provides improved performance in recovering the true underlying covariance matrix Ω 0 , compared to the multiplicative gamma process. To address this goal, we focus on covariance matrices of dimension 20 × 20 factorized as Ω 0 = Λ 0 Λ T 0 + Σ 0 , with Λ 0 ∈ ℜ 20×H0 , where H 0 ∈ {5, 10, 15} induces three different rank structures. The entries in Λ 0 Algorithm 1. Gibbs sampler for Bayesian factor models with the cumulative shrinkage process (θ 1 , . . . , θ H ), η = (η 1 , . . . , η n ) T and y j = (y 1j , . . . , y nj ) T ;

for h from 1 to H do sample z h from the categorical variable with probabilities (7);

; Output at the end of one cycle: a sample from the posterior of Ω = ΛΛ T + Σ. are drawn independently from N (0, 9), whereas Σ 0 is set at 0·25I 20 . For each of the three values of H 0 , we generate 25 datasets made of n = 100 observations y 1 , . . . , y 100 from N 20 (0, Ω 0 ) and, for each of the 25 replicates, we perform posterior inference on Ω via the Bayesian factor model in §3.1, considering both prior (1) and (6). Motivated by the results in §2, we rely on a finite factor model with conservative truncation at H = p = 20, and facilitate collapsing of redundant dimensions along with flexible modeling of active loadings by setting α = 5, a θ = b θ = 2 and θ ∞ = 0·05. We considered these default choices in all simulations and found the results robust to several settings, although in practical applications additional care should be paid to the scale of the data, which can be normalized before inference, if necessary. This problem is, however, not specific to our cumulative shrinkage process, but common to any prior specification. For the multiplicative gamma process, we follow Durante (2017) by considering (a 1 , a 2 ) = (1, 2), and set ν = 3 as done by Bhattacharya and Dunson (2011) in their simulation studies. Finally, (a σ , b σ ), which are common to both models, are fixed at (1, 0·3) as in Bhattacharya and Dunson (2011) .

Bayesian inference for both models relies on 40000 samples from the posterior of Ω produced by Algorithm 1 and by the Gibbs sampler in Bhattacharya and Dunson (2011) , respectively, with a burn-in of 20000 and thinning every five samples. Based on the trace-plots of the chains for the entries in Ω, these settings were sufficient for convergence and reasonable mixing. Leveraging the posterior samples of Ω from both models, we compute for the 25 simulations within each scenario a Monte Carlo estimate of the averaged mean square error p j=1 p q=j E(Ω jq − Ω 0jq ) 2 /{p(p + 1)/2}, calculated with respect to the posterior of Ω. Table 1 displays the quartiles of these 25 averaged mean square errors for each scenario under both models. As is clear from Table 1 , our prior is comparable to the multiplicative gamma process in low-rank settings, and outperforms it in higher-rank factorizations. Such a result is consistent with the tendency of the multiplicative gamma process to induce over-shrinkage, thus failing to account for higher-rank factorizations. Exploiting the data augmentation described in §3.2, it is further possible to quantify uncertainty in . Approximating this posterior based on the samples of the augmented data z 1 , . . . , z H , we obtained concentration around the true value H 0 in each of the three scenarios, thus suggesting that H * may provide a useful proxy for learning the number of active factors. These empirical results motivate future theoretical studies on posterior consistency in recovering the true covariance matrix Ω 0 and its unknown rank H 0 , when p grows with n. A possibility to address this goal is to adapt recent results in Pati et al. (2014) .

We now turn our attention to Bayesian infinite Poisson factorization (Gopalan et al., 2014) . This model is particularly useful in recommendation systems based on count data y ij ∈ N measuring, for example, the total purchases of the item j = 1, . . . , p made by user i = 1, . . . , n. In this context, Poisson factorization assumes (y ij | λ ij ) ∼ Pois(λ ij ), independently for every pair (i,j), and expresses each rate λ ij = γ T i ψ j ∈ ℜ + as the quadratic combination between user-specific preferences γ i = (γ i1 , . . . , γ iH ) T ∈ ℜ H + and itemspecific attributes ψ j = (ψ j1 , . . . , ψ jH ) T ∈ ℜ H + . Setting H to ∞ leads to an infinite factorization of the n × p Poisson rates matrix Λ via Λ = ΓΨ T , where Γ and Ψ are matrices having rows γ T i (i = 1, . . . , n) and ψ T j (j = 1, . . . , p), respectively. Although there are different alternatives for Bayesian learning of Λ = ΓΨ T (Gopalan et al., 2014) , a natural option, which exploits gamma-Poisson and gamma-gamma conjugacy, is to assume Ga(a, θ −1 h ) priors for each γ ih and ψ jh (i = 1, . . . , n; j = 1, . . . , p; h = 1, . . . , ∞) and induce cumulative shrinkage over the columns of Γ and Ψ using either the multiplicative gamma process (1) or the proposed cumulative shrinkage process (6) on the sequence of scales θ = {θ h : h = 1, . . . , ∞}.

Posterior inference for the infinite Poisson factorization model in §4.1, truncated at H terms with prior (6) on the column-specific scales θ h , proceeds via the Gibbs sampler in Algorithm 2. This routine combines conjugate full-conditional results for Poisson factorization outlined in Gopalan et al. (2014) with the same data augmentation step employed in Algorithm 1 and described in §3.2. Here, the updating of the augmented data z h (h = 1, . . . , H) relies on the full-conditional distribution

for every l = 1, . . . , H, where K = n + p and ξ h = (ξ 1h , . . . , ξ Kh ) T = (γ 1h , . . . , γ nh , ψ 1h , . . . , ψ ph ) T . The equation for the case l > h in (8) can be easily obtained by marginalizing out θ h in the joint distribution InvGa(θ h ; a θ , b θ ) K k=1 Ga(ξ kh ; a, θ −1 h ). Gibbs sampling under the multiplicative gamma process (1) requires a simple combination of steps 1-2 in Algorithm 2 and step 5 in Bhattacharya and Dunson (2011). 

As for the factor model, we consider illustrative simulation studies to understand whether our prior provides improved performance in recovering the true underlying Poisson rates matrix Λ 0 , compared to the multiplicative gamma process. Consistent with this goal, we focus on Poisson rate matrices Λ 0 of dimension 80 × 40 factorized as Λ 0 = Γ 0 Ψ T 0 , with Γ 0 ∈ ℜ 80×H0 + and Ψ 0 ∈ ℜ 40×H0 + , where H 0 ∈ {5, 10, 15} induces three different rank structures. The entries in Γ 0 and Ψ 0 are drawn from Ga(1, 0·5). As done in §3.3, for each value of H 0 , we generate 25 data matrices Y with entries y ij from a Pois(γ T 0i ψ 0j ) for every i = 1, . . . , n and j = 1, . . . , p. For each of the 25 replicates, we perform posterior inference on Λ via the Bayesian Poisson factorization model described in §4.1, based on our cumulative shrinkage process (6) and on the multiplicative gamma process (1). Motivated by the results in §2, we consider a finite Poisson factorization based on a conservative truncation at H = min(n, p) = 40. In specifying the hyper-parameters, we follow Gopalan et al. (2014) and define exponentially shaped Gamma priors for the elements in Γ and Ψ by setting a = 1. The remaining hyper-parameters for the priors in (1) and (6) 

(1 − π h ) (h = 1, . . . , ∞). Hence, pr(|θ h − θ ∞ | > ε)=P 0 {B ε (θ ∞ )}{1 − E(π h )}. Replacing E(π h ) with its expression in equation (3) leads to (5). To prove pr(|θ h+1 −θ ∞ | ≤ ε) > pr(|θ h −θ ∞ | ≤ ε) it suffices to note that {α(1 + α) −1 } h+1 < {α(1 + α) −1 } h .

Proof of Theorem 2.4 The proof follows after noting that pr(sup h>H |θ h | > ε) = pr{∪ h>H (|θ h | > ε)} and that δ θ∞ {B ε (0)} = 0 for ε > |θ ∞ |. Hence, adapting the proof of Lemma 2.3, we obtain

To conclude the proof, notice that ∞ h=H+1 {α(1 + α) −1 } h = α{α(1 + α) −1 } H .

Sparse Bayesian infinite factor models

Sparse estimation of a covariance matrix

Convex mixture regression for quantitative risk assessment

High-dimensional sparse factor modeling: applications in gene expression genomics

The horseshoe estimator for sparse signals

Nonparametric Bayes modeling of multivariate categorical data

A note on the multiplicative gamma process

Nonparametric Bayes modeling of populations of networks

Bayesian nonparametric Poisson factorization for recommendation systems

Gibbs sampling methods for stick-breaking priors

Spike and slab variable selection: frequentist and Bayesian strategies

Nonparametric Bayesian sparse factor models with application to gene expression modeling

Posterior contraction in sparse Bayesian factor models for massive covariance matrices

Shrink globally, act locally: Sparse Bayesian regularization and prediction

Asymptotic behaviour of the posterior distribution in overfitted mixture models

Bayesian semiparametric multivariate density deconvolution

A constructive definition of Dirichlet priors

Bayesian group factor analysis with structured sparsity

Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers

Sparse principal component analysis

Algorithm 2. Gibbs sampler for Bayesian Poisson factorization with the cumulative shrinkage process 1 for i from 1 to n do for j from 1 to p do sample the vector x ij = (x ij1 , . . . , x ijH ) T ∈ N H from Mult{y ij , (γ i • ψ j )/γ T i ψ j }, where • is the element-wise Hadamard product;sample z h from the categorical variable with probabilities (8);Output at the end of one cycle: a sample from the posterior of Λ = ΓΨ T .with the only exception of θ ∞ which is set at 0·25 to obtain a marginal variance under the spike that is similar to the one for the factor loadings in §3.3. Moderate changes in θ ∞ did not substantially affect the final results, although excessively low θ ∞ values could lead to difficulties in attracting inactive components under the spike, thus making convergence of the algorithm slightly slower.Bayesian inference for both priors relies on 40000 posterior samples of Λ, with a burn-in of 20000 and thinning every five samples. Based on the trace-plots of the chains for the entries in Λ, these settings were sufficient for convergence and mixing, although slightly unstable chains were found for the posterior under the multiplicative gamma process. Leveraging the posterior samples of Λ from both models, we compute for each of the 25 simulations within every scenario a Monte Carlo estimate of the averaged mean square error n i=1 p j=1 E(λ ij − λ 0ij ) 2 /np, calculated with respect to the posterior of Λ. The quartiles of these 25 averaged mean square errors for each of the three scenarios under both models are displayed in Table 2 . As for factor model, our prior is comparable to the multiplicative gamma process in low-rank settings, but achieves improved performance in higher-rank factorizations.

Proof of Proposition 2.2 The proof of Proposition 2.2 adapts the one of Theorem 1 in Canale et al. (2018) . In fact, under the prior in Definition 2.1, the distance d TV (P 0 , P h ) on the Borel σ-algebra in ℜ is equal toHence, d TV (P 0 , P h ) = π h d TV (P 0 , δ θ∞ ), thus proving Proposition 2.2. CUSP, cumulative shrinkage process; MGP, multiplicative gamma process.