key: cord-0627544-xp5te5x3
authors: Fruhwirth-Schnatter, Sylvia; Zens, Gregor; Wagner, Helga
title: Ultimate P'olya Gamma Samplers -- Efficient MCMC for possibly imbalanced binary and categorical data
date: 2020-11-13
journal: nan
DOI: nan
sha: 7f42436aac5fdfe3bb1bfeda9a45e712831a4ba1
doc_id: 627544
cord_uid: xp5te5x3

Modeling binary and categorical data is one of the most commonly encountered tasks of applied statisticians and econometricians. While Bayesian methods in this context have been available for decades now, they often require a high level of familiarity with Bayesian statistics or suffer from issues such as low sampling efficiency. To contribute to the accessibility of Bayesian models for binary and categorical data, we introduce novel latent variable representations based on P'olya Gamma random variables for a range of commonly encountered discrete choice models. From these latent variable representations, new Gibbs sampling algorithms for binary, binomial and multinomial logistic regression models are derived. All models allow for a conditionally Gaussian likelihood representation, rendering extensions to more complex modeling frameworks such as state space models straight-forward. However, sampling efficiency may still be an issue in these data augmentation based estimation frameworks. To counteract this, MCMC boosting strategies are developed and discussed in detail. The merits of our approach are illustrated through extensive simulations and a real data application.

Applied statisticians and econometricians commonly have to deal with modelling binary or categorical outcome variables. Widely used tools for analyzing such data include logit and probit models, binomial and multinomial logit (MNL) regression-type models as well as more flexible models such as binary state space or random effects models. Bayesian approaches toward inference are very useful in this context, in particular for uncertainty quantification, but as opposed to Gaussian outcome variables their implementation is demanding from a computational viewpoint.

One strategy to implement sampling-based inference relies on importance sampling (Zellner & Rossi 1984) or Metropolis-Hastings (MH) algorithms (Rossi et al. 2005) , exploiting directly the non-Gaussian likelihood. However, these algorithms require careful tuning and substantial experience with Bayesian computation for more complex models such as state space models.

Routine Bayesian computation for these type of data more often relies on Markov Chain Monte Carlo (MCMC) algorithms based on data augmentation. As shown by the seminal paper of Albert & Chib (1993) , the binary probit model admits a latent variable representation where the latent variable equation is linear in the unknown parameters with an error term following a standard normal distribution. As the latent variables are easily sampled for known parameters, the latent variable representation admits a straightforward Gibbs sampler using one level of data augmentation, where the unknown parameters are sampled from a conditionally Gaussian model. This strategy works also for more complex models such as probit state space or random effects models.

However, MCMC estimation based on data augmentation is less straightforward for a logit model which still admits a latent variable representation that is linear in the unknown parameters, but exhibits an error following a logistic distribution. Related latent variable representations with non-Gaussian errors exist for MNL models (see e.g. Frühwirth-Schnatter & Frühwirth (2010) ) and binomial models (Fussl et al. 2013) . While the latent variables can be sampled straightforwardly for any of these latent variable representations, sampling the unknown parameters is far from straightforward due to the non-Gaussian errors.

A common resolution relies on a scale-mixture representation of the non-Gaussian error distribution and introduces the scales as a second level of data augmentation. Very conveniently, the unknown model parameters are then sampled from a conditionally Gaussian model. Examples include a representation of the logistic distribution involving the Kolmogoroff-Smirnov distribution (Holmes & Held 2006) and very accurate finite scalemixture approximations (Frühwirth-Schnatter & Frühwirth 2007 , 2010 , Frühwirth-Schnatter et al. 2009 ). A seminal paper in this context which avoids any explicit latent variable representation and works instead with a single level of data augmentation is Polson et al. (2013) .

They derive the Pólya-Gamma sampler, using a mixture representation of the non-Gaussian likelihood of the marginal model involving the Pólya-Gamma distribution. In the present paper, we propose a new sampler involving the Pólya-Gamma distribution introduced by Polson et al. (2013) . However, as opposed to working with the marginal model, we introduce a new mixture representation of the logistic distribution based on the Pólya-Gamma distribution in the latent variable representation of the logit model. Similarly as Holmes & Held (2006) and Frühwirth-Schnatter & Frühwirth (2010) , we expand data augmentation and introduce the Pólya-Gamma mixing variables as a second set of latent variables. Our new Pólya-Gamma mixture representation of the logit model has the advantage that the joint (conditional) posterior distribution of all data-augmented variables is easy to sam-ple from as the Pólya-Gamma mixing variable conditional on the latent variable follows a tilted Pólya-Gamma distribution. Conditional on all data-augmented variables, this allows to sample the unknown model parameters from a conditionally Gaussian model, even for complex settings such as state space or random effects models.

One challenge when working with any MCMC method based on data augmentation is poor mixing, in particular in the case of imbalanced data, where the success probability is either close to zero or one for the majority of the observed data points, see the excellent work of Johndrow et al. (2019) . Neither the original Pólya-Gamma sampler of Polson et al. (2013) with a single level of data augmentation nor our new Pólya-Gamma sampler with two levels of data augmentation is an exception to this rule.

In the present paper we introduce imbalanced marginal data augmentation (iMDA) as a boosting strategy to make not only our new sampler, but also the original probit sampler of Albert & Chib (1993) robust to possibly imbalanced data. This strategy is inspired by earlier work on marginal data augmentation for binary and categorical data (van Dyk & Meng 2001 , Imai & van Dyk 2005 , McCulloch et al. 2000 . Starting from a latent variable representation of the binary model, we move to an unidentified version of the latent variable representation with the help of two unidentified parameters. One parameter is a global scale parameter for the latent variable which usually takes the value one for the identified model. Introducing such an unidentified parameter has been shown to improve mixing considerably, among others by Imai & van Dyk (2005) . However, this strategy does not resolve the situation of highly imbalanced data. To this aim, we introduce in addition an unknown threshold parameter which takes the value zero for the identified model. As will be shown, this parameter restores balance in the unidentified model and in this way improves mixing considerably for unbalanced data, both for the sampler of Albert & Chib (1993) as well as for our new Pólya-Gamma samplers. As iMDA only works in the context of a latent variable representation, this strategy cannot be applied to the original Pólya-Gamma sampler of Polson et al. (2013) due to the lack of such a representation.

The new Pólya-Gamma representation of the logit model is very generic and is easily combined with iMDA, not only for binary regression models, but also for more flexible models such as binary state space models, as the resulting full conditional of the state process is Gaussian which allows multi-move sampling of the entire state process (Frühwirth-Schnatter 1994 , Carter & Kohn 1994 . We name any sampling strategy that combines a Pólya-Gamma mixture representation with iMDA an ultimate Pólya-Gamma sampler due to its efficiency.

A further contribution of the present paper is to show that such an ultimate Pólya-Gamma sampler can be derived for more general non-Gaussian problems, including multinomial modelling of categorical data and modelling binomial data. For the MNL model usually a logit model based on a partial dRUM representation is applied to sample the category specific parameters, see e.g. Holmes & Held (2006) , Frühwirth-Schnatter & Frühwirth (2010), Polson et al. (2013) . In the present paper, we derive a new latent variable representation of the MNL model involving a partial dRUM representation with two latent variables instead of a single one. One latent variable equation is linear in the unknown parameters and involves a logistic error distribution. As for binary logit models, we use the Pólya-Gamma mixture representation of the logistic distribution and introduce the mixing variables as additional latent variables. For binomial models, a latent variable representation has been introduced earlier by Fussl et al. (2013) , but this representation does not provide an explicit choice equation. However, since an explicit choice equation is needed to apply iMDA, we derive a new latent variable representation for binomial data involving two latent variables. Both latent variable equations are linear in the unknown parameters and involve a generalized logistic error distribution. Extending the strategy introduced for binary logit models, we use Pólya-Gamma mixture representations of the generalized logistic distribution and introduce the Pólya-Gamma mixing variables as additional latent variables.

Both for MNL models as well as for binomial models, this data augmentation scheme leads to a conditionally Gaussian posterior and allows to sample all unknowns through highly efficient block moves. Again, we apply iMDA to derive an ultimate Pólya-Gamma sampler which is shown to mix well also for highly unbalanced data.

The rest of the paper is structured as follows. Section 2 introduces the ultimate Pólya-Gamma sampler for binary data. This sampling strategy is extended to multinomial data in Section 3 and to binomial data in Section 4. For all three types of models, the ultimate Pólya-Gamma sampler is compared to alternative algorithms for simulated data in Section 5. Section 6 considers binary state space models. Section 7 concludes.

2 Ultimate Pólya-Gamma samplers for binary data 2.1 Latent variable representations for binary data Binary models for a set of N binary data y = (y 1 , . . . , y N ) are defined by

where λ i depends on exogenous variables and unknown parameters β, e.g. log λ i = x i β in a standard binary regression model. Choosing the cdf F ε (ε) = Φ(ε) of the standard normal distribution leads to the probit model Pr(y i = 1|λ i ) = Φ(log λ i ), whereas the cdf F ε (ε) = e ε /(1 + e ε ) of the logistic distribution leads to the logit model

A latent variable representation of model (1) involving the latent utility z i is given by:

where f ε (ε) = F ε (ε) = φ(ε) is equal to the standard normal pdf for a probit model and equal to f ε (ε) = e ε /(1 + e ε ) 2 for a logit model.

While MCMC estimation based on (2) is straightforward for the probit model using one level of data augmentation involving the latent utilities (z 1 , . . . , z N ) (Albert & Chib 1993) , for the logit model a second level of data augmentation is required, based on a mixture representation of the logistic distribution. In the present paper, we apply a new mixture representation of the logistic distribution,

where ω ∼ PG (2, 0) follows a Pólya-Gamma distribution (Polson et al. 2013 ) with parameter b = 2 and κ = 0, see Appendix A.1.1 and A.1.2 for details. This representation is very convenient, as the conditional posterior of ω i given ε i is equal to a tilted Pólya-Gamma distribution, i.e. ω i |ε i ∼ PG (2, |ε i |) which is easy to sample from, see Polson et al. (2013) .

Hence, whenever log λ i is linear the unknown parameters β, such as log λ i = x i β for a binary regression model, this new representation allows to construct a Pólya-Gamma sampler that (a) samples the unknown parameters β conditional on the latent variables (z 1 , . . . , z N ) and (ω 1 , . . . , ω N ) from a conditionally Gaussian model, and (b) samples z i , ω i |λ i , y i independently for each i by sampling the latent variable z i from z i |λ i , y i in the original latent variable model (2) in the usual manner (see Appendix A.3.1) and sampling the parameter

While the sampler outlined above is easy to implement, such data augmentation schemes are known to be slowly mixing for imbalanced data sets where only a few cases with y i = 1 or y i = 0 among the N data points are observed, see e.g. Johndrow et al. (2019) . To deal with this issue, we combine our new Pólya-Gamma mixture representation with marginal data augmentation (van Dyk & Meng 2001 , Imai & van Dyk 2005 , McCulloch et al. 2000 .

Given a global scale parameter δ and a threshold parameter γ, we transform the latent variable z i such thatz

to define the following expanded model:

Transformation (4) implies that logλ i = √ δ log λ i + γ. For a binary regression model with

The global scale parameter δ and the threshold parameter γ, however, are not identified, since the marginal version of the expanded model (5), where the latent variables (z 1 , . . . ,z N ) are integrated out, is independent both of γ and δ:

Various MCMC schemes based on the expanded model (5) are possible. We consider here marginal data augmentation where the following prior p(γ, δ) = p(γ|δ)p(δ) is defined on the parameters γ and δ with fixed hyper parameters d 0 , D 0 and G 0 :

Note that the joint prior p(γ, δ) can also be written as p(δ|γ)p(γ) where

The prior for the model parameters is defined by specifying a prior for the parameterβ in the expanded model (5) which is allowed to depend on the global scale parameter δ.

Letz summarize all latent variables in the model i.e.z = (z 1 , . . . ,z N ) for a probit model andz = (z 1 , . . . ,z N , ω 1 , . . . , ω N ) for a logit model. Marginal data augmentation is based on (a) sampling the latent variablesz from p(z|β, y) and (b) sampling the unknown parameters β from p(β|z, y) without conditioning on δ and γ.

Step (a) is carried out by sampling from the augmented posterior p(z, δ, γ|β, y) = p(z|δ, γ, β, y)p(δ, γ|β, y) by first drawing (δ, γ) from the conditional prior p(δ, γ|β) 1 , then drawingz conditional on (δ, γ), and finally discarding (δ, γ). To sample fromz|δ, γ,

and definez i according to transformation (4). In each sweep of the sampler, we are dealing with a different expanded model. This allows much larger moves than in conventional sampling where δ ≡ 1 and γ ≡ 0 for all MCMC draws.

Step (b) is carried out by joint sampling from p(β, δ, γ|z, y) and dropping δ and γ.

To this aim, we sample (β, δ new , γ new ) ∼ p(β, δ new , γ new |z, y) in the expanded model (5) and match back to (2) through the inverse of (4), log λ i = (logλ i − γ new )/ √ δ new . Using representation (8) of p(δ, γ), the joint posterior p(β, δ, γ|z, y) factorizes for any binary model as p(β, δ, γ|z, y) ∝ p(y|γ,z)p(γ)p(δ|γ)p(z|β, δ)p(β|δ). Thus, we first sample γ from the marginal posterior p(γ|y,z) ∝ p(y|γ,z)p(γ) which is equal to the Student-t prior given

Algorithm 1 The ultimate Pólya-Gamma sampler for binary data.

Choose starting values for λ = (λ 1 , . . . , λ N ) and repeat the following steps:

(a-1) For each i = 1, . . . , N , sample z i = log λ i +F −1 ε (y i +U i (1−y i −π i )) in model (2), where U i ∼ U [0, 1], π i = F ε (log λ i ), and F −1 ε (p) = Φ −1 (p) for the probit and F −1 ε (p) = log p − log(1 − p) for the logit model. For a logit model, sample

(a-2) Sample δ ∼ G −1 (d 0 , D 0 ) and γ ∼ γ|δ , λ. For a binary regression model, γ|δ , β d is given in (13).

Move to the unidentified model by definingz i = √ δ z i + γ for i = 1, . . . , N .

(b-1) Sample γ new from the truncated t-prior p(γ) using ( 

in (8), however, truncated to a region depending on the latent variablesz i :

where L(z) = max i:y i =0zi and O(z) = min i:y i =1zi , if 0 < #{i : y i = 1} < N . If #{i : y i = 0} = 0, then L(z) = −∞; if #{i : y i = 1} = 0, then O(z) = +∞. Hence, γ is sampled as:

where U ∼ U [0, 1] and T ν (·) and T −1 ν (·), are, the cdf and the quantile function of a t νdistribution.

Conditional on γ, we sample (β, δ) from p(β, δ|γ,z) = p(β|z, δ)p(δ|z, γ). The specifics depend on the chosen model for λ i . We consider now in detail a binary regression model, where log λ i = x i β and x i,d ≡ 1 corresponds to the intercept. In the expanded model (5), the linear predictor is logλ i = x iβ , whereβ j = √ δβ j + I{j = d}γ. We specify the prior β |δ ∼ N d (0, δA 0 ) in the expanded model. Using the inverse of (4),

the conditional prior p(β|γ, δ) can be derived: 2

From (12) we obtain that p(δ, γ|β) ∝ p(β d |δ, γ)p(γ|δ)p(δ) is proportional to p(γ|δ, β d )p(δ)

where p(δ) is given by (7) and p(γ|δ, β d ) is equal to the following normal distribution:

Hence, to sample from the conditional prior p(δ, γ|β), δ is sampled from G −1 (d 0 , D 0 ) and γ|δ, β d is sampled from N √ δg 1 , δG 1 .

Next, for each i the latent variables (z i , ω i ) are sampled from the identified model (2) and transformation (4) yields the latent variables (z i , ω i ) in the expanded model

Discarding γ and δ completes Step (a).

Step ( is inverted Gamma with following moments:

while the conditional posteriorβ|δ,z, γ ∼ N d (b N , δB N ) is multivariate normal. The posterior draw ofβ is transformed back to β using (11).

The ultimate Pólya-Gamma (UPG) sampler for binary data is summarized in Algorithm 1. 3 Note that for the probit model the marginal data augmentation scheme of Imai & van Dyk (2005) would result, if γ ≡ 0 was fixed to zero. Our data augmentation scheme is much more general than that. First of all, it works for any binary data model with a latent variable representation. Second, freeing the location of the threshold γ in model (5) restores balance, with γ being considerably larger than zero, if only a few cases with y i = 1 are present among the N binary observations, as will be illustrated in Section 2.3. The imbalanced marginal data augmentation (iMDA) scheme introduced in this section leads to an MCMC scheme that is well mixing, even in cases of extremely unbalanced data, see

Section 5 for further illustration.

As a first illustration of the potential merits of the UPG approach, we compare estimation efficiency of the popular Pólya-Gamma sampler from Polson et al. (2013) with UPG in an imbalanced data setting in Figure 1 . It is clearly visible that, while both samplers simulate from the same posterior distribution, the UPG approach outperforms the conventional Pólya-Gamma sampler considerably in terms of efficiency. Note that this efficiency gain is realized in spite of an additional layer of data augmentation to sample the latent utilities z i . While this second layer of data augmentation allows for straight-forward model extensions due to a conditionally Gaussian likelihood representation, it usually increases autocorrelation in the posterior draws significantly. This is counteracted by our novel MCMC boosting strategy based on the working parameters γ and δ.

The interplay of these two working parameters is illustrated for the case of a logistic regression model in Figure 2 . In case of imbalanced data where only a few data points are a success, most of the latent utilities z i will be negative. On the contrary, z i will mostly be greater than zero if the majority of data points are successes. In the balanced case, z i is concentrated around zero. The role of γ is to shift the mean of the latent utilities towards zero in order to restore balance in the unidentified model. At the same time, z i is rescaled using δ. Ultimately, this results in an estimation setup where the boosted utilities z i = z i √ δ + γ are re-centered and re-scaled such that the resulting distribution exhibits fat tails and a zero mean. This leads to an improvement in the signal-to-noise ratio in the boosted model when compared to the unboosted model, effectively allowing for more efficient MCMC estimation.

In addition to this intuitive explanation of our boosting strategy, the role of δ and γ can be explored in a more technical context as well. Variants of the working parameter δ have been discussed for instance in Imai & van Dyk (2005) . Put briefly, δ is used to ensure that the posterior of the boosted utilitiesz i becomes more diffuse than the posterior of the unboosted utilities z i . Through this diffusion, autocorrelation in the posterior draws will decrease asz i becomes less dependent on the current values of the remaining model parameters (e.g. the regression coefficients β). This allows the sampler to move through the parameter space more quickly. A formal proof that this strategy improves convergence of data augmentation algorithms can be found in Meng & Van Dyk (1999) . Theoretical results on MCMC sampling efficiency from the perspective of computational complexity are provided in Johndrow et al. (2019) . One key insight is that data augmentation strategies in imbalanced categorical data sets results in a sampling behavior where the average step size of the sampler is too small to explore the whole posterior distribution efficiently. In our boosting strategy, this issue is effectively offset by restoring pseudo-balance through the working parameter γ, allowing the UPG sampler to efficiently explore the entire posterior distribution.

Let {y i } be a sequence of categorical data, i = 1, . . . , N , where y i is equal to one of at least three unordered categories. The categories are labeled by L = {0, . . . , m}, and for any k the set of all categories but k is denoted by L −k = L \ {k}. We assume that the observations are mutually independent and that for each k ∈ L the probability of y i taking the value k depends on covariates x i in the following way:

where β 0 , . . . , β m are category specific unknown parameters of dimension d. To make the model identifiable, the parameter β k 0 of a baseline category k 0 is set equal to 0:

Thus the parameter β k is in terms of the change in log-odds relative to the baseline category k 0 . In the following, we assume without loss of generality that k 0 = 0. A more general version of the multinomial logit (MNL) model (again with baseline k 0 = 0) reads:

where λ 1i , . . . , λ mi depend on unknown parameters β. For the standard MNL regression model (17), for instance, log λ ki = x i β k for k = 1, . . . , m.

In this section, we consider a new random utility representation of a MNL model with at least 3 categories for the general formulation (18). Starting point is writing the MNL model as a random utility model (RUM), see McFadden (1974):

Thus the observed category is equal to the category with maximal utility. If the errors 0i , . . . , mi in (19) To sample the category specific parameters in log λ ki conditional on the remaining parameters, it is common practice to reduce the RUM (20) to a binary logit model with observations d ki = I{y i = k} (Frühwirth-Schnatter & Frühwirth 2010). However, the UPG sampler introduced in Section 2 cannot be applied, since the parameters for all categories in L −k appear as an off-set in the latent variable representation.

Our new data augmentation scheme resolves this problem by reducing the data to three categories: category k, the baseline category, and a category which collapses the remaining

. For all categories in A, we define an aggregated utility u a,i = max l∈A u li as the maximal utility in A. It can be shown that

and therefore exp (−u a,i ) = min l∈A exp(−u li ) ∼ E (λ a,i ). Evidently, for m = 1, A is the empty set as there are no alternative categories and we obtain the binary logit model.

For each specific category k = 1, . . . , m, the full RUM (19) can be represented as following aggregated RUM, based on the latent utilities (u ki , u 0i , u a,i ):

where the errors ki , 0i , a,i are iid from an EV distribution.

To derive an efficient sampler to estimate the category specific parameters in log λ ki conditional on u a,i , it is useful to rewrite (21) in the following way:

where ε ki ∼ LO follows a logistic distribution, independent of a,i , and the choice equation is the same as in (23). We term this model the aggregated RUM representation.

Conditional on having observed y i , the posterior distribution p(u i |λ ki , y i ) of the latent variables u i = (u ki , u 0i , u a,i ) is of closed form and easy to sample from, see Theorem 1 which is proven in Appendix A.2.

Theorem 1 Draws from the posterior distribution p(u i |λ ki , y i ) can be represented as:

For the standard MNL regression model, the parameters β k corresponding to category k appears only in a linear regression model with logistic errors, given by (24). The term log λ a,i in (22) depends only on the regression parameters β −k of all other categories in A and, as opposed to the original partial dRUM, log λ a,i does not appear as an offset in the latent equation (21), from which we estimate the category specific parameters β k . Rather, the information of all other categories is sufficiently summarized in the latent variable u a,i which affects only the choice equation. Conditional on the aggregated utility u a,i , the logit model (24) is independent of β −k and we can proceed similarly as in Section 2 to derive an ultimate Pólya-Gamma sampler by exploiting the Pólya-Gamma mixture representation of the logistic distribution in (24) with latent variables ω ki , i = 1, . . . , N .

To protect our sampler against imbalanced data, we apply iMDA with category specific parameters γ k and δ k , using the transformations

Transformation (26) 

Transformation (26) leads to following unidentified model:

The priors for the parameters δ k and γ k are chosen as for the binary model:

the joint prior p(δ k , γ k ) of γ k and δ k factors as:

Comparable to the binary case, we define a prior forβ k in the boosted model (27). For a MNL regression model, e.g., we assume the priorβ (27) and (24).

Based on (27), Algorithm 1 can be extended in a fairly straightforward manner. The ultimate Pólya-Gamma sampler for multinomial data is summarized in Algorithm 2, with latent variablesz k = (ũ 1 , . . . ,ũ N , ω k1 , . . . , ω kN ), whereũ i = (ũ ki ,ũ 0i ,ũ a,i ). The choice equation (28) implies following constraint for γ k , conditional onz:

Algorithm 2 The ultimate Pólya-Gamma sampler for multinomial data.

Choose starting values for λ = (λ 1 , . . . , λ m ), λ k = (λ k1 , . . . , λ kN ) and repeat for k = 1, . . . , m:

(a-1) For each i = 1, . . . , N , sample the latent variables u ki = (u ki , u 0i , u a,i ) ∼ p(u ki |λ ki , y i )

in the aggregated RUM model (21) of category k using Theorem 1. Sample ω ki |λ ki , u ki ∼ PG (2, |u ki − u 0i − log λ ki |) in the model (24).

. . , N , to move to the unidentified model (27).

(29), however, truncated to a region depending onz:

Given U ∼ U [0, 1], γ k is sampled as:

Conditional on γ k , we sample from δ k |z k , γ k andβ k |z k , δ k , see Step (b) in Algorithm 2.

In this section, we consider models with binomial outcomes, i.e. models of the form

with log λ i = x i β for a binomial regression model. As shown by Johndrow et al. (2019),

Bayesian inference for binomial regression models based on the Pólya Gamma sampler (Polson et al. 2013 ) is sensitive to imbalanced data. A latent variable representation of binomial models introduced by Fussl et al. (2013) is also is sensitive to unbalanced data, as shown in Section 5. As for a logit model, which results for N i ≡ 1, iMDA would be an option to boost MCMC, however, Fussl et al. (2013) provide no explicit choice equation, which is needed for iMDA. The goal of this section is to define an ultimate Pólya-Gamma sampler for binary outcomes by combing a new latent variable representation of binomial models based on a Pólya-Gamma representation of the generalized logistic distribution with iMDA to protect the algorithm against imbalanced data.

We introduce in Theorem 2 a new latent variable representation for binomial outcomes involving latent variables z i such that the latent variable equation is linear in log λ i and the errors follows a generalized logistic distribution. An explicit choice equation is provided which relates the latent variables z 1 , . . . , z N to observations y 1 , . . . , y N . We show in Theorem 3 that, conditional on y i , the posterior distribution z i |y i , λ i of the latent variables z i is of closed form and easy to sample from. See Appendix A.2 for a proof of Theorem 2 and 3.

Theorem 2 Latent variable representation of a binomial model For 0 < y i < N i , the binomial model has following random utility representation:

where GL I (ν) and GL II (ν) are, respectively, the generalized logistic distributions of type I and type II. For y i = 0, the model reduces to

For y i = N i , the model reduces to

For N i = 1, the logistic model results, as both GL I (ν) and GL II (ν) reduce to a logistic 

where W i and V i are iid uniform random numbers.

The following steps are the main building stones for the ultimate Pólya-Gamma sampler for binomial data. We use a Gaussian mixture representation of the generalized logistic distribution based on the Pólya-Gamma distribution and apply iMDA to achieve robustness against imbalanced data.

A random variable ε following the generalized logistic distributions of type I or type II can be represented as mixture of normals using the Polya-Gamma distribution as mixing measure,

where ξ = κ/ω and κ = a − b/2, and ω ∼ PG For y i > 0, the type II generalized logistic distribution ε w,i ∼ GL II (y i ) in (34) has such a representation with, see (A.11):

Similarly, for y i < N i , the type I generalized logistic distribution ε v,i ∼ GL I (N i − y i ) in (34) has such a representation with, see (A.7):

Note that κ w,i = 0 for y i = 1 and κ v,i = 0 for y i = N i − 1. Hence, for N i = 1, the Pólya-Gamma mixture approximation of a logistic model involving PG (2, 0) results. For

Algorithm 3 The ultimate Pólya-Gamma sampler for binomial data.

Choose starting values for λ = (λ 1 , . . . λ N ) and repeat the following steps:

(a1) For each i = 1, . . . , N , sample w i |λ i , (y i > 0) and v i |λ i , (y i < N i ) using Theorem 3 and sample ω w,i |w i , λ i , (y i > 0) and ω v,i |v i , λ i , (y i < N i ) using (38).

(a2) Sample δ ∼ G −1 (d 0 , D 0 ) and γ ∼ γ|δ , λ. For a binomial regression model, γ|δ , β d is given in . . . , N , to move to the unidentified model. holding all unknown parameters in log λ i fixed, the latent variables w i |λ i , (y i > 0) and v i |λ i , (y i < N i ) are sampled from Theorem 3 without conditioning on ω w,i and ω v,i . Given the latent variables w i and v i , the parameters ω w,i |w i , (y i > 0), λ i and ω v,i |v i , (y i < N i ), λ i are independent and follow (tilted) Pólya-Gamma distributions:

To protect our sampler against imbalanced data, we apply iMDA and transform the latent variablesw i andṽ i as for a logit model:

which implies logλ i = √ δ log λ i + γ. Transformation (39) leads to an unidentified, boosted version of model (34) with a choice equation involving γ:

The parameters δ and γ have the same prior as in the logit model, see (7). The choice equation (41) implies following constraint for γ, conditional onz = (z 1 , . . . ,z N ), wherẽ

However, due to the presence of the (fixed) location parameters ξ w,i and ξ v,i in the likelihood In the main simulation results, we fix N = 1000 for probit, logit and multinomial logit models. For the multinomial case, we consider three categories. Finally, for binomial models, we fix N = 500 and N i = 10. 5 Estimating the spectral density at 0 is accomplished via the R package coda (Plummer et al. 2006) and is based on fitting an autoregressive process to the posterior draws. The results of the main simulation exercise are summarized in Figure 3 . The results

show that standard data augmentation techniques exhibit extremely inefficient sampling behavior when increasing data sparsity as pointed out in Johndrow et al. (2019) . While this general pattern can be observed for the UPG approach as well, our novel boosting strategy clearly alleviates this issue and allows for rather efficient estimation also in imbalanced data settings. 6 6 Interestingly, all samplers exhibit asymmetric inefficiency factors with respect to baseline sparsity in A direct comparison between the UPG approach and the original Pólya-Gamma sampler shows that the former clearly outperforms the latter in terms of efficiency in settings with imbalanced data. That is, UPG samplers are the preferable data augmentation sampler for imbalanced data cases. However, in data sets that are very balanced, the original Pólya-Gamma sampler exhibits the highest observed efficiency levels. This is most likely due to the additional layer of latent variables that the UPG approach utilizes compared to P&S.

Nevertheless, note that all data augmentation algorithms under consideration even surpass the performance of the efficiency "gold standard" AMH in balanced scenarios. In other words, these settings correspond to data sets where inefficient estimation plays a minor role to begin with.

In summary, UPG as well as Metropolis-Hastings based estimation allow for efficient sampling across a wide range of settings when compared to standard data augmentation algorithms. This can also be seen through extensive, additional simulations that are summarized in subsection A.5. It is worth to mention that the UPG approach has some benefits compared to Metropolis-Hastings based samplers. First, no tuning is necessary. Second, one of the major advantages of the UPG approach is that, for all models discussed in this paper, the conditional posterior of the regression effects is multivariate normal. This enables the researcher to readily rely on the large statistical toolbox of modifications and the MNL case. That is, one has to differentiate between cases with an increasingly sparse baseline (positive intercept range) and increasingly sparse non-baseline groups (negative intercept range). A sparse baseline seems to increase inefficiency in all observed cases. This is rather intuitive given that intercepts are the log of the probability of a specific category divided by the log of the probability of the baseline/reference.

Generally the reference category should be a "common" rather than a "rare" category. However, this problem can be avoided in applied statistical modelling as the baseline can be chosen by the researcher, effectively eliminating this issue.

extensions of the linear regression model that has been developed over the past decades.

We will proceed to demonstrate this convenient fact through applying the UPG approach to one of the most general modeling frameworks in Section 6, where we discuss binary state space modeling and present a brief real data application.

Let {y t } be a time series of binary observations, observed for t = 1, . . . , T . Each y t is assumed to take one of two possible values, labelled by {0, 1}. The probability that y t takes the value 1 depends on covariates x t , including a constant (i.e. x td ≡ 1), through time-varying parameters β t in the following way:

We assume that conditional on knowing β 1 , . . . , β T , the observations are mutually independent. A commonly used model for describing the time-variation of β t reads:

where Q = Diag (θ 1 , . . . , θ d ). The expected initial values β = (β 1 , . . . , β d ) and the variances θ 1 , . . . , θ d are unknown. To complete the model specification, the distribution of the initial value β 0 has to be specified. We will discuss this choice in connection with our boosting strategy. We show in this section that Algorithm 1 can be quite easily extended to binary state space models, as

Step (a) is generic conditional on knowing log λ t = x t β t .

A latent utility z t of choosing category 1 is introduced for each y t :

For a probit link, (45) is a conditionally Gaussian state space model (SSM), whereas for a logit link, ε t follows a logistic distribution and (45) is a conditionally non-Gaussian.

However, using the Pólya Gamma mixture representation of the logistic distribution as in Section 2 yields a (high-dimensional) Gaussian posterior for the state process β = {β 0 , β 1 , . . . , β T }, which allows multi-move sampling of the entire state process using FFBS (Frühwirth-Schnatter 1994 , Carter & Kohn 1994 ) in a similar fashion as for a probit SSM.

To achieve robustness again imbalance, we apply iMDA as introduced in Section 2: to transform the SSM (45) to the following unidentified SSM β jt =β j,t−1 + w jt , w jt ∼ N (0, δθ j ) , j = 1, . . . , d,

where transformation (46) implies β j = √ δβ j + I{j = d}γ,β jt = √ δβ jt + I{j = d}γ, t = 0, . . . , T, j = 1, . . . , d. (48) The ultimate Pólya-Gamma sampler for binary SSMs is summarized in Appendix A.4, Algorithm 4.

To illustrate the gains in sampling efficiency for binary state space models, we apply the framework outlined in the previous subsections to a dataset on global pandemics. The data covers 220 years from 1800 to 2020. We define a pandemic as a disease episode that can be q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q This leaves us with a total of eight pandemic events, starting with a bubonic plague outbreak between 1855 and 1860. With more than 12,000,000 deaths in China and India alone, this is considered the third major outbreak of the plague in human history, following prior pandemics in the 6th and 14th century. As second event in the 19th century, we consider a major influenza outbreak from 1889-1890, also known as "Russian flu" or "Asiatic flu". The beginning of the 20th century was hit especially hard by pandemics, starting with an outbreak of Encephalitis lethargica from 1915-1926. Within the same time period, the well known Spanish flu pandemic took place between 1918 and 1920. Being one of the deadliest recorded pandemics, the Spanish flu ultimately eradicated around 25% of the world's population. After the second world war, we identify four major global pandemic events: Still in the 20th century, two influenza A strains known as "Asian flu" (1957) (1958) and "Hong Kong flu" (1968 -1969) killed around 3,000,000 people. In more recent history, a H1N1 virus ("swine flu") spread globally in 2009 and 2010, leading to up to 575,000 deaths.

Finally, the most recent event is the ongoing global outbreak of COVID-19, starting in 2019 and claiming around 1,294,000 victims as of November, 13th 2020. 7

For years featuring a global pandemic, the binary indicator y t equals 1. In years without major pandemic events, y t = 0. A pandemic is observed in roughly 1 out of 8 years with extremely high autocorrelation, rendering the data set relatively imbalanced. To illustrate the efficiency gains arising from our boosting algorithm, a logistic local level model is fitted to the data twice, once with and once without boosting, enabling us to directly compare these samplers in terms of efficiency. We iterate the Gibbs sampler 500,000 times after a burn-in period of 20,000 iterations. The resulting posterior distribution is shown in Figure 4 . The resulting inefficiency factors are provided in Figure 5 . Clearly, long periods of persistence in the binary outcome time series lead to an aggravated efficiency loss in the sampler. On the contrary, during periods where y t is likely to change (corresponding to a predicted probability around 0.5), sampling efficiency is rather high. Summarizing, the boosted sampling algorithm described in Section 6.1 is able to drastically reduce sampling inefficiency, even during prolonged periods of no change in the outcome variable. Hence, the UPG approach can be considered a useful tool for efficient discrete choice analysis, not only in simple regression models, but also in more complex settings such as state space frameworks.

7 Sourced from https://en.wikipedia.org/wiki/List_of_epidemics and sources therein.

Due to a wide range of applications in many areas of applied science, much attention has been dedicated towards the development of estimation algorithms for generalized linear models. In the past decades, various data augmentation algorithms have been brought forward that have steadily increased accessibility and popularity of Bayesian estimation techniques in the context of discrete choice models.

In this article, we introduce a number of sampling algorithms based on the data augmentation algorithm of Polson et al. (2013) . These algorithms may be used to estimate parameters in the most commonly encountered discrete choice models and are suited to For all latent variable representations derived in this paper for binary, multinomial or categorical data, the error term in the latent equations arises from a distribution ε ∼ F ε (ε),

for which the density f ε (ε) can be represented as a mixture of normals using the Pólya-Gamma distribution as mixing measure: To simulate from a (tilted) Pólya-Gamma distribution PG (q, c) distribution, the following convolution property is exploited:

where d is a constant not depending on ε.

Hence, to simulate from Y ∼ PG (q, c), use Y = q j=1 X j , where X j ∼ PG (1, c) are q independent r.v. from the PG (1, c) distribution.

For the type I generalized logistic distribution ε ∼ GL I (ν) with parameter ν > 0, the density reads

GL I (ν) reduces to the logistic distribution for ν = 1. The c.d.f. of a type I generalized logistic distribution takes a simple form:

Hence, the quantiles are available in closed form:

The type I generalized logistic distribution ε ∼ GL I (ν) can be represented as a mixture of normals with a Pólya-Gamma distribution serving as mixing measure, where

see Section A.1.1. For the logistic distribution, ω ∼ PG (2, 0) and κ = 0.

For the type II generalized logistic distribution ε ∼ GL II (ν) with parameter ν > 0, the density reads

Also GL II (ν) reduces to the logistic distribution for ν = 1.

The c.d.f. of a type II generalized logistic distribution takes a simple form:

Hence, the quantiles are available in closed form:

(A.10)

The type II generalized logistic distribution ε ∼ GL II (ν) can be represented as a mixture of normals with a Pólya-Gamma distribution serving as mixing measure, where

see Section A.1.1. Again, for the logistic distribution, ω ∼ PG (2, 0) and κ = 0 results.

Proof of Theorem 1

The proof of Theorem 1 is straightforward. Depending on the observed category y i , the corresponding utility is the maximum among all latent utilities and the posterior distribution p(u ki , u 0i , u a,i |λ ki , y i ) is equal to one of the following distributions:

where λ i = 1 + m l=1 λ li . For efficient joint sampling of all utilities for all i = 1, . . . , N , this can be rewritten as in (25).

A binomial observation y i can be regarded as the aggregated number of successes among N i independent binary outcomes z 1i , . . . , z N i ,i , labelled {0, 1} and each following the binary logit model Pr(z ni = 1|π i ) = π i . For each individual binary observation z ni , the logit model can be written as a RUM (McFadden 1974) :

involving a latent variable u ni , where ni are i.i.d. errors following a logistic distribution.

Among the N i binary experiment, y i outcomes z ni choose the category 1, whereas the remaining N i − y i outcomes z ni choose the category 0. The challenge is to aggregate the latent variables u ni to a few latent variables in such a way that an explicit choice equation is available. As it turns out, such an aggregation can be based on the order statistics

Consider first the case that y i = 0. Such an outcome is observed, iff z ni = 0 or, equivalently, the latent utility is negative (u ni < 0) for all n = 1, . . . , N i . Hence, a necessary and sufficient condition for y i = 0 is that the maximum of all utilities is negative, or equivalently,

Next, consider the case that y i = N i . Such an outcome is observed, iff z ni = 1 or, equivalently, the latent utility is positive (u ni > 0 ) for all n = 1, . . . , N i . Hence, a necessary and sufficient condition for y i = N i is that the minimum of all utilities is positive, or equivalently,

Also for outcomes 0 < y i < N i , the order statistics u (N i −k),i and u (N i −k+1),i provide necessary and sufficient conditions:

Note that (A.12) -(A.14) are choice equations involving either a single or two order statistics. Hence, we introduce the corresponding order statistics as aggregated latent variables.

Given

with obvious modifications for y i = 0 and y i = N i . It remains to prove that the latent variables can be represented as in the aggregated model (34):

Note that the order statistics can be represented for j = 1, . . . , N i as u (j),i = log λ i + ε (j),i , involving the order statistics ε (1),i , . . . , ε (N i ),i are of N i iid realisations ε 1,i , . . . , ε N i ,i of a logistic distribution. Their distribution can be derived from the order statistics X (1),i , . . . , X (N i ),i of N i uniform random numbers X 1i , . . . , X N i ,i using:

where F is the cdf of the logistic distribution.

First, for the special cases where y i = 0 or y i = N i , we use that X (j),i ∼ B (j, N i − j + 1).

Using (A.16), we can derive the density of ε (N i ),i :

which is the density of a GL I (N i ) distribution, see (A.4). Hence, for y i = 0,

Using (A.16), we can derive the density of ε (1),i :

which is the density of a GL II (N i ) distribution, see (A.8). Hence, for y i = N i :

follows a Dirichlet distribution, see e.g. Robert & Casella (1999) , we ob-

and their inverse,

where f is the pdf of the logistic distribution. Therefore,

This density can be expressed as

is the density of a GL II (k) distribution, see (A.8), and

is a normalising constant. It is possible to verify that

Knowing that y i = k, with 0 < k < N i , w i |y i and v i |y i are conditionally independent and the following holds:

Since both F ε and F −1 ε are available in closed form for both types of generalized logistic distributions, we obtain:

where W i is a uniform random number, see (A.10). This proves equation (35):

where V i is a uniform random number, see (A.6). This proves equation (36):

For y i = 0, only v i |y i is sampled; for y i = N i , only w i |y i is sampled. It is easy to verify that indeed w i > 0 and v i < 0.

Consider the latent variable representation of a binary model

involving the latent variables z i :

where f ε (ε) is the pdf of the cdf F ε (ε). f ε (ε) = φ(ε) is equal to the standard normal pdf for a probit model and equal to f ε (ε) = e ε /(1 + e ε ) 2 for a logit model.

Provided that the quantile function F −1 ε (p) is available in closed form, it is easy to sample z i from the marginalized density z i |λ i , y in the latent variable model (A.21). The latent variables z i are pairwise independent and the posterior z i |y i , λ i is given by

The posterior of z i is f ε (z i − log λ i ) truncated to (−∞, 0], if y i = 0, and truncated to

where U i ∼ U [0, 1] and π i = Pr(y i = 1|λ i ) = F ε (log λ i ), where F −1 ε (p) = Φ −1 (p) for the probit model and F −1 ε (p) = log p − log(1 − p) for the logit model. 9 To simulate ε from a distribution F ε (ε) truncated to [a, b] we simulate a uniform random number U Johndrow et al. (2019) suggest to use a simplified adaptive MH algorithm based on the ideas outlined in ?. Corresponding to their sampling scheme, we update the probit/logit/multinomial coefficients θ using a time-inhomogenous proposal θ i for each component θ i of θ using the proposal kernel

where φ is the standard univariate Gaussian density andθ ik is the average posterior draw across all previous Gibbs iterations. That is, they use the variance of all previous Gibbs iterations to construct on-the-fly proposal densities. For the first 100 iterations, we use a proposal density with unit variance.

The likelihood contribution for each y i in the boosted Pólya-Gamma mixture representation of a binomial model (40) reads:

The likelihood contributions (A.24) have to be combined with the prior p(δ|γ) and with the prior distributions of the parameters in logλ i . We provide details for a binomial regression model.

For a binomial regression model, logλ i = x iβ , where the parameterβ is defined exactly as for a binary regression model:

As in the binary case, we assume the following prior forβ in the boosted model:

where A 0,dd is possibly larger than A 0,jj , for 1 ≤ j < d, which implies the following joint prior of β|γ, δ in the identified model:

This dependence on δ and γ has to be taken into account, when we move from the identified to the unidentified model and sample δ ∼ G −1 (d 0 , D 0 ) and sample γ |δ , β d from

The boosted model readsw

To obtain the joint posterior p(β, δ|z, γ), we combine the likelihood contributions (A.24) with the conditional prior p(δ|γ) given in (8) It can be shown that the joint posterior p(β, δ|z, γ) factorizes in the following way The last terms in D I and B I result from completing the squares inβ. Note that

Hence, the factor exp 1 2δ m N (δ) B N m N (δ) has to be added to (A.29) to compensate for completing the squares in f N (β; b N (δ), δB N ). Using that

we obtain:

The marginal density p(δ|γ,z, y) ∝ 1 π(δ) for resampling such that the mode and the curvature of π(δ) coincide with the mode δ M and the curvature I p of the posterior p(δ|γ,z, y) which are given by:

Resampling works as follows. L draws δ (l) ∼ π(δ), l = 1, . . . , L, from the auxiliary prior are resampled using weights proportional to the "auxiliary likelihood" (δ) = p(δ|γ,z, y)/π(δ), given by:

The desired draw from p(δ|·) is given by δ (l ) , where l ∼ MulNom (1; w 1 , . . . , w L ) and the weights w l ∝ (δ (l) ) are normalized to 1.

The auxiliary likelihood (δ) is expected to be rather flat over the support of π(δ), see Figure A .1 for illustration. Hence, w 1 , . . . , w L is expected to be close to a uniform distribution and L can be pretty small (L = 5 or L = 10 should be enough).

The factorization of the posterior suggests two distribution families as auxiliary prior π(δ). First, the inverted Gamma prior δ ∼ G −1 (d I , D I ) with mode δ IG and curvature I IG of the pdf given by:

Matching the mode, i.e. δ IG = δ M , and the curvature, i.e. I IG = I p , to the posterior, i.e.

yields following optimal choice for the parameters (d I , D I ):

The log likelihood ratio reads:

Second, provided that B I > 0, the translated inverted Gamma prior √ δ ∼ G −1 (2b I , B I ) 10

with mode δ SQ and the curvature I SQ of the pdf given by:

Matching the mode, i.e. δ SQ = δ M , and the curvature, i.e. I SQ = I p , to the posterior, i.e.

δ SQ = (B I ) 2 4(b I + 1) 2 = δ M , 10 As follows from the law of transformation of densities, a random variable δ, where the transformed

To sample from such a density, we sample X ∼ G −1 (2a, b) and take the square, i.e. δ = X 2 .

yields following optimal choice for the parameters (b I , b I ):

The log likelihood ratio reads:

See again Figure A.4 More Details on the UPG sampler for binary state space models As for binary regression models, we use the priors γ|δ ∼ N (0, δG 0 ) , δ ∼ G −1 (d 0 , D 0 ) and, specify the following priors in the boosted SSM (47):

where A 0,jj , j = 1, . . . , d and c 0 as well as C 0 are fixed hyperparameters. Marginalizing with respect toβ j yields following prior for the initial valueβ j0 in (47):

β j0 |δ, θ j ∼ N (0, δ(A 0,jj + P 0,jj θ j )) . 

We observe the same prior dependence between γ, δ and β d as for the simple binary regression model: 

The key variables in these formulations are the state variable β t and the observation variable y t . The state variable β t is a latent d-dimensional random vector which is observed only indirectly through the effect it has on the distribution of the r-dimensional time series y t . Note that the system (A.38) and (A.39) is a linear Gaussian state space model where all (co)variances depend on the same unknown scaling factor δ.

Let ϑ be the vector containing all unknown parameters in the system matrices F t , H t , Q t and R t apart from δ.

In filtering, the goal is to estimate β 0 , . . . , β T |y, δ, ϑ given data y = (y 1 , . . . , y T ) for fixed model parameters ϑ. For a linear Gaussian state space model with unknown scaling, the filter density p(β t |δ, ϑ, y t ) at time t, where y t = (y 1 , . . . , y t ), is equal to a normal distribution, scaled with δ, i.e. β t |δ, ϑ, y t ∼ N d β t|t , δ P t|t , whereβ t|t and P t|t are determined recursively for t = 1, . . . , T fromβ t−1|t−1 , P t−1|t−1 and y t . This results follows immediately from applying the Kalman filter to the system (A.38) and (A.39). From a computational point of view, this extended filter is a standard Kalmanfilter which is run with δ = 1, and where all covariances in the filter density β t |δ, ϑ, y t , the propagated density β t |δ, ϑ, y t−1 and the predictive density y t |δ, ϑ, y t−1 are multiplied by δ after running the filter.

We provide further details for TVP models for univariate time series y t , where r = 1,

. . , β d ) and P 0|0 = Diag (θ 1 P 0,11 , . . . , θ d P 0,dd ):

An advantage of writing the Kalman filter in this way is that the integrated likelihood p(y|δ, ϑ) is available in closed form. If r = 1, then exploiting (A.40) we obtain:

whereŷ t|t−1 and S t|t−1 are the moments given in (A.40) . This allows to sample the scaling factor δ jointly with the state process β 0 , . . . , β T conditional on ϑ and y, (a) by sampling δ|ϑ, y from the inverted Gamma prior that results from combining the likelihood (A.41) with the inverted gamma prior δ ∼ G −1 (d 0 , D 0 ),

(b) by sampling β 0 , . . . , β T |y, δ, ϑ conditional on δ using FFBS (Frühwirth-Schnatter 1994).

As mentioned above, the momentsŷ t|t−1 and S t|t−1 in Step (a) are obtained from running a Kalman-filter with δ = 1. Note that we need not rerun the Kalman-filter in Step (b), all covariances are just multiplied by the scaling factor δ sampled in Step (a).

Application to boosting binary state space models. This result is useful for the boosted state space model (47), because the working parameter δ new can be sampled without conditioning on the state processβ 0 , . . . ,β T and the initial valuesβ 1 , . . . ,β d . In this case, ϑ = (θ 1 , . . . , θ d ) and the Kalman filter described above is applied withβ 0|0 = 0 d and P 0|0 = Diag (A 0,11 + θ 1 P 0,11 , . . . , A 0,dd + θ d P 0,dd ). The Kalman filter reads:

Therefore, the likelihood (A.41) takes the following form:

This section provides some additional simulation results for probit, logit, multinomial and binomial regression models. We investigate a number of simulation scenarios. First, we vary the sample size N with only one instance of y i = 1, implying increasing imbalance with a larger sample size. Second, we fix N = 1000 and vary the intercept in the data generating process, where large absolute intercepts imply greater imbalance than intercepts that are close to zero (corresponding to the fully balanced case). In addition, we vary the Table A .1. It is obvious that the commonly used standard data augmentation approach from A&C has severe efficiency problems when increasing sparsity in the dependent variable and therefore imbalancedness of the data. Increasing inefficiency can also be observed for the I&vD algorithm as seen in Figure A .2 and Table A.1. On the other hand, constantly high levels of efficiency are observed for UPG and the AMH algorithm. In the probit case, the extreme intercept values of −3 and 3 imply success counts of close to 0 and close to N , respectively. In general, AMH and UPG exhibit efficient estimation behavior over the full range of intercepts. On the contrary, I&vD as well as A&C show increasing inefficiency with increasing imbalance. Selected numerical results for these models can be found in Table A.2 to Table A .5. 

Bayesian analysis of binary and polychotomous response data

On Gibbs sampling for state space models

Data augmentation and dynamic linear models

Auxiliary mixture sampling with applications to logistic models

Statistical Modelling and Regression Structures -Festschrift in Honour of Ludwig Fahrmeir

Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data

Efficient MCMC for binomial logit models

Markov chain Monte Carlo for dynamic generalized linear models

Bayesian auxiliary variable models for binary and multinomial regression

A Bayesian analysis of the multinomial probit model using marginal data augmentation

MCMC for imbalanced categorical data

A Bayesian analysis of the multinomial probit model with fully identified parameters

Conditional logit analysis of qualitative choice behaviour

Seeking efficient data augmentation schemes via conditional and marginal data augmentation

CODA: Convergence diagnosis and output analysis for MCMC

Bayesian inference for logistic models using Pólya-Gamma latent variables

Monte Carlo Statistical Methods

Bayesian Statistics and Marketing

 11 We use the UPG algorithm with γ deterministically fixed at 0 which corresponds to the marginal data augmentation algorithm in Imai & van Dyk (2005) .

Algorithm 4 The ultimate Pólya-Gamma sampler for binary state space models.Choose starting values for (θ 1 , . . . , θ d ), and λ 1 , . . . , λ T and repeat the following steps:Sample ω t |z t , λ t , y t ∼ PG (2, |z t − log λ t |) for the logit SSM.(a-2) Sample δ ∼ G −1 (d 0 , D 0 ) and γ |δ from (A.37). Move to the boosted SSM (47) using transformation (48) to definez t , t = 1, . . . , T .(b) Sampling in the boosted conditionally Gaussian SSM (47):. . ,β T jointly conditional on θ 1 , . . . , θ d based on the marginal prior β j0 |δ, θ j given in (A.35), using the methods described in Appendix A.4.1 with the process variances being equal to δθ j and heteroscedastic observation errorswhere d T and D T are given by:andẑ t|t−1 and S t|t−1 are obtained from a Kalman filter with δ = 1, see (A.43).(b-2b) Sampleβ 0 , . . . ,β T |δ new ,z, θ 1 , . . . , θ d using forward-filtering-backward-sampling.(b-3) Update (β j , θ j )|δ new under the priorβ j0 |β j , θ j , δ new ∼ N β j , δ new P 0,jj θ j given by (A.34):(b-3a) Sampleβ j |θ j ,β j0 , δ new , for j = 1, . . . , d:A0,jj +θj P0,jj , δ new A0,jj θj P0,jj A0,jj +θj P0,jj .(b-3b) Sample θ j |δ new ,β j , {β jt } ∼ G −1 (c 0 + (T + 1)/2, C j ) for j = 1, . . . , d where