key: cord-0501810-0vsvm5uw authors: Piancastelli, Luiza S.C.; Friel, Nial; Barreto-Souza, Wagner; Ombao, Hernando title: Multivariate Conway-Maxwell-Poisson Distribution: Sarmanov Method and Doubly-Intractable Bayesian Inference date: 2021-07-15 journal: nan DOI: nan sha: f053dc098e051e9fe128576ee311032626b656b0 doc_id: 501810 cord_uid: 0vsvm5uw In this paper, a multivariate count distribution with Conway-Maxwell (COM)-Poisson marginals is proposed. To do this, we develop a modification of the Sarmanov method for constructing multivariate distributions. Our multivariate COM-Poisson (MultCOMP) model has desirable features such as (i) it admits a flexible covariance matrix allowing for both negative and positive non-diagonal entries; (ii) it overcomes the limitation of the existing bivariate COM-Poisson distributions in the literature that do not have COM-Poisson marginals; (iii) it allows for the analysis of multivariate counts and is not just limited to bivariate counts. Inferential challenges are presented by the likelihood specification as it depends on a number of intractable normalizing constants involving the model parameters. These obstacles motivate us to propose a Bayesian inferential approach where the resulting doubly-intractable posterior is dealt with via the exchange algorithm and the Grouped Independence Metropolis-Hastings algorithm. Numerical experiments based on simulations are presented to illustrate the proposed Bayesian approach. We analyze the potential of the MultCOMP model through a real data application on the numbers of goals scored by the home and away teams in the Premier League from 2018 to 2021. Here, our interest is to assess the effect of a lack of crowds during the COVID-19 pandemic on the well-known home team advantage. A MultCOMP model fit shows that there is evidence of a decreased number of goals scored by the home team, not accompanied by a reduced score from the opponent. Hence, our analysis suggests a smaller home team advantage in the absence of crowds, which agrees with the opinion of several football experts. The Conway-Maxwell-Poisson (COM-Poisson) distribution was introduced by Conway and Maxwell (1962) in the context of queuing systems and was later revived in the literature by Shmueli et al. (2005) . A random variable X follows a COM-Poisson distribution if its probability function can be written as where is the normalised constant, with λ > 0 and ν > 0, the latter of which is responsible for controlling the dispersion. The COM-Poisson distribution is overdispersed when ν < 1 and underdispersed when ν > 1. The case ν = 1 corresponds to the Poisson distribution, which is equidispersed. The Bernoulli and geometric distributions can also be obtained from the COM-Poisson. The geometric special case corresponds to ν = 0 and λ < 1 while the Bernoulli(λ/(λ + 1)) distribution is a limiting case as ν → ∞. This distribution has received much attention in the literature after its revival in 2005. Properties of the COM-Poisson distribution are discussed for instance in Nadarajah (2009) and Daly and Gaunt (2016) , and a regression model is introduced by Sellers and Shmueli (2010) . Bayesian inference approaches for this model were proposed by Kadane et al. (2006) and Benson and Friel (2020) . Approximations for the intractable normalizing constant Z(λ, ν) given in (2) have been studied by Daly and Gaunt (2016) and Gaunt et al. (2019) . Recent contributions on time series analysis and tree-based semi-varying coefficient model are due to Sellers et al. (2019) and Chatla and Shmueli (2020) , respectively. For a recent account on the COM-Poisson model, we refer the reader to Sellers and Premeaux (2020) . A natural point of interest is the proposal of a multivariate count model with COM-Poisson marginals. A first bivariate proposal attempt in this direction was addressed by Sellers et al. (2016) but the marginals of that proposed bivariate model are not longer COM-Poisson distributed. Moreover, the dispersion parameter is assumed the same for both marginals and the range of correlation depends on it, which limits its ability to account dependency. For example, for the particular case for their model when the dispersion equals 1 (the bivariate Poisson case), the correlation is non-negative. The quantities involved in such a bivariate distribution are also very cumbersome; for instance, see the joint probability function given in equation (16) of Sellers et al. (2016) which depends on an infinite summation. Here we aim to address these issues and through the construction of a multivariate COM-Poisson distribution. To do this, we develop a modification of the Sarmanov (1966) it allows for analysis of multivariate counts rather than being limited to bivariate counts. A challenging point that arises in our proposed multivariate COM-Poisson model is that the likelihood function depends on the ratio of normalised constants arising from (2). We propose a Bayesian inference based on doublyintractable posterior via the exchange algorithm and Grouped Independence Metropolis-Hastings to deal with this posed challenge. A recent related work is due to Ong et al. (2021) where a bivariate COM-Poisson distribution is proposed based on the Sarmanov method. The exponential kernel case discussed there is a particular case of our MultCOMP model when the dimensional equals 2. Furthermore, the inference in that paper is performed via direct maximization of the log-likelihood function without exploring the difficulties involving the appearance of the constants due to (2) and the parameter restrictions to be considered in the optimization. Such challenging points are carefully addressed in our paper under a Bayesian perspective. The remainder of this paper is organized as follows. In Section 2, the Sarmanov construction of bivariate distributions is reviewed along with its multivariate extension by Lee (1996) . A variation of the former with a lower number of parameters and more tractable correlation bounds is developed in this paper and related properties are discussed. We propose a multivariate COM-Poisson distribution using our modified Sarmanov method in Section 3. We develop and compare Bayesian methods to deal with intractability of the proposed model likelihood in Section 4. Two MCMC strategies are developed in Section 5 and compared via simulation studies in Section 6. In Section 7, we apply the MultCOMP model to analyze the numbers of goals scored by the home and away teams in the Premier League from 2018 to 2021. Here, our interest is to assess the effect of the absence of crowds due to the COVID-19 pandemic on the well-known home team advantage. The analysis using the MultCOMP model reveals that the home team advantage, in the Premier League, was significantly diminished while no crowds were allowed in the games. Concluding remarks are given in Section 8. 2 Generalized Sarmanov method Sarmanov (1966) proposed a method for constructing bivariate distributions with given marginals. This method was extended by Lee (1996) in order to accommodate higher-order dimensions rather than twodimensional. In this section, we propose a modification of the version by Lee (1996) , that is more mathematically tractable as explained in what follows. We begin by briefly exploring the works by Sarmanov (1966) and Lee (1996) . Let f 1 and f 2 be two density functions with respect to measures µ 1 and µ 2 with support on S 1 and S 2 , respectively. Commonly, these are either the counting and Lebesgue measures corresponding to the discrete and continuous cases, respectively. A joint density function f (·, ·) with respect to the product measure µ 1 × µ 2 having marginals f 1 and f 2 is now constructed using the Sarmanov method by where φ 1 (·) and φ 2 (·) are bounded functions satisfying S i f i (x)φ i (x)dµ i (x) = 0, for i = 1, 2, with S 1 and ∈ S 2 being the marginal supports. Furthermore, these functions and δ must satisfy (3) is a proper joint density function. As discussed in Vernic (2020) , , with X i being a random variable having density function f i , for i = 1, 2. Some possible choices for the function u i (·) are: which is known as exponential kernel and will be the focus of our paper; (ii) u i (x) = x ω (assuming that the associated support is compact and that For more details on the bivariate Sarmanov distributions; see Kotz et al. (2000) and Vernic (2020) . Let us consider the exponential kernel case, that is u i (x) = e −ωx with either S i = N 0 or S i = R + for i = 1, 2. As mentioned above, in order to ensure that (3) is a proper density function, it is necessary that the following condition holds: The range of the δ parameter yielding a valid joint density function for the above case was studied for instance in Lee (1996) . From Corollary 2 of that paper, we obtain that the parameter space of δ ensuring (4) is given by Let (X 1 , X 2 ) be a bivariate vector following a bivariate Sarmanov distributions with the exponential kernel function as described above. Denote µ 1 ≡ E(X 1 ), µ 2 ≡ E(X 2 ), σ 1 ≡ Var(X 1 ) and σ 2 ≡ Var(X 2 ). The correlation between X 1 and X 2 is where Ψ i = M X i (−ω) for i = 1, 2, with M X (·) denoting the first derivative of the marginal moment generation function of a random variable X. With the above results, we obtain that the range of correlation between X 1 and X 2 is The Sarmanov method was extended by Lee (1996) in order to allow the construction of higher-order multivariate distributions rather than bivariate; see also Kotz et al. (2000) . Let f 1 , . . . , f d be d ∈ N density functions with respect to the measures µ 1 , . . . , µ d with respective supports S 1 , . . . , S d . Then, a joint density function having marginal densities f 1 , . . . , f d can be constructed by where and Ω d = {{δ j 1 j 2 } 1≤j 1 0; i = 1, . . . , d) marginals, which we call the multivariate COM-Poisson (MultCOMP) distribution. For this, we consider , for x ∈ N 0 , ω > 0, and i = 1, . . . , d. Definition 3.1. We say that a random vector X = (X 1 , . . . , X d ) follows a multivariate COM-Poisson (MultCOMP) distribution if its joint probability function P (X 1 = x 1 , . . . , X n = x n ) ≡ f (x 1 , . . . , x d |θ) assumes the form where θ = (λ, ν, δ, ω) , δ = vec{δ jk : j = 1 . . . , d−1, k = j+1, . . . , d} satisfy (10) with From now on, assume that X = (X 1 , . . . , X d ) is a random vector with joint probability function (12), which is denoted by X ∼ MultCOMP(λ, ν, δ, ω). For j = k, we derive the correlation between X j and X k to be where Since A jk > 0, the sign of the parameter δ jk determines if X j and X k are negative or positive correlated for δ jk < 0 and δ jk > 0, respectively. We have that X j and X k are independent if δ jk = 0. Remark 3.1. An illustration of the correlation supported under the bivariate case is provided in Figure 1 which is obtained as follows. For fixed configurations of λ and ν we vary ω and calculate the possible δ range according to (10). Given λ, ν and ω, there is a linear relationship between δ jk and the dependency among components j and k. Hence, the lower and upper δ values yield the minimum and maximum correlation under the set configuration. Naturally, the calculation of (6) depends on intractable terms under COM-Poisson marginals. In this initial illustration, we replace the infinite summations in the model for truncated ones. More specifically, this is calculated using where Z(λ, ν) denotes replacing 2 with a finite summation in x = 0, · · · , T . A minimum T = 1000 is set, followed by and an iterative procedure that increases T until the difference in successive terms is less than 10 −5 . However, in Section 4 we will describe Monte Carlo approaches to handle this intractability. In Figure 1 , λ is set to (1, 1) and the ν values vary so that we obtain two overdispersed marginals I : ν = (0.7, 0.7), both overdispersed II : ν = (1.5, 1.5) and one of each III : ν = (0.7, 1.5). Remark 3.2. This preliminary investigation allows us to conclude the following. The parameter ω plays the role of increasing the correlation range (with respect to the case ω = 1 commonly assumed in the literature) supported by the given λ, ν and δ, but this effect is non-linear. Further, there is evidence that the values associated with the minimum and maximum possible correlation, denoted ω min and ω max , are not equivalent and vary with the model configuration. For instance, ω min = 0.9 and ω max = 0.7 in I, while (ω min , ω max ) are (1.6, 0.8) and (1.2, 0.8) under II and III respectively. Summary of the proposed model features. The MultCOMP distribution is a flexible model for analyzing multivariate count data that are dependent since (i) it is defined for a arbitrary dimension d ∈ N; (ii) it has a flexible covariance matrix allowing both negative and positive correlations, being the independent case included not at the boundary of the parameter space; (iii) it permits to deal with different degrees of overdispersion and underdispersion for the marginals; (iv) it also allows components to be Poisson, geometric or Bernoulli distributed, with the last two being limiting cases as happens in the univariate COM-Poisson distribution. Marginal and conditional probability functions for the MultCOMP law are directly available from Proposition 2.3. In what follows, we focus our attention on how to conduct Bayesian inference for the proposed multivariate count model. Inference and random variable generation for the proposed model depend on being able to evaluate the likelihood (12) pointwise. Our likelihood model involves two types of intractable terms, the univariate We shall denote these respectively by z −1 j and r j , where the dependency on the parameters is suppressed for simplification of notation. This section addresses the estimation of the latter while the former is handled in Section 5. Different methodologies to estimate ratios of normalizing constants of two probability distributions have been developed targeting problems in Bayesian statistics and statistical physics. Quantities of interest are, for example, the Bayes factor and the free energy difference of physical systems. A natural Monte Carlo method for doing this is via Importance Sampling (IS), where draws from the distribution associated with the ratio denominator are taken. The performance of the simple importance sampling scheme will depend on how close the two distributions are. The acceptance ratio, bridge sampling and thermodynamic integration (or path sampling) methods originating from physics are introduced in the statistical literature by Meng and Wong (1996) and Gelman and Meng (1998) . The authors showcase how these methods are linked to the widely known importance sampling, evidencing how they are natural generalizations of it. Other developments are the ratio importance sampling by Torrie and Valleau (1977) (or umbrella sampling) and the annealed importance sampling by Neal (2001) , among others. For a careful assessment of Monte Carlo methods for ratios of normalising constants, we refer the reader to Chen et al. (2000) . In this section, importance sampling and thermodynamic integration estimators of the r j 's are presented and compared via simulation. With these at hand, an algorithm to simulate random draws from the proposed model will be developed. can be obtained by sampling from a COM-Poisson(λ, ν) distribution. Let x 1 , ..., x N be N independent draws from a COM-Poisson(λ, ν) law. These can be efficiently obtained via the fast-rejection sampler proposed by Benson and Friel (2020) . An Importance Sampling (IS) estimator of r is given by The estimator in (14) is unbiased for the ratio of interest as follows: The Thermodynamic INTegration (TINT) approach is a generalization of the Importance Sampling that has demonstrated successful for many statistical problems. For example, it is the basis of the powerposterior approach for computing Bayesian model evidence (Friel and Pettitt, 2008; Friel et al., 2014) . Let q 0 (x) and q 1 (x) be two unnormalised densities with the same support χ satisfying p( Suppose that it is possible to introduce a class of densities in χ indexed by a continuous parameter α (with support on some closed interval, say [0, 1]), say p(x|α) = q(x|α)/z(α), that links the two densities, and we are interested in computing r = z(1) z(0) . Thermodynamic integration is also known as path sampling because it relies on creating a path between q 0 (x) and q 1 (x). One option is to Having defined a path, we employ the basic where the expectation is taken with respect to p(x|α). Integrating α yields the log-ratio of interest once The continuous parameter α is often called the inverse temperature and is defined such that the path gives us the unnormalised densities of interest at both extremes with the log-ratio resulting from the defined integral. Different strategies to perform thermodynamic integration rely on (i) the definition of the path and (ii) how to perform integration. Common choices for (i) are the geometric and harmonic paths. Regarding (ii), α can be seen as a random variable with a prior distribution, or numerical integration strategies can be adopted. A thermodynamic integration estimator for the multivariate COM-Poisson intractable ratio is defined by introducing a probability function that is indexed by τ , and integration over the inverse temperature τ yields the desired log-ratio z(ω) z(0) . Following the notation in Gelman and Meng (1998) The path sampling identity gives us that where the expectation is taken with respect to a COM-Poisson(λe −τ , ν) distribution. If we define τ to be a random variable with density p(τ ), a Monte Carlo estimator of log r is . For example, we can consider τ ∼ U [0, ω] and sample from p(x, τ ) = p(x|τ )p(τ ). A thermodynamic estimator that sets the inverse temperature to be a random variable can result in a poor performance if values on the extremes of the interval are not sampled frequently enough under p(τ ) (Friel et al., 2014) . A popular alternative is to adopt numerical integration over a discretised range 0 = τ 1 < τ 2 < . . . < τ nrungs = ω, guaranteeing proper exploration of τ values. An estimator based on the trapezoid rule is given by At each τ i , a number of independent draws from a COM-Poisson(e −τ i λ, ν) distribution are used to estimate expectations as Monte Carlo averages. A TINT estimator employing numerical integration relies additionally on (i) the number and schedule of the discretisation terms (also called rungs) and (ii) the number of simulated draws per rung. Finding an optimal form for (i) is a non-trivial problem for which the recommendation is to adopt a power fraction schedule (Oates et al., 2014) . Under this approach, τ values are placed according to (i/n rungs ) c with c > 1 and i = 1, . . . , n rungs . Schedules of this form have commonly been adopted in literature, demonstrating to be a successful choice (Friel and Pettitt, 2008) . We adopt τ i = ω(i/n rungs ) 5 and n rungs depending on the interval length ω. Simulation studies are performed in the next subsection to compare the performance of the IS, TINT with prior distribution (TINT-prior) and numerical integration TINT (TINT-trapezium) estimators for r. Estimators for r based on the IS and TINT methods are compared in this section via simulation. For this task, we consider the pairs (λ, ν) = (1.5, 1), (1, 0.5) and ω = 0.5, 3. For a fair comparison, approximately the same total number of draws (n total ) is used when computing r through the alternative estimators. Effectively, this means that IS and TINT-prior use n total draws and n total is spread over the grid of τ values under TINT-trapezium. Two discretisations are assessed for the latter. The first takes n rungs = ω/0.1 and the second n rungs = ω/0.05 with the number of draws per rung being n total /n rungs . We explore a grid of n total from 10K to 200K draws where 200 replications are used to compute the Monte Carlo standard deviation of each estimator. Results for ω = 3 with r 1 = Z(e −3 1.5, 1) Z(1.5, 1) and r 2 = Z(e −3 , 0.5) Z(1, 0.5) are reported in Figure 2 and those due to ω = 0.5 can be found in the Supplementary Material, which can be obtained from the authors upon request. In both experiments, it is shown that the smallest variability is due to the simplest estimation pro-cedure (IS) over the entire n total range. This indicates that the importance density COM-Poisson(e −ω λ, ν) is close enough to COM-Poisson(λ, ν) to yield a low variability even for high ω. It is now possible to introduce an algorithm to sample from the proposed distribution. For simplicity, . Our strategy is to draw X 1 from its marginal and sample from the se- x 1 ), we obtain an unnormalised probability function that depends on r ≡ (r 1 , . . . , r d ) but no longer on z −1 ≡ (z −1 1 , . . . , z −1 d ) . An estimator for p(x j |x j−1 , · · · , x 1 ) is obtained by plugging in r. If we evaluate p(x j = k|x j−1 , · · · , x 1 ) with k ∈ {0, · · · , K} for sufficiently large K, a normalised probability function can be recovered as In Algorithm 1, we provide a pseudo-code to draw N independent samples following approximately a MultCOMP(λ, ν, δ, ω) distribution. The quality of approximation depends on (i) how well we estimate r and (ii) the choice of K. From Section 4.3, we recommend the use of the importance sampling estimator with n draws over 130K, region where the standard deviation seems to stabilize. For (ii) we adopt the strategy of setting a minimum value of K that is increased until the difference of successive probabilities is less than a pre-specified tolerance. Estimate r j . Calculate p(x j = k|x j−1 , · · · , x 1 ) for k ∈ {0, · · · , K}, where K is sufficiently large. . Output: N draws following approximately a d-dimensional multivariate COM-Poisson distribution. First coined by Murray et al. (2006) , the term doubly-intractable refers to the posterior distribution of a (Varin et al., 2011) also known as pseudo-likelihoods, or even likelihood-free methods such as Approximate Bayesian Computation (ABC) (Sisson et al., 2018) . We focus on a class of MCMC algorithms that have been proposed for doublyintractable problems. These can be classified into asymptotically exact and asymptotically inexact (or noisy) algorithms, depending on whether their stationary distribution is the target posterior exactly or approximately (Park and Haran, 2018) , (Alquier et al., 2016) . Assume that X|θ follows a MultCOMP model with joint probability function assuming the form (12) and the parameter vector denoted by θ = (λ, ν, δ, ω) , which is assumed to follow a prior distribution. Then, p(θ|X) ∝ p(X|θ)p(θ) is the posterior model, which is doubly-intractable. In this section, two MCMC methods based on auxiliary variables are developed to perform inference for the proposed multivariate count model. Here Suppose that a likelihood estimator p(X|θ) is available provided Y ≡ (y 1 , . . . , y N ) independent draws from an auxiliary density g θ (·), where the subscript denotes the dependence of the auxiliary density on θ. A general framework for conducting GIMH is given in Algorithm 2. Algorithm 2: Grouped Independence Metropolis-Hastings (GIMH) 4 Compute the acceptance probability π = min 1, h(θ (t) |θ ) p(X|θ )π(θ ) h(θ (t) |θ ) p(X|θ (t) )p(θ (t) ) ; 5 With probability π set θ (t+1) = θ and Y (t+1) = Y , otherwise θ (t+1) = θ (t) and Y (t+1) = Y (t) ; An unbiased likelihood estimator of the joint probability function given in (12) can be obtained by estimating unbiasedly and independently r j and z −1 j , for j = 1, · · · , d. We leverage the IS estimator introduced in Subsection 4.1 for the first and the latter can be handled via the method proposed by Benson and Friel (2020) in the context of univariate COM-Poisson regression. By sampling N draws of a COM-Poisson(λ, ν) distribution through their fast-rejection sampler, an unbiased estimator of 1 Z(λ, ν) is given by is the ratio of n N , the number of draws required for N acceptances, and N . The denominator B is the envelope's tractable bound which is the normalising constant of a Poisson or geometric distributions in the respective cases when ν ≥ 1 and ν < 1. Hence, p(X|θ) relies on drawing two sets of auxiliary variables from univariate COM-Poisson(λ j , ν j ) distributions using the fast-rejection sampler for j = 1, · · · , d. One set is used to estimate r j via IS and the other yields z −1 j via the envelope's acceptance. Evidently, the computational cost could be reduced if the same draws are used to compute both z −1 j and r j but this would introduce dependency among the estimators. Consequently, it would be required to show unbiasedness of E[ z −1 j r j ] which is not straightforward or even not true. By simulating two separate sets, we are able to guarantee an unbiased estimator of p(X|θ) since z −1 j and r j are independent and unbiased. Another advantage in this approach is that different number of draws for z −1 and r, to be denoted N z and N r , can be set if a distinct precision is required. We investigate GIMH mixing and how it is affected by the different quantities being estimated in the next section. This will guide the choices of N z and N r . First, a GIMH algorithm with single-site updates for the multivariate COM-Poisson posterior model is described given N z and N r . After specifying initial states for the parameters, we store the N z × d and N r × d matrices of auxiliary draws that are used to compute z −1 and r. These are denoted by Y z and Y r , respectively. The update of one λ j follows from proposing a new state λ j and drawing the auxiliary variables at the proposed parameter value to compute r j . In Algorithm 3, the notation Y (t) z [_, j] ← Y z,j denotes the update of the j (t) column of the current Y (t) z,j matrix with auxiliary draws taken at the proposed parameter value, Y z,j ∼ COM-Poisson(λ j , ν (t) j ), denoted by Y z,j . This is similar for Y r,j . If the proposed stated is accepted, Y z and Y r are updated to Y z and Y r . Algorithm 3 details the update of the location parameter vector λ with those of ν following in a similar manner. Algorithm 3: GIMH for multivariate COM-Poisson model: λ update Use Y r,j to calculate r j and set r j ≡ r Calculate π = min 1, . With probability π set λ Step 8 in 3 is implemented to ensure that the current δ values are comprised by the interval implied by the proposed parameter value (interval given in (10)). Updates of ω and δ differ on the usage of the auxiliary draws. Since their distribution does not depend on ω or δ, no new draws are required at these moves. A new ω value is proposed from the same log-normal kernel and recycled draws are used to recompute all r elements. Finally, each δ jk is simulated from a truncated-normal(δ (t) jk , σ δ jk ) distribution in the interval (10), for j = 1, · · · , d − 1, k = j + 1, · · · , d. This section is dedicated to investigating the effects of r and z −1 in the performance of the GIMH. This is crucial as the GIMH chain can get stuck in regions of the parameter space if the likelihood is substantially overestimated at any given iteration (Drovandi et al., 2018) , resulting in poor mixing. Naturally, the precision to which the likelihood is estimated depends on how well we estimate r and z. In this section, we are able to investigate separately how the variability associated to r and z affect the likelihood estimator, and hence the overall GIMH mixing. This is possible because the log-likelihood function of one single d-dimensional multivariate COM-Poisson observation X can be decomposed in the summation of two terms. The first is contribution due to the marginal univariate COM-Poisson distributions and the second one is the dependency part introduced by the Sarmanov method with exponential kernel. This is indicated in the next equation where the marginal term depends solely on z −1 and the second on r: When considering a sample of n observations, the contribution of z −1 is d j=1 n z −1 j , hence there is a multiplicative effect of the sample size for each z −1 j , and the overall influence of this estimator also increases with d. We begin by fixing N z = N r = 10K running GIMH while storing the marginal and kernel contributions at each iteration. Resulting trace plots give an idea of how the GIMH mixing associated to each type of estimator. Results reported in the section are due to 50K iterations for a simulated data set of 500 tri-dimensional observations with λ = (1.5, 1, 0.5), ν = (1, 0.5, 0.8), ω = 3, δ 12 = 3.5, δ 13 = −2.5, and δ 23 = −3. ν chains while those of δ and ω behave well. The log-likelihood function proportional to δ and ω is given only by the kernel term, while those of λ and ν depend on both type of estimators. This suggests that N r = 10K is sufficient for r but a higher precision is required for z −1 . To confirm the hypothesis that the poor mixing in λ and ν is caused by z −1 , we plot in Figure 4 the parameters total proportional loglikelihood functions and its contribution due to the marginal part. The kernel term cannot be decomposed in this manner so it is the same for all λ and ν. Figure 4 shows how the largest part of the proportional log-likelihood function is given by the marginal contribution. Consequently, these parameters are largely influenced by z −1 . According to Doucet et al. (2015) , for good performance of the GIMH, the log-likelihood function should be estimated with a standard deviation between 1 and 1.7. More specifically, for the case in which the efficiency of the Metropolis-Hastings algorithm using the exact likelihood is unknown, the suggested value is 1.2. Assumptions in this work are that the noise introduced by the log-likelihood estimator is Gaussian with variance inversely proportional to the number of samples used to construct the estimator and it is also independent of the parameter value. We implement this strategy by setting an adaptation phase for N z before running the MCMC chain. Naturally, it is also possible to calibrate each N z,j , j = 1, · · · , d but we find that taking N z,j = N z ∀j is a simpler and more conservative choice. A fixed N r = 10K for IS demonstrated to work well in our experiments, but this can be ensured via a preliminary run where we assess N r through monitoring the mixing of δ and ω. Additional simulation studies are included in the supplementary material investigating how N z increases with d and n. In this study, 100 data sets are simulated with different configurations and sample sizes of the Mult-COMP parameters. We adapt N z under the true parameter values and report the resulting mean and standard deviation. Results evidence that N z grows with the sample size but how they associate seems to depend on the parameter configuration. There is also a positive impact of increasing the data dimension, which is suggested by comparing configurations that take the same parameter values and vary d. In addition, higher COM-Poisson mean values relate to bigger N z since log z −1 will be higher in magnitude. For example, under λ = 1 and ν = 1.5, the log reciprocal normalising constant is approximately −0.89, while this is around −1.12 when λ = 1 and ν = 0.7. In these settings, the average N z is 3878 and 6672 respectively for n = 200 which increases to 6672, 17137 when n is 500. With these results, we are able to better understand on the effect of N z in the accuracy of the final log-likelihood estimator, which is crucial for a good performance of the GIMH. The exchange algorithm by Murray et al. (2006) provides a framework for conducting MCMC for a doublyintractable problems relying on the ability to simulate exactly from the likelihood. It assumes that it is possible to write the model's likelihood function p(X|θ) as a product of a tractable unnormalised term q(X|θ) and the reciprocal normalising constant 1 Z(θ) , which is intractable. By augmenting the target density with an auxiliary variable X ∼ p(·|θ ) where θ is a proposed state for θ, a cancellation of the normalising terms is achieved in the Metropolis-Hastings acceptance ratio. Algorithm 4 gives the general formulation of an exchange algorithm where moves from the current state θ (t) to θ are proposed according to h(θ |θ (t) ). Input: Initial state θ (t) 1 Propose θ ∼ h(·|θ (t) ); 2 Simulate X from the likelihood at θ , that is, The last step illustrates how the cancellation of normalising terms in achieved, leaving the acceptance ratio tractable. In this section, we investigate the application of the exchange algorithm for inferring on the multivariate COM-Poisson model parameters. Our motivation in developing this alternative approach is in avoiding the computation of z −1 which can require a substantial number of auxiliary draws depending on d and n. However, in our context, an algorithm to simulate from the likelihood function exactly is not available. Instead, we formulate an noisy exchange algorithm where Step 2 of algorithm 4 is done following Algorithm 1. Additionally, q(X|θ) is replaced with q(X|θ, r), an estimator of the MultCOMP unnormalised probability function. In analogy to 4, Z(θ) ≡ Z(λ, ν) under our model. More specifically, Z(λ, ν) is the . Under our formulation, there is inexactness in the cancellation of the normalising terms and approximation of q(X|θ). Bearing this is mind, our goal is to compare the noisy exchange approach with GIMH under controlled scenarios. We highlight that there are extensions of the algorithm by Murray et al. (2006) relaxing the perfect sampling requirement (Liang et al., 2016) but these still depend on an analytical calculation of q(X|θ). Algorithm 5 describes single-site updates of the λ elements under our noisy exchange formulation. For increased computational efficiency, the ratio estimates are recycled throughout the iterations. This means that we propagate r(λ (t) , ν (t) , ω (t) ) and refresh r k (λ k , ν k ) whenever a new λ k or ν k is accepted, or all elements when a new ω is visited. Algorithm 5: Noisy Exchange Algorithm for the multivariate COM-Poisson model: λ update Estimate r j and set r ≡ r (t) [j] ← r j 5 Check if λ j yields a valid joint probability function by ensuring that every component of δ satisfies (10). If this is not achieved, return to the Step 2. 6 Draw X ∼ MultCOMP(θ ) (Sampler 1); With probability π set λ (t+1) 1 = λ 1 and r (t) = r . As before, ν update is carried in a similar fashion to λ and that of ω requires all r elements to be estimated. The final move for δ is simplified due to the independence of the ratio on this parameter. Similarly to GIMH, we employ a truncated normal proposal in this step and the same previous prior specification. 6 Simulation studies This section is dedicated to comparing the proposed algorithms via simulation studies. The GIMH and noisy exchange approaches employing IS ratio estimators are applied to synthetic bivariate and trivariate COM-Poisson data. Under controlled settings, we can assess whether the inference is consistent with the true parameter values used to simulate the data and how GIMH and the noisy exchange results compare. We simulate data from the bivariate COM-Poisson distribution under two configurations following the steps in Algorithm 1 with sample size n = 200. In the first scenario, the two components are positively correlated and overdispersed. More specifically, we set λ 1 = 1, λ 2 = 1.5, ν 1 = 0.4, ν 2 = 0.8, ω = 2, and δ = 3. The empirical means, variances, and Pearson's linear correlation for this synthetic data set are (1.745, 1.745), (2.915, 1.990), and 0.340 respectively. Configuration 2 sets λ 1 = 1, λ 2 = 2, ν 1 = 0.8, ν 2 = 1.5, ω = 1.5, and δ = −2, a scenario where the first and second components are overdispersed Scatterplots with added jitter (small random noise) are displayed in Figure 5 illustrating the two bivariate synthetic data sets. This is done via geom_jitter from R package ggplot2 to improve visualization by avoiding that observations are plotted directly on top of each other. For each data set, five parallel chains of the GIMH and noisy-exchange algorithms are run for 30K iterations with the first 10K discarded as burn-in. Convergence of the multiple chains is assessed via the Brooks-Gelman-Rubin R statistic (Brooks and Gelman, 1998) . If R is close to 1 for all model parameters, convergence is accepted and inference and posterior draws are combined. Prior distributions of the location parameters are set to Gamma(2, 2), while a Gamma(1.5, 2) is adopted for the parameters controlling the dispersion. Those of ω and δ are Gamma(2, 0.8) and Truncated-Normal(0, 5), respectively. Adaptation period for N z is employed before running GIMH targeting 1.2 likelihood standard. This returned N z ≈ 18K for the first configuration (minimum 17708 and maximum 18318) and 15K for the second (minimum 14648 and maximum 15489). Furthermore, N r is fixed to 10K in both algorithms. The results due to the first configuration are illustrated in Figure 6 , where the posterior densities obtained under each method are displayed using different line types. These are produced from the 100K combined draws from the five parallel chains as we found R values close to 1 for all parameters. Posterior density estimates from the two algorithms are very similar and mostly overlap. Among the model parameters, there is a higher uncertainty in the posterior distribution of ω, which has a heavy right tail. This parameter plays the role of extending the correlation range supported by a given δ value. However, at some point a plateau is reached, in a way that there is no further increase in the correlation related to the increase in ω. In our view, this explains the heavy right tail of this parameter's posterior distribution. Table 1 provides numerical summaries of the posterior distributions. As expected, the posterior mean is close to the parameter values used to generate the data, which are comprised by the 95% percentile-based credible intervals in all cases. The model fits for the second simulated data set are presented in Figure 7 with numerical summaries given in Table 2 . As before, there is agreement between the MCMC methods as evidenced by the proximity of posterior curves obtained from each algorithm. The numerical summaries also show that the parameters used to generate the data are very likely under the posterior model, all covered by the 95% credible intervals. Given the agreement between the algorithms compared in Subsection 6.1, we now present a tri-dimensional data example for which the fastest inference strategy will be employed. Given that N z depends on d and n, the preferred algorithm can vary depending on the data set. In this experiment, where d = 3 and n = 500, preliminary runs of both algorithms indicate GIMH to be preferred. The characteristics of the trivariate COM-Poisson simulated data set are as follows. The location and dispersion parameter vectors are respectively λ = (1.5, 1, 0.5) and ν = (1, 0.5, 0.8) (the first component is equidispersed and the others are overdispersed). The parameters controlling the correlation are ω = 3, δ 12 = 3.5, δ 13 = −2.5, δ 23 = −3, in a way that there are positively and negatively dependent pairs. The Pearson's correlation is 0.167 for the pair (X 1 , X 2 ), −0.165 for (X 1 , X 3 ), and −0.191 is due to (X 2 , X 3 ). As before, five parallel chains of GIMH from random starting points are run for a total of 30K iterations. Adaptation of N z resulted in N z around 15K for all chains (minimum 14525 and maximum 157891), also searching for a 1.2 likelihood standard deviation. Table 3 The results are consistent with the parameters used to generate the data, which demonstrate to be likely under the parameters posterior distributions. We here illustrate the usefulness of the proposed multivariate COM-Poisson model in modelling real-life correlated count data. A novel data analysis that we focus on concerns the number of goals scored by the home and away team at Premier League matches. Our main goal is to assess the effect of the absence of crowds during the COVID-19 pandemic on the well-known home team advantage. Several papers have modelled soccer data demonstrating a positive effect of playing at home such as Dixon and Coles (1997) , Karlis and Ntzoufras (2000) , and Karlis and Ntzoufras (2003) . Very recently, there is also a great interest in evaluating whether this has changed due to the absence of crowds at games during the pandemic; for instance, see Tilp and Thaller (2020) and McCarrick et al. (2020) . Table 4 where the proportion of home draws, losses and wins are compared for these two distinct time periods, pre-and during-pandemic. A reduction in the proportion of home team wins is observed during the pandemic when crowds are present with an associated increase in losses, supporting the hypothesis that the advantage of playing at home is decreased in the pandemic matches. of these features of the data. Further, we will adopt a regression structure on the count of home team goals given that this seems to decrease for matches during the pandemic. We take X 1i and X 2i to be the number of goals scored by the home and away team, respectively, with i = 1, · · · , 1140 denoting an index to each Premier League match. The following assumptions are made in our data analysis. The pair (X 1i , X 2i ) is a multivariate observation from a MultCOMP distribution with match-specific location parameters. That is, (X 1i , X 2i ) ∼ MultCOMP(λ i , ν, δ, ω) with λ i = (λ 1i , λ 2i ) . Further, we assume independence among matches and the following regression structure on λ i : for i = 1, . . . , 1140, where Home (= 1 for home team and = 0 otherwise) and Pandemic (= 1 for matches realized during the pandemic and = 0 otherwise) are indicator covariates. Following Ntzoufras (2000, 2003) and other references in the field, we adopt a common intercept γ 0 for the competing teams. Since the home team goals are assigned to the first component, the first indicator is always one (Home i = 1 ∀i) and represents a deviation from the overall average number of goals scored by a team in a Premier League match. This is explicit in the regression structure for clarity and implies that γ 1 measures the home team main effect during the pre-pandemic period. When Pandemic = 1, the difference between home and away teams becomes γ 1 + γ 2 so we can interpret γ 2 as the parameter measuring the pandemic effect on the home team advantage. We shall denote the elicited model M BCOM P . The Bayesian inferential procedures introduced in the paper holds for the iid case and are be easily adapted for the regression analysis involving categorical covariates, with the developed methodology being applied to each regressor level. Inference is carried with 100K draws from the model posterior distributions resulting from the combination of five parallel chains of the GIMH algorithm. Adaptation of N z , as described in Subsection 5.1.1, indicates that on average 170K auxiliary draws are necessary to ensure good mixing of the GIMH algorithm. Summaries of the posterior model parameter distributions are shown in Table 5 and respective density plots can be found in the Supplementary Material. The posterior distributions of the MultCOMP model parameters are in accordance with our preliminary analysis. The negative dependency between X 1i and X 2i is captured by the proposed model as determined by the sign of δ. Moreover, the 99% credible interval of this parameter (−2.472 -−0.741) does not contain zero, the independence case. It is also seen that the marginal mean-variance relationships are well modelled with both ν 1 and ν 2 below one, which captures the data overdispersion. In the absence of pandemic, the home goals surplus is measured by γ 1 which we can interpret on a multiplicative scale by taking exp(γ 1 ). The posterior expectation of this quantity is 1.249, with associated 95% credible interval of (1.055 − 1.455). Hence, playing at home when crowds are present increases the goals scored, on average, by around 25%. Without the public, the home effect on the number of goals is exp{γ 1 + γ 2 } which has a posterior expectation of 1.146 and a 95% credible interval of (0.968 − 1.336). The shift in home team advantage is illustrated in Figure 8 , where the posterior density of exp{γ 1 } and exp{γ 1 + γ 2 } are displayed. Since P (γ 2 < 0) = 0.97, we conclude that there is a high posterior probability that the home team advantage, here measured by the home goals surplus, has decreased with the absence of crowds in Premier League matches. Alternative models were fitted and compared to M We refer to the R package loo (Vehtari et al., 2020) to compute this criterion for each competing model, where the one with minimum PSIS-LOO is expected to have the best predictive performance. A more complex regression structure was considered for the MultCOMP parameters in the model which we denote as M BCOM P,F ull . In this version, the pandemic effect is component-specific and is also included in the model's dependency structure. This is done by setting log λ 2i = γ 0 + γ 3 Pandemic i and log γ = α 0 + α 1 Pandemic i where α 1 is the parameter quantifying the pandemic effect in the correlation. A bivariate COM-Poisson model without the pandemic covariate (M BCOM P,0 ) was also fitted to the data as well as the bivariate Poisson special case (ν 1 = ν 2 = 1) with the regression structure (16). The former is denoted by M P , a tractable model that we fit with a Gibbs sampler. The expected PSIS-LOO for M BCOM P is 6917.2 (32.6) with estimated standard deviation in parenthesis. While this is 6979.8 (63.8) for M BCOM P,F ull , 6923.1 (64.0) for M BCOM P,0 and finally 6938.4 (71.6) for M P . Although the smallest expected value is due to M BCOM P , we cannot decisively choose between models according to their predictive performances given that the differences in PSIS-LOO are not high in comparison to their variability. We proceed by examining the posterior distribution of nested model fits from which the following observations are made. The M BCOM P model would reduce to M P if equidispersion was a reasonable assumption for the marginal count distributions. From Table 5 , we have that P (ν 1 < 1) and P (ν 2 < 1) are close to one under M BCOM P which motivates us to choose M BCOM P over its Poisson special case. Regarding M BCOM P,F ull , it does not seem to be worthwhile considering this more complex model given that the posterior distributions of γ 2 and γ 3 are highly similar and that α 1 is concentrated at zero as shown in the Supplementary Material. Finally, a comparison between M BCOM P and the baseline model M BCOM P,0 is carried out by drawing from their posterior predictive distributions. In this analysis, 100K data sets pre and during pandemic are simulated from each model and summarised according to relevant statistics. Our interest is in evaluating how well the alternative models capture characteristics of the original data, which is done by comparing Let (X 1 , X 2 ) k,pre , (X 1 , X 2 ) k,pand denote the k th pre and during pandemic simulated data sets either from M BCOM P or M BCOM P,0 , for k = 1, · · · , 100K. In keeping with the Premier League data characteristics and given that there is no pandemic effect under M BCOM P,0 , the difference between the two sets when simulating from this model is simply the sample size. If the data is replicated according to M BCOM P , (X 1 , X 2 ) k pand includes the effect of γ 2 in log λ 1 . Given our interest in the shift of home advantage, we record the average number of goals scored by the home team pre and during the pandemic Home goals pre , Home goals pand ≡ 668 i=1 X k,pre i,1 /668, Careful and detailed Bayesian inferential procedures were developed to treat the doubly-intractable likelihood challenge due to our model construction. Different inferential strategies for the doubly-intractable posterior distribution of the multivariate COM-Poisson model were proposed in Section 5. The first option is the GIMH, a pseudo-marginal approach that relies on an unbiased likelihood estimator. Since the number of auxiliary draws required for the z −1 estimators involved in this approximation increases with d and n, an alternative noisy exchange algorithm was proposed. Although this option is inexact, its advantage is in avoiding the computation of z −1 . Simulation studies were conducted to investigate the two inference strategies. Results from artificial bivariate data sets showed that the GIMH and noisy exchange algorithm produced very similar results, with negligible difference among them. Moreover, the posterior means were close to the parameter values used to generate the data, all comprised by the 95% credible intervals. This demonstrates that both options provide sensible inference for the proposed model and computational speed can guide the choice of algorithm to be used in each application. Finally, a trivariate example displaying positively and negatively related components was included to illustrate the d > 2 case. An empirical illustration to investigate the impact of the COVID-19 on the Premier League was presented based on the methodologies developed in this paper. We fitted a bivariate COM-Poisson regression model to the goals scored by the home and away teams in the Premier League from 2018 to 2021, also considering the effect of no crowds during the COVID pandemic. Our inferential analysis has showed a potential decrease in the number of goals scored by the home team during the pandemic compared to number of goals scored pre-pandemic. We also analysed a shunters accident data (Arbous and Kerrich, 1951) in the Supplementary Material, which is a well-known example that was used to illustrate numerous count data models in the literature; for instance, see Aitchison and Ho (1989) , Famoye and Consul (1995) , Sellers et al. (2016) , and Jones and Marchand (2019). Here we conduct a full Bayesian analysis employing our model, its Poisson special case (BP-S), the trivariate reduction bivariate Poisson (BP-T), and the bivariate negative binomial (BNB) distribution by Marshall and Olkin (1990) . Results reported in the Supplementary Material showed that the proposed model is preferred to the Poisson alternatives and is at least competitive with respect to the BNB model. The multivariate Poisson-log normal distribution Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels The pseudo-marginal approach for efficient Monte Carlo computations Accident statistics and the concept of accident-proneness Bayesian inference, model selection and likelihood estimation using fast rejection sampling: The Conway-Maxwell-Poisson distribution Multivariate count data generalized linear models: Three approaches based on the Sarmanov distribution General methods for monitoring convergence of iterative simulations A tree-based semi-varying coefficient model for the COM-Poisson distribution Estimating Ratios of Normalizing Constants. In: Monte Carlo Methods in Bayesian Computation A queueing model with state dependent service rate The Conway-Maxwell-Poisson distribution: Distributional theory and approximation Modelling association football scores and inefficiencies in the football betting market Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator Accelerating pseudo-marginal MCMC using Gaussian processes Bivariate generalized Poisson distribution with some applications Improving power posterior estimation of statistical evidence Marginal likelihood estimation via power posteriors An asymptotic expansion for the normalizing constant of the Conway-Maxwell-Poisson distribution Simulating normalizing constants: From importance sampling to bridge sampling to path sampling Multivariate discrete distributions via sums and shares Conjugate analysis of the Conway-Maxwell-Poisson distribution On modelling soccer data Analysis of sports data by using bivariate Poisson models Properties and applications of the Sarmanov family of bivariate distributions An adaptive exchange algorithm for sampling from distributions with intractable normalizing constants Multivariate distributions generated from mixtures of convolution and product families Home advantage during the COVID-19 pandemic in European football Simulating ratios of normalizing constants via a simple identity: A theoretical exploration Mcmc for doubly-intractable distributions Useful moment and cdf formulations for the COM-Poisson distribution Annealed importance sampling The controlled thermodynamic integral for Bayesian model comparison Bivariate Conway-Maxwell Poisson distributions with given marginals and correlation Bayesian inference in the presence of intractable normalizing functions Generalized normal correlation and two-dimensional Fréchet classes Bivariate Conway-Maxwell-Poisson distribution: Formulation, properties, and inference A flexible univariate autoregressive time-series model for dispersed count data Conway-Maxwell-Poisson regression models for dispersed count data A flexible regression model for count data A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution Handbook of Approximate Bayesian Computation Covid-19 has turned home advantage into home disadvantage in the German soccer Bundesliga Non-physical sampling distributions in Monte-Carlo free-energy estimation: Umbrella sampling An overview of composite likelihood methods loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC On a class of bivariate mixed Sarmanov distributions