key: cord-0234364-ge54nweu
authors: Deb, Soudeep; Roy, Rishideep; Das, Shubhabrata
title: Modeling a sequence of multinomial data with randomly varying probabilities
date: 2021-04-07
journal: nan
DOI: nan
sha: 7699040a88bdfee7e87c58151018a330be2ad576
doc_id: 234364
cord_uid: ge54nweu

We consider a sequence of variables having multinomial distribution with the number of trials corresponding to these variables being large and possibly different. The multinomial probabilities of the categories are assumed to vary randomly depending on batches. The proposed framework is interesting from the perspective of various applications in practice such as predicting the winner of an election, forecasting the market share of different brands etc. In this work, first we derive sufficient conditions of asymptotic normality of the estimates of the multinomial cell probabilities, and corresponding suitable transformations. Then, we consider a Bayesian setting to implement our model. We consider hierarchical priors using multivariate normal and inverse Wishart distributions, and establish the posterior consistency. Based on this result and following appropriate Gibbs sampling algorithms, we can infer about aggregate data. The methodology is illustrated in detail with two real life applications, in the contexts of political election and sales forecasting. Additional insights of effectiveness are also derived through a simulation study.

Multinomial distribution is the most common way to model categorical data. In this work, we study the behaviour of a collection of multinomial distributions with random fluctuations in cell (or 'category', as we interchangeably use the two terminology) probabilities. We consider that the multinomial data is observed in batches, with the number of trials in each batch being possibly different. The cell probabilities for every multinomial random variable in the data are assumed to be additive combinations of a fixed component which is constant across batches, and a random perturbation. These perturbations for different batches are taken to be independent. In this paper, we develop an appropriate modeling framework for the above setup under mild assumptions on the behaviour of the random perturbations. The model is implemented in a Bayesian setting. We provide relevant theoretical results of the proposed method, and describe how the model can be used to forecast and infer about different features of the entire data.

One of the main motivations behind this study is an attractive problem related to the political elections. Voter models have been of interest for a long time. Around the world with strengthening of democracy and the role of media, statistical models for prediction of elections have been on the rise. Elections are studied at different stages -starting with pre-poll predictions, going on to post-poll predictions, and also calling the election before the final result is declared as the counting of ballots is in progress. In particular, the current work is motivated by the last context, where the counting of results is publicly disclosed in batches (or rounds). Stemmed from public interest, this often leads to a competition among the media houses "to call the election", i.e. to predict the outcome correctly and as early as possible. Using the exit poll data has been a popular approach in this context. It however has its own pitfalls, as has been shown in many papers (see, e.g. Stiers and Dassonneville (2018)). Because of the inadequacy of the exit poll data, Associated Press (AP) designed its methodology for both survey and forecasting (Slodysko (2020)). Interestingly, they still do not call a closely contested race unless the trailing candidate cannot mathematically secure a victory. Another common approach is to extract relevant information from social media, such as Twitter or Google Trends, and to use that to predict the winner of an election. O'Connor et al. (2010) and Tumasjan et al. (2010) are two notable works in this regard. A major criticism behind this approach is that intentional bias in the social media posts often leads to error-prone conclusions in the forecasting problems, cf. Anuta et al. (2017) . Further, Haq et al. (2020) provides a great review of the existing literature on how to call an election and one can see that most of the methods are either based on data from social media or are ad-hoc and devoid of sound statistical principles. To that end, it would be of paramount importance to develop a good forecasting methodology based only on the available information of votes secured by the candidates in the completed rounds of counting. This can be treated as a collection of multinomial data where the number of categories is equal to the number of candidates and the random fluctuations in cell probabilities point to the randomness of the vote shares in different rounds.

The methodology proposed in this work can be applied in other domain as well, most notably in sales forecasting. Retail giants and companies track the sales for different products in regular basis. This is useful from various perspectives. One of the key aspects is to project or assess the overall (annual) sales pattern from the data collated at monthly, weekly or even daily level. It is of great value to a company to infer about its eventual market share for the period (year) as early as possible, hence to validate if its (possibly new) marketing plans are as effective as targeted. The retailer would find these inferences useful from the perspective of managing their supply chain, as well as in their own marketing. Other possible applications of the proposed methodology (possibly with necessary adjustments) include analyzing hospital-visits data, in-game sports prediction etc.

To the best of our knowledge, the present work is the first attempt to introduce a randomly varying probability structure in a sequence of non-identical multinomial distributions. In the binomial setting (i.e. with two categories of outcome), Le Cam (1960) introduced the concept of independent but non-identical random indicators, which was later discussed and applied in a few other studies (see, e.g. Volkova (1996) , Hong et al. (2009) , Fernández and Williams (2010) ). However, in all of these papers, the cell probabilities are considered to be fixed. Introducing the randomness in the probabilities makes it more attractive and suitable for many real life applications, as has been discussed above. Thus, from methodological perspectives, this paper aims to bridge that gap. Given the above-mentioned motivating applications, we emphasize that the goal of this study is to forecast aspects of the final outcome rather than inference on the multinomial cell probabilities, although the latter can also be easily done through the proposed framework.

The rest of the paper is arranged as follows. In Section 2.1, we introduce the preliminary notations, and in Section 2.2, we describe the model in a formal statistical framework, and state the main results. The proofs of the theorems are provided in Appendix A. In Section 2.3, we discuss the estimation of the posterior distribution of the model parameters via Gibbs sampling. The prediction aspects of the model are described in Section 2.4. Section 3 discusses simulation results which demonstrate the suitability of the model in the broad context. Application of our proposed methodology in calling the election is demonstrated in Section 4.1 with data from Bihar (a state in India) legislative assembly election held in 2020. Section 4.2 deals with a real life application of the methodology in the sales forecasting context. We conclude with a summary, some related comments and ways to extend the current work in Section 5.

In this section, we describe the notations and assumptions, and then for better understanding of the reader, draw necessary connections to the previously discussed examples.

At the core of the framework, we have a sequence of independent trials, each of which results in one among the possible C categories of outcomes. The outcomes of the trials are recorded in aggregated form, which we refer to as batches (or rounds). We assume that the cell probabilities remain the same within a batch, but vary randomly across different batches. Hence, the observed data from an individual batch can be modelled by the standard multinomial distribution; and collectively we have a sequence of independent but non-identical multinomial random variables.

We adopt the following notations. Altogether, the multinomial data is available in K batches. For each batch, the number of trials is given by n j , 1 j K, and each n j is assumed to be large. Let N j = j i=1 n i denote the cumulative number of trials up to the j th batch. For simplicity, let N = N K , the total number of trials in all the batches. We use X j = (X 1j , X 2j , . . . , X Cj ) T to denote the observed counts for the C different categories in case of the j th multinomial variable in the data. Also, let Y j = (Y 1j , Y 2j , . . . , Y Cj ) T be the cumulative counts till the j th batch. Clearly,

(2.1)

Let p 1j , p 2j ,. . . , p Cj denote the probabilities for the C categories for each of the independent trials in the j th batch. We focus on {p 1j , p 2j , . . . , p (C−1)j }, since the probability of the last category is then automatically defined as C c=1 p cj = 1, for 1 j K. Our objective is to estimate the probability distribution, and subsequently various properties, of (h(Y l ) | F j ) for l > j, and for some suitable measurable function h, where F j is the sigma-field generated by the data up to the j th batch.

In the voting context, C denotes the number of candidates contesting in a constituency, and each voter casts a vote in favour of exactly one of these candidates. Here, n j represents the number of voters whose votes are counted in the j th round, while N j denotes the cumulative number of votes counted up to the j th round. Similarly, X j shows the number of votes received by the candidates on the j th round, and Y j represents the cumulative votes up to that round. We can use our method to find the probability of winning for any particular candidate, as well as the expected margin of victory, given the counted votes till round j. In this case, the measurable functions of interest are:

where I(·) denotes the indicator function, and Y

(1) K and Y

(2) K respectively denote the first and second order statistics of Y K .

Next, for the sales forecasting problem, C denotes the number of competing brands in a certain product category. The customers exercise their choices in selecting one of the brands in individual purchases. The aggregate (e.g. weekly) sales data is tracked, with X j (resp. Y j ) showing the (resp. cumulative) sales of the competing brands in (resp. up to) the j th week, equivalent to the j th batch in our notations. Consider the problem of predicting the market share of the i th brand up to the l th week, based on the data up to the j th week, for l > j. For example, in a standard setting, one might consider l = 52, representing the end of the year. In this case, the measurable function of interest is h 1,i (Y K ) = Y iK /N . Further, one can take h 2,i (Y K ) = I(Y iK π 0 N ) to predict the probability of the brand reaching a target market share of π 0 .

We propose to consider the multinomial cell probabilities as random variables, having a fixed part and a randomly varying part. Thus, for c = 1, 2, . . . , C, p cj is written as p c + ε cj , where p c is the constant preference component that remains the same across the batches for category c, and ε cj 's, for c ∈ {1, 2, . . . , C}, are zero-mean random perturbations that model possible fluctuations in the preference probabilities for the categories across the batches. We enforce ε Cj = − C−1 c=1 ε cj . In line with the earlier notations, we use p = (p 1 , p 2 , . . . , p C ) T ,p = (p 1 , p 2 , . . . , p C−1 ) T , p j = (p 1j , p 2j , . . . , p Cj ) T and ε j = (ε 1j , ε 2j , . . . , ε (C−1)j ) T , for convenience. Note that the covariances between the components of ε j 's are likely to be negative due to the structure of the multinomial distribution. The randomness of ε j 's need to be also carefully modelled so as to ensure that the category probabilities p ij 's lie in the interval [0, 1]. We shall use S to denote the possible set of values for ε j .

Let Cov(·, ·) denote the variance-covariance matrix of two random variables and N r (θ, Ψ) denote a rvariate normal distribution with mean θ and dispersion matrix Ψ. Throughout this article, 0 denotes a vector of all zeroes and I denotes an identity matrix of appropriate order. Following is a critical assumption that we use throughout the paper.

Assumption 1. The random variables (ε j ) 1 j K , as defined above, are independent of each other and are distributed on S, with E(ε cj ) = 0 for 1 c C − 1. Further, the density of √ n j ε j converges uniformly to the density of N C−1 (0, Ξ ε ).

Then, the proposed model can be written as

where (X j ) 1 j K are independent of each other and (ε j ) 1 j K satisfy Assumption 1.

Theorem 1. For the proposed model (eq. (2.2)) the following results are true if Assumption 1 is satisfied.

(a) Unconditional first and second order moments of X cj , X c ′ j (c, c ′ = 1, 2, . . . , C, c = c ′ ) are given by

(b) Unconditional first and second order moments of Y cj , Y c ′ j (c, c ′ = 1, 2, . . . , C, c = c ′ ) are given by

(c) As n j → ∞,p j = (p 1j ,p 2j , . . . ,p (C−1)j ) T = (X 1j /n j , X 2j /n j , . . . , X (C−1)j /n j ) T satisfies the following:

where L → denotes convergence in law, and Ξ = Ξ p + Ξ ε . Here, Ξ ε is positive definite with finite norm and

(2.6) Remark 1. In part (c) of Theorem 1, if the variances and covariance of ε 1j , . . . , ε (C−1)j are o(1/n j ) then as n j → ∞, √ n j (p j −p) L → N C−1 (0, Ξ p ).

The above remark includes the scenario of independent and identically distributed multinomials and is therefore of special interest. Detailed proof of Theorem 1 is deferred to Appendix A. Note that the asymptotic variance is a function of p, thereby motivating us to adapt an appropriate adjustment in the form of using a suitable variance stabilizing transformation. While the standard variance stabilizing transformation is given by the sine-inverse function (Anscombe (1948) ), there are some better variance stabilizing transformations, as discussed in Yu (2009) . Based on that work, we define the following modified transformation, having superior relative errors, skewness and kurtosis:

. . .

where a is a positive constant. Throughout this paper, we consider a = 3/8, one of the most popular choices in this regard, cf. Anscombe (1948) and Yu (2009).

Lemma 1. For every 1 j K in the framework of the proposed model (eq. (2.2)), under Assumption 1

In particular, if the variances and covariances of ε 1j , . . . , ε (C−1)j are o(1/n j ) then Σ is a variancecovariance matrix with diagonal entries as 1, and off-diagonal entries given by

Proof. For a function g : R C−1 → R C−1 of the form:

using eq. (2.5) and the multivariate delta theorem (see Cox (2005) and Ver Hoef (2012) for the theorem and related discussions) we get √ n j g(p 1j ,p 2j , . . . ,p (C−1)j ) − g(p 1 , p 2 , . . . , p C−1 )

where the (c, c ′ ) th entry of Ω is

Using g 1 (t) = g 2 (t) = · · · = g C−1 (t) = sin −1 ((2t − 1)/(1 + 2a/n j )) in the above, the required result follows. For the special case in the second part, the exact form of Ξ p from eq. (2.6) can be used.

Lemma 1 serves as a key component of our proposed method. Note that g 1 (t) is a monotonic function on [0, 1] and therefore, we can easily use the above result to make inference aboutp, and subsequently about (h(Y l ) | F j ) for l > j. In this paper, for implementation of the model, we adopt a Bayesian framework which is advantageous from multiple perspectives. First, we find this to yield more realistic results in several reallife data context. Second, it helps in reducing the computational complexity, especially for a large dataset. Third, this approach has the flexibility to naturally extend to similar problems in presence of covariates.

Following eq. (2.8) and the notations discussed above, the proposed model (see eq. (2.2)) is equivalent to the following in an asymptotic sense.

(2.10)

Observe that both µ and Σ are unknown parameters and are fixed across different batches. In the Bayesian framework, appropriate priors need to be assigned to these parameters to ensure "good" behaviour of the posterior distributions. To that end, we consider the following hierarchical structure of the prior distributions.

The inverse Wishart prior is the most natural conjugate prior for the covariance matrix, cf. Chen (1979) , Haff (1980) , Barnard et al. (2000) , Champion (2003) . The conjugacy property facilitates amalgamation into Markov chain Monte Carlo (MCMC) methods based on Gibbs sampling. This leads to easier simulations from the posterior distribution as well as posterior predictive distribution after each round. On this note, the inverse Wishart distribution, if imposed only on Σ with Σ p known, restricts the flexibility in terms of modeling prior knowledge (Hsu et al. (2012) ). To address this issue, we apply the above hierarchical modeling as is most commonly used in Bayesian computations. This allows greater flexibility and stability than diffuse priors ). Studies by Kass et al. (2006) , Gelman and Hill (2006) , Bouriga and Féron (2013) discuss in depth how the hierarchical inverse Wishart priors can be used effectively for modeling variance-covariance matrices.

Using the above prior specifications, we develop a Gibbs sampling algorithm to implement the proposed framework. The Gibbs sampling procedure is outlined in the next subsection. Before that, we present an important result which establishes the large sample property of the posterior means of µ and Σ.

Theorem 2. Let µ 0 and Σ 0 be the true underlying mean and true covariance matrix, respectively. Let µ P M and Σ P M be the posterior means. Then, for the hierarchical prior distributions given by eq. (2.11), the following are true:

lim

(2.13)

In terms of implementation, it is usually a complicated procedure to find out a closed form for the joint posterior distribution of the parameters. A common approach is to adopt Gibbs sampling which reduces the computational burden significantly. It is a MCMC method to obtain a sequence of realizations from a joint probability distribution. Here, every parameter is updated in an iterative manner using the conditional posterior distributions given other parameters. We refer to Geman and Geman (1984) and Durbin and Koopman (2002) for more in-depth readings on Gibbs sampling.

Recall that F j denotes the sigma-field generated by the data up to the j th instance. In addition, below, Γ k stands for a k-variate gamma function and tr(·) is the trace of a matrix.

For j = 1, we have L 1 |µ, Σ ∼ N C−1 (µ, Σ/(n 1 + 0.5)). The joint posterior likelihood is therefore given by

(2.14)

In an exact similar way, the posterior likelihood based on the data up to the j th instance can be written as

In the Gibbs sampler, we need to use the conditional posterior distributions of µ, Σ and Σ p . It is easy to note that

Thus, we can write the following.

In an identical fashion, it is possible to show that

For the conditional posterior distribution of µ, observe that

Comparing the above expression with the density of multivariate normal distribution, and readjusting the terms as necessary, we can show that

(2.20)

Existing theory on Bayesian inference ensures that if the above conditional posterior distributions are iterated many times, it would converge to the true posterior distributions for the parameters. It is naturally of prime importance to make sure that convergence is achieved when we implement the algorithm and collect a posterior sample. In that regard, we use the Gelman-Rubin statistic, cf. Gelman et al. (1992) . This is an efficient way to monitor the convergence of the Markov chains. Here, multiple parallel chains are initiated from different starting values. Following the standard theory of convergence, all of these chains eventually converge to the true posterior distributions and hence, after enough iterations, it should be impossible to distinguish between different chains. Leveraging this idea and applying an ANOVA-like technique, the Gelman-Rubin statistic compares the variation between the chains to the variation within the chains. Ideally, this statistic converges to 1, and a value close to 1 indicates convergence. In all our applications, we start multiple chains by randomly generating initial values for the parameters and start collecting samples from the posterior distributions only when the value of the Gelman-Rubin statistic is below 1.1. Also, we take samples from iterations sufficiently apart from each other so as to ensure independence of the realizations.

In this section, we discuss the prediction procedure for the proposed model. Note that the posterior predictive distribution for the l th instance based on the data up to the j th instance (l > j), because of the conjugacy we observed above, is given by a normal distribution. The mean parameter of this distribution can be computed by the following equation.

(2.21)

Similarly, using the property of conditional variance, the dispersion parameter of the posterior predictive distribution is

Once again, it is complicated to get closed form expressions for the above expectation and variance. Therefore, we make use of the Gibbs sampler to simulate realizations from the posterior predictive distribution. Based on the data up to the j th instance, we generate M samples (for large M ) for the parameters from the posterior distributions. Let us call them Σ j,s , µ j,s for s = 1, 2, . . . , M . Then, we can approximate the mean of the posterior predictive distribution in eq. (2.21) by M s=1 µ j,s /M . Similarly, the first term in the right hand side of eq. (2.22) can be approximated byΣ j /(n l + 0.5) whereΣ j is the sample mean of Σ j,s for s = 1, 2, . . . , M , while the second term in that equation can be approximated by the sample dispersion matrix of µ j,s for s = 1, 2, . . . , M . Next, we generate many samples (once again, say M ) from the posterior predictive distribution with these estimated parameters.

This sample from the posterior predictive distribution is then used to infer on various properties of (h(Y l ) | F j ), as discussed in Section 2.1.

In this section, we consider a few toy examples and evaluate the effectiveness of our proposed methodology across various scenarios. We generate K batches of multinomial data following eq. (2.2) for three different data generating processes (DGP). Each multinomial data is assumed to have C categories. The expected values of n j (the size of the j th batch) are considered to be the same for all j. Let n denote the common expected value for different batches. We consider different value of C (3 or 5), K (25 or 50), n (100 or 1000 or 5000 or 50000) to understand the robustness of the method. Note that C = 3, K = 25, n = 5000 closely resembles the application to the election data (Section 4.1) while C = 5, K = 50, n = 50000 is similar to the real example of the sales data (Section 4.2).

The DGPs we consider differ in the way the ε j 's are generated. First, we take the simplest situation of ε cj = 0 for all c, j, which corresponds to an iid collection of multinomial random variables. Second, ε j is simulated from a truncated multivariate normal distribution with mean 0 and dispersion matrix n −0.5 j I. Third, we impose dependence across different coordinates of ε j , and simulate it from a truncated multivariate normal distribution with mean 0 and dispersion matrix n −0.5 j A, where A is a randomly generated positive definite matrix with finite norm. Observe that all the three DGPs satisfy Assumption 1. For every combination of K, C, n, each experiment is repeated many times and we compute the average performance over these repetitions.

In this simulation study, we explore particularly two aspects in detail. First, we find out the accuracy of our methodology in terms of predicting the category with maximum count at the end. Here, a decision is made when the maximal category is predicted with at least 99.5% probability and when the predicted difference in counts for the top two categories is more than 5% of the remaining total counts. Second, we focus on the above-mentioned two specific cases which are similar to our real-life applications, and examine how well we can predict various properties of h(Y l ), for different choices of h(·).

In case of the first problem, for every iteration, the fixed values of p are obtained from a Dirichlet distribution with equal parameters to ensure a more general view of the performance of the proposed method. Detailed results are reported in Appendix B, in tables B.1, B.2 and B.3. We note that the results are not too sensitive with respect to the choices of C and K, but the accuracy increases with n. Below, in Table 1 , we look at the results for C = 5, K = 50 for the three different DGPs. It is evident that under the iid assumption (DGP1), the method makes a correct call almost all the times. When the components of ε j are independent random variables, a correct call is made more than 90% of the times for large enough samples (n > 5000). For the third DGP, where the components of ε j are simulated from a multivariate distribution with a general positive definite matrix as the covariance matrix, the accuracy of making a correct call drops to 70% for n = 50000. Another interesting observation is that the accuracy depends heavily on the average final margin between the top two categories. In other words, if the underlying probabilities of the top two categories are close (which directly relates to a lower margin of difference between the final counts), then the method achieves lower accuracy, and vice-versa. We also observe that in those cases, the method records no call (i.e. a decision cannot be made with the prescribed rules) more often than incorrect calls. (184868) Moving on to the particular choice of C = 3, K = 25, n = 5000, we aim to find out the effectiveness of the proposed approach in identifying the leading category for varying degrees of difference between the true cell probabilities of the topmost two categories. To that end, for p = (p 1 , p 2 , p 3 ), without loss of generality, we assume p 1 > p 2 > p 3 . Now, data are generated for the above three processes by fixing p 1 − p 2 = δ where δ ∈ {0.01, 0.05, 0.1, 0.25}. For every choice of δ and for every DGP, we repeat the experiments and find out the mean accuracy in predicting the top category. Additionally, we also find out the average number of observations used by the method for making the prediction. These results are displayed in Table 2 .

It is evident that for DGP1, the method provides great prediction. Even when there is only 1% difference in the probabilities of the top two categories, the method predicts the leading category correctly 96% of the times. For larger than 1% difference, it never fails to predict the leading category correctly. For DGP2, our method can predict the topmost category with more than 75% accuracy whenever the probabilities of the top two categories differ by at least 5%. This accuracy reaches the value of nearly 90% for δ = 0.1 and is 100% if the difference in those two cell probabilities is 0.25. For the third DGP, the prediction accuracy is about 70% for δ = 0.1. The accuracy improves steadily as δ increases. In the context of political election, one can say that the method will be able to predict the winner with high level of accuracy whenever there is a considerable difference between the p i values for the top two candidates. If that difference is minute, which corresponds to a very closely contested election, the accuracy will drop. Additionally, we observe that in the most extreme cases, about 20% of the times, the method never declares a winner with desired certainty, which is in line with what we should expect.

Focus on the second specific case of C = 5, K = 50, n = 50000 which is similar to the forecasting of sales for different brands. For different DGPs, the values of p are generated randomly from a Dirichlet distribution, and we evaluate how well our method can predict the cumulative proportions of counts for the five categories. Experiments are repeated many times and we compute the overall root mean squared Table 3 for these results. Once again, for DGP1, the method achieves great accuracy very early. With only 15 rounds of data (approximately 30% of the total observations), the predictions are precise. At a similar situation, for DGP2, the RMSE is less than 1%, and it decreases steadily to fall below 0.5% by round 35. Finally, for the third DGP, RMSE of less than 1% is recorded after 35 rounds as well. These results show that our method can accurately predict the overall cumulative proportions of counts with about 35 rounds of data. Translating this to the sales forecasting problems, we hypothesize that the proposed method will be able to predict the annual market shares of different categories correctly at least 3 to 4 months before the year-end. 

We examine the effectiveness of our proposed methodology in the election calling context, with the real data from the Bihar Legislative Assembly election held during October and November of 2020, for 243 constituencies (seats). Indian elections have multi-party system, with a few coalition of parties formed before (sometimes after) the election, playing the major role in most contests. While these coalitions are at times temporary in nature and there is no legal binding or legitimacy, typically the (pre-poll) alliance winning the majority (at least 50%) of the constituencies come to the power.

In Bihar 2020 election, the contest in most constituencies was largely limited to two such alliances -the National Democratic Alliance (NDA), and the Mahagathbandhan (MGB). The NDA comprised of four parties -Janata Dal United, Bharatiya Janata Party, Hindustani Awam Morcha and Vikassheel Insaan Party, while MGB comprised of Rashtriya Janata Dal, Indian National Congress and the left parties. Lok Janshakti Party (LJP), a traditional constituent member of NDA, fought on its own in most constituencies; also there were couple of other alliances, most notably the Grand Democratic Secular Front (GDSF). According to many political analysts, LJP and GDSF potentially had significant impact in several constituencies, even though they did not have a great chance to win in most constituencies.

The counting of votes began at 8am on November 10, 2020. Because of the Coronavirus pandemic, in all the constituencies, a greater number of polling booths were used. Counting went on for many more rounds than usual. While typically election results become clear by noon, because of the close contests in several seats in addition to the increased number of rounds, leads swung back and forth and the final numbers could be declared only in the wee hours of the next day. Eventually, NDA won in 225 seats, MGB won in 110 seats, GDSF won in 6 seats, while LJP won in 1 seat and the remaining seat was won by an independent candidate. The number of contesting candidates varied between 4 and 31, with the median (and mode) being 15. However, it was a two-way fight in almost all of the seats. The winning candidate got 43% of the votes on average, while the runner-up got 33.25% on average. However, there were constituencies where the winner got less than 25% votes, also constituencies where the runner up got nearly 45% of the votes.

The result of the election is downloaded from the official website of Election Commision of India (2020). A brief summary of the data is provided in Table 4 . All summary statistics in this table are calculated over the 243 constituencies in Bihar. (Our analysis ignores the postal votes, which are few in numbers and do not affect the results.) Note that the number of votes cast in the constituencies vary between 119159 and 225767, with the average being 172480. The votes are counted in 25 to 51 rounds across the state, and the mode is found out to be 32. While in most rounds about little over 5000 votes are counted, in some cases, especially the last few rounds, show fairly less numbers. The last three rows in the table provide the minimum, maximum and average number of votes counted per round for all the constituencies. We also point out that the final margin between winner and runner-up parties in these constituencies show a wide range. The lowest is recorded in Hilsa (13 votes) whereas the maximum is recorded in Balrampur (53078 votes). In the following analysis, we focus on the two dominant categories (essentially the two main alliances in every constituency) and the rest of the alliances or the parties are collated into the third category. As our main objective is to predict the winner and the margin of victory, it is sensible to consider this threecategory-setup. The two primary categories here mostly turn out to be NDA and MGB. We also make the logical assumption that each vote is cast independently, thereby ensuring that in each round of counting, we have an independent multinomial distribution. However, there is no information on how the counting happens sequentially in each round, and therefore we assume the individual probabilities of the three different categories in different counting rounds to be random variables satisfying Assumption 1. Consequently, it is an exciting application of the proposed modeling framework.

Recall the hierarchical prior distributions for the parameters in the model (eq. (2.11)). For α, we use the past election's data and use the proportion of votes received by the corresponding alliances (scaled, if needed, to have α i = 1). For Ψ and Ψ p , we use identity matrices of appropriate order. Both ν and ν p are taken to be 5. We point out that the results are not too sensitive to these choices. Next, using these priors, we implement our method and estimate the probability of winning for every alliance after the counting of votes in each round. Corresponding margin of victory is also estimated in the process. Based on these values, we propose the following rules to call a race. First, we allow at least 50% votes to be counted in order to make a prediction. It is otherwise considered to be too early to call. On the other hand, akin to the procedure laid out by Lapinski et al. (2020), our decision is to call the race in a particular constituency when we are at least 99.5% confident of the winner and when the predicted winning margin is at least 5% of the remaining votes. Unless these conditions are met, it is termed as too close to call.

We start with a summary of the results obtained after fitting the model to the election data. Complete results are provided in Table B .4 in Appendix B. In 227 out of the 243 constituencies (approximately 93.4%), the proposed method calls the race correctly. Our method considers that it was always too close to call for Bakhri and Barbigha where the eventual win margin was 439 and 238 respectively. For 14 other constituencies (approximately 5.7%), the method calls the race incorrectly, as the winning candidate was significantly behind at the time of calling in those constituencies.

We next take a detailed look at the prediction patterns for all constituencies, in different aspects, through Figure 1 . We plot the remaining vote percentages at the time of calling against the final margins of victories for all the constituencies. It is interesting to observe that all red points, which correspond to incorrect calls, appear on the left hand side of the plot. Thus, our method performs perfectly for all constituencies but one where the final margins of victories are above 10000. In fact, the proposed method calls the races very early for all such constituencies. Only exception to this is the constituency Baisi, which is discussed in more detail below. Along a similar line, cluster of the points on the left bottom of the figure implies that the closely contested constituencies require more votes to be counted before the race can be called. We also note that the sizes of the points, which are proportional to the total number of votes in the constituencies, are spread across the whole plot. It tells us that the overall sizes of the constituencies have negligible effect on the performance of the method. To further explore the above point, the accuracy and the average percentage of votes counted before making a call, corresponding to the final margins of victories, are displayed in Table 5 . There are 52 constituencies which observed very closely fought election (final margin is less than 5000 votes). In 41 out of them, our method correctly predicts the winner and in 2 of them, it cannot make a call. On average, around 70% votes are counted for all these constituencies before we can make a call. This, interestingly, drops drastically for the other constituencies. On average, only about 60% votes are counted before we make a call in the constituencies where eventual margin is between 5000 and 10000. In 4 out of 34 such constituencies though, our method predicts inaccurately. For the constituencies with higher eventual margins (greater than 10000), the prediction turns out to be correct in 156 out of 157 constituencies. In these cases, around only about 50 to 55 percentage of votes are counted on average before making a call. Finally, the last row of the table corroborates the earlier observation that the size of the constituency does not have considerable effect on the prediction accuracy.

Not only the winner of an election, but the proposed approach also predicts the final margin of victory, along with a prediction interval, for the winner. In Figure 2 , we present the true margin and the predicted win margin for all the 227 constituencies where correct calls are made. Corresponding prediction intervals are also shown in the same plot. It can be observed that the predicted margins closely resemble the true margins. Overall, it is evident that the method performs really well in terms of calling an election race based on only the counting of votes in different rounds. It is imperative to point out that better accuracy can be achieved with more detailed information about different constituencies. Especially, information about the sequence in which the counting is carried out, ethnic composition or socio-economic status of people in different areas are necessary to build a more accurate procedure. Our proposed method in this paper works with only the multinomial assumption and achieves more than 93% accuracy in calling the race with the requirement of at least 50% votes being counted. To that end, in order to gain more insight about the accuracy of the method, we rerun the experiment with different values for the minimum requirement of counting. The number of constituencies with correct call, incorrect call and no call are displayed against the minimum requirement in Figure 3 . We see that the method achieves 90% accuracy (216 out of 243) even when only 30% votes are counted.

Next, we look into the robustness of the method, by focusing on four particular constituencies in more detail. These are Hilsa (minimum final margin), Balrampur (maximum final margin), Baisi (final margin more than 10000, but our proposed methodology makes a wrong call with nearly 55% votes counted) and Barbigha (final margin of 238, the proposed method never calls the race in anyone's favour). Now, for each constituency, we create synthetic data by randomly switching the order of the individual rounds of counting and implement our method on the resulting data. This exercise helps us to understand whether the predictions would have been different if the data were available in a different order. In particular, it points towards objectively judging to what extent the success or failure of the proposed methodology in this real data can be attributed to luck. Note that the proposed modeling framework assumes independence across different rounds, and hence the method should perform similarly if the votes are actually counted in random order. In Table 6 below, the results of these experiments are presented. We see that for Balrampur, even with randomly permuted rounds, the method calls the race correctly every time. For Baisi, the method yields correct forecast 99.97% of times, although in the original order of the data it makes an incorrect call. Meanwhile, Hilsa and Barbigha observed very closely contested elections. For these two constituencies, in the experiment with permuted data, the method records too close to call decisions more commonly, 72.55% and 70.53% respectively. These results suggest that quite possibly the sequence of roundwise counting of votes is not random. Thus, if more information is available about individual rounds from which votes are counted, it would be possible to modify the current method to achieve greater accuracy. 4.2. Predicting the sales of different competing brands in a product category As a second illustration, we implement our methodology in sales forecasting contexts. In general, the companies show great deal of interest in such a prediction exercise, even though some are reluctant to disclose the data in public. Working with various datasets, we find a great deal of similarity in noting that the rank of the competing brands seldom change and naturally is of little or no interest (it is well established as such, and our method also infers correctly very soon from the data). On the contrary, the inference on market shares is of great value. On that note, generally our method works very well when there is relatively less fluctuation in percentage market share. We can label them as 'easy problems'. In this section, we discuss one particular example where the company permitted to disclose the real names and figures and where fluctuations in weekly market shares are higher than usual.

Yara Fertilisers India Pvt. Ltd (Yara India, in short) sells various crop nutrition products all across India through different parties. Below, we focus on the NPK (nitrogen, phosphorus, and potassium) fertilizer sales in 2020. There are six competing products -YMC, DS191919, YMW, DS114211, DS140529, DS050935 (arranged according to their popularity). YMC stands for YaraMila Complex which is the undisputed leader in this market. YMW stands for YaraMila Winner, while the other four products are from Deltaspray (DS). All of the products are sold in different sizes, but for uniformity, throughout this section, all sales are converted to represent the number of 1 kg packets.

For every week, using the counts of sales, one can compute the market shares of different categories. As already noted, the primary objective in this type of problem is to be able to predict the annual market share within an acceptable target accuracy and desired prediction confidence, based on the weekly numbers of sales. Here, we follow the rule that we make a call when the predicted annual market share, with 99.5% probability, shows a margin of error of ±0.5%. In addition, we wait for at least 5 weeks (approximately 10% of the total), before considering the forecast. Next, it can also be of great value to predict, with high probability, if a particular brand would be able to achieve a targeted market share in the whole year. These problems are well within the scope of the proposed modeling framework, and are discussed in detail below.

We assume that each buying customer chooses a brand independently of the other customers, thereby implying that the collective sales for the six brands in the j th week can be treated as a multinomial random variable with parameters (n j , p 1j , . . . , p 6j ). Here, n j is the total sales in the j th week, for j = 1, 2, . . . , 53. This is in line with our model defined in Section 2.2. The cell probabilities denote the market share for the individual brands. While carrying out the aforementioned analysis, we use a flat prior for µ, i.e. α i is constant for all i (eq. (2.11)). Table 7 displays the results of the first problem where we predict the annual market share with desired certainty and margin of error. It can be observed that only five weeks of data are sufficient to make a good prediction for the three brands lower down the order. For the top three brands though, an acceptable call is made only towards the end of the year. For example, in case of YMC, under the aforementioned criteria, our method predicts the annual market share with great accuracy only two weeks before the year-end. For DS191919 and YMW, the predictions are made about a month before the year-end. Let us focus a bit more on the top three brands. In Figure 4 , we see the predicted market shares and the corresponding credible intervals, updated every week. While an acceptable call is made towards the end of the year, we can see that the predictions are actually very close (within 2%) to the true numbers from around the 20th week. Thus, one can rely on the predictions made after five months, which corroborates the earlier findings in the simulation studies. It establishes the efficacy of the proposed method in this context. To further support this, as a quantitative measure of accuracy, RMSE in predicting the year-end market share for all the brands are calculated. From Table 8 , we can see that the RMSE is consistently low in all cases. It is around or less than 2% at the 30th week and it starts to fall below 1% after the 40th week, which is approximately 9 months of data. As a second exercise, we use our method to provide probabilistic forecasts on whether YMC would be able to reach 80% market share in the whole year. Refer to Figure 5 . It is evident that our method would have predicted it with high probability from around 15th week that YMC would be able to meet the targeted market share of 80%. Because of high variation in the weekly market shares, the predicted probability drops a bit in the middle of the year, but it is always above 50% and is consistently above 90% from 40th week onward. Clearly, the methodology can be used to provide valuable insight in this aspect of sales forecasting as well. 

In this work, we consider a hierarchical Bayes model for batches of multinomial data with cell probabilities changing randomly across batches. The goal is to predict various properties for aggregate data correctly and as early as possible. We illustrate the effectiveness of the methodology in sales forecasting and poll prediction problem which motivated this study. The application in the sales forecasting context is seen to work with great precision. The performance of the methodology in the poll prediction context is good as well, especially considering the limited information. We plan to implement the method in future elections. A potential future direction in this specific context would be to improve the method when more information on the constituencies or the rounds of counting are available. It is likely that additional information on relevant covariates would improve forecast accuracy. It is possible to modify the proposed methodology to incorporate the covariates, but that is a more challenging and interesting problem and requires full treatment. We defer it for a future work.

The proposed model accommodates randomness in the multinomial cell probabilities, depending on the batch. This is pragmatic given the actual data pattern in most practical situations, as otherwise early calls are made which end up being often wrong. This is also intuitively justifiable as, for example, different rounds of votes can have differential probabilities for the candidates, potentially owing to different locations or the changes in people's behaviour. Similarly, relative popularity of brands may fluctuate from one week to another because of advertisements or other external influences. This paper aims to tackle these situations appropriately.

It is worth mention that a common practice in Bayesian analysis of sequences of multinomial data is to apply Dirichlet priors for the cell probabilities. This approach requires the iid assumption and hence do not work under the modeling framework of this study. In fact, if we assume iid behaviour for the multinomial data across the batches and use that model for the examples discussed above, then the performances are much worse than what we achieve. Further, we want to point out that even without the variance stabilizing transformation, the proposed Bayesian method in conjunction with the results from Theorem 1 can be used in similar problems. The prediction accuracy of that model is comparable to our approach. We employ the transformation for superior prediction performance, albeit marginally in some cases, across all scenarios.

Finally, note that in some contexts, it may be unrealistic to assume that values of the future n j 's are known. In that case, the decision can be taken by anticipating them to be at the level of the average of n j 's observed so far. It is intuitively clear from Section 2.4, as one can argue that the effectiveness of the methodology will not alter to any appreciable extent as long as the sample sizes are large. It can also be backed up by relevant simulation studies. 

Proof of Theorem 1. Following eq. (2.2), E(X lj |ε lj ) = n j (p l + ε lj ). Then, using Assumption 1 and applying the relationship between total and conditional expectation,

We also know that Var(X lj |ε lj ) = n j (p l + ε lj )(1 − p l − ε lj ). Then, we can use the relationship Var(X lj ) = Var (E(X lj |ε j )) + E (Var(Y lj |ε j )) to get the following:

Var(X lj ) = Var (n j (p l + ε lj )) + E [n j (p l + ε lj )(1 − p l − ε lj )] = n j p l (1 − p l ) + n j (n j − 1)Var(ε lj ). (A.2) Finally, note that the covariance of X 1j and X 2j , conditional on ε j , is equal to −n j (p 1 + ε 1j )(p 2 + ε 2j ). One can use this and apply similar technique as before to complete the proof of part (a).

Part (b) follows from (a) using the fact that the multinomial data in different batches are independent of each other.

To prove part (c), we use the multinomial central limit theorem which states that if (D 1 , D 2 , . . . , D C ) ∼ Multinomial(n, π 1 , π 2 , . . . , π C ), then

Let us define V εj to be the covariance matrix of (p 1j ,p 2j , . . . ,p (C−1)j ), conditonal on ε j . Then we have

Now, under Assumption 1 and from eq. (A.3), as n j → ∞,

Using the assumptions in part (c) of Theorem 1, as n j → ∞, V εj converges in probability to Ξ p . Hence, by an application of Slutsky's Theorem, eq. (A.4) is equivalent to:

Let f ε (·) denote the density function of √ n j ε j , φ(·, Ξ ε ) be the density function of N C−1 (0, Ξ ε ), and Φ(·, Ξ ε ) be the distribution function of N C−1 (0, Ξ ε ). Note that the previous convergence in eq. (A.5) is conditional on ε j , as n j → ∞. To find the unconditional distribution ofp j −p, we look at the following, for z ∈ R C−1 :

From the assumptions in part (c) of Theorem 1, √ n j ε j converges in distribution to N C−1 (0, Ξ ε ). Then,

A simple application of dominated convergence theorem then implies, as n j → ∞,

Note that the term on the right of eq. (A.8) is the distribution function of N C−1 (0, Ξ p + Ξ ε ) at the point z ∈ R C−1 . Hence we arrive at the required result, eq. (2.5).

Proof of Theorem 2. Recall that the sigma field generated by the collection {L 1 , L 2 , . . . , L K } is F K , and the posterior means of µ and Σ, conditional on F K , are µ P M and Σ P M respectively. Then, eq. (2.12) and eq. (2.13) can be equivalently stated as,

We begin with the term

where the expectation is taken with respect to Σ. Since eq. (A.11) is either 1 or 0 depending on whether µ P M − µ 0 > ǫ or not, on taking expectation over F K , we obtain

To show that eq. (A.12) goes to 0 as K → ∞, we do the following:

Applying the identity (A + B) −1 = A −1 − A −1 (A −1 + B −1 ) −1 A −1 on the first term of the above,

which goes to 0 a.e., using the properties of inverse Wishart distribution and the assumptions on the prior parameters we have. Consequently, E Σ −1 p + N Σ −1 −1 Σ −1 p (α − µ 0 ) goes to 0 and hence, we can argue that

For the second term in eq. (A.13), first note that K i=1 n i (L i − µ 0 ) is equivalent, in distributional sense, to √ N Z, where Z ∼ N (0, Σ 0 ). On the other hand, (A.16) where C K = O(1/N ). Using the above results, we can write

In light of the fact that Σ p is finite and that C K = O(1/N ), it is easy to see that the above probability goes to 0. This, along with eq. (A.15), completes the proof that eq. (A.12) goes to 0 as K → ∞.

In order to prove eq. (A.10), we follow a similar idea as above. Note that Ψ is a constant positive definite matrix and ν is a finite constant. Following eq. (2.17), for large K, straightforward calculations yield the following.

Further, using the true value µ 0 inside the term on the right hand side above, we get (A.19) and subsequently,

Using similar arguments as in eq. (A.16), we can show that E[N (µ − µ 0 )(µ − µ 0 ) T | F k ] is bounded and therefore, the second term goes to 0. Next, taking cue from the previous part regarding the consistency of µ, it is also easy to see that for large K, E[n i (L i − µ 0 )(µ − µ 0 ) T | F k ] is small enough. Hence, the third term in eq. (A.20) also goes to 0. Finally, for the first term in that inequality, observe that √ n i (L i − µ 0 ) ∼ N C−1 (0, Σ 0 ) for all i. Thus, W = K i=1 n i (L i − µ 0 )(L i − µ 0 ) T ∼ Wishart(Σ 0 , K), and the law of large numbers implies that W/K − Σ 0 → 0 in probability. Consequently, we get that P ( Σ P M − Σ 0 > ǫ) → 0 and that completes the proof. Table B .1: Accuracy (in %) of predicting the category with maximum count (corresponding final margins, averaged over all repetitions, are given in parentheses) for different values of C, K, n, when data are generated from DGP1.

Correct ( 

The transformation of poisson, binomial and negative-binomial data

Election bias: Comparing polls and twitter in the 2016 us election

Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage

Estimation of covariance matrices based on hierarchical inverse-wishart priors

Empirical bayesian estimation of normal variances and covariances

Bayesian inference for a normal dispersion matrix and its application to stochastic multiple regression analysis

Delta method. Encyclopedia of biostatistics 2

A simple and efficient simulation smoother for state space time series analysis

Data from Bihar Legislative Election 2020

Closed-form expression for the poisson-binomial probability density function

Data analysis using regression and multilevel/hierarchical models

Inference from iterative simulation using multiple sequences

Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper)

Stochastic relaxation, gibbs distributions, and the bayesian restoration of images

Empirical bayes estimation of the multivariate normal covariance matrix

A survey on computational politics

Prediction of remaining life of power transformers based on left truncated and right censored lifetime data

Bayesian estimation of a covariance matrix with flexible prior specification

A default conjugate prior for variance components in generalized linear mixed models (comment on article by Browne and Draper)

4: Detailed results from all constituencies in Bihar. True data At the time of calling the race Constituency Rounds Final margin Round Votes left (%) Lead Predicted margin Decision