key: cord-0780275-bzng1kcy authors: Jammalamadaka, S. Rao; Bapat, Sudeep R. title: Middle censoring in the multinomial distribution with applications date: 2020-08-27 journal: Stat Probab Lett DOI: 10.1016/j.spl.2020.108916 sha: 39a18846cd13f0b71d3ffd608afdc5bbeb1bb2e7 doc_id: 780275 cord_uid: bzng1kcy In a multinomial set-up with k possible outcomes, we develop estimation under a “middle censoring” paradigm, which is as per Jammalamadaka and Mangalam (2003). This problem has many special features because of the inter-dependent probabilities, which we explore here. In this paper we discuss a "middle-censoring" scheme when the data comes from a multinomial experiment. Middle censoring occurs if the actual value of a data point is not observed but is known to fall inside a specific interval. In particular for our multinomial setup, some individuals choose exactly one of the k possible categories whereas some others, choose intervals covering several categories. Well known censoring schemes such as right-and left-censoring can be seen as special cases of such a middle censoring by picking suitable censoring intervals. Considerable ground has been covered with regards to middle censoring problems over the last decade and a half. One may refer to Jammalamadaka and Mangalam (2003) where the authors develop self-consistent and non-parametric maximum likelihood estimators (MLEs) for the unknown Cumulative Distribution Function (CDF) for such middle censored data. Jammalamadaka and Iyer (2004) establish approximate self consistency for middle censored data. Iyer et al. (2008) considered a parametric middle censoring scheme using exponential lifetime data. Davarzani and Parsian (2011) discussed middle censoring in a discrete setup by taking observations from a geometric distribution. More recent references include Jammalamadaka and Leong (2015) where the authors discuss a middle censoring scheme for geometric random variables in the presence of covariates, and Ahmadi et al. (2017) who consider middle censoring in the context of competing risks. An outline of the paper is as follows. In Section 2 we develop the likelihood function for middle censored data from a multinomial model, in the most general setup. However, because of the complicated dependencies between the multinomial probabilities as well as the observed frequencies, explicit expressions for the MLEs for individual probabilities and their large-sample variances in such a general setup are not easy to get, and may have to be obtained numerically. To illustrate these ideas, we consider three different scenarios covering the middle censoring scheme-one where there is just one interval allowed, a second one where there are 2 nonoverlapping intervals, and the third case that allows 2 intervals that overlap. Section 3 develops a Bayesian framework for estimating the required probabilities. The final Section 4 contains bootstrap estimates and variances of the unknown probability vector. This section also provides a simulation analysis comparing the Bayes estimates, and the estimates one gets from the different methods proposed here. We also present an example using real data, in the form of ratings given by a group of students for their experience in using a Consumers are constantly asked to rate products that they buy on a website like Amazon. Or in market research, a company which plans to launch a new product, wants to gauge the user response in terms of the preference-ratings or the "star-ratings" the product gets, as part of a pilot study. Assume that the company contacts n individuals, each of them being asked to rate the product in terms of {1, 2, ..., k} stars, according to his/her liking for the product. Let f j stand for the number/frequency of people giving j stars. If we denote the true probabilities of giving 1, 2, ..., k stars by p 1 , p 2 , ..., p k respectively, we have the standard multinomial scheme with k j=1 f j = n and k j=1 p j = 1, which is a classical and well-studied problem. Alternatively, assume that out of these n individuals some of them hedge their bets, and assign an "interval rating" for the item. To get started and to illustrate things, let us say e.g. a given number f 12 of people are undecided between the ratings 1 and 2, and say their rating falls in the interval [1, 2] comprising both the ratings between 1 and 2. This refers to either 1 or 2 stars but s/he is not convinced over one particular rating between these two. This is what we shall refer to as an "interval rating" from now on. Given this new additional category, say with probability p 12 , we now have p 12 + k j=1 p j = 1 and the total frequency f 12 + k j=1 f j = n. We are interested in determining how the estimated probabilities for each individual category would change if the scheme also allows such interval ratings. In other words, we wish to figure out the estimated probabilities,p 1 ,p 2 , ...,p k under this new scheme. Developing the maximum likelihood estimates along with their properties such as asymptotic variances under the standard multinomial setup, has been considered extensively in the literature. One may, for instance, refer to Alam (1979) or Kunte and Upadhya (1996) where the authors have discussed both the MLEs and UMVUEs under the classical multinomial setup. First we consider the likelihood function under our general multinomial scheme which allows interval ratings, for which we introduce some notations. Let I represent an interval (say e.g. j 1 to j 2 ) of categories/scores with corresponding probability P I = j∈I p j for this interval. When such interval scores are allowed, out of the n individuals, let us say m (≤ n) of them provide interval-ratings that belong to the intervals {I j ; j = 1, 2, ..., m} with r of these intervals being distinct. The remaining (n − m) individuals provide specific single ratings, of which let us say there are k. Then the probabilities satisfy Further assume that the frequency in the interval I j is F j and the frequency in the k individual categories 2 J o u r n a l P r e -p r o o f Journal Pre-proof Then the likelihood for the vector p given m, n − m, {f i }, and {F j } is given by: subject to the conditions (1.1) and (1.2) with the corresponding Log-likelihood Iyer et al. (2008) or Eqn. (1) in Jammalamadaka and Leong (2015), except for the additional restrictions imposed by the conditions (1.1) and (1.2) due to the dependence among the categories, and their frequencies. Estimation for individual p i s which is our main goal, becomes even more cumbersome when some of the intervals overlap. In such cases, analytical solutions may not be possible, but one can obtain estimates through numerical methods. To illustrate these ideas, we develop three successively more complex scenarios-labelled Cases 1, 2, and 3, and show how they can be handled. The following sections introduce corresponding likelihood functions for these three cases, provide estimators for p ≡ (p 1 , p 2 , ..., p k ) and discuss their asymptotic variances, in each of these cases. We now propose three interesting scenarios with increasing levels of complexity and provide appropriate MLEs for the probability vector p. First, in "Case 1", we start by assuming that the individuals are allowed just one pre-specified interval rating besides the singleton ratings. Similarly "Case2" assumes that two such "non-overlapping" interval ratings are allowed besides the singleton ratings, whereas "Case 3" assumes that two such "overlapping" interval ratings are possible. More general scenarios are possible, and follow similar ideas. Assume that we only have a single "interval rating" namely [i, j] , with f ij number of individuals opting for that. Clearly the probability of any individual giving that rating is p ij = p i + p i+1 + ... + p j . Given that this is an additional category that is being allowed in the multinomial scheme, we further have The likelihood function is then proportional to . To obtain the MLEs, one needs to solve the following simultaneous equations, which lead to the following MLEs, where l = i, (i + 1), ..., j. Remark 1. Now if f ij = 0, i.e. no one opts for the interval rating even after being given that choice, the MLEs for the p l in this interval will suffer because of that and reduce to becomep l = f l /2n. This is justified in view of Equation (2.1). Remark 2. If all the individual frequencies in the interval [i, j] are zero except for one category, say just the f i = 0, then i.e the i th category gets all the added benefit of this interval frequency f ij . This is in agreement with the Proposition 1 of Jammalamadaka and Mangalam (2003) . J o u r n a l P r e -p r o o f Journal Pre-proof Now assume that we allow for two disjoint "interval ratings" namely, [i 1 , j 1 ] and [i 2 , j 2 ] with corresponding observed frequencies f i1j1 and f i2j2 and respective probabilities p i1j1 , p i2j2 . The forms of p i1j1 and p i2j2 are similar to those given in Section (2.1). We further have, The likelihood function will then be, where 1 ≤ i 1 < j 1 < i 2 < j 2 ≤ k. The log-likelihood function is clearly, . To obtain the MLEs, one needs to solve the following simultaneous equations, which lead to the following MLEs, where l 1 = i 1 , (i 1 + 1), ..., j 1 and l 2 = i 2 , (i 2 + 1), ..., j 2 . Again, if f i1j1 = f i2j2 = 0 i.e. no individual opts for either of these interval ratings even after being given the option, then the MLEs will becomep l1 = f l1 /2n andp l2 = f l2 /2n, where l 1 , l 2 belong to intervals given above. Next we consider the large-sample variances of the estimates given in Eqns. Then one can obtain the large-sample variances as V (θ i ) = [I −1 (θ)] ii , i = 1, . . . , N. Deriving these asymptotic variances for a general scheme is not straightforward and we provide derivations for some special cases in Appendices A.1 and A.2 (found in the supplement) corresponding to Cases 1 and 2 respectively. For some i 1 < i 2 < j 1 < j 2 , if we now allow for two overlapping "interval ratings" say, [i 1 , j 1 ] and [i 2 , j 2 ], with an overlap of [i 2 , j 1 ], the likelihood function can be written similar to Eqn. (2.9) and is given by: where 1 ≤ i 1 < i 2 ≤ j 1 < j 2 ≤ k. The log-likelihood function is clearly, However because of this overlap, finding even the MLEs, leave alone their asymptotic variances, becomes very cumbersome and easy analytical solutions do not exist. However they can be obtained numerically, as we demonstrate in Section 4. Bayes estimation in a multinomial setup has been discussed by several authors-see e.g. Lehmann and Casella (1998) or Ferrie and Blume-Kohout (2016) . In this section we will adopt a Bayesian framework to estimate the unknown probability vector. Now using notations from Section (1.2), the unknown probability vector is p ≡ (P I1 , P I2 , ..., P Ir , p 1 , p 2 , ..., p k ) We will now assume a prior distribution for p. A natural choice would be the conjugate prior, namely the Dirichlet distribution. The setup is as follows: Let X = (X 1 , X 2 , ..., X n ) denote the choices of the n individuals with each X i taking either an interval or a specific score. Hence we can assume, where Dir(α) stands for a Dirichlet distribution with parameter vector α = (α * 1 , ..., α * r , α 1 , α 2 , ..., α k ), where α * i (> 0) corresponds to the respective prior parameter on each P Ij , j = 1, 2, ..., r. The Dirichlet density is 6 J o u r n a l P r e -p r o o f Journal Pre-proof then given as: Now the likelihood function (L(data|p)) is exactly similar to Eqn. (1.1). Hence the posterior density is given by: The numerator of (3.2) can be written as, from which the posterior density can be easily written down. We now illustrate the ideas in a simple special case namely when k = 5 and p = (p 1 , p 2 , p 3 , p 4 , p 5 , p 12 ), where there exists a single "interval rating" viz. [1, 2] with a frequency of f 12 (> 0) and having a probability of p 12 = p 1 + p 2 . Also hence, p 12 + 5 i=1 p i = 1. We will now build upon the likelihood function along with the appropriate posterior distribution for p. Now as in previous sections, the likelihood function can be written as, Further, similar to (3.3), where f * = f 12 + α * 1 − 1, which is > 0 since we assume f 12 ≥ 1 and α * 1 > 0. The expression in (3.5) can be thought of as a Dirichlet distribution with a different set of parameters. Further, p L(data|p)π(p)dp = Γ(f * + 1) Combining (3.5) and (3.6) we have the posterior density, from which one can obtain the Bayes estimate of p. Under Squared Error Loss, it is given by the mean of this posterior. In particular, the Bayes estimator of p i is given bŷ 4 Parametric Bootstrapping and Real Data Analysis for the Es- As an alternative to finding the MLEs and their asymptotic variances, which as we can see, gets complicated pretty quickly, one might adopt a parametric bootstrap to get the estimates and the variances of the estimates for the 3 cases discussed in Sections 2.1, 2.2 and 2.3. These are obtained by first using the relative frequencies as the initial probabilities, and bootstrapping/simulating a large number of independent samples. As an illustration and demonstration that they provide similar results, we first present results for such bootstrapping for "Case-1", alongside the results for our Bayesian setup. Results for "Case-2" and "Case-3" can be derived similarly (see Remark at the end of Section 4.1). We will consider the case as given in Section 2.1, where we allow a single "interval rating". Now let i = 1, j = 2 and k = 5. The likelihood and log-likelihood functions are as given in (2.2), (2.3) and in particular take the following forms: and, log L = f 1 log p 1 + f 2 log p 2 + f 3 log p 3 + f 4 log p 4 + f 5 log p 5 + f 12 log(p 1 + p 2 ), where p 5 = (1 − 2p 1 − 2p 2 − p 3 − p 4 ) and f 12 denotes the only "interval rating". Now we first fix an observed vector of frequencies and assume it to come from a multinomial distribution with parameter vector p, where p = [p 1 , p 2 , p 3 , p 4 , p 5 , p 12 ]. We intentionally fix f 12 to be higher than both f 1 and f 2 , since it is reasonable to assume that in any practical scenario when given an option, more people will likely opt for an interval rating instead of giving a single number. We then calculate the estimated probabilities using (2.4) and (2.5), which take the following forms, J o u r n a l P r e -p r o o f Journal Pre-proof for i = 3, 4, 5. Now assuming these estimates are the actual probabilities (p ≡ p), we bootstrap a large number of samples (say R = 10 3 ) from a M ult(p) distribution, and recalculate the probability estimates using (2.4) and (2.5). We then observe the pattern of the estimates by looking at the mean and standard errors of these estimates over the R bootstraps. Let these be denoted byp R and s(p R ) respectively. The asymptotic variances ofp take the following forms: The above expressions for the asymptotic variances are derived in Appendix A.1 (found in the supplement). We also provide the Bayes estimates, using Dirichlet priors for the given data sets. Tables 1 and 2 outline the results for two different values of n, for a given set of observed frequencies. For our Bayesian setup, as our first instance we fix α = (2, 3, 1, 2, 4, 4), whereas as a second instance we fix α = (4, 6, 1, 1, 2, 4). the estimated probabilities for "Case-2" are derived in Appendix A.2 (found in the supplement). Further, the likelihood and log-likelihood functions for "Case-3" can be written down along the lines of (2.10) and (2.11). However the estimated probabilities do not have closed-form analytical solutions and require either a numerical maximization or parametric bootstrapping, as demonstrated in this section for "Case 1". We now present a real data example which demonstrates the applicability of estimators discussed here. As the entire world suffers from the current COVID-19 pandemic, there has been an ever increasing demand for a software where a group is able to conduct online meetings and live sessions. Many competitors have cropped up in the market. The following survey was conducted at the University of California, Santa Barbara by one of the authors recently, which asked a class of students about their overall experience with regards to one such widely used software. They could give ratings of 1 through 5 (5 being the highest rating) along with a couple of "interval ratings" consisting of [1, 2] and [4, 5] . Data is collected from a group of n = 90 students from the class. This falls in the paradigm of our current problem, in particular, "Case-2", given in Section 2.2, and the likelihood and log-likelihood functions take the following forms: L = p f1 1 p f2 2 p f3 3 p f4 4 p f5 5 (p 1 + p 2 ) f12 (p 4 + p 5 ) f45 , and, log L = f 1 log p 1 + ... + f 5 log p 5 + f 12 log(p 1 + p 2 ) + f 45 log(p 4 + p 5 ), where p 3 = (1 − 2p 1 − 2p 2 − 2p 4 − 2p 5 ). We then obtain the estimated probabilities (MLEs) from (2.8) and (2.9). Table 3 outlines results for these students' observed frequencies. Note that all these results are from a single run (single question in the survey). In this paper we develop a middle-censoring scheme under a multinomial setup which allows outcomes to fall within intervals besides individual categories. Although the general framework has been presented for estimating the individual probabilities, analytical solutions become onerous pretty quickly and may need J o u r n a l P r e -p r o o f Journal Pre-proof numerical solutions. To illustrate the ideas, we consider special cases and demonstrate how the Maximum Likelihood Estimators work out, as well as under a Bayesian setup. Also provided are the asymptotic variances of the multinomial probability vector under these cases. Parametric bootstrap has been suggested for getting the estimates and their variances when the MLEs get complicated. A real data analysis is carried out illustrating the results derived. statistical analysis of middle-censored competing risks data with exponential distribution Estimation of Multinomial Probabilities Statistical inference for discrete middle-censored data Bayes estimator for multinomial parameters and Bhattacharyya distances Analysis of middle censored data with exponential lifetime distributions Approximate self consistency for middle censored data Analysis of discrete lifetime data under middle-censoring and in the presence of covariates Non-parametric estimation for middle-censored data Estimating Multinomial Probabilities. The Theory of Point Estimation Acknowledgements: We wish to thank Professor Yonathan Arbel of the University of Alabama Law School who presented one of the authors with this problem, and the Referee for a careful reading and constructive suggestions.