key: cord-0076193-bshe4ybd authors: Khruachalee, Krisada; Bodhisuwan, Winai; Volodin, Andrei title: On the Partial-Geometric Distribution: Properties and Applications date: 2022-03-28 journal: Lobachevskii J Math DOI: 10.1134/s1995080222010103 sha: 40766726d4895ebb467294523191c5e03444c71a doc_id: 76193 cord_uid: bshe4ybd In this article we introduce the new, two-parameter partial-geometric distribution (PG) that contains both geometric and first success distributions as a particular case. Some probability and statistical properties of the proposed distribution are discussed, including probability mass function, mean, variance, moment generating function, and probability generating function. We propose the method of maximum likelihood for estimating the model’s parameters, and apply the PG distribution to two real datasets to illustrate the flexibility of the proposed distribution. We found the PG distribution is more dynamic than the geometric distribution in the sense that it can be applied to the under-dispersed data. The PG distribution also performs well with a goodness of fit test and some other model selection characteristics for model fitting of these two datasets. Thus, the PG distribution can be applied as an alternative model for the analysis of discrete data. There is much interest in developing the most flexible probability distributions; many generalized classes of distributions have been developed and applied to describe various natural phenomena [1] . To provide an explanation of a natural phenomenon, researchers consider a construction of the new generalized class of distributions, and decide whether the underlying distribution should be regarded as discrete, continuous, or of a mixed type. The discrete distributions are very useful in many applications especially when the count phenomenon consists of non-negative integers. This happens when the number of times a discrete event occurrences are observed and examined in a specific area or period of time [2] . Examples include the number of trips per month that a person takes, the number of children a couple has, the number of Prussian soldier deaths during the Crimean war resulting from being kicked by a horse (a famous classical example related to the Poisson distribution); see [3] [4] [5] , etc. In these situations, the continuous model is inappropriate to describe the count phenomenon. Accordingly, the discrete models are as significant as the continuous models. Situations where a number of trials or experiments must occur before a predetermined number of successes, such as the number of bills that must be proposed to a legislature before 10 bills are passed, and recently the number of Thai citizens that must be tested for COVID-19 before 100 Thai citizens are confirmed to be infectious, are of the interest to many researchers keen to find suitable probability distributions that explain these natural phenomena. In addition, the complications of comparing the probabilities of success for the Bernoulli trials that mostly arise in medical and biological investigations, acceptance sampling in quality control, and modeling demand for a product are considered. These real-life phenomena can be described by the geometric distribution. However, there are criticisms that the geometric and first success distributions are sometimes considered to be the same, when they are actually different. Because the confusion between the geometric and first success distributions plays a very important role in this study, we would like to introduce a new family of distributions called the partial-geometric (PG) distribution that contains both the geometric and first success distributions as a particular case. The idea to combine and consider the geometric and first success distributions as a member of one distribution family seems to be very natural, but we did not see it in literature. The number of the studies that propose modifications of the geometric distribution for various purposes is so large that we decided not to discuss them. The major advantage of the newly proposed PG distribution over the previously modified geometric distributions is the flexibility in applications to real-life data. The remainder of the paper is organized as follows. In Section 2, we discuss the difference between geometric and first success distributions. The PG distribution and some of its mathematical properties are defined in Section 3. Then, the maximum likelihood estimates of the PG distribution parameters are discussed in Section 3.3. Finally, some practical applications of the proposed distribution are illustrated by a goodness of fit with two real datasets in Section 4. Based on the theoretical interpretation of the Bernoulli experiment, we note a confusion between two very simple and basic geometric and first success distributions. In some literature, these two distributions are considered the same. But, as mentioned in Gut (2009) [6] , they are different and defined in the following way. Let 0 < p < 1 and q = 1 − p. A random variable X has a Geometric distribution with parameter p, denoted by X ∼ G(p), if its probability mass function (pmf) is We can interpret the geometric distribution as the number of failures in Bernoulli experiments until we reach first successes, while the first success distribution is the number of trials in Bernoulli experiments need to reach the first success. Referring to the properties of the probability generating function (pgf), the pgf of the geometric distribution G(p) can be calculated in the following way: where |t| < 1/q. In the same manner, the pgf of the first success distribution F S(p) can be also calculated in accordance with the following procedure: where again |t| < 1/q. According to the pgf of the geometric and first success distributions, we present Proposition 1 that can be used to illustrate the connection of these two distributions as follows. Proof. If a random variable X ∼ G(p), then the pgf of the random variable Y = X + 1 can be written The second part of the Proposition can be shown in the similar way. Changing the momentum by adding an extra parameter α, leads us to propose the PG distribution. A random variable X has a Partial-Geometric distribution with parameters 0 < p < 1 and 0 In order to illustrate the appearance of the PG distribution, Figs. 1 and 2 show some pmf plots of the PG distribution with various values of the parameters p (0.25, 0.50 and 0.75) and α (p/3q, p/2q, 2p/3q and 9p/10q) where q = 1 − p. We found that the scale of the PG distribution change due to the parameter p. On the other hand, the shape parameter of the PG distribution can be varied because of the parameter α. It is seen that the pmf rapidly decreases as parameter p increases. In addition, the PG distribution is clearly a unimodal curve when the α conversely increases to p/q. According to Figs. 1 and 2, we conclude that the PG distribution is right skewed and unimodal. Measuring the dispersion of the partial-geometric (PG) distribution, the ID, the ratio between variance to mean, is applied under some specified values of parameters p and α from Figs. 1 and 2 . The values of ID will indicate whether the distribution is over-dispersed (ID > 1) or under-dispersed (ID < 1) [7] . Table 1 illustrates that the partial-geometric (PG) distribution is more dynamic than the geometric distribution in the sense that it can be applied to the under-dispersed data as well where the geometric distribution is only suitable for over-dispersed data. Some probability properties of the PG distributions, especially the mean, variance, moment generating function (mgf), and pgf are provided in this section. Proof. The expectation of the PG distribution can be obtained from Since ∞ k=1 k(1 − p) k−1 = 1 p 2 is the geometric series, the expectation will be E(X) = α(1−p) p 2 . 2 Table 1 . The mean, variance and index of dispersion (ID) values of the partial-geometric (PG) distribution for different value of p and α G(p, α) , then the variance of X is Proof. The variance of the PG distribution can be obtained from With the expectation definition, is the geometric power series, then the E(X(X − 1)) will be equal to 2α(1 − p) 2 /p 3 . Therefore, The mgf of the PG distribution can be achieved from is the geometric series, then the mgf will be equals β where The pgf of the PG distribution can be acquired from is the geometric series, then the pgf will be equals β where |t| < 1/(1 − p). 2 We consider the maximum likelihood estimation (MLE) that is the most commonly used method for parameter estimation. Let X 1 , X 2 , . . . , X n be an independent and identically distributed random sample of size n from the partial-geometric distribution, P G(p, α) , and x 1 , x 2 , . . . , x n be the observed sample values. For k ≥ 0 denote the frequencies f k = #{i : x i = k}, that is, f k is the count of observations that are equal to k. Note that The likelihood function can be written as The appropriate distribution for fitting these datasets is evaluated with the Anderson-Darling (AD) goodness of fit test for discrete data [12] . In addition, the discrete AD test is obtained by applying the dgof package [13] in the R language. Moreover, there are also other model selection criteria used to determine the best fit model: the minus log-likelihood (-LL), the Akaike information criterion (AIC), and the Bayesian information criterion (BIC). The results of fitting different distributions to these datasets are recorded in Tables 2 and 3 . The fitted distributions for the number of claims and the number of hospital stays presented in Tables 2 and 3 illustrate that the p-value based on the discrete AD test statistic of the PG distribution provides a good fit to the data where it provides the largest p-value among others. Moreover, the PG distribution provides the lowest values of -LL, AIC and BIC. Obviously, the PG distribution provides the nearest expected value to the observed frequency. Therefore, the most appropriate fit distribution among these three distributions for the number of claims and the number of hospital stays is the PG distribution followed by the geometric and Poisson distributions respectively. Figure 3 illustrates the plots of fitted frequency of the geometric, partial-geometric, and Poisson distributions with the observed datasets for the total number of claims of the automobile insurance policies and the number of hospital stays. It firmly shows that the partial-geometric (PG) distribution provides the most fitted performance to these two datasets among the three distributions. Thus, we consistently conclude that the PG distribution is more flexible than the geometric and Poisson distributions. Confusion between geometric and first success distributions led to the idea of developing a new family of distributions. By adding an extra parameter to an existing distribution for capturing more variation of the natural phenomena, the partial-geometric distribution that contains both geometric and first success distributions as a particular case is proposed. We found that the PG distribution is right skewed and unimodal. Moreover, it can also be applied to model under-dispersed data. We also derived some essential probability properties, for instance, probability mass function, mean, variance, moment generating function, and probability generating function. The maximum likelihood estimation method is employed to estimate the parameters of the PG distribution. Due to the practical applications with two real datasets, the PG distribution provides the highest p-value for the discrete AD test and also provides the lowest values of -LL, AIC and BIC as well. Therefore, the PG distribution is useful as an alternative to other distribution for the analysis of discrete data. A new method for generating families of continuous distributions Distribution in Statistics: Discrete Distributions Das Gesetz der kleinen Zahlen (G. Teubner Quotations: Das Gesetz der kleinen Zahlen The Theory of Probability (Heinemann An Intermediate Course in Probability Discrete gamma distributions: Properties and parameter estimations On best practice optimization methods in R R: A Language and Environment for Statistical Computing Loss Models: From Data to Decisions Demand for medical care by the elderly: A finite mixture approach Cramer-von Mises statistics for discrete distributions Nonparametric goodness-of-fit tests for discrete null distributions The authors would like to thank the Department of Statistics, Faculty of Science, Kasetsart University. The log-likelihood function can be written asBy taking partial derivatives by the parameters, we obtainThe method of maximum likelihood estimators are found by equating the partial derivatives to zero; that is f 0By rewriting the second equation and substituting to the first equation, we obtain the estimated maximum likelihood parametersα andp of the PG distribution aŝ In order to evaluate the performance of the PG distribution, we consider two real datasets to fit with the two competing geometric and Poisson distributions. We do not consider the first success distribution as a competing because the data contain zeros.The first dataset is accident data that provides the total number of the claims for 9,461 automobile insurance policies [10] . The second dataset is the number of hospitals stays of persons age 66 and over, for which there are 4,406 observations. These data were acquired from the national medical expenditure survey of how Americans use and pay for health services conducted in 1987 and 1988 to reveal a comprehensive picture of medical expenditure [11] . The research of the author listed last was partially supported by the subsidy allocated to Kazan Federal University for the state assignment in the sphere of scientific activities, project 1.13556.2019/13.1.