key: cord-0969042-2j9pwy3l authors: Shanmugam, Ramalingam title: Spinned Poisson distribution with health management application date: 2011-04-26 journal: Health Care Manag Sci DOI: 10.1007/s10729-011-9157-8 sha: adbc8757255002df760c308276d92bc7bc969c15 doc_id: 969042 cord_uid: 2j9pwy3l Consider a data collection setup during a spread of an infectious disease. Examples include severe acute respiratory syndrome (SARS) or influenza A virus H3N2. The health management in such scenarios quickly removes the infected cases before the data collection is completed. The estimate of the incidence rate or the chance for anyone to be immune requires using a correct probability distribution. The usual Poisson distribution is inappropriate because of the impact of removing infected cases on the incidence rate as well as observing a new case. An appropriate new probability distribution is derived and is named as spinned Poisson distribution. The statistical properties of the spinned Poisson distribution are developed and illustrated using infectious smallpox incidence in Abakaliki, Nigeria. Let Y be a random number of cases in an observable set S of non-negative integers with unknown parameters Θ ∈ {θ, ρ} in a Poisson type chance mechanism where the infected cases are quickly removed by the health management even before the completion of data. The analysis of data has to be carefully done with a selection of an appropriate underlying probability distribution. The usual Poisson distribution is inappropriate to use because of the impact of removing infected cases on the incidence rate and the chance of observing a new case. The parameters θ>0 and ρ≥0 portray respectively the incidence rate and the impact level of removing infected cases. Consequently, both the random observation count Y and the incidence rate, θ become size/ length biased. Published articles in the statistical literature deal with only the size/length bias on observation, Y in a marginal sense excluding the impact on the parameter (see pages 149-150 and the reference section of Johnson et al. [3] for a full list of published articles on size/length biased sampling). There is no article in the literature discussing the sampling bias on the incidence rate, θ. The sampling bias on both the observation and the incidence rate should be dealt with simultaneously to be realistic in modeling the situation in which the infected cases are removed. For this purpose, assume that the observation Y is proportional to a sampling weight factor w y; r; q ð Þ¼ 1þry ð Þ 1þrq ð Þ . An insight for this weight factor arises from the following reasons. The observation y is impacted by the removal bias ρ≥0 due to the removal. A scale shift on one in the weight factor is necessary to avoid degeneration of probabilities to zero when ρ=0. The denominator (1+ρθ) in the weight factor echoes the impact of the removal bias on the incidence parameter θ and it is a necessity for the expression in (1) to be a bona fide probability distribution. In this weighted Poisson sampling framework, when ρ=0, it nullifies the impact of removing cases. When ρ=1, it is called size/length biased sampling. That is found in count distribution literature. Since the seminal work of Cox [2] , many articles have appeared about size/length biased sampling. These articles considered only the bias in the observation, Y, but not the bias in the parameter. This article is the first one to suggest that the selection bias could influence the parameter(s) as much as the observation. A new Poisson distribution in (1) incorporates the chance mechanism of rare events in which the infected cases are removed. Þ e Àq q y =y! y ¼ 0; 1; 2; :::; 1; 0 < q < 1; 0 r < 1 ð1Þ The probability distribution in (1) is named the spinned Poisson distribution (SPD). Is expression in (1) a bona fide probability distribution? The answer is affirmative. The following arguments prove it. For the specified parameter space 0 < θ < ∞; 0 ≤ ρ < ∞ and the sample space y=0,1,2,…,∞, the right side of the expression (1) has clearly non-negative valued functions. In addition, the sum of their non-negative values is equal to one and it is proved below. The SPD is versatile enough to describe data from engineering, finance and economics in which the removal of cases occurs before completing the data collection. For example, in finance studies, a defaulted homeowner may be removed from a potential home mortgage insurance list. The SPD has several interesting statistical properties as explained below. The estimation of its parameters, assessment of its survival function, and the application of the hypothesis testing procedures for the SPD are of interest. Results are demonstrated, using Bailey's [1] data on Nigeria's smallpox incidences in Section 3. The Table 1 Table 1 . The Fig. 2 helps to compare the goodness of fit test results. In this section, several statistical properties are derived and interpreted. The expected value μ = E(Y) is obtained in a straightforward manner as stated below. That is, neglecting the fraction 1 1þrq < 1. When the impact of removing infected cases is negligible (that is, ρ=0), the mean in (2) reduces to just the expected value, θ of the usual Poisson distribution. From (2), note that and it attests to a property that there are triangular relationships among the mean μ, parameters θ and ρ. In a similar manner the variance The expression (3.a) is the second factorial moment. The expression (3.a) could be rewritten equivalently, for ρ≠0, The expression (3.b) illustrates a mean-variance relationship like the one for the usual Poisson distribution in which the mean and variance are equal (that is, σ 2 = μ). When there is negligible impact of removing the infected cases (that is, ρ=0), the result in (3.a) and (3.b) reduce respectively to the second factorial moment and variance of the usual Poisson distribution. Otherwise, there is an intrinsic relationship among the incidence rate, impact of removing infected cases and the expected number of infected cases. . This result confirms that ρ→0 when μ→θ, a property of the usual Poisson distribution. Þ is the threshold probability for the event Y=0. However, a recurrence relationship among the spinned Poisson probabilities exists and it is The recurrence relationship in (4) of the SPD implies that and so on. These relationships point out that the datacollection mechanism that includes the removal of infected cases indeed imposes a sampling bias on both the observation and the incidence rate. When the impact of removing infected cases turns out to be negligible (that is, ρ→0), the SPD in (1) reduces to the usual Poisson distribution See Johnson et al. [3] for properties of the usual Poisson distribution. Now, other properties of the SPD are derived. First, note incidentally that the statistic 1 1þrY is an unbiased estimate of the parametric function 1 1þrq because Þ e Àq q y =y!: Furthermore, the variance of the statistic 1 1þrY is obtained using a formula (see Stuart and Ord [4] for details) that where m 2 j and s 2 j denote the mean and variance respectively of j = U or V. In this set up, note that U=1 and V=1+ρY. Hence, the variance of the statistic 1 1þrY is Var 1 Þ e Àq q y =y!: Consequently, an expression for computing the correlation coefficient between two sample statistics Y and 1 1þrY is obtained and utilized to confirm whether the SPD matches well the given data in Section 3. The correlation coefficient is zero when ρ=0. Otherwise (that is for ρ≠0), using (2) using the MLE of the dispersion σ 2 and the mean μ. Consider a random sample y 1 , y 2 ,....,y n from the SPD in Eq. 1. The MLEs are asymptotically efficient. The log likelihood function is ln L ¼ P n i¼1 lnp y i ; r; q j ð Þ. The first derivatives of the log likelihood function with respect to the parameters θ and ρ give their score functions. Equating the score functions @ q ln L and @ r ln L to zero gives the MLEs. The exact value of the MLEs b q mle and b r mle could be obtained by solving iteratively the equations and The process is tedious for the exact values. The process can be eased if approximate values are acceptable. For this purpose, the expression (9) is approximated using Taylor's series approximation. That is, where the notation @ b r mle ¼0 6 b r mle ð Þ portrays the derivative of 6 b r mle ð Þ with respect to b r mle evaluated at b r mle ¼ 0 and b q mle is the MLE of the incidence rate. Hence, the approximate values of the MLEs are b q mle % y À 1 ð10Þ where y is the sample mean and s is the sample standard deviation. Next, an expression for the survival function of the SPD is derived. The cumulative distribution function of the usual Poisson distribution is expressed in terms of the cumulative chi-squared distribution function in Johnson et al. [3] , which is tabulated extensively in books. Following this line of thinking, a similar relationship between the cumulative distribution functions (CDF) of the SPD and the chisquared distribution can be established. For this purpose, let CDF m; q; r ð Þ¼Pr Y m ½ be CDF of the SPD for a prespecified m in the sample space. Implicitly, the survival function Sf m þ 1; q; r ð Þ¼1 À CDF m; q; r ð Þof the spinned Poisson distribution is, after algebraic simplifications, in which CDF # 2 2m;df 2q ð Þ denotes the cumulative chi-squared distribution function with 2 m degrees of freedom (df) and the percentile (2θ). An immediate removal means that the number of days to remove a next infected case is Y=0. The probability for an immediate removal to occur, with m=0, is The scenario, ρ=0 is indicative of no impact of the removing the infected cases. Substituting ρ=0 in (12) and (13), they reduce to the results for the usual Poisson distribution. On the contrary, the scenario, ρ=1 is indicative of the presence of impact on the observation count Y and on the incidence rate θ due to removing the infected cases. Otherwise (that is when ρ>0), there is a significant impact on the observation and on the incidence rate due to removing infected cases. A hypothesis testing procedure about the impact parameter ρ is therefore worthwhile. The likelihood ratio concept can be used to test the null hypothesis H o : ρ = ρ o where ρ o is zero against a research/ alternative hypothesis H 1 : ρ = ρ * ≠ ρ o . Note that the likelihood ratio under the null hypothesis is Λ r o ¼ L y 1 ; y 2 ; ::; y n ; b q r O ¼0 ; r o ¼ 0 L y 1 ; y 2 ; ::; y n ; b q b r mle ; b r mle 6 ¼ 0 and is Λ r » ¼ L y 1 ; y 2 ; ::; y n ; b q r » ; r ¼ r » L y 1 ; y 2 ; ::; y n ; b q b r mle ; b r mle 6 ¼ 0 under a research/alternative hypothesis. The MLE of the incidence rate is b q mle; r o ¼0 ¼ y under the null hypothesis and b q mle; r¼r » % y À 1 under the alternative/ research hypothesis. The MLE of the impact parameter under the alternative/research hypothesis. Under the null hypothesis, the minus log likelihood ratio is and it follows a non-central chi-squared distribution with one df and the non-centrality parameter See Wald [5] for definition and properties of the non-central chi squared distribution. A unique property of the MLE is that the MLE of a function is the function of the MLE of the parameters. In addition, the variance-covariance matrix of the MLE of the parameters is the inverse of the information matrix and (2). The determinant of the information matrix in (16) is using (3.b) and (10). The variance-covariance matrix of the MLE of the parameters is with the MLE of b q mle and b r mle . In addition, the non-central chi squared distribution with one df and non-centrality parameter δ approximately follows 1 þ d 1þd times a central chi squared distribution with 1þd ð Þ 2 1þ2d ð Þ df (see Stuart and Ord [4] for details of this equivalence). This means that the null hypothesis H o : ρ = ρ o where ρ o is zero will be rejected in favor of the research/alternative hypothesis where the right side is the critical value based on the 100 (1−α) th percentile of the central chi squared distribution 1þ2 b dr o À Á df and a significance level α ∈(0,1). The statistical power of the test statistic in (14) can be examined with a selection of a specific value for ρ in the research/alternative hypothesis. Let H 1 : ρ = ρ * ≠ ρ o in which ρ o is zero. It then suggests that the statistical power is the probability of rejecting the null hypothesis H o : ρ = ρ o in favor of the research/alternative hypothesis H 1 : ρ = ρ * ≠ ρ o . That is, under the research hypothesis, the minus log likelihood ratio is In addition, it follows a non-central chi-squared distribution with one df and non-centrality parameter This non-central chi squared distribution with one df and non-centrality parameter b d r » approximately follows df. The power is the probability of accepting the research/alternative hypothesis H 1 when ρ = ρ*. That is, studies in which some cases are removed from the study population due to a screening criterion. The results based on the spinned Poisson distribution could become the foundation for performing further reliability or validity analysis. These and other aspects are currently under investigation and the findings will be reported later. The mathematical theory of infectious diseases and its applications Some sampling problems in technology Univariate discrete distributions Kendall's advanced theory of statistics, volume I Tests of statistical hypotheses concerning several parameters when the number of observations is large The author sincerely thanks all three reviewers and the editor for comments and suggestions, which helped to greatly improve the manuscript. In this section, the data consisting of smallpox cases that are displayed in Bailey [1, p.105 ] are considered for illustration of the results in Section 2. Bailey reported the number, Y of days waited to remove a smallpox case in a closed community of 100 individuals in Abakaliki, Nigeria (as in the Table 1 below). Two graphs and one table are included.The MLEs are b r mle % 0:48 and b q mle % 0:93 using (11) and (10) respectively. It is clear that there is an impact of removing infected smallpox cases on the incidence rate, θ and on the observation, y as well. The usual Poisson distribution p y; q j ð Þ ¼ Pr Y ¼ y ½ ¼e Àq q y =y! is therefore inappropriate for this data with the incidence rate θ. Under the usual Poisson distribution, the estimate of incidence rate would have been b q r¼0 % y ¼ 1:39 . The graph of Y versus 1 1þrY in Fig. 1 confirms the appropriateness of SPD for the given data. Their linear relationship is captured by the correlation coefficient and it is estimated to be b corr Y ; 1 1þrY ¼ À0:79 due to (6) which is quite different from zero. If the data were to be simple Poisson with no sampling bias on y and no impact on the incidence rate due to the removal of infected cases, the configuration in Fig. 1 would have been horizontal with zero correlation. But, it did not happen.The cumulative distribution function, CDF(m, θ, ρ) = Pr [Y≤m] or implicitly its survival function, Sf m þ 1; q; r ð Þ¼ 1 À CDF m; q; r ð Þ of the spinned Poisson distribution is after algebraic simplifications, in which CDF # 2 2m;df 1:86 ð Þ is the cumulative chi-squared distribution function with 2 m df and percentile (1.86).The Fig. 2 illustrates a graphical comparison of the survival functions of the usual Poisson, spinned Poisson and empirical distributions of the data in Table 1 . The vertical and horizontal axes represent survival probability and m+1 respectively. The survival function of the spinned Poisson survival function rather than the survival function of the usual Poisson distribution is closer to the empirical survival function.The immediate removal occurs when the number of waiting days to remove an infected case is zero, meaning that Y=0. The probability for an immediate removal situation to occur isThe null hypothesis is H o : ρ = ρ o =0 and the research/ alternative hypothesis is H 1 : ρ = ρ*=1. These are now tested using the likelihood ratio test. Under the null hypothesis, the minus log likelihood ratio is À ln Λ b r mle ¼ 14:14 and the non-centrality parameter b d r 0 ¼0 ¼ 0:008 This non-central chi squared value with one df and noncentrality parameter is approximately 1 þ dr 0 Using the likelihood ratio criterion, the statistical power of rejecting the null hypothesis in favor of the research hypothesis is calculated using Eq. 21. That is, Power ¼ Pr # 2 0:991df < 0:34 h i ¼ 0:55 when the research hypothesis H1: ρ = ρ*=1 is true. The results of this article are applicable to analyzing data from business, engineering, economics, and sociologic