key: cord-0135858-84v1kmdt authors: Milewska, Marta; Hofstad, Remco van der; Zwart, Bert title: Two more ways of spelling Gini Coefficient with Applications date: 2022-01-28 journal: nan DOI: nan sha: bf4d9b986aa9b72fc7d4e74d5cd29f8d19228a19 doc_id: 135858 cord_uid: 84v1kmdt In this paper, we draw attention to a promising yet slightly underestimated measure of variability - the Gini coefficient. We describe two new ways of defining and interpreting this parameter. Using our new representations, we compute the Gini index for a few probability distributions and describe it in more detail for the negative binomial distribution. We also suggest the latter as a tool to measure overdispersion in epidemiology. Variability is one of the most fundamental features in data analysis. Statistics provides us with many tools suitable for measuring variability, the most popular of which is the variance. However, it can be argued that in some cases, for example when dealing with a distribution that significantly deviates from the Gaussian curve, variance misses to convey some important information. Therefore, different metric have been proposed, which also provide quantitative information if the variance is infinite. A particular metric known as the Gini mean difference (abbreviated to GMD), first proposed in 1912 by the Italian statistician, Corrado Gini. Since then GMD and the parameters that can be derived from it (such as the Gini coefficient) have received quite some interest. However, they are mainly applied to quantify and compare the income disparities among countries while it is being suggested that this application does not reflect the entire potential of the GMD. Out of the Gini family, the most well-known index is the Gini coefficient, again mostly used in the area of income inequalities. It was developed independently of the GMD and defined with respect to the Lorenz curve (explained in more detail further), which definition until today remains the most well-known formulation of the Gini coefficient. In 1914, Corrado Gini discovered and documented the relationship between the Gini coefficient and the Gini mean difference which turns out to be very simple: the Gini coefficient equals the GMD divided by twice the mean, assuming that the latter is positive. As a result, the Gini coefficient ranges from 0 to 1, where 0 corresponds to perfect equality and 1 to the situation where only one observation is positive. So far at least 14 distinct alternative representations have been described, each of them leading to a different interpretation. The readers interested in investigating these various representations further should refer to [17] or for a more concise guide to [20] . Our motivation for considering the Gini coefficient arises from the field of epidemiology, where it has been realized that, apart from the mean reproduction number, variability can also play an important role. Viruses such as COVID-19 are considered to be highly overdispersed (see [5] and [7] ). So far, describing overdispersion was restricted to a particular parametric model (see [10] ). We want to make the case that the Gini coefficient provides a measure of overdispersion that is not limited to any particular probability distribution, and is consistent with the insights provided by existing parametric models. As computing the Gini coefficient with its classic definition might by arduous for some probability distributions, we first add two new representations: a probabilistic and an analytical one. We then apply the newly invented methodology to compute the Gini coefficient for a few classic probability distributions. We link our findings with the potential application to measuring overdispersion by giving more attention to the negative binomial distribution, which is often used in epidemiology to model the number of secondary cases per infected. In particular, we examine its asymptotic behavior. This paper is organised as follows: in Section 2 we derive two new representations of the Gini coefficient: a probabilistic and an analytical one. In Section 3 we apply the newly invented methodology to compute the Gini coefficient for a few classic probability distributions: the Poisson distribution in Section 3.1, the exponential distribution in Section 3.2, the geometric distribution in Section 3.3, the Pareto distribution in Section 3.4, the uniform distribution in Section 3.5 and the negative binomial distribution in Section 3.6. When possible, we refer to other examples of computing the Gini coefficient for the mentioned distributions. In Section 4 we focus on applications of the Gini coefficient for the negative binomial distribution, which is very often used for epidemiological modeling. We explain why the Gini coefficient should be considered as a measure of overdispersion in infectious diseases. In Theorems 5 and 6, we prove the asymptotic behavior of the Gini coefficient of the negative binomial distribution for small and large parameters, respectively. Lastly, in Section 5 we close with a conclusion. The Gini coefficient is originally defined based on the Lorenz curve, which plots the proportion of the total income of the population (y-axis) that is cumulatively earned by the bottom x% of the population. Thus, the line at 45 degrees represents perfect equality of incomes. The Gini coefficient can then be thought of as the ratio of the area that lies between the line of equality and the Lorenz curve over the total area under the line of equality. Before its link with the Gini mean difference was discovered, the coefficient was actually called a 'concentration ratio'. However, the graphical representation is not always that useful. In order to derive alternative expressions of the Gini index, we first recall its well-known [17, 20] definition in terms of the mean difference. where X 1 and X 2 are independent copies of X. Formula (1) is the most popular representation of the Gini coefficient. However, it turns out that for many probability distributions it yields a rather unpleasant form. Hence, we propose a more instructive representation in terms of size-biased random variable: Definition 2. (Size-biased random variable.) Let X be a non-negative random variable with 0 < E[X] < ∞. We define X * , the size-biased version of X, as (2) Theorem 1. (Gini coefficient in terms of size-biased variable.) For X ≥ 0, where X * is a size-biased random variable defined in (2) . Proof. By the definition of the Gini coefficient stated in (1), We write x dz dF X (y) and since the integrands are nonnegative we can apply Fubini's theorem to change the order of integration in the latter and rearrange (4) as dz dF X (x) = P(X * ≥ X). . Theorem 1 is valid both for discrete and continuous random variables. However, for an N 0 -valued random variable, X * in (3) can be replaced by a discrete version of size-biased random variable, X * d , with the probability distribution Proposition 1. Take a random variable X taking values in N 0 . Then, We now differentiate (2) to obtain the density of X * : Thus, X * d = X * d + Y . Next, observe that, for all N 0 -valued X, Since we only consider x ∈ N 0 , it is true that Substituting this into (5) completes the proof of the second statement. Since using X * d to compute the Gini index simplifies some calculations, throughout the paper we will use this definition when working with N 0 -valued random variables. By noticing that the Gini coefficient is a convolution at 0 of the distribution of X and the tail distribution of −X * , one can arrive at a different representation using Fourier theory. Since Fourier theory is different in the discrete and continuous world, we make a distinction between the two cases. The following theorem provides a representation of the Gini coefficient of a non-negative integer-valued random variable. Theorem 2. (Fourier representation of Gini coefficient in discrete case.) Let X be a nonnegative integer-valued random variable with probability mass function p X (j) = P(X = j) and Fourier transformp X (θ) = ∞ j=0 P(X = j)e iθj . Then (6) can be alternatively written as Proof. Write Now, if we define the functions f X and g X as , then we notice that (8) is a convolution of these functions, say (f X g X )(n), at n = 0, so G(X) = (f X g X )(0), which can be rearranged with the help of Fourier theory. Applying the inverse Fourier transform and the convolution theorem we obtain Substituting n = 0 in (9), we arrive at the new representation of G(X) as It remains to compute the Fourier transformsf X (θ) andĝ X (θ). It holds that Further,ĝ where the last equality follows from changing the order of summation and substituting = m j=0 e iθj . Now, after extracting 1 e iθ −1 and splitting the sum we arrive at We now computep X * d (θ) aŝ where we again changed the order of summation and computed l−1 m=0 e imθ to obtain the last equality. After extracting 1 e iθ −1 and splitting the sum we havê . Substituting (13) into (12) yieldŝ Finally, we rewrite We derive the final formula from Theorem 2 by substituting (11) and (14) into (10) and noticing that ||e iθ − 1|| 2 = sin(θ) 2 + (cos(θ) − 1) 2 = 2[1 − cos(θ)]. The following theorem provides an alternative representation of the Gini coefficient of a continuous non-negative random variable: Theorem 3. (Fourier representation of Gini coefficient in continuous case.) Let X be a non-negative random variable with probability density function p X (x) and Fourier Then Proof. The main idea in this proof is exactly the same as it was in the discrete case, though we need to define functions f X and g X in a slightly different way. Consider a convolution of f where g X is defined for negative values of x. Applying analogous reasoning as in the proof of Theorem 2, we get We again compute the corresponding Fourier transforms. We havê where the last equality follows after changing the order of integration and computing z 0 e −iθy dy = i θ (e −iθz − 1). We substitute P(X > z) = ∞ z f X (y) dy and change the order of integration again to obtain which simplifies toĝ We obtain (15) by substituting (17) in (16) noting thatf X (θ) =p X (θ). In this section, we apply our expressions to the computation of G for various distributions. In some cases, the examples are new, and sometimes, our expressions provide easier derivations. Take X ∼ Poisson(λ). The characteristic function equalsp X (θ) = exp(λ(e iθ − 1)). Consequently,p X (−θ) = exp(λ(e −iθ − 1)) and we substitute these results into (6) to obtain Alternatively, we could use (7). We compute ||p X (θ)|| 2 = || exp(λ(e iθ − 1))|| 2 = e −2λ ||e λ(cos θ+i sin θ) || 2 = e −2λ(1−cos θ) . The size-biased formula G(X) = P(X * d ≥ X) does not seem to be convenient in this example. An alternative representation of G(X) in terms of Bessel functions can be found in [16] . Proof. Note that X * d = X and apply Theorem 1. In case of geometric distribution Fourier representation is also easy to compute and yields the same result as (18) . Alternatively, one can consider a shifted version of the Geometric distribution following [8] : take Y ∼ Geom'(p) with P(Y = j) = p(1 − p) j−1 supported on {1, 2, 3, ...}. Take X ∼ Pareto(α) with f X (x) = αx α m x α+1 =, EX = αxm α−1 and assume α > 1. We first compute P(X * ≥ x) for x ≥ x m : This is not a new result, but its derivation is quite simple-compare e.g. [8, Example 1]. We have not been able to find this explicit example in the literature. Applying (2), substituting EX = a+b 2 and changing the order of integration yields For a derivation see [16] , formula (2.11) for the mean absolute difference of the negative binomial; then divide by twice the mean to obtain (19) . The complexity of this expression was in fact our main motivation to develop new representations. Because of the alternating sign the formula is not very stable numerically and thus not handy to work with. The representation in terms of the size-biased random variable does not seem very helpful either. The computation leads to a rather non-intuitive formula containing three infinite sums. Instead, we apply Theorem 2 to obtain the following representation: Remark 2. Note that since k+j−1 j can be interpreted as Γ(k+j) Γ(j+1)Γ(k) , (20) also applies to non-integer k. Proof. We want to apply Theorem 2 to the negative binomial variable X. Sincê Substituting (21) and (22) into (6) yields We investigate the product ofp X (θ) andp X (−θ): However, we have that (pe iθ −e iθ + 1)(e iθ + p − 1) = pe 2iθ + p 2 e iθ − 2pe iθ − e 2iθ + 2e iθ + p − 1 = e iθ p 2 − 2p + 2 + p(e iθ + e −iθ ) − (e iθ + e −iθ ) = e iθ p 2 + 2(p − 1)(cos(θ) − 1) . Therefore p 2 e iθ (pe iθ − e iθ + 1)(e iθ + p − 1) = p 2 p 2 + 2(p − 1)(cos θ − 1) . After plugging in the above computation in (23) and rearranging we arrive at the expression (20) . In the next section, we discuss the asymptotic behavior of the Gini coefficient of the negative binomial distribution as k → 0 and k → ∞. In epidemiology, the basic reproduction number R 0 denotes the expected number of new infections directly generated by one case. If R 0 > 1, the epidemic is growing, and the epidemic is dying our when R 0 < 1. It has been determined that transmission patterns of SARS-COV-2 and many other viruses are highly overdispersed (see [5] or [7] ). Looking at R 0 alone might therefore be misleading. In case of COVID-19 in particular, it is estimated that more than 80% of new infections resulted from the top 10% of most infectious individuals (see [1] , [14] or [15] ). This means that the vast majority of infected individuals will not pass the infection on to anyone. R 0 should therefore be complemented with another statistic that sheds light on possible overdispersion in the number of secondary cases. To this end, the authors in [10] assume a parametric setting, in which the number of secondary cases follows a negative binomial distribution with parameters k and R 0 , which results in its variance being equal to R 0 (1 + R 0 /k). In such a setting, the smaller the value of k, the greater heterogeneity in the distribution of secondary infections. Despite coming from a very particular model, this interpretation became so popular that multiple papers adopted the symbol k and referred to it as an 'overdispersion parameter' (see for instance [1] , [4] , [9] or [11] ). Even non-mathematical press followed this trend, calling k the key to overcoming the pandemic of COVID-19 (see [18] ). In this section, we investigate the relation between k and the Gini index. We show numerically that the Gini index, when specialized to the negative Binomial distribution, depends monotonically on k, and complement this with rigorous asymptotic estimates for small and large k. This indicates that the Gini index provides insights consistent with the ones obtained so far, without the need to rely on a parametric model. Thus, the Gini index can serve as candidate to measure the variability in the infectiousness of a disease. This section is organised as follows: in Section 4.1 we first enable a continuous interpretation of parameter k of the negative binomial distribution by expressing is as a composition of Poisson and Gamma processes. Next we investigate monotonicity of the Gini coefficient in k. To that end, we illustrate its behavior in plots and support them with statements about asymptotics of the Gini coefficient for k → 0 and k → ∞. In Section 4.2 we show for the real-world data set that the Gini coefficient provides insights consistent with using a parametric model based on k. In Section 3.6 we showed that the Gini index for the negative binomial distribution cannot be easily obtained from (1), neither by its size-biased representation P(X * d ≥ X). In this section, we investigate the Gini index for small and large values of k. For that purpose it is useful to first rewrite the negative binomial distribution in terms of the composition of Poisson and Gamma processes, which has the added benefit that the resulting interpretation is natural for non-integer values of k. Let X follow a P oisson(ν) distribution, where the rate ν is also a random variable and it follows a Gamma(k, λ) distribution. To maintain a clear connection to epidemiological notation, we denote E[ν] = R 0 . Lemma 1. For X as above, it holds that X ∼ N B(k, p) with parameter k, corresponding to the number of 'successes', and parameter p = λ λ+1 denoting the probability of success. The proof of Lemma 1 is straightforward and is hence omitted. Note that p = (1 + R 0 /k) −1 . In the following, we will write Γ k referring to a Gamma process determined by Γ(k, λ) and N (Γ k ) to denote a subordinated Poisson process with rate following this Gamma process. With the new, continuous interpretation in mind we will have a closer look at the behaviour of the Gini coefficient as k increases. For creating some visual intuition, Figure 1 plots the Gini coefficient as a function of k for different values of p = λ λ+1 . Figure 1 suggests that the Gini coefficient of the negative binomial is decreasing in k. (N B(k, p) ) approaches 1 as k approaches 0 and approaches 0 as k approaches infinity. We formulate these results in the following two theorems: Theorem 5. (Asymptotics of the Gini coefficient for k small.) Let X ∼ N B(k, p). Then, for k → 0, with c = 2 log(p) + 2−p 1−p log 2p(1 − p) = 2 log λ λ+1 − (λ + 2) log λ 2 +2λ (λ+1) 2 . Theorem 6. (Asymptotic behavior of the Gini coefficient for k large.) Let X ∼ N B(k, p). Then, for k → ∞, Proofs of these theorems are given in the Appendix. We validate the asymptotic estimates in the theorems by plotting the Gini coefficient and its limiting functions against each other. See Figure 2 for k → 0 and Figure 3 for k → ∞. In this section we investigate how our formulas perform in a real-world example. Heterogeneity in the number of secondary cases receives significant attention during the currently ongoing pandemic of COVID-19. [11] aims to quantify overdispersion by estimating k. We use the data provided there to first compute the Gini coefficient explicitly (by (1)) and then calculate it using formulas we derived in this paper. For the latter, we substituted the parameters p and k obtained by the authors. The table below presents all results. These results show that our formulas come very close to the actual value of the Gini coefficient. It is also apparent that they reflect the dynamic Location Jakarta -Depok Batan R 0 6,79 2,47 k 0,06 0,2 p 0,008 0,06 G(X) computed explicitly from data 0,9213411 0,83191721 G(X) computed with (6) 0,9269144 0,8151066 G(X) computed with (24) 0.9193061 0.7623808 Table 1 : Secondary cases data summary for Jakarta-Depok and Batan. of epidemic -lower values of k assigned to higher overdispersion correspond to higher value of the Gini coefficient. The models based on the negative binomial distribution of secondary infections have been used to estimate overdispersion in the spread of infectious diseases for a long time. However, this approach necessitates parametric models suitable only for specific situations. We argue that it is preferable to use the Gini coefficient to measure overdispersion. It correctly captures the dynamic of epidemic and is arguably more informative than popular so far parameter k as it is solely known as the measure of variability and it is not bound to any specific probability distribution. Thanks to the new representations that we presented in this paper, we can express the Gini coefficient as a simple function of the parameters of the distribution for multiple probability distributions, as long as they have finite expected value, making the Gini index a versatile statistic not depending on parametric assumptions. In particular, it can be useful in the case of heavy-tail distributions where the variance might be infinite. In [6] the authors present methods that allow estimating the Gini coefficient for heavy-tailed data with infinite variance. The applications of the newly developed representations do not need to be limited to epidemiology. The Gini coefficient was advocated as an indicator in many other areas and we hope our results could help widen its usage. For reference see for example [3] , where the Gini coefficient is suggested as a measure of heart rate variability to assess the level of mental stress, or [19] , where the Gini coefficient is applied to explain inequalities in resource use around the globe. We begin by proving two auxiliary propositions. Proposition 4. (Asymptotic behavior of P(X k = j).) When k → 0, where c j = 1 j(λ+1) j . It is also true that E[X k ] is asymptotically equal to k for k small. Proof. For X k ∼ P oisson(Γ(k, λ)), Hence, for k → 0, where we apply e −ak = 1 − ak + o(k) in the second equality. Further, applying the same asymptotic formula together with Γ(k) ∼ 1 k and Γ(j + k) = Γ(j) + O(k) (see [2, 6.1.35]), we obtain, for j ≥ 1, which yields (27). Proposition 5. (Asymptotic behavior of P(X * k ≥ j).) For k → 0 and j ≥ 1, Proof. We first compute for l ≥ 1 and k → 0, and in the fourth equality, we have substituted (27). Hence, in a similar manner, and we compute from which the statement follows. Now we can proceed with the proof of the main theorem: Proof of Theorem 5. Since P(X * k ≥ 0) = 1 we write, for all N , where we have substituted (26) and (27) and therefore, taking N → ∞ lim sup On the other hand, for all N we have lim inf where we have applied P(X k = j) ≤ 1 and Markov inequality. Hence, letting N → ∞, Combining (30) and (29) we can write Now we compute the term The first summand in (32) is easily computable and equals − log λ 2 +2λ (λ+1) 2 . The second summand can be computed by interchanging the order of summation, as n + 1 = − log λ λ + 1 − 1 λ + 1 + λ + 1 log λ 2 + 2λ (λ + 1) 2 + 1 λ + 1 = λ + 1 log λ 2 + 2λ (λ + 1) 2 − log λ λ + 1 . Hence, (31) becomes G(X k ) = 1 + k 2 log λ λ + 1 − (λ + 2) log λ 2 + 2λ (λ + 1) 2 what can be also written as G(X k ) = 1 + k 2 log(p) + 2 − p 1 − p log 2p(1 − p) + o(k), when we take p = λ λ+1 . We first prove the following proposition: λ . Proof. We can write DenoteΓ k = λΓ k −k √ k . From the central limit theorem, it follows that P(Γ k ≥ x) →Φ(x) as k → ∞. Hence, Furthermore, by the strong law of large numbers and thus Γ k k → 1 √ λ . Finally, Γ k → ∞ a.s. as k → ∞ and again, by the central limit theorem we know that for a Poisson process N (k) it is true that We conclude that also N (Γ k )−Γ k √ Γ k d −→ W 1 as k → ∞. The claim follows. Proof of Theorem 6. Recall X k = N (Γ k ) and setX k = where X k is an independent copy of X k . Then, for fixed M , We study parts I and II separately, starting with I: Letting first k and then M tend to infinity, by Proposition 6, we arrive at whereX andX are independent standard normal random variables. Write Y = X −X . Then Y ∼ N (0, 2) and E[|Y |] = 2 √ π . Thus, Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables Introduction of Application of Gini Coefficient to Heart Rate Variability Spectrum for Mental Stress Evaluation Superspreading and heterogeneity in transmission of SARS, MERS, and COVID-19: A systematic review Tail Risk of Contagious Diseases Gini estimation under infinite variance Evidence that coronavirus superspreading is fat-tailed A Formula for the Gini Coefficient Lockdowns exert selection pressure on overdispersion of SARS-CoV-2 variants Superspreading and the effect of individual variation on disease emergence Superspreading in early transmissions of COVID-19 in Indonesia Superspreading and the Gini Coefficient Distributions and moments for estimators of Gini index in an exponential distribution Full genome viral sequences inform patterns of SARS-CoV-2 spread into and within Israel Superspreading of SARS-CoV-2 in the USA The Mean Difference and the Mean Deviation of Some Discontinuous Distributions This Overlooked Variable Is the Key to the Pandemic Sharing resources: The global distribution of the Ecological Footprint More Than a Dozen Alternative Ways of Spelling Gini Acknowledgement. The work of MM is supported by a Marie Sk lodowska-Curie Action from the EC (Cofund grant no. 945045 ) and by the Netherlands Organisation for Scientific Research (NWO) through the Gravitation Networks grant 024.002.003. The work of RvdH is supported in parts by the NWO through the Gravitation Networks grant 024.002.003. For term II,Combining (33) and (34) we obtain the desired claim (25) from Theorem 6.