key: cord-0662506-ntqqc25s authors: Mukhopadhyay, Subhadeep title: A Maximum Entropy Copula Model for Mixed Data: Representation, Estimation, and Applications date: 2021-08-21 journal: nan DOI: nan sha: 9b5e368ba53703963d6679113eaae6d6c31306bd doc_id: 662506 cord_uid: ntqqc25s A new nonparametric model of maximum-entropy (MaxEnt) copula density function is proposed, which offers the following advantages: (i) it is valid for mixed random vector. By `mixed' we mean the method works for any combination of discrete or continuous variables in a fully automated manner; (ii) it yields a bonafide density estimate with intepretable parameters. By `bonafide' we mean the estimate guarantees to be a non-negative function, integrates to 1; and (iii) it plays a unifying role in our understanding of a large class of statistical methods. A new class of nonparametric maximum-entropy (MaxEnt) approach of copula modeling is discussed. The proposed statistical formalism has the following advantages: First, it yields a bonafide (smooth, non-negative, and integrates to 1) copula density estimate with interpretable parameters that provide insights into the nature of the association between the random variables pX, Y q. Secondly, the method is data-type agnostic-which is to say that it automatically adapts to mixed-data types (any combination of discrete, continuous, or even categorical). Thirdly, and most notably, our copula representation theory subsumes and unifies a wide range of statistical learning methods using a common mathematical notationunlocking deep, surprising connections and insights, which were previously unknown. In this section, we introduce two new maximum-entropy (MaxEnt) copula density models. But before diving into technical details, it will be instructive to review some basic definitions and concepts related to copula. Sklar's Copula Representation Theory (Sklar, 1959) . The joint cumulative distribution function (cdf) of any pair of random variables pX, Y q F X,Y px, yq " PrpX ď x, Y ď yq, for px, yq P R 2 can be decomposed as a function of the marginal cdfs F X and F Y F X,Y px, yq " Cop X,Y`F X pxq, F Y pyq˘, for px, yq P R 2 (2.1) where Cop X,Y denotes a copula distribution function with uniform marginals. To set the stage, we start with the continuous marginals case, which will be generalized later to allow mixed-pX, Y q. Taking derivative of Eq. (2.1), we get f X,Y px, yq " f X pxqf Y pyq cop X,Y`F X pxq, F Y pyq˘, for px, yq P R 2 (2.2) which decouples the joint density into the marginals and the copula. One can rewrite Eq. (2.2) to represent copula as a "normalized" joint density function cop X,Y`F X pxq, F Y pyq˘:" dep X,Y px, yq " f X,Y px, yq f X pxqf Y pyq , (2.3) which is also known as the dependence function, pioneered by Hoeffding (1940) . To make (2.3) a proper density function (i.e., one that integrates to one) we perform quantile transformation by substituting F X pxq " u and F Y pyq " v: , pu, vq P r0, 1s 2 . (2.4) We are now ready to extend this copula density concept to the mixed pX, Y q case. Pre-Copula: Conditional Comparison Density (Parzen and Mukhopadhyay, 2013) . Before we introduce the generalized copula density, we need to introduce a new conceptconditional comparison density (CCD). For a continuous random variable X, define dpu; X, X|Y " yq " f X|Y`F´1 X puq | yf X`F´1 X puq˘, 0 ă u ă 1 (2.5) For Y discrete, we represent it using probability mass function (pmf): where Q Y pvq is the quantile function of Y . It is easy to see that the CCDs (2.5) and (2.6) are proper densities in the sense that ż 1 0 dpu; X, X|Y " yq du " ż 1 0 dpv; Y, Y |X " xq dv " 1. Generalized Copula Representation Theory . For the mixed case, when Y is discrete and X is continuous the joint density of (2.3) is defined by either side of the following identity: PrpY | X " xqf X pxq " f X|Y px|yq PrpY " yq. (2.7) This can be rewritten as the ratios of conditionals and their respective marginals: Pre-Bayes' Rule : This formula (2.8) can be interpreted as the slices of the mixed copula density, since cop X,Y`F X pxq, F Y pyq˘" f X|Y px|yq f X pxq " PrpY | X " xq PrpY " yq . (2.9) Substituting F X pxq " u and F Y pyq " v, we get the following definition of the generalized copula density in terms of conditional comparison density (CCD): cop X,Y pu, vq " d`u; X, X|Y " Qpv; Y q˘" d`v; Y, Y |X " Qpu; Xq˘, 0 ă u, v ă 1. (2.10) Bayes' theorem ensures the equality of two CCDs, with copula being the common value. Based on this foundation, we are now ready to develop the MaxEnt copula modeling theory. An exponential Fourier series representation of copula density function is given. The reasons for entertaining an exponential model for copula is motivated from two different perspectives. The Problem of Unboundedness. One peculiar aspect of copula density function is that it can be unbounded at the corners of the unit square. In fact, many common parametric copula families-Gaussian, Clayton, Gumbel, etc.-tend to infinity at the boundaries. So naturally the question arises: How to develop suitable approximation methods that can accommodate a broader class of copula density shapes, including the unbounded ones? The first key insight: logarithm of the copula density function is far more convenient to approximate (due to its well-behaved nature) than the original density itself. We thus express the logarithm of copula density log cop X,Y in the Fourier series-instead of doing canonical L 2 approximation, which expands cop X,Y directly in an orthogonal series (Mukhopadhyay and Parzen, 2020). Accordingly, for densities with rapidly changing tails, 'log-Fourier' method leads to an improved estimate that is less wiggly and more parsimonious than the L 2 -orthogonal series model. In addition, the resulting exponential form guarantees the nonnegativity of the estimated density function. Choice of Orthonormal Basis. We choose nonparametric LP-polynomial basis (see Appendix A.1) to expand log-copula density function. LP-bases' appeal lies in its ability to approximate the quirky shapes of mixed-copula functions in a completely automated way; see Fig. 1. Consequently, it provides a unified way to develop nonparametric smoothing algorithms that simultaneously hold for mixed data types. Definition 1. The exponential copula model admits the following LP-expansion where Z θ is the normalization factor that ensures cop θ is a proper density We refer (2.11) as the log-bilinear copula model. The Maximum Entropy Principle. Another justification for choosing the exponential model comes from the principle of maximum entropy (MaxEnt), pioneered by E. T. Jaynes (1957) . The maxent principle defines a unique probability distribution by maximizing the entropy Hpcopq "´ş cop X,Y log cop X,Y under the normalization constraint ş cop X,Y " 1 and the following LP-co-moment conditions: (2.13) LP-co-means are orthogonal "moments" of copula, which can be estimated by (2.14) Applying calculus of variations, one can show that the maxent constrained optimization problem leads to the exponential (2.11) form. The usefulness of Jaynes' maximum entropy principle lies in providing a constructive mechanism to uniquely identify a probability distribution that is maximally non-committal (flattest possible) with regard to all unspecified information beyond the given constraints. Estimation. We fit a truncated exponential series estimator of copula density The task of finding the maximum likelihood estimates (MLE) of θ boils down to solving the following sets of equations for j " 1, . . . , m 1 and k " 1, . . . , m 2 : Note that the derivative of the log-partition function is equal to the expectation of the LP-co-mean functions: Replacing (2.16) and (2.14) into (2.15) implies that the MLE of MaxEnt model is same as the method of moments estimator satisfying the following moment conditions: At this point, one can apply any convex optimization 1 routine (e.g., Newton's method, gradient descent, stochastic gradient descent, etc.) to solve for p θ. Asymptotic. Let the sequence of m 1 and m 2 increase with sample size with an appropriate rate pm 1 m 2 q 3 n Ñ 0 as n Ñ 8. Then, under certain suitable regularity conditions, the exponential cop p θ is a consistent estimate in the sense of Kullback-Leibler distance; see Barron and Sheu (1991) for more details. Determining Informative Constraints. Jayne's maximum entropy principle assumes that a proper set of constraints (i.e., sufficient statistic functions) are given to the modeler, one that captures the phenomena under study. This assumption may be legitimate for studying thermodynamic experiments in statistical mechanics or for specifying prior distribution in Bayesian analysis, but certainly not for building empirical models. Which comes first: a parametric model or sufficient statistics? After all, the identification of significant components (sufficient statistics) is a prerequisite for constructing a legitimate probability model from the data . Therefore the question of how to judiciously design and select the constraints from data, seems inescapable for nonparametrically learning maxent copula density function from data; also see Appendix A.3, which discusses the 'two cultures' of maxent modeling. We address this issue as follows: (i) compute Ă LP jk using the formula eq. (2.14); (ii) sort them in descending order based on their magnitude (absolute value); (iii) compute the penalized ordered sum of squares PenSumpqq " Sum of squares of top q LP comeans´γ n n q. For AIC penalty choose γ n " 2, for BIC choose γ n " log n, etc. Further details can be found in Mukhopadhyay and Parzen (2020, Sec. 4 .3). (iv) Find the q that maximizes the PenSumpqq. Store the selected indices pj, kq in the set I. (v) Carry out maxent optimization routine based only on the selected LP-sufficient statistics-based constraints: , pj, kq P I. This pruning strategy guards against overfitting. Finally, return the estimated reduced-order (with effective dimension |I|) maxent copula model. Remark 1 (Nonparametric MaxEnt). The proposed nonparametric maxent mechanism produces a copula density estimate, which is flexible (can adapt to the 'shape of the data' without making risky a priori assumptions) and yet possesses a compact analytical form. We provide a second parameterization of copula density. Definition 2. The log-linear orthogonal expansion of LP-copula is given by: We call the parameters of this model "log-linear LP-correlations" that satisfy for k ą 0 Connection. Two fundamental representations, namely the log-bilinear (2.18) and loglinear (2.11) copula models, share some interesting connections. To see that perform singular value decomposition (SVD) of the Θ-matrix whose pj, kqth entry is θ jk : Θ " U ΩV T , u ij and v ij are the elements of the singular vectors with singular values µ 1 ě µ 2 쨨¨ě 0. Then the spectral bases can be expressed as the linear combinations of the LP-polynomials: Hence, the LP-spectral functions (2.20-2.21) satisfy the following orthonormality conditions: We demonstrate the flexibility of the LP-copula models using real data examples. Example 1. Kidney fitness data (Efron and Hastie, 2016, Sec 1.1). It contains measurements on n " 157 healthy volunteers (potential donors). For each volunteer, we have their age (in years) and a composite measure "tot" of overall function of kidney function. To understand the relationship between age and tot, we estimate the copula: y cop X,Y pu, vq " exp ´.40S 1 pu; XqS 1 pv; Y q`.18S 2 pu; XqS 2 pv; Y q´0.12 ( , (2.22) displayed in Fig. 1(a) . At the global scale, the shape of the copula density indicates a prominent negative ( p θ 11 "´0.40) association between age and tot. Moreover, at the local scale, significant heterogeneity of the strength of dependence is clearly visible, as captured by the nonlinear asymmetric copula: the correlation between age and tot is quite high for older (say, ą 70) donors, compared to younger ones. This allows us to gain refined insights into how kidney function declines with age. Example 2. PLOS data (both discrete marginals). It contains information on n " 878 journal articles published in PLOS Medicine between 2011 and 2015. For each article, two variables were extracted: length of the title and the number of authors. The dataset is available in the R-package dobson. The checkerboard-shaped estimated discrete copula y cop X,Y pu, vq " exp .42S 1 pu; XqS 1 pv; Y q`.10S 2 pu; XqS 2 pv; Y q´.07S 2 pu; XqS 2 pv; Y q´0.12 ( . is shown in Fig. 1 (b), which shows a strong positive nonlinear association. In particular, the sharp lower-tail, around the p0, 0q, indicates that the smaller values of (X, Y ) have a greater tendency to occur together than the larger ones. Example 3. Horseshoe Crabs Data (mixed marginals). The study consists of n " 173 nesting horseshoe crabs (Agresti, 2013) . For each female crab in the study, we have its carapace width (cm) and number of male crabs residing nearby her nest. The goal of the study is to investigate whether carapace width affects number of male satellites for the female horseshoe crabs. If so, how-what is the shape of the copula dependence function? The estimated copula, shown in Fig. 1 (c), is given by: This indicates a significant positive linear correlation between the width and number of satellites of a female crab. Example 4. 1986 Challenger Shuttle O-Ring data. On January 28, 1986, just after seventythree seconds into the flight, Challenger space shuttle broke apart, killing all seven crew members on board. The purpose of this study is to investigate whether the ambient temperature during the launch was related to the damage of shuttle's O-rings. For that we have n " 23 previous shuttle missions data, consisting of launch temperatures (degrees F), and number of damaged O-rings (out of 6). The estimated LP-maxent copula density is displayed in panel (d), and shows a strong negative association between the temperature at the launch and the number of damaged O-rings. Moreover, the sharp peak of the copula density around the edge p0, 1q further implies that cold temperatures can excessively increase the risk of failure of the o-rings. The scope of the general theory of the preceding section goes far beyond simply a tool for nonparametric copula approximation. In this section, we show how one can derive a large class of applied statistical methods in a unified manner by suitably reformulating them in terms of the LP-maxent copula model. In doing so, we also provide statistical interpretations of LP-maxent parameters under different data modeling tasks. Categorical data analysis will be viewed through the lens of LP-copula modeling. Let X and Y denote two discrete categorical variables; X with I categories and Y with J categories. The data are summarized in an IˆJ contingency table F: f kl is the observed cell count in row k and column l of F and n " f``is the total frequency. The row and column totals are denoted as f k`a nd f`l. The observed joint PrpX " k, Y " lq is denoted by r p kl " f kl {f``; the respective row and column marginals are given by r LP Log-linear Model. We specialize our general copula model (2.18) for two-way contingency tables. The discrete LP-copula for the IˆJ table is given by where we abbreviate the row and columns scores φ j pF X pkqq " φ jk and ψ j pF Y plqq " ψ jl for k " 1, 2, . . . , I and l " 1, 2, . . . , J. The number of components m ď M " minpI1 , J´1q; we call the log-linear model (3.1) 'saturated' (or 'dense') when we have m " M components. The non-increasing sequence of model parameters µ j 's are called "intrinsic association parameters" that satisfy 2 Note that the discrete LP-row and column scores, by design, satisfy (for j ‰ j 1 ): Interpretation. It is clear from (3.2) that the parameters µ j 's are fundamentally different from the standard Pearsonian-type correlation The coefficients of the LP-MaxEnt-copula expansion for contingency tables carry a special interpretation in terms of log-odds-ratio. To see this we start by examining the 2ˆ2 case. The 2ˆ2 Contingency Table. Applying (3.2) for two-by-two tables we have Note that for dichotomous X the LP-spectral basis φ 1 pF X pxqq is equal to T 1 px; F X q. Consequently, we have the following explicit formula for φ 1 and ψ 1 : Substituting this into (3.7) yields the following important result. Theorem 1. For 2-by-2 contingency tables, the estimate of the statistical parameter µ 1 of the maxent LP-copula model can be expressed as follows: where the part inside the square bracket is the sample log-odds-ratio. Remark 2 (Significance of Theorem 1). We have derived the log-odds-ratio statistic from first principles using a copula-theoretic framework. To the best of our knowledge, no other study has discovered this connection; see also eq.(16) of Goodman (1991) . In fact, one can view Theorem 1 as a special case of the much more general result described next. Theorem 2. For an IˆJ table, consider a two-by-two subtable with rows k and k 1 and columns l and l 1 . Then the logarithm of odds-ratio ϑ kl,k 1 l 1 is connected with the intrinsic association parameters µ j in the following way: To deduce (3.10) from (3.11), verify the following, utilizing the LP-basis formulae (3.8)-(3.9) Reproducing Goodman's Association Model. Our discrete copula-based categorical data model (3.1) expresses the logarithm of "dependence-ratios" 3 as a linear combination of LP-orthonormal row and column scores satisfying (3.3)-(3.5). The copula-dependence ratio (3.12) measures the strength of association between the k-th row category and l-the column category. To make the connection even more explicit, rewrite (3.1) for two-way contingency tables as follows: where µ R k denotes the logarithm of row marginal log p k`a nd µ C l denotes the logarithm of column marginal log p`l. Goodman (1991) called this model a "weighted association model" where weights are marginal row and column proportions. He used the term "association model" (to distinguish it from correlation (3.6) based model) as it studies the relationship between rows and columns using odds-ratio. Remark 3. Log-linear models are a powerful statistical tool for categorical data analysis (Agresti, 2013) . Here we have provided a contemporary unified view of loglinear modeling for contingency tables from discrete LP-copula viewpoint. This newfound connection might open up new avenues of research. We describe a graphical exploratory tool-logratio biplot, which allows a quick visual understanding of the relationship between the categorical variables X and Y . In the following, we describe the process of constructing logratio biplot from the LP-copula model (3.1). Copula-based Algorithm. Construct two scatter plots based on the top two dominant components of the LP-copula model: the first one is associated with the row categories, formed by the points pµ 1 φ 1k , µ 2 φ 2k q for k " 1, . . . , I; and the second one is associated with the column categories, formed by the points pµ 1 ψ 1l , µ 2 ψ 2l q for l " 1, . . . , J. Logratio biplot is a two-dimensional display obtained by overlaying these two scatter plots-the prefix 'bi' refers to the fact that it shares a common set of axes for both the rows and columns categories. Interpretation. Here we offer an intuitive explanation of the logratio biplot from the copula perspective. We start by recalling the definition of conditional comparison density (CCD; see eq. 2.8-2.9), as the copula-slice. For fixed X " k, logratio-embedding coordinates pµ j φ jk q can be viewed as the LP-Fourier coefficients of the dpF Y pyq; Y, Y |X " kq, since Similarly, the logratio coordinates pµ j ψ jl q for fixed Y " l can be interpreted as the LPexpansion coefficients of dpF X pxq; X, X|Y " lq. Hence, the logratio biplot can alternatively be viewed as follows: (i) estimates the discrete LP-copula density; (ii) Extract the copula slice p dpu; Y, Y |X " kq along with its LP-coefficients pµ 1 φ 1k , µ 2 φ 2k q; (iii) similarly, get the estimated p dpv; X, X|Y " lq-the copula slice at Y " F Y plq along with its LP-coefficients pµ 1 ψ 1l , µ 2 ψ 2l q; (iv) Hence, the logratio biplot (see Fig. 2(b) ) measures the association between the row and column categories X " k and Y " l by measuring the similarity between the 'shapes' of p dpu; Y, Y |X " kq and p dpv; X, X|Y " lq through their LP-Fourier coefficients. Remark 4 (Historical Significance). The following remarks are pertinent: (i) Log-ratio map traditionally taught and practiced using matrix-algebra (Greenacre, 2018) . This is in sharp contrast with our approach, which has provided a statistical synthesis of log-ratio biplot from a new copula-theoretic viewpoint. To the best of author's knowledge, this is the first work that established such a connection. (ii) Logratio biplot has some important differences with the correspondence analysis pioneered by the French statistician Jean-Paul Benzécri; for more details, see Goodman (1991) and Benzecri (1991) . However, in practice, these two methods often lead to very similar conclusions (e.g., contrast Fig. 2(b) and Fig. 7) . Example 5. 1970 Vietnam War-era US Draft Lottery (Fienberg, 1971 ). All eligible American men aged 19 to 26 were drafted through a lottery system in 1970 to fill the needs of the country's armed forces. The results of the draft are given to us in the form of a 12ˆ3 contingency table (see Table 3 in the appendix): rows are months of the year from January to December, and columns denote three categories of risk of being drafted-high, medium, and low. The discrete 'staircase-shaped' LP-copula estimate is shown in the Fig. 2 (a) , whose explicit form is given below: We now overlay the scatter plots p0.26φ 1k , 0.18φ 2k q for k " 1, . . . , 12 and p0.26ψ 1l , 0.18ψ 2l q for l " 1, 2, 3 to construct the logratio biplot, as displayed in Fig. 2 . This easy-to-interpret two-dimensional graph captures the essential dependence pattern between the row (month: in blue dots) and the column (risk category: in red triangles) variables. It has been known for a long time (Fienberg and Rinaldo, 2007 ) that classical maximum likelihood-based log-linear models break down when applied to large sparse contingency tables with many zero cells. Here we discuss a new maxent copula-based smooth method for fitting a parsimonious log-linear model to sparse contingency tables. Example 6. Zelterman data. The dataset (Zelterman, 1987, which can be viewed as the intrinsic degrees of freedom (df). Our LP-maxent approach distills a compressed representation with reduced numbers of parameters that yields a smooth estimates: only requiring m " 3 components to capture the pattern in the data-a radical compression with negligible information loss! Contrast this with the dimension of the saturated loglinear model: p28´1qˆp26´1q " 675 -a case of a severely overparameterized non-smooth model with inflated degrees of freedom, which leads to an inaccurate goodness-of-fit test for checking independence between rows and columns. More on this in Sec. 3.5.1. Smoothing Ordered Contingency Tables. The nonparametric maximum likelihoodbased cell probability estimates r p X,Y pk, lq " f kl {n are very noisy and unreliable for sparse contingency tables. By sparse, we mean tables with a large number of cells relative to the number of observations. Using Sklar's representation theorem, one can simply estimate the joint probability p p X,Y pk, lq by multiplying the empirical product pmf r p X pkqr p Y plq with the smoothed LP-copula. In other words, the copula can be viewed as a data-adaptive bivariate discrete kernel function that smooths the independent product-density to estimate the cell probabilities. dKernelpk, lq " y cop X,Y`r F X pkq, r F Y plq˘, for k " 1, . . . , I; l " 1, . . . , J. This approach can be generalized for any bivariate discrete distribution; see next section. Remark 5. For a comprehensive literature on traditional kernel-based nonparametric smoothing methods for sparse contingency tables, readers are encouraged to consult Simonoff (1985 Simonoff ( , 1995 and references therein. The topic of nonparametric smoothing for multivariate discrete distributions has received far less attention than the continuous one. Two significant contributions in this direction include: Aitchison and Aitken (1976) and Simonoff (1983) . In what follows, we discuss a new LP-copula-based procedure for modeling correlated discrete random variables. Example 7. Shunter accident data (Arbous and Kerrich, 1951) . As a motivating example, consider the following data: we are given the number of accidents incurred by n " 122 shunters in two consecutive year periods, namely 1937-1942 and 1943-1947 . To save space, we display the bivariate discrete data in a contingency table format; see Table 1. 1943-47 1937-42 0 1 2 3 4 5 6 0 21 18 8 2 1 0 0 1 13 14 10 1 4 1 0 2 4 5 4 2 1 0 1 3 2 1 3 2 Algorithm. The main steps of our analysis are described below. Step 1. Modeling marginal distributions. We start by looking at the marginal distributions of X and Y . As seen in Fig. 4 , negative binomial distributions provide excellent fit. To fix the notation, by G µ,φ " NBpy; µ, φq, we mean the following probability distribution: NBpy; µ, φq "ˆy`φ´1 y˙ˆµ µ`φ˙yˆφ µ`φ˙φ , y P N, (3.17) where EpXq " µ and VarpXq " µ`µ 2 φ . Using the method of MLE, we get: X " G 1 " NBpx;μ " 0.97,φ " 3.60q, and Y " G 2 " NBpy;μ " 4.30,φ " 1.27q. Step 2. Generalized copula density. The probability of bivariate distribution at px, yq can be written as follows (generalizing Sklar's Theorem): PrpX " x, Y " yq :" p X,Y px, yq " g 1 pxqg 2 pyq cop X,Y`G 1 pxq, G 2 pyq˘(3.18) where generalized log-copula density admits the following decomposition: It is important to note that the set of LP-basis functions tT j px; G 1 qu and tT k py; G 2 qu are specially designed for the parametric marginals G 1 and G 2 . We call them gLP-basis, to distinguish them from the earlier empirical LP-polynomial systems tT j px; r F X qu and tT k py; r F Y qu. Step 3. Exploratory goodness-of-fit. The estimated log-bilinear LP-copula is y cop X,Y pu, vq " exp ! 0.287S 1 pu; G 1 qS 1 pv; G 2 q´0.043 ) . (3.20) There are three important conclusions that can be drawn from this non-uniform copula density estimate: (i) Goodness of fit diagnostic: the independence model (product of parametric marginals) g K px, yq " g 1 pxqg 2 pyq is not adequate for the data. (ii) Nature of discrepancy: between X and Y . (iii) Nonparametric repair: how to update the initial g K px, yq to construct a "better" model? Eq. (3.18) gives the general updating rule, which simply says: copula provides the necessary bivariate-correction function to reduce the 'gap' between the starting misspecified model g K px, yq and the true unknown distribution p X,Y px, yq. Step 4. LP-smoothed probability estimation. The bottom panel Fig. 4 shows the final smooth probability estimate p p X,Y px, yq, computed by substituting (3.20) into (3.18). Also compare Tables 4 and 5 of Appendix A.4. Remark 6. Our procedure fits a 'hybrid' model: a nonparametrically corrected (through copula) multivariate parametric density estimate. 4 One can use any parametric distribution instead of a negative binomial. The algorithm remains fully automatic, irrespective of the choice of parametric marginals G 1 and G 2 , which makes it a universal procedure. Mutual information (MI) is a fundamental quantity in Statistics and Machine Learning, with wide-ranging applications from neuroscience to physics to biology. For continuous random variables pX, Y q, mutual information is defined as Among non-parametric MI estimators, k-nearest-neighbor and kernel-density-based methods (Moon et al., 1995 , Kraskov et al., 2004 , Zeng et al., 2018 are undoubtedly the most popular ones. Here we are concerned with a slightly general problem of developing a flexible MI estimation algorithm that is: (D1) applicable for mixed 5 pX, Y q; (D2) robust in the presence of noise; and, (D3) invariant under monotone transformations 6 . To achieve this goal, we start by rewriting MI (3.21) using copula: The next theorem presents an elegant closed-form expression for MI in terms of LP-copula parameters, which allows a fast and efficient estimation algorithm. Theorem 3. Let pX, Y q be a mixed-pair of random variables. Under the LP log-bilinear copula model (2.11), the mutual information between X and Y has the following representation in terms of LP-co-mean parameters LP jk and maximum entropy coefficients θ jk Proof. Express mutual information as: The first equality follows from (3.22) and the second one from (2.11). Complete the proof by replacing by LP jk by ErS j pU ; XqS k pV ; Y qs by virtue of (2.13). As a practical consequence, we have the following efficient and direct MI-estimator, satisfying D1-D3: Continuous pX, Y q example. Consider the kidney fitness data, discussed in Example 1. The estimated LP-copula density-based (2.22) MI estimate: x MI " 0.20 with bootstrap pvalue is almost zero. Remark 7. MI (3.22) measures the departure of copula density from uniformity. This is because, MI can be viewed as the Kullback-Leibler (KL) divergence between copula and the uniform density: MIpX, Y q " KLpcop; U r0,1s 2 q. A few immediate consequences: (i) MI is always nonnegative, i.e., MIpX, Y q ě 0, and equality holds if and only if variables are independent. Moreover, the stronger the dependence between two variables, the larger the MI. (ii) MI is also invariant under different marginalizations. Two additional applications of MI (for categorical data and feature selection problems) are presented below. Given n independent samples from an IˆJ contingency table, the G 2 -test of goodness-of-fit, also known as the log-likelihood ratio test, is defined as which under the null hypothesis of independence has asymptotic χ 2 pI´1qpJ´1q distribution. From (3.26) one can immediately conclude the following. Theorem 4. The G 2 log-likelihood ratio statistic can be viewed as the raw nonparametric MI-estimate where Ă MIpX, Y q is obtained by replacing the unknown distributions in (3.21) with their empirical estimates. Example 8. Hellman's Infant Data (Yates, 1934) We verify the identity (3.27) for the following 2ˆ2 table, which shows cross-tabulation of n " 42 infants based on whether the infant was breast-fed or bottle-fed. Breast-fed 4 16 Bottle-fed 1 21 Yates (1934, p.230 ). The estimated LP-copula density (shown in Fig. 10 of Appendix A.4) is given by y coppu, v; X, Y q " exp 0.234S 1 pu; XqS 1 pv; Y q´0.03 ( . (3.28) The empirical MI estimate is given by 2nˆĂ MIpX, Y q " 2.50, with the pvalue 0.12 computed using the asymptotic null distribution χ 2 1 . This exactly matches with the G 2 -statistic value; one may use the R-function GTest. The problem arises when we try to apply G 2 -test for large sparse tables, and it is not hard to see why: the adequacy of asymptotic χ 2 pI´1qpJ´1q distribution depends both on the sample size n and the number of cells p " IJ. Koehler (1986) showed that the approximation completely breaks down when n{p ă 5, leading to erroneous statistical inference due to significant loss of power. The following example demonstrates this. Example 9. Zelterman Data Continued. Log-likelihood ratio G 2 -test produces pvalue 1, firmly concluding the independence between salary and years of experience. This directly contradicts our analysis of Sec. 3.3, where we found a clear positive dependence between these two variables. Why G 2 -test was unable to detect that effect? Because it is based on chi-square approximation with degrees of freedom p28´1qˆp26´1q " 675. This inflated degrees of freedom completely ruined the power of the test. To address this problem, we recommend the following smoothed version: where x MIpX, Y q is computed based on the LP-bilinear copula model: Smooth´G 2 analysis generates pvalue 2.48ˆ10´1 1 , thereby successfully detecting the association. The crucial aspect of our approach lies in its ability to provide a reduced dimensional parametrization of copula density. For the Zelterman data, we need just two components (i.e., the effective degrees of freedom is 2) to capture the pattern. Remark 8 (Discrete variables with many categories). Discrete distributions over large domains routinely arise in large-scale biomedical data such as diagnosis codes, drug compounds and genotypes (Seok and Kang, 2015) . The method proposed here can be used to jointly model such random variables. 3.5.2 Application 2: (X,Y) Mixed: Feature Importance Score We consider the two-sample feature selection problem where Y is a binary response variable, and X is a predictor variable that can be either discrete or continuous. Example 10. Chronic Kidney Disease data. The goal of this study is to investigate whether kidney function is related to red blood cell count (RBC). We have a sample of n " 203 participants, among whom 79 have chronic kidney disease (ckd) and another 124 are nonckd. Y denotes the kidney disease status and X denotes the measurements on RBC (unit in million cells per cubic millimeter of blood). The estimated LP-copula density y coppu, vq " exp !´0 .76 S 1 pu; Y q S 1 pv; Xq`0.18 S 1 pu; Y qS 2 pv; Xq´0.19 S 1 pu; Y qS 4 pv; Xq´0.33 ) , (3.30) is shown in Fig. 5 . We propose mutual information-based feature importance measure based on the formula (3.25); this yields x MIpY, Xq " 0.36 with pvalue almost zero, strongly indicating that RBC is an importance risk-factor related to kidney dysfunction. What additionally insights can we glean from this copula? To answer that question, let's focus our attention on the copula-slice for u P r0.61, 1s. This segment of the copula is essentially the conditional comparison density dpv; X, X|Y " 1q (see Sec. 2.1), which can be easily derived by substituting S 1 p r F Y p1q; Y q " b 1´μ µ " 1.25 into (3.30), whereμ " 79{203 " 0.389: p dpv; X, X|Y " 1q " exp ´0.95S 1 pv; Xq`0.23S 2 pv; Xq´0.24S 4 pv; Xq´0.33 ( . (3.31) A few remarks on the interpretation of the above formula: ‚ Distributional effect-size: Bearing in mind eq. (2.5, 2.9), it is easy to see that dpv; X, X|Y " 1q compares two densities: f X|Y "1 pxq with f X pxq, thereby capturing the distributional difference. This has advantages over traditional two-sample feature importance statistic (e.g., Student's t or Wilcoxon statistic) that can only measure differences in location or mean. ‚ Explainability: The estimated p dpv; X, X|Y " 1q involves three significant LP-components of X; the presence of the 1st order 'linear' S 1 pv; Xq indicates location-difference; the 2nd order 'quadratic' S 2 pv; Xq indicates scale-difference; and 4th order 'quartic' S 4 pv; Xq indicates the presence of tail-difference in the two RBC-distributions. In addition, the negative sign of the linear effect p θ 11 "´0.95 implies reduced mean level of RBC in the ckd-population. Medically, this makes complete sense, since a dysfunctional kidney cannot produce enough Erythropoietin (EPO) hormone, which causes the RBC to drop. The bottom panel shows the two-sample boxplots and the conditional comparison density (CCD) dpv; X, X|Y " 1q. The piecewise constant shape of estimated CCD is not an aberration of our nonparametric approximation method; it reflects the inherent discreteness of the feature X (red blood cell count). We describe a new copula-based nonparametric logistic regression model. The key result is given by the following theorem, which provides a first-principle derivation of a robust nonlinear generalization of the classical linear logistic regression model. Theorem 5. Let µ " PrpY " 1q and µpxq " PrpY " 1|X " xq. Then we have the following expression for the logit (log-odds) probability model: where α 0 " logitpµq and α j " Proof. The proof consists of four main steps. Step 1. To begin with, notice that for Y binary and X continuous, the general log-bilinear LP-copula density function (2.11) reduces to the following form: since we can construct at most 2´1 " 1 LP-basis function for binary Y . Step 2. Apply copula-based Bayes Theorem (Eq. 2.10) and express the conditional comparison densities as follows: (3.34) and also, d 0 pxq " dpF X pxq; X, X|Y " 0q " 1´µpxq 1´µ . (3.35) Taking logarithm of the ratio of (3.34) and (3.35), we get the following important identity: Step 3. From (3.33), one can deduce the following orthonormal expansion of maxentconditional copula slices log d 1 and log d 0 : Step 4. Substituting (3.37) and (3.38) into (3.36) we get: since for binary Y we have (see Appendix A.1): (3.39) Substitute α 0 " logitpµq and α j " to complete the proof. Generalize the univariate copula-logistic regression model (3.32) to the high-dimensional case as follows: Nonparametrically approximate the unknown smooth h j 's by LP-polynomial series of X j h j px j q " m ÿ k"1 α jk T k px j ; F X j q, for j " 1, . . . , p. (3.41) Remark 9 (Estimation and Computation). A sparse nonparametric LP-additive model of the form (3.40-3.41) can be estimated by penalized regression techniques (lasso, elastic net, etc.), whose R-computation is remarkably easy using glmnet function (Friedman et al., 2010) : where T X simply is the column-wise stacked rT X 1 |¨¨¨| T Xp s LP-feature matrix. Example 11. UCI Credit Card data. The dataset is available in the UCI Machine Learning Repository. It contains records of n " 30, 000 cardholders from an important Taiwan-based bank. For each customer, we have a response variable Y denoting: default payment status (Yes = 1, No = 0), along with p " 23 predictor variables (e.g., gender, education, age, history of past payment, etc.). We randomly partition the data into training and test sets, with an 80-20 split, repeated 100 times. We measure the prediction accuracy using AUC (the area under the ROC curve). Fig. 6 compares two kinds of lasso-logistic regressions: (i) usual version: based on feature matrix X; and (ii) LP-copula version: based on feature matrix T X . As we can see, LP-copula based additive logistic regression classifier significantly outperforms the classical logistic regression model. To gain further insight into the nature of impact of each variable, we plot the lasso-smoothed location and scale coefficients: LS-Feature plot :`p α j1 , p α j2˘, for j " 1, . . . , p. L-stands for location and S-stands for scale. The purpose of the LS-feature plot is to characterize 'how' each feature impacts the classification task. For example, consider the three variables pay 0, limit balance, and pay 6, shown in the bottom panel Fig. 6 . Each one of them contains unique discriminatory information: pay 0 has location as well as scale information, hence it appeared at the top-right of the LS-plot; The variable limit balance only shows location difference, whereas the variable pay 6 shows contrasting scale in the two populations. In short, LS-plot explains 'why and how' each variable is important using a compact diagram, which is easy to interpret by researchers and practitioners. We can see a tight cluster around the origin p0, 0q, which indicates that most of the variables are irrelevant for prediction (sparsityassumption). Also, it is evident that the majority of the features have either location or scale information (differences), the only exception being the variable pay 0-which denotes the repayment status of the last two months (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, and so on). This is further illustrated using three variables, as marked in the LS-plot; see also the two-sample density estimates shown at the bottom panel. This paper makes the following contributions: (i) we introduce modern statistical theory and principles for maximum entropy copula density estimation that is auto-adaptive for the mixed(X,Y)-described in Section 2. (ii) Our general copula-based formulation provides a unifying framework of data analysis from which one can systematically distill a number of statistical methods by revealing some completely unexpected connections between them. The importance of our theory in applied and theoretical statistics is highlighted in Section 3, taking examples from different sub-fields of statistics: Log-linear analysis of categorical data, logratio biplot, smoothing large sparse contingency tables, mutual information, smooth-G 2 statistic, feature selection, and copula-based logistic regression. We hope that this new perspective will motivate new methodological developments for cohesive statistical data analysis. This paper is dedicated to the birth centenary of E. T. Jaynes (1922 Jaynes ( -1998 , the originator of the maximum entropy principle. I also like to dedicate this paper to the memory of Leo Goodman (1928-2020)-a transformative legend of categorical data analysis, who passed away on December 22, 2020, at the age of 92 due to COVID-19. This paper is inspired in part by the author's intention to demonstrate how these two modeling philosophies can be connected and united in some ways. This is achieved by employing a new nonparametric representation theory of generalized copula density. Preliminaries. For a random variable X with the associated probability distribution F X , define the mid-distribution function as F mid px; F X q " F X pxq´1 2 ppx; F X q where ppx; F X q is probability mass function. The F mid pX; F X q has mean ErF mid pX; F X qs " .5 and variance VarrF mid pX; F X qs " 1 12`1´řx p 3 px; F X q˘. Define first-order basis function T 1 pX; F X q by standardizing F mid -transformed random variable: Construct the higher-order LP-polynomial bases 7 tT j pX; F X qu 1ějďm by Gram-Schmidt orthonormalization of T 2 1 , T 3 1 , . . . , T m 1 . We call these specially-designed polynomials of middistribution transforms as LP-basis, which by construction, satisfy the following orthonormality with respect to the measure F X : where δ jk is the Kronecker delta. Two particular LP-family of polynomials are given below: ‚ gLP-Polynomial Basis. Let X " G, where G is a known distribution. Following the above recipe, construct parametric gLP-basis tT j pX; Gqu jě1 for the given distribution G. For X " Bernoullippq, one can show that T 1 px; Gq " x´p ? pp1´pq , x " 0, 1; see Mukhopadhyay and Parzen (2020) for more details. ‚ eLP-Polynomial Basis. In practice, we only have access to a random sample X 1 , . . . , X n from an unknown distribution F . To perform statistical data analysis, it thus becomes necessary to nonparametrically 'learn' an appropriate basis that is orthogonal with respect to the (discrete) empirical measure r F X . To address that need, construct LP-basis with respect to the empirical measure tT j pX; r F X qu jě1 . LP-unit Basis. We will occasionally express the T j 's in the quantile domain S j pu; Xq " T j`QX puq; F X˘, 0 ă u ă 1. (6.3) We call these S-functions the unit LP-bases. The 'S-form' and the 'T-form' will be used interchangeably throughout the paper, depending on the context. LP-product-bases and Copula Approximation. The LP-product-bases T j px; F X qT k py; F Y q ( can be used to expand any square-integrable function of the form ΨpF X pxq, F Y pyqq. In particular, the logarithm of Hoeffding's dependence function (2.3) can be approximated by LP-Fourier series: log cop X,Y`F X pxq, F Y pyq˘" ÿ j,k θ j,k T j px; F X qT k py; F Y q. (6.4) Represent (6.4) in the quantile domain by substituting x " Q X puq and y " Q Y pvq to get the copula density expression (2.11). The form of the maximum-entropy exponential model directly depends on the form of the set of constraints, i.e., the sufficient statistics functions. The maxent distribution depends on the data only through the sample averages for these functions. ‚ Parametric maxent modeling culture: The traditional practice of maxent density modeling assumes that the appropriate sufficient statistics functions are known or given beforehand, which, in turn, puts restrictions on the possible 'shape' of the probability distribution. This parametric maxent modeling culture was first established by Ludwig Boltzmann in 1877, and then later popularized by E. T. Jaynes in 1960s. ‚ Nonparametric maxent modeling culture: It proceeds by identifying a small set of most important sufficient statistics functions from data . In the next step, we build the maxent probability distribution that agrees with these speciallydesigned relevant constraints. We have used LP-orthogonal polynomials to systematically and robustly design the constraining functions. See, Section A.3 for more discussion. Vladimir Vapnik (2020) calls the sufficient statistics functions as 'predicates' and the associated moment constraints as 'statistical invariants.' While describing his learning theory he acknowledged that "The only remaining question in the complete statistical learning theory is how to choose a (small) set of predicates. The choice of predicate functions reflects the intellectual part of the learning problem." The question of how to systematically design and search for informative predicates was previously raised by Mukhopadhyay et al. (2012) in the context of learning maxent probability models from data; also see Section 2.2 of the main article where we discussed some concrete strategies to address this problem. However, it is important to note that Vapnik and Izmailov (2020) works entirely within the traditional least-square (L 2 risk) setup, instead of maximum-entropy framework. Table 4 : Accident data. The raw empirical probability estimates r ppx, yq. Categorical data analysis Multivariate binary discrimination by the kernel method Accident statistics and the concept of accidentproneness Approximation of density functions by sequences of exponential families Comment on Leo Goodman's "Models, and graphical displays in the analysis of cross-classified data Computer Age Statistical Inference Randomization and social affairs: the 1970 draft lottery Three centuries of categorical data analysis: Loglinear models and maximum likelihood estimation Regularization paths for generalized linear models via coordinate descent Estimating mutual information for discrete-continuous mixtures Measures, models, and graphical displays in the analysis of crossclassified data (with discussion) A single general method for the analysis of cross-classified data: reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also some methods of correspondence analysis and association analysis Compositional data analysis in practice Massstabinvariante korrelationstheorie Information theory and statistical mechanics Goodness-of-fit tests for log-linear models in sparse contingency tables Estimating mutual information Estimation of mutual information using kernel density estimators Large-scale mode identification and data-driven sciences Applied Stochastic Models in Business and Industry, special issue on From data to constraints United statistical algorithms, LP-comoment, copula density, nonparametric modeling. 59th ISI World Statistics Congress (WSC) Mutual information between discrete variables with many categories using recursive adaptive partitioning A penalty function approach to smoothing large sparse contingency tables An improved goodness-of-fit statistic for sparse multinomials Smoothing categorical data Fonctions de répartitionà n dimensions et leurs marges Contingency tables involving small numbers and the χ 2 test Goodness-of-fit tests for large sparse multinomial distributions Jackknife approach to the estimation of mutual information Applied Stochastic Models in Business and Industry, special issue on From data to constraints Complete statistical theory of learning: learning using statistical invariants