key: cord-0201761-1seehwpr
authors: Lindberg, Julia; Am'endola, Carlos; Rodriguez, Jose Israel
title: Estimating Gaussian mixtures using sparse polynomial moment systems
date: 2021-06-29
journal: nan
DOI: nan
sha: 891363b43b90385302fb645502e83c62cc243213
doc_id: 201761
cord_uid: 1seehwpr

The method of moments is a statistical technique for density estimation that solves a system of moment equations to estimate the parameters of an unknown distribution. A fundamental question critical to understanding identifiability asks how many moment equations are needed to get finitely many solutions and how many solutions there are. We answer this question for classes of Gaussian mixture models using the tools of polyhedral geometry. Using these results, we present an algorithm that performs parameter recovery, and therefore density estimation, for high dimensional Gaussian mixture models that scales linearly in the dimension.

A fundamental problem in statistics is to estimate the parameters of a density from samples. This problem is called density estimation and formally it asks, "Given n samples from an unknown distribution p, can we estimate p"? To have any hope of solving this problem we need to assume our density lives in a family of distributions. One family of densities known as Gaussian mixture models are a popular choice due to their broad expressive power. Theorem 1.1. [26, Ch. 3 ] A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components. Theorem 1.1 motivates our study of Gaussian mixture models. These are ubiquitous in the literature with applications in modeling geographic events [56] , the spread of COVID-19 [48] , the design of planar steel frame structures [22] , speech recognition [44, 55, 46] , image segmentation [10] and biometrics [29] .

A Gaussian random variable, X, has a probability density function given by

where µ ∈ R is the mean and σ ∈ R >0 is the standard deviation. In this case we write X ∼ N (µ, σ 2 ). A random variable X is the mixture of k Gaussians if its probability density function is the convex combination of k Gaussian densities. Here we write X ∼ k =1 λ N (µ , σ 2 ) where µ ∈ R, σ ∈ R >0 for all ∈ [k] = {1, . . . , k} and (λ 1 , . . . , λ k ) ∈ ∆ k−1 = {λ ∈ R k >0 : k i=1 λ i = 1}. Each λ , ∈ [k], is the mixture weight of the th component.

A Gaussian k-mixture model is a collection of mixtures of k Gaussian densities. Often one imposes constraints on the means, variances or weights to define such 1 arXiv:2106.15675v1 [stat.ME] 29 Jun 2021 models. For example, one might assume that all variances are equal or that the mixture weights are all equal. The former is known as a homoscedastic mixture model, and the latter is called a uniform mixture model in the literature. In this paper we consider three classes of Gaussian mixture models.

(1) The λ-weighted model, where the mixture weights are fixed for λ ∈ ∆ k−1 .

(2) The λ-weighted homoscedastic model, which is the λ-weighted model under the additional assumption that the variances are equal. ( 3) The λ-weighted known variance model, where the weights and variances are assumed known. We wish to do parameter recovery for these Gaussian mixture models, that is, we would like to solve the following problem. Problem 1.2. Given samples, y 1 , . . . , y N , distributed as the mixture of k Gaussian densities, recover the parameters µ i , σ 2 i , λ i for i ∈ [k]. It is important to distinguish parameter recovery from density estimation. For density estimation, one wishes to estimate a density that is close to the true density, with no restriction on how close each individual component is. In this paper we wish to do parameter recovery. Namely, we wish to recover accurate estimates of the mean, variance and mixing weight of each component. It is clear that density estimation follows trivially once all of the parameters are known. The important distinction between density estimation and parameter recovery is illustrated next. There are two Gaussian mixture densities with these eight moments. These densities are shown in Figure 1 where it is seen that they are almost indistinguishable. In Problem 1.2 is well-defined because Gaussian mixtures are identifiable [50] . Specifically, one can recover the mean, variance and weight of each component if given the full mixture density.

One idea to solve Problem 1.2 is to use maximum likelihood estimation. Maximum likelihood estimation aims to maximize the likelihood function by solving the following optimization problem:

Unless k = 1, (1.1) is a nonconvex optimization problem and obtaining a global optimum is difficult or impossible. In general (1.1) is unbounded, so no global maximum exists. Iterative algorithms such as expectation maximization (EM) try to find the largest local maximum [19] . On top of being sensitive to the starting point, another downside of the EM algorithm is that it needs to access all data in each iteration, which is prohibitive for applications with large data sets. Even worse, there is no bound on the number of critical points of the likelihood function and in general these estimates are transcendental [1] .

Recent work has analyzed the local behavior of (1.1) by considering maximum likelihood estimation for two Gaussians in R n when λ 1 = λ 2 = 1 2 and all of the covariance matrices are known and equal -this is a special case of the models we consider. It has been shown that in this regime the EM algorithm converges to the global optimum in 10 iterations [18] . Other work has studied the global landscape of the EM algorithm and the structure of local optima in this setting [15, 53] . Further work has considered inference for Gaussian mixture models with known mixing coefficients and identity covariance matrices [40] and clustering analysis of the mixture of two Gaussians where the covariance matrices are equal and unknown [14] . When these covariance matrices are further assumed to be spherical, [45] gives polynomial time approximation schemes for (1.1). Recently, techniques from numerical algebraic geometry have been used to identify the number of components in a Gaussian mixture model [47] . Further progress has been made on giving optimal sampling bounds needed to learn a Gaussian mixture model [5] .

A recent revolution of robust statistics has lead to the development of algorithms to learn the parameters for mixtures of multivariate Gaussians when some of the samples are corrupted. Computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions including a single Gaussian and mixtures of spherical Gaussians was given in [21] . Clustering algorithms and learning algorithms for Gaussians that are well separated were given in [6] and [20] respectively. For a mixture of two Gaussians with equal mixture weights, [33] gives an algorithm to learn each mean and variance up to a distance that is poly( ) from the true density. Recent work gives a polynomialtime algorithm for robustly learning the parameters of a high-dimensional mixture of a constant number of Gaussians [39] . A downside to this algorithm is that it runs in time polynomial in the sample size which can be prohibitive for large data sets.

Another idea for density estimation in this set up is to use the generalized method of moments. The generalized method of moments was proposed in [27] and aims to minimize the difference between the fitted moments and the sample moments. For Gaussian mixture models, this again cannot be solved in a way guaranteeing global optimality due to the nonconvexity of the moment equations. Recently this method has been remedied for Gaussian mixture models in one dimension with the same variance parameter, where the authors provably and efficiently find the global optimum of the generalized method of moments [52] .

It is important to note that in many of the cases above, assumptions are made on the values that each Gaussian component can take.

In this paper we are interested in identifying the means, variances and weights of each mixture from its moments up to a certain order. It has been shown that one dimensional Gaussian k mixture models can be uniquely recovered using the first 4k − 2 moments [32] and using the first 3k − 1 moments generically gives finitely many solutions [3] . Multivariate Gaussians are still identifiable [54] and there exists a finite number of moments that identify them [8] .

In this paper we propose using the method of moments to estimate the density arising from the mixture of k multivariate Gaussians. This methodology was first proposed and resolved for the mixture of two univariate Gaussians by Karl Pearson [43] . Pearson reduced solving this system of 6 polynomial equations in the 6 unknown density parameters, µ i , σ 2 i , λ i , i = 1, 2, to understanding the roots of a degree nine polynomial with coefficients in the sample moments m i , i ∈ [5] .

The method of moments for Gaussian mixture models was revisited in 2010 in a series of papers [31, 41] . The case of a k = 2 mixture model in n dimensions was handled in [31] where a polynomial time algorithm was presented. This approach was generalized in [41] where an algorithm for a general k mixture model in n dimensional space was presented that scales polynomially in n and the number of samples required scales polynomially in the desired accuracy. It is important to note that these algorithms focus on density estimation, not parameter recovery. In other words, with high probability they return a density that is close to the true density with no guarantees on how close the estimated components are to the true ones.

1.1. Contributions. The main result of this paper is to present an algorithm (Algorithm 1) for parameter recovery for a mixture of k Gaussians in R n that scales linearly in n. Our results are organized as follows. We give a fresh presentation of the statistics and algebraic background needed for the study of method of moments. The main take away in this section is that the method of moments can be thought of as solving a system of sparse polynomial equations. By studying this system we can make conclusions about parameter recovery and density estimation. For example, we show that for a wide range of models the corresponding moment system has finitely many solutions, which leads to identifiability results. Moreover, the proofs of our results give rise to effective algorithms that implement the method of moments for mixtures of univariate Gaussians. As a surprising application of our theorems in the univariate case, we show that density estimation for Gaussian mixture models in R n can be done using homotopy continuation by tracking O(n) paths.

In this section, we present some statistics and algebraic geometry background.

2.1. Method of moments. This paper focuses on an approach for parameter recovery known as the method of moments. The method of moments for parameter estimation is based on the law of large numbers. This approach expresses the moments of a density as functions of its parameters.

Let f : R → R be the probability density function of a random variable X. For i ≥ 0, the i−th moment of X is

We consider a statistical model with n unknown parameters, θ = (θ 1 , . . . , θ n ), and consider the moments up to order n as functions of θ, g 1 (θ), . . . , g n (θ).

Assume y 1 , . . . , y N are independent samples from the same distribution. The rth sample moment is given by

The number of samples, N , needed to accurately estimate m r is dependent on the distribution. The method of moments works by using samples from the statistical model to calculate sample moments m 1 , . . . , m n , and then solve the corresponding system m i = g i (θ), i = 1, . . . , n, for the parameters θ 1 , . . . , θ n .

The moments of Gaussian distributions are polynomials in the variables µ, σ 2 and can be calculated recursively as M 0 (µ, σ 2 ) = 1, M 1 (µ, σ 2 ) = µ and

We calculate the ith moment of a mixture of k Gaussian densities as the convex combinations of M i (µ 1 , σ 2 1 ), . . . , M i (µ k , σ 2 k ). We are interested solving the system,

under assumptions on the parameters, µ, σ 2 , λ where i varies over an appropriate index set.

As stated in the introduction, for a k mixture model with generic sample moments, the first 3k − 1 moments are needed to have a polynomial system with finitely many solutions [3] . This shows that Gaussian mixture models in one dimension are algebraically identifiable using moment equations

. . , f k 3k−1 = 0. In other words, for a generic choice of sample moments, the polynomial system (2.2) for i = 0, . . . , 3k − 1 has finitely many solutions.

Remark 2.1. The concept of generic behavior is prevalent in applied algebraic geometry. Throughout the rest of this paper we consider generic Gaussian mixture models and sample moments. Formally, we require that the parameters of the model lie on a dense Zariski open set. In particular this means that the non-generic behavior is restricted to a Lebesgue measure-zero set. Furthermore, this measure zero set is a lower dimensional algebraic variety. For an illustration of genericity, see Example 2.3.

Statistically meaningful solutions. We note that for any set of real-valued sample moments it is not guaranteed that the moment equations will give any statistically meaningful solutions. A statistically meaningful solution is a real valued solution with positive variances and mixing weights. In other words, it is a solution that corresponds to a true density. If the sample moments are inaccurate, it may happen that no solutions obtained from the method of moments is statistically meaningful. By the law of large numbers, as the number of samples goes to infinity the sample moments will converge to the true moments and the method of moments will return a consistent estimator [51, Theorem 9.6] .

A property of parameterized polynomial systems is that the number of real solutions is constant in open Euclidean sets of the parameter space. Such open sets can be computed via cylindrical algebraic decomposition [16] . The constraints differing real solutions and statistically meaningful ones will further divide these cells. Therefore, in any of these open cells the number of statistically meaningful solutions will be constant. So long as the sample moments lie in a cell that has at least one statistically meaningful solution, the method of moments will return a true density. 2.3. Sparse polynomial systems. We now review concepts from algebraic geometry, but for an introduction to sparse polynomial systems see [49, Ch. 3] . A family of sparse polynomials is defined by its monomial support. For each α = (α 1 , ..., α n ) ∈ N n ≥0 , we have the monomial x α := x α 1 1 · · · x αn n with exponent α. A sparse polynomial is a linear combination of monomials. Let A • = (A 1 , . . . , A k ) denote a k-tuple of nonempty finite subsets of N n ≥0 . A generic sparse polynomial system of equations with support A • is given by

where the scalar coefficients {c i,α } α∈A i ,i∈[k] are generic. The terminology "sparse" is used because the monomial support A • is typically much smaller compared to the number of monomials up to a specific degree. Example 2.3. Consider the sparse polynomial system

The monomial supports of f 1 and f 2 are the same and given by A i = {(0, 0), (1, 0), (0, 1), (1, 1)} for i = 1, 2.

Solving f 1 for x in terms of y and substituting it into f 2 , we see that for generic

However, for certain non-generic coefficients this will not be the case. For example, if

0, there will be fewer than two distinct isolated solutions. For instance, when

Consider two polytopes P, Q ∈ R n . The Minkowski sum of P and Q is defined as

Given s convex polytopes in R n , K 1 , . . . , K s , we consider the standard n−dimensional Euclidean volume of a linear combination of these polytopes

where the sum here refers to the Minkowski sum and the λ i refers to a scaling. The polynomial ν(λ 1 , . . . , λ s ) is homogeneous of degree n in λ 1 , . . . , λ s .

The mixed volume of s convex polytopes K 1 , . . . , K s in R n is the coefficient of the λ 1 λ 2 · · · λ s term in ν(λ 1 , . . . , λ s ). It is denoted MVol(K 1 , . . . , K s ). More information on mixed volumes can be found in [24] .

Example 2.4. Consider P = Conv{(0, 0), (0, 1), (1, 0), (1, 1)}, the unit square. We compute MVol(P, P ) by considering

The coefficient in front of λ 1 λ 2 is 2, giving MVol(P, P ) = 2.

Example 2.5. The mixed volume of K 1 , . . . , K n is easy to describe when K i is a line segment from the origin to the vertex v i ∈ Z n . The Minkowski sum λ 1 K 1 +· · ·+λ n K n is a parallelpiped. Hence, its volume is given by a determinant:

So MVol(K 1 , . . . , K n ) equals the absolute value of the determinant of the matrix with the vertices v 1 , . . . , v n as its columns.

A celebrated series of papers [9, 34, 35] gives the connection between sparse polynomial systems described in Section 2.3 and polyhedral geometry described in Section 2.4.

Consider

The Newton polytope of f is the convex hull of its exponent vectors α ∈ A, denoted Newt(f ).

. . , f n ) where F ∈ L(A) and the number of complex solutions to F = 0 is finite. Then the number of complex solutions to F = 0 is less than or equal to MVol(P 1 , . . . , P n ). Moreover, for generic choices of F ∈ L(A), equality holds.

We illustrate Theorem 2.6 on a basic example from statistics.

Example 2.7. The method of moments can be used to determine the parameters of a Gaussian distribution N (µ, σ 2 ). For sample moments m 1 , m 2 , these parameters are determined by solving the sparse polynomial system

The mixed volume MVol(Conv(A 1 ), Conv(A 2 )) and the number of solutions to the system are both one. This is no coincidence by the BKK bound.

Remark 2.8. The BKK theorem is usually stated without the hypothesis 0 n ∈ A i . In this case, the BKK bound counts the number of complex solutions with nonzero coordinates. To get a bound on the number of complex solutions, the assumption 0 n ∈ A i was added in [37] . Remark 2.9. A special case of the BKK bound is the Bézout bound [28, p. 47] . The Bézout bound says that if a polynomial system F = (f 1 , . . . , f n ) has finitely many solutions, then the number of complex solutions with nonzero coordinates is bounded above by

Theorem 2.6 states that for a generic sparse polynomial system in L(A), the number of complex solutions equals the mixed volume of the system. An obstacle to applying this theorem is that the mixed volume is #P-hard to compute [23] .

An important property of mixed volumes is that they are monotonic. Namely, if P 1 ⊆ P 1 then MVol(P 1 , P 2 , . . . , P n ) ≤ MVol(P 1 , P 2 , . . . , P n ). Therefore by Example 2.5, taking line segments Conv

, is an easy way to get a lower bound on MVol(P 1 , . . . , P n ). We would like conditions under which such a lower bound is tight.

Definition 2.10. [42, Definition 7.29] Let P 1 , . . . , P m be convex polytopes in R n . We say P 1 , . . . , P m are dependent if there is a non-empty subset I ⊆ [m] such that dim( i∈I P i ) < |I|. Otherwise we say P 1 , . . . , P m are independent.

This definition may be difficult to parse on a first read. But it is related to the usual definition of linear independence: if each P i is a line through the origin, then the two ideas of dependent agree. Moreover, the collection of empty sets is independent.

Given a nonzero vector w ∈ R n and a convex polytope P ⊆ R n , we consider the function P → R, x → w, x with (w 1 , . . . , w n ), (x 1 , . . . , x n ) := w 1 x 1 + · · · w n x n denoting the standard inner product. We denote the minimum value w takes on P by val w (P ), and the points of P where w is minimized by Init w (P ). This set of points is a face of P , which we call the face exposed by w. Specifically, we have val w (P ) = min x∈P w, x and Init w (P ) = {x ∈ P : w, x ≤ w, y for all y ∈ P }.

the initial polynomial of f . For more background on initial polynomials, see [17, Chapter 2] .

The following are equivalent:

(1) MVol(P 1 , . . . , P n ) = MVol(Q 1 , . . . , Q n ) (2) One of the following holds: (a) P 1 , . . . , P n are dependent i.e. MVol(P 1 , . . . , P n ) = 0 (b) For each w ∈ R n \{0}, the collection of polytopes

Proposition 2.11 gives conditions under which is suffices to consider the (potentially much simpler) polytopes Q i ⊆ P i to compute the mixed volume of P 1 , . . . , P n . Example 2.12. Consider the triangles P 1 = P 2 = Conv({(0, 0), (1, 0), (0, 1)}), and the line segments

Direct computation shows

We can also use Example 2.5 to prove MVol(Q 1 , Q 2 ) = 1 and Proposition 2.11 to prove MVol(P 1 , P 2 ) = MVol(Q 1 , Q 2 ) since for any nonzero w ∈ Z 2 , the collection of polytopes {Init w (Q i ) : Q i ∩ Init w (P i ) = ∅}, i = 1, 2, contains a single point so the collection is dependent. This type of argument, where we use the dependence of polytopes and Proposition 2.11, is also used in the proofs of Theorem 3.3 and Theorem 3.7.

The following lemma will be of use later to apply Proposition 2.11 in the proof of our main results.

If {v 1 , . . . , v n } are linearly independent, then the polytopes

Proof. Fix w ∈ W . By the definition of dependent, we need to show that

Recalling

Since {v 1 , . . . , v n } are linearly independent, the only w that satisfies this is w = 0 n ∈ W .

We are now able to present our first set of results: efficiently finding all complex solutions stemming from the moment equations.

3.1. Mixed volume of λ-weighted models. First consider a λ-weighted model with k mixture components. We consider the moment system

where λ are known mixing coefficients, and f k i , i ∈ [2k] is as defined in (2.2). In this set up the unknowns are µ , σ , ∈ [k].

First, we record the following fact about the moment functions M k .

Proof. We verify both by induction. For both identities, the base case k = 1 is immediate. Suppose ∂ ∂µ M k−1 (σ 2 , µ) = (k − 1)M k−2 . By the recursive relationship

Now we prove our first algebraic identifiability result. Proof. By [28, Ch. 1, Section 5] it is enough to show the Jacobian of (3.1) is full rank at a generic point. The Jacobian of (3.1) is a 2k × 2k matrix with rows indexed by equations and columns indexed by the variables µ 1 , σ 1 , . . . , µ k , σ k :

whereD k is the diagonal matrix given by (λ 1 , λ 1 , λ 2 , λ 2 , . . . , λ k , λ k ). Note that for nonzero λ , ∈ [k], J k is full rank if and only ifJ k is full rank. We now show thatJ k is full rank by induction on k. When k = 1,

Note that µ k , σ 2 k only appear in the last two columns ofJ k . Further, by Lemma 3.1, the nonzero entries of each row ofJ k have higher degree than the previous row.

Doing Laplace's cofactor expansion along the last two columns ofJ k , we get det(J k ) = det(J k−1 ) · µ 2k−2 k σ 2k−2 k + lower order terms.

By induction, det(J k−1 ) is nonzero at generic (µ 1 , σ 2 1 , . . . , µ k−1 , σ 2 k−1 ). This shows that det(J k ) is a nonzero bivariate polynomial in µ k , σ k with leading coefficient det(J k−1 ). At generic µ k , σ k , det(J k ) does not vanish so we concludeJ k , and hence J k , is invertible at a generic point (µ 1 , σ 2 1 , . . . , µ k , σ 2 k ). Proposition 3.2 shows that for generic sample moments, the moment equations yield finitely many solutions.

We now use Theorem 2.6 and Proposition 2.11 to give an upper bound on the number of complex solutions to (3.1) . Recall that if N is odd, the double factorial is defined as Proof. Consider the moment equations f k 1 , . . . , f k 2k as defined in (3.1) with variable ordering (µ 1 , σ 2 1 , . . . , µ k , σ 2 k ). Denote P i = Newt(f k i ). Let Q ⊂ P be the line segment defined as:

where 0 2k ∈ R 2k is the vector of all zeros and e ∈ R 2k is the th standard basis vector. By Example 2.5 we have MVol(Q 1 , . . . , Q 2k ) = (1 · 3 · 5 · · · (2k − 1)) · (1 · 2 · 3 · · · k) = (2k − 1)!!k!.

We want to use the equivalence of (1) and (2) in Proposition 2.11 to show MVol(P 1 , . . . , P 2k ) = (2k − 1)!!k!.

Theorem 2.6 then gives that the number of complex solutions to (3.1) is bounded above by (2k − 1)!!k!. For a nonzero vector w ∈ R 2k , let

We will show for each w ∈ R n \{0}, the collection of polytopes

is dependent. By Lemma 3.4, (which we postpone to after this proof) I w is nonempty. Since each Q i is a one dimensional line segment, it suffices to show that for any nonzero w ∈ R 2k , w minimizes some Q i at a unique point for i ∈ I w . This follows from the definition of dependent since each Q i is one dimensional, so if Init w (Q i ) is a single point for some i ∈ I w , then i∈Iw Init w (Q i ) < |I w |.

We look at two cases.

• First, consider when 2 ∈ I w for all ∈ [k]. Since the origin is in Q i , we have 0 2k ∈ Init w (P 2 ) for all ∈ [k]. Since P 2 ⊂ R 2k ≥0 , this means val w (P 2 ) < 0 for all ∈ [k]. Hence, w i < 0 for some i ∈ [2k].

Let i be the index of the smallest element of w. If i is odd, then

So Init w (Q i ) = {ie i } for i ∈ I w and we are done. Now consider when i is even. Recall,

. . , e 2k , 2 e 1 , 2 e 3 , . . . , 2 e 2k−1 } = · P 2 , and so Init w (P 2 ) = ·Init w (P 2 ) for all ∈ [k]. Therefore, 2 ∈ I w for all ∈ [k] implies that P 2 cannot be minimized at e j for any even j. This implies that if i is even, 0 > w i > 2w j for some odd j (otherwise P i would be minimized at i 2 e i ). Let j be the index of the smallest odd element of w. In this case, P j would be minimized at je j so j ∈ I w . Hence Init w (Q j ) is {je j }. • Second, suppose 2 ∈ I w for some ∈ [k]. If Init w (P 2 ) ∩ Q 2 is a point, then we are done. Otherwise, we may assume P 2 is minimized by w at a face containing the line segment Q 2 = Conv({0 2k , e 2 }). This means 0 = w T 0 2k ≤ w T a ∀a ∈ P 2 . So w i ≥ 0 for all i ∈ [2k] because the vertices of P 2 are scaled standard basis vectors. With the fact each P i is in the nonnegative orthant, this implies

This shows that I w = [2k] so by Lemma 2.13, the collection of polytopes {Init w (Q 1 ), . . . , Init w (Q 2k )} is dependent.

Lemma 3.4. The index set I w as defined in the proof of Theorem 3.3 is nonempty for any nonzero w ∈ R 2k .

We consider three cases, depending on Init w (P 2 ).

• If either 0 2k or e 2 is in Init w (P 2 ), then Q 2 ∩ Init w (P 2 ) = ∅ and hence 2 ∈ I w .

• Now suppose e j ∈ Init w (P 2 ) for some even j > 2. This means 2w j ≤ w i for all odd i and w j ≤ w m for all even m. Consider

Then j 2 e j ∈ Init w (P j ). Since Q j = Conv({0 2k , j 2 e j }), we have j ∈ I w . • Now suppose e i ∈ Init w (P 2 ) for some odd i ≥ 1. This means w i ≤ w j for all odd j and 2w i ≤ w m for all even m. Consider

Remark 3.5. An instance of the previous theorem is when the mixing weights are all equal. In this case, there are (2k − 1)!! solutions up to the standard label-swapping symmetry. There exist special techniques in numerical nonlinear algebra that exploit this symmetry for computational speed up. One such technique known as monodromy was recently used for this problem with success [4, Section 4.1].

Mixed volume of λ-weighted homoscedastic models. We now consider the λ−weighted homoscedastic model. This is when the means are unknown and the variances are unknown but all equal. In this case, a k mixture model has k + 1 unknowns. We address the high dimensional version of this problem in Section 4 which is also considered in recent work, for example [14] . We consider the moment system

where λ are known mixing coefficients, and f k i , i ∈ [k + 1] is as defined in (2.2). In this set up the unknowns are µ , ∈ [k] and σ 2 .

Again, the first step is to prove that this model is algebraically identifiable. Proof. This argument is analogous to the one given in Proposition 3.2. Again, we consider the Jacobian of (3.2) with rows indexed by equations and columns indexed by the variables σ 2 , µ 1 , . . . , µ k . It suffices to show that for generic, σ 2 , µ 1 , . . . , µ k the Jacobian, J k , is full rank. We proceed by induction on k. When k = 1,

is full rank for λ 1 = 0. Now consider the (k+1)×(k+1) matrix J k for any k. By cofactor expansion along the last column, det(J k ) = det(J k−1 )µ k+1 k whereJ k−1 is the upper left k × k block ofJ k . By induction det(J k−1 ) is nonzero at generic (σ 2 , µ 1 , . . . , µ k−1 ) so det(J k ) is a nonzero univariate polynomial in µ k , and by Lemma 3.1 it has leading coefficient det(J k−1 ). Therefore, det(J k ) is not the zero polynomial and for generic µ k it does not vanish. This shows that det(J k ) = 0 at a generic point (σ 2 , µ 1 , . . . , µ k ), hence the Jacobian is full rank, giving that the variety defined by (3.2) is zero dimensional, i.e., has finitely many solutions.

We bound the number of solutions to (3.2) for a generic λ-weighted homoscedastic mixture model using mixed volumes. Proof. Let P i = Newt(f k i ) where f k i is as defined in (3.2) with variable ordering (µ 1 , . . . , µ k , σ 2 ) for i ∈ [k + 1].

Define Q i ⊂ P i as follows:

where 0 k+1 ∈ R k+1 is the vector of all zeros and e i ∈ R k+1 is the ith standard basis vector. By Example 2.5, MVol(Q 1 , . . . , Q k+1 ) = 1 · 3 · 4 · · · · (k + 1) = (k + 1)! 2 .

As in the proof of Theorem 3.3, we want to show that MVol(P 1 , . . . , P k+1 ) = MVol(Q 1 , . . . , Q k+1 ) by using the equivalence of (1) and (2) in Proposition 2.11.

Let I w be the set of indices such that Q i has a vertex in Init w (P i ). Specifically,

By Lemma 3.8, I w is nonempty. Now we want to show that for any w ∈ Z k+1 \{0} the nonempty set of polytopes

is dependent. Since Q i is a one dimensional line segment, it suffices to show that for any w, there exists an i ∈ I w such that Q i is minimized at a single vertex. We consider 2 cases.

• Suppose 2 ∈ I w . If Init w (P 2 ) is a single point, then we are done. Otherwise, assume P 2 is minimized at Q 2 = Conv({0 k+1 , e k+1 }). This means w k+1 = 0 and w j ≥ 0 for all j ∈ [k], giving val w (P i ) = 0 so 0 ∈ Init w (P i ) for all i ∈ [k]. This shows that I w = [k + 1]. By Lemma 2.13, the collection of polytopes {Init w (Q 1 ), . . . , Init w (Q k+1 )} is dependent. • Now suppose 2 ∈ I w . If 1 ∈ I w and Q 1 is minimized at a single vertex then {Init w (Q i ) : i ∈ I w } is dependent, so we are done. If Q 1 is not minimized at a single vertex, then w 1 = 0 and w j ≥ 0 for all 2 ≤ j ≤ k. Since 2 ∈ I w , this gives that w k+1 > 2w j for all j ∈ [k]. Therefore, val w (P j ) = 0 for all j ∈ [k + 1] which shows that 0 ∈ Init w (P 2 ). This contradicts 2 ∈ I w . On the other hand, if i ∈ I w where i ≥ 3, either Q i is minimized at a single vertex or w i−1 = 0 and w j ≥ 0 for all j ∈ [k + 1]\(i − 1). In the latter case, this shows val w (P i ) = 0 for all i ∈ [k + 1], contradicting 2 ∈ I w .

Lemma 3.8. The index set I w as defined in the proof of Theorem 3.7 is nonempty for any nonzero w ∈ R k+1 .

Proof. If w is in the nonnegative orthant, then w minimizes P i at the origin for all i ∈ [k + 1]. In this case I w = [k + 1] = ∅. Now let i be the index of the smallest element of w. If i = k + 1 then w minimizes P 2 and Q 2 at {e k+1 }. If i < k + 1 then w minimizes P i+1 and Q i+1 at {(i + 1)e i }, which shows I w is nonempty.

Finding all solutions using homotopy continuation. To do parameter recovery for any of the set ups described in Section 3.1 and Section 3.2, it is not enough to know the number of complex solutions to the moment equations, we need a way to find all of them. Finding all complex solutions to a zero dimensional variety is a well-studied problem in numerical algebraic geometry. We outline the basics of homotopy continuation below but give [7] for a detailed reference.

Consider the system of polynomial equations

where the number of complex solutions to F (x) = 0 is finite and x = (x 1 , . . . , x n ). The main idea is to construct a homotopy [38] (3.4)

such that these three conditions hold:

(1) the solutions to H(x, 0) = G(x) = 0 are trivial to find, (2) there are no singularities along the path t ∈ [0, 1), and (3) all isolated solutions of H(x, 1) = F (x) = 0 can be reached. Using a random γ ∈ C ensures that (2) is met. Continuation methods known as predictor-corrector methods are used to track the solutions from G(x) = 0 to F (x) = 0 as t varies from 0 to 1 [13] .

It is standard terminology to call G(x) = 0 the start system. There are many choices for constructing a start system. A total degree start system is of the form

The number of solutions to G(x) = 0 is d 1 · · · d n , which is the Bézout bound of G(x). Using this start system, one must track d 1 · · · d n paths in order to find all solutions to F (x) = 0. If F (x) = 0 has close to d 1 · · · d n solutions this is a reasonable choice.

If F (x) = 0 is a polynomial system with fewer solutions than the Bézout bound, then tracking d 1 · · · d n paths is unnecessary. Instead, one can leverage the system's Newton polytopes to construct a start system that has as many solutions as the mixed volume. One example of this is the polyhedral homotopy method [30] which tracks N paths where N is the BKK bound described in Theorem 2.6. This method constructs several binomial start systems, that is, systems formed by polynomials with precisely two terms. The main drawback to polyhedral methods is that there is often a computational bottleneck associated with computing the start system. Our related approach circumvents this bottleneck and relies on the following lemma.

The collection of Newton polytopes of a polynomial system F = (f 1 , . . . , f n ) is denoted by Newt(F ) = (Newt(f 1 ), . . . , Newt(f n )). Lemma 3.9. Suppose G(x) = 0 is a general sparse binomial system such that MVol(Newt(G)) = MVol(Newt(F )) and Newt(G) ⊆ Newt(F ) element-wise. If the origin is in each Newton polytope of G, then the three items above hold for the homotopy (3.4).

Proof. By Theorem 2.6 the number of solutions for G(x) = 0 equals the generic number of solutions for a polynomial system with Newton polytopes Newt(F ). Since Newt(G) ⊆ Newt(F ) and γ is generic, we have Newt(F ) = Newt(γ(1 − t)G + tF ) for t ∈ (0, 1]. So the mixed volume, and therefore the number of solutions, of (1−t)G+tF agrees with the mixed volume of Newt(F ).

The fact that the total degree homotopy works is a special case of the previous lemma applied to polynomials with full monomial support. Combining Lemma 3.9 with Theorem 3.3 and Theorem 3.7 we get the following corollary.

Corollary 3.10. The binomial system induced by the polytopes Q i in the proofs of Theorem 3.3 and Theorem 3.7 constructs an optimal homotopy continuation start system for the corresponding moment system.

Proof. In this proof we construct the homotopy; give its start points; and show that it is optimal. We only show the details for the case of Theorem 3.3 because the other statement's proof is analogous. Consider the binomial system

where a , b , c , d ∈ C * are generic for ∈ [k]. Since each g 2 −1 and g 2 is a univariate polynomial in distinct variables, multiplying the degrees we know that there are (2k − 1)!!k! solutions. This number agrees with the mixed volume of the respective moment system by Theorem 3.3. Moreover, the solutions are the start points of the homotopy

i are defined as in (3.1). By Lemma 3.9 the result follows. Corollary 3.10 bypasses the computational bottleneck associated with polyhedral homotopy methods. Therefore, the proofs of each theorem give the number of complex solutions to the corresponding variety and provide an efficient way to find all of them. Example 3.11. Consider (3.1) when k = 2 and λ = (1/2, 1/2). Here we have

In this case we consider the start system:

. This gives six start solutions of the form (µ 1 , σ 2 1 , µ 2 , σ 2 2 ):

(10, 12, η · 3, 2), (10, 12, η 2 · 3, 2), (10, 12, 3, 2) (10, 12, η · 3, −2), (10, 12, η 2 · 3, −2), (10, 12, 3, −2)

where η is a primitive third root of unity. We chose integers as the coefficients for ease of exposition. In practice, random complex numbers with norm close to one are used as the coefficients.

For the λ-weighted model discussed in Section 3.1, the Bézout bound of (3.1) is (2k)! but Theorem 3.3 showed that the mixed volume is (2k − 1)!!k!. The limit of (2k)! (2k − 1)!!k! tends to infinity as k → ∞, showing that for large k, our start system is significantly better than a total degree one.

Remark 3.12. For the λ−weighted homoscedastic model discussed in Section 3.2, Theorem 3.7 shows that there are at most (k+1)! 2 solutions even though the Bézout bound is (k + 1)!. In this case using our start system, one would track half as many paths as a total degree start system would.

A final case of interest is the λ-weighted, known variance model. This is where only the means are unknown. This set up was considered in high dimensions in [15, 18, 53] .

We consider the moment system (3.6) f k 1 (µ, σ 2 1 , λ) = · · · = f k k (µ, σ 2 k , λ) = 0 where λ are known mixing coefficients, σ 2 is a known variance, and f k , ∈ [k] is as defined in (2.2) . In this set up, the unknowns are µ , ∈ [k]. Theorem 3.13 (Means unknown). A λ-weighted, known variance Gaussian k-mixture model is algebraically identifiable using moment system (3.6). Moreover, for generic λ , σ 2 , m , ∈ [k], the moment system (3.6) has k! solutions.

Proof. First we observe that (3.6) generically has finitely many solutions by the same arguments in the proof of Proposition 3.2 and Proposition 3.6. This proves the first part of the theorem.

It follows from the Bézout bound that there are at most k! complex solutions to (3.6) . We now show that this bound is generically achieved with equality by giving parameter values where there are k! solutions.

Consider λ = 1 k , σ 2 = 1 and m = k

. It is clear that this has a solution of the form µ = (1, 2, . . . , k) . Further, by the same induction argument involving the Jacobian of (3.6) referenced above, there are finitely many solutions for this set of parameters. We observe that in this case our solution set has the typical label-swapping symmetry. This shows that any action by the symmetric group on k letters, S k , on any solution is also a solution. Therefore, there are k! solutions to (3.6) in this case namely {σ · (1, 2, . . . , k) : σ ∈ S k }.

Corollary 3.14 (Equal mixture weights and variances). A generic Gaussian k−mixture model with uniform mixing coefficients and known and equal variances is identifiable up to label-swapping symmetry using moments m 1 , . . . , m k .

As a consequence of Theorem 3.13, when only the means are unknown the Bézout bound is equal to the BKK bound. In this case using polyhedral homotopy gives no advantage to total degree. As discussed in Corollary 3.14, when the mixture weights and variances are known and equal the standard label-swapping symmetry observed with mixture models gives only one solution up to symmetry. Tracking a single path from a total degree start system, this one solution is easy to find. Example 3.15. When k = 2, λ 1 = λ 2 = 1 2 and σ 2 1 = σ 2 2 = σ 2 is a known parameter, we can symbolically solve the corresponding moment system and see that up to symmetry

This shows that so long as −m 2 1 + m 2 − σ 2 > 0 we are guaranteed to get something statistically meaningful. A picture of that region in R 3 is shown in Figure 4 .

Section 3 gives upper bounds on the number of solutions to the moment equations for Gaussian mixture models where some parameters of the model are assumed known. Using homotopy continuation algorithms we can efficiently perform density estimation in these cases. We now use our results to do density estimation on Gaussian mixture models in high dimensions. 4.1. High-dimensional Gaussian mixture models. A random variable X ∈ R n is distributed as a multivariate Gaussian with mean µ ∈ R n and symmetric positive definite covariance matrix Σ ∈ R n×n , if it has density f X (x 1 , . . . , x n |µ, Σ) = ((2π) n det(

We denote X ∼ N (µ, Σ). A random variable is distributed as the mixture of k multivariate Gaussians if it is the convex combination of k multivariate Gaussian densities. It has probability density function

where (λ 1 , . . . , λ k ) ∈ ∆ k−1 , µ ∈ R n , and Σ ∈ R n×n is symmetric and positive definite for ∈ [k]. Here we write X ∼ k =1 λ N (µ , Σ ). Let f X : R n → R be the probability density function of a random vector X = (X 1 , . . . , X n ). The (i 1 , . . . , i n )−th moment of X is

where i s ≥ 0 for all s ∈ [n]. The non-negative integer i 1 + . . . + i n = d is called the order of m i 1 ,...,in .

We can get the explicit polynomials m i 1 ,...,in of a Gaussian mixture using the moment generating function. This gives the identity:

Using Taylor's formula, we can expand the left side of (4.1) and equate coefficients of each side to get m i 1 ,...,in . Note that m 0,...,0 = 1. Here we have X ∼ λ 1 N (µ 1 ,

Observe that our convention is to use the first index to identify the component of the mixture. The moments up to order three are (4.2) m 00 = λ 1 + λ 2 m 10 = λ 1 µ 11 + λ 2 µ 21 m 01 = λ 1 µ 12 + λ 2 µ 22 m 20 = λ 1 (µ 2 11 + σ 111 ) + λ 2 (µ 2 21 + σ 211 ) m 11 = λ 1 (µ 11 µ 12 + σ 112 ) + λ 2 (µ 21 µ 22 + σ 212 ) m 02 = λ 1 (µ 2 12 + σ 122 ) + λ 2 (µ 2 22 + σ 222 ) m 30 = λ 1 (µ 3 11 + 3µ 11 σ 111 ) + λ 2 (µ 3 21 + 3µ 21 σ 211 ) m 21 = λ 1 (µ 2 11 µ 12 + 2µ 11 σ 112 + µ 12 σ 111 ) + λ 2 (µ 2 21 µ 22 + 2µ 21 σ 212 + µ 22 σ 211 ) m 12 = λ 1 (µ 11 µ 2 12 + µ 11 σ 122 + 2µ 12 σ 112 ) + λ 2 (µ 21 µ 2 22 + µ 21 σ 222 + 2µ 22 σ 212 ) m 03 = λ 1 (µ 3 12 + 3µ 12 σ 122 ) + λ 2 (µ 3 22 + 3µ 22 σ 222 ). The m 0,0,...,0,is,0,...0 −th moment is the same as the i s −th order moment for the univariate Gaussian mixture model k =1 λ N (µ s , σ ss ). This observation is key to our proposed density estimation algorithm in Section 4.2 and it follows from the property that marginal distributions of a Gaussian are Gaussian themselves.

Dimension reduction and recovery algorithm. We propose an algorithm for density estimation of multivariate Gaussian densities using the method of moments. The main idea is that if we use higher order moment equations, density estimation for multivariate Gaussian mixture models reduces to multiple instances of density estimation for univariate Gaussian mixture models. The algorithm is described in Algorithm 1. = {m a 1 , . . . , m a N : a j ∈ N n } that are the moments to multivariate Gaussian mixture model:

where N = k 2 (n 2 − n) and m 2 is any set of sample moments with polynomials where the off diagonal entries of Σ are linear for ∈ [k]. Output: Parameters λ ∈ R, µ ∈ R n , Σ 0 for ∈ [k] such that m 1 , m 2 are the moments of distribution k =1 λ N (µ , Σ ). 1 Solve the general univariate case using sample moments {m e 1 , m 2e 1 , . . . , m (3k−1)e 1 } to get parameters λ , µ ,1 and σ ,1,1 . 2 Select the statistically meaningful solution with sample moment m 3ke 1 . 3 Using the mixing coefficients λ and sample moments {m e i , . . . , m 2ke i } to solve (3.1) n − 1 times to obtain µ i and σ ii for ∈ [k], 2 ≤ i ≤ n 4 Select the statistically meaningful solution with sample moment m (2k+1)e i for 2 ≤ i ≤ n. 5 Using m 2 , solve the remaining system of N linear equations in N unknowns 6 Return λ , µ , Σ , ∈ [k].

Remark 4.2. There are many possible choices for the input m 2 . To avoid cumbersome and unnecessary notation for the reader, we do not explicitly list all such options. Our personal preference is to only use second and third order moments. In Step 1 we solve the general univariate case to obtain λ , µ 1 , σ 2 11 for = 1, 2. We find up to symmetry that there are two statistically meaningful solutions: The clear benefit to Algorithm 1 is that it avoids the curse of dimensionality. We observe that Step 1 is the most computationally prohibitive step since density estimation for univariate Gaussian mixture models with many components is difficult.

Currently, the only known way to solve Step 1 is by finding all complex solutions then selecting the statistically meaningful one closest to the next sample moment. It has been shown, in [43, 2, 4] respectively, that for a mixture of k = 2, 3 and 4 Gaussian densities there are 9 · 2!, 225 · 3! and 10350 · 4! complex solutions. Since the number of complex solutions is known in each of these cases, state of the art polynomial system solvers can exploit the label swapping symmetry to find all solutions quickly.

Also, unlike varying n, which just changes the dimension of the problem, changing k gives a new class of problems -for example, a one-mixture is qualitatively different from a two-mixture. Our algorithm hints that every k deserves its own independent analysis. Remark 4.4. Using the homotopy continuation methods discussed in Section 3.3, we can efficiently solve Step 3 of Algorithm 1. Using these continuation methods for fixed k, the number of homotopy paths we need to track is O(n). One observation for implementation is that Steps 3 and 4 can be performed in parallel. Now we establish preliminary results for the proof of correctness of Algorithm 1. Proof. When k = 1 the result is trivial. Lazard proved the result when k = 2 [36] . For k = 3, 4, we observe this statement is equivalent to saying the polynomial system

where m 3k is considered as a variable has solutions where the m 3k coordinate has multiplicity k!. Since multiplicity is generic behavior, we find all solutions to this polynomial system when k = 3, 4 for generic m 1 , . . . , m 3k−1 ∈ C and verify the result numerically using HomotopyContinuation.jl [12] and the certificate described in [11] .

We conjecture that Lemma 4.5 is true for all k. It has been shown that every Gaussian mixture model is uniquely identifiable by the first 4k − 2 moments [32] and that if all means are equal, this bound is tight [25] . A generic Gaussian mixture model does not have equal means, which distinguishes the conjectured generic bound of 3k moments from the proven upper bound of 4k − 2.

We consider a lemma similar to Lemma 4.5.

Lemma 4.6. A generic Gaussian k mixture model for k ≤ 4 with known mixing coefficients is uniquely identifiable up to symmetry using moments m 1 , . . . , m 2k+1 .

Proof. When k = 1 the result is trivial. When k = 2, 3, 4 we provably find all solutions using Theorem 3.3 then compute the multiplicity of the coordinate m 2k+1 as in Lemma 4.5.

Theorem 4.7. For a generic Gaussian k mixture model with k ≤ 4 in R n ,

with sets of moments m 1 and m 2 , Algorithm 1 is guaranteed to recover the parameters λ , µ , Σ , ∈ [k].

Proof. The proof of correctness follows from Lemma 4.5 and Lemma 4.6.

The philosophy behind Algorithm 1 extends to any k. The main bottleneck is that state of the art numerical methods to solve Step 1 by finding all complex solutions are exhausted for k ≥ 5. The benefit of finding all complex solutions is that we are guaranteed to find all of the statistically meaningful ones. While the number of complex solutions to the moment equations seems to grow exponentially with k, our computational results show that typically there are few statistically meaningful ones. This motivates future work to efficiently find only the statistically meaningful solutions.

Remark 4.8. In Step 2 and Step 4, strict identifiability results are needed to guarantee correctness. This is already settled if one identifies the desired statistically meaningful solution from sample moments m 3k , . . . , m 4k−2 instead of just m 3k as done in Algorithm 1.

Remark 4.9. Although Algorithm 1 uses higher order moments, this order doesn't exceed what we would need for the univariate case. In addition, it has been shown that there exist algebraic dependencies among lower order moment equations complicating the choice of which moment system to consider [3] . Example 4.10. We perform numerical experiments by running Algorithm 1 on randomly generated Gaussian mixture models with diagonal covariance matrices. We use HomotopyContinuation.jl to do all of the polynomial system solving [12] . The average running time and error for k = 2 are given in Table 1 and for k = 3 in Table 2 . Overall, we see that the error incurred from doing homotopy continuation is negligible. where v ∈ R 4n+2 is a vector of the true parameters andv is a vector of the estimates. The normalized error is /(4n + 2). where v ∈ R 6n+3 is a vector of the true parameters andv is a vector of the estimates. The normalized error is /(6n + 3).

Uniform mixtures with equal variances. We consider the special case of estimating the mixture of k Gaussians in R n where all of the means µ i ∈ R n are unknown, each λ i = 1 k and each covariance matrix Σ i ∈ R n×n is equal and known. This is the set up in [15, 18, 40, 53] .

The fundamentals of Algorithm 1 can be applied with even greater success. Recall from Corollary 3.14 that in one dimension, there is generically a unique solution up to symmetry to (3.6) that can be found efficiently. We use this fact in Step 1 of Algorithm 2. In Step 2 there are other choices for sample moments that will give a square linear system. Some of these choices involve sample moments of lower order.

Observe that Step 1 of Algorithm 2 requires tracking a single homotopy path. This is in contrast to Step 1 of Algorithm 1 in which one needs to track many homotopy paths to obtain all complex solutions. Further, Algorithm 2 requires solving a k × k linear system n − 1 times (Step 2). This is again in contrast to Algorithm 1 where one needs to solve a nonlinear polynomial system that tracks (2k − 1)!!k! paths n − 1 times (Step 3). In both cases, we see that we need to solve n polynomial systems, where n is the dimension of the Gaussian mixture model.

If we consider the tracking of a single homotopy path as unit cost, we consider the number of paths tracked as the complexity of the algorithm (as is customary in numerical algebraic geometry). With this choice, Algorithm 2 tracks n homotopy paths, while Algorithm 1 tracks (2k −1)!!k!(n−1)+N k paths where N k is the number of homotopy paths needed to complete Step 1 in Algorithm 1. This highlights how Algorithm 2 is highly efficient and the method of moments up can be effective for large n and k. Output: Parameters µ ∈ R n , such that m i , i ∈ [n], are the moments of distribution k =1 1 k N (µ , Σ) 1 Using mixing coefficients λ = 1 k for ∈ [k] and sample moments m 1 solve (3.6) to obtain µ 1 ∈ R. 2 Using sample moments m i solve the k × k linear system in µ i1 , . . . , µ ik for 2 ≤ i ≤ n.

Maximum likelihood estimates for Gaussian mixtures are transcendental

Moment varieties of Gaussian mixtures

Algebraic identifiability of Gaussian mixtures

Solving parameterized polynomial systems with decomposable projections

Abbas Mehrabian, and Yaniv Plan. Near-optimal sample complexity bounds for robust learning of Gaussian mixtures via compression schemes

Outlier-robust clustering of Gaussians and other non-spherical mixtures

Numerically solving polynomial systems with Bertini

Polynomial learning of distribution families

The number of roots of a system of equations

A spatially constrained mixture model for image segmentation

Certifying zeros of polynomial systems using interval arithmetic

Homotopycontinuation.jl: A package for homotopy continuation in Julia

Numerical methods for ordinary differential equations

CHIME: clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality

Likelihood landscape and local minima structures of Gaussian mixture models

Quantifier elimination for real closed fields by cylindrical algebraic decomposition

An introduction to computational algebraic geometry and commutative algebra

Ten steps of EM suffice for mixtures of two Gaussians

Maximum likelihood from incomplete data via the EM algorithm

Robustly learning any clusterable mixture of Gaussians

Robust estimators in high-dimensions without the computational intractability

Gaussian mixture model for robust design optimization of planar steel frames

On the complexity of computing mixed volumes

Combinatorial convexity and algebraic geometry

Moment-based learning of mixture distributions

Deep Learning

Large sample properties of generalized method of moments estimators

Algebraic geometry

Gaussian mixture modeling of keystroke patterns for biometric applications. systems man and cybernetics

A polyhedral method for solving sparse polynomial systems

Efficiently learning mixtures of two Gaussians

Ankur Moitra, and Gregory Valiant. Disentangling Gaussians. Communications of The ACM

Robust learning of mixtures of Gaussians

Newton polyhedra, and the genus of complete intersections

Polyèdres de Newton et nombres de Milnor

Injectivity of real rational mappings: the case of a mixture of two Gaussian laws

The BKK root count in C n

Numerical solution of multivariate polynomial systems by homotopy continuation methods

Settling the robust learnability of mixtures of Gaussians

The landscape of empirical risk for nonconvex losses

Settling the polynomial learnability of mixtures of Gaussians

How many zeroes? counting the number of solutions of systems of polynomials via geometry at infinity

Contributions to the mathematical theory of evolution

A Gaussian mixture modeling approach to text-independent speaker identification

Learning mixtures of arbitrary Gaussians

Large margin Gaussian mixture modeling for phonetic classification and recognition

Identifying the number of components in Gaussian mixture models using numerical algebraic geometry

Modeling and prediction of COVID-19 pandemic using Gaussian mixture model

Solving systems of polynomial equations

Identifiability of finite mixtures. The Annals of Mathematical statistics

All of statistics. Springer Texts in Statistics

Optimal estimation of Gaussian mixtures via denoised method of moments

Global analysis of Expectation Maximization for mixtures of two Gaussians

On the identifiability of finite mixtures. The Annals of Mathematical Statistics

Using Gaussian mixture modeling in speech recognition

Early identification of an impending rockslide location via a spatially-aided Gaussian mixture model