Estimating the square root of probability  density function on Riemannian manifold  Article  Accepted Version  Hong, X. and Gao, J. (2018) Estimating the square root of  probability density function on Riemannian manifold. Expert  Systems: International Journal of Knowledge Engineering.  e12266. ISSN 1468­0394 doi:  https://doi.org/10.1111/exsy.12266 Available at  http://centaur.reading.ac.uk/75374/  It is advisable to refer to the publisher’s version if you intend to cite from the  work.  See Guidance on citing  . To link to this article DOI: http://dx.doi.org/10.1111/exsy.12266  Publisher: Wiley  All outputs in CentAUR are protected by Intellectual Property Rights law,  including copyright law. Copyright and IPR is retained by the creators or other  copyright holders. Terms and conditions for use of this material are defined in  the End User Agreement  .  www.reading.ac.uk/centaur    CentAUR  http://centaur.reading.ac.uk/71187/10/CentAUR%20citing%20guide.pdf http://www.reading.ac.uk/centaur http://centaur.reading.ac.uk/licence Central Archive at the University of Reading  Reading’s research outputs online Estimating the Square Root of Probability  Density Function on Riemannian Manifold  Article  Hong, X. and Gao, J. (2018) Estimating the Square Root of  Probability Density Function on Riemannian Manifold. Expert  Systems. ISSN 1468­0394 Available at  http://centaur.reading.ac.uk/75843/  It is advisable to refer to the publisher’s version if you intend to cite from the  work.  Publisher: Wiley­Blackwell  All outputs in CentAUR are protected by Intellectual Property Rights law,  including copyright law. Copyright and IPR is retained by the creators or other  copyright holders. Terms and conditions for use of this material are defined in  the End User Agreement  .  www.reading.ac.uk/centaur    CentAUR  Central Archive at the University of Reading  http://centaur.reading.ac.uk/licence http://www.reading.ac.uk/centaur Reading’s research outputs online Estimating the Square Root of Probability Density Function on Riemannian Manifold Xia Hong and Junbin Gao Abstract We propose that the square root of a probability density function can be represented as a linear combination of Gaussian kernels. It is shown that, if the Gaussian kernel centres and kernel width are known, then the maximum likelihood parameter estimator can be formulated as a Riemannian optimization problem on sphere manifold. The first order Riemannian geometry of the sphere manifold and vector transport are initially explored, then the well-known Riemannian conjugate gradient algorithm is used to estimate the model parameters. For completeness the k- means clustering algorithm and a grid search are employed to determine the centers and kernel width respectively. Simulated examples are employed to demonstrate that the proposed approach is effective in constructing the estimate of the square root of probability density function. 1 Introduction The importance of probability density function (PDF) is evidenced by intensive the- oretic researches and many data analysis and pattern recognition applications [McLach- lan and Peel, 2000, Silverman, 1986, Duda and Hart, 1973, Chen et al., 2010, Rutkowski, 2004, Yin and Allinson, 2001, Dempster et al., 1977, Parzen, 1962]. The Parzen window estimator (PW) and the finite mixture models, especially a mixture of Gaussians, are two widely researched and popular estimators. The mixture of Gaussians model represents an underlying PDF as a linear combination of Gaus- sian kernels, and the maximum likelihood (ML) estimator of the mixture models Xia Hong Department of Computer Science, School of Mathematical and Physical Sciences, University of Reading, UK, e-mail: x.hong@reading.ac.uk Junbin Gao Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, NSW 2006, Australia, e-mail: junbin.gao@sydney.edu.au Xia Hong and Junbin Gao parameters can be obtained using the expectation- maximization (EM) algorithm. Alternatively, the Parzen window (PW) estimator [Parzen, 1962] can be regarded as a special case of the finite mixture model [McLachlan and Peel, 2000], in which the number of mixtures is equal to that of the training data samples and all the mixing weights are equal, providing as a simple practical PDF estimator. However if the number of training data samples is very large, then the point density estimate using the PW estimator for a future data sample can be computationally expensive. The as- sociated ML optimisation in a general finite mixture Gaussian is generally a highly nonlinear optimisation process requiring extensive computation, while the EM algo- rithm for Gaussian mixture models enjoys an explicit iterative form [Bilmes, 1998]. There are also considerable interests into research on sparse PDF estimation which can be summarized into two categories. The first category is based on constrained optimization [Weston et al., 1999, Vapnik and Mukherjee, 2000, Girolami and He, 2003, Hong et al., 2015]. The second category of sparse kernel density estimators construct the PDF estimator in a forward regression manner [Choudhury, 2002,Chen et al., 2004b, Chen et al., 2004a, Chen et al., 2008, Hong et al., 2013]. The Riemannian optimisation algorithms have been recently researched on many types of matrix manifolds such as the Stiefel manifold, Grassmann manifold and the manifold of positive definite matrices, see Section 3.4 of [Absil et al., 2008]. Since Riemannian optimisation is directly based on the curved manifolds, one can elim- inate those constraints such as orthogonality to obtain an unconstrained optimisa- tion problem that, by construction, will only use feasible points. This allows one to incorporate Riemannian geometry in the resulting optimisation problems, thus pro- ducing far more accurate numerical results. The Riemannian optimisation have been successfully applied in machine learning, computer vision and data mining tasks, including fixed low rank optimisation [Mishra et al., 2013], Riemannian dictionary learning [Harandi et al., 2014], and computer vision [Lui, 2012]. Since the constraint on the mixing coefficients of the finite mixture model is the multinomial manifold, recently we have introduced Riemannian trust-region (RTR) algorithm for sparse fi- nite mixture model based on minimal integrated square error criterion [Hong et al., 2015]. Note that all the aforementioned PDF estimation algorithms are aimed at directly estimating the probability density function. However in this work we investigate the less studied problem of estimating the square root of the probability density function (PDF) estimation [Pinheiro and Vidakovic, 1998], i.e., the square root of a PDF, rather than the PDF itself is a mixture of Gaussians. Clearly the resultant estimator can be used similarly in many applications as can a PDF estimator. However the estimation of the square root of the probability density function (PDF) poses as a different problem, and new estimation algorithm is required. In this paper, we introduce/extend a fast maximum likelihood estimator based on a linear combination of Gaussian kernels which represents the square root of probability density function [Hong and Gao, 2016]. Initially we apply the k-means clustering algorithm and a grid search to determine the centers and kernel width. It is shown that the underlying parameter estimation problem can be formulated as a Riemannian optimization on the sphere manifold. The first order Riemannian geom- Estimating the Square Root of Probability Density Function on Riemannian Manifold etry of the sphere manifold and vector transport are explored, and the well-known Riemannian conjugate gradient algorithm is used to estimate the model parameters. Numerical examples are employed to demonstrate that the proposed approach is effective in constructing the estimate of the square root of probability density func- tion. 2 Preliminary on Sphere Manifold This section briefly introduces the concept of sphere manifold and the necessary ingredients used in the retraction based framework of Riemannian optimization. As a reference, the main notations on Riemannian geometry on sphere manifold in this section is summarized in Table 1. We refer the readers to [Absil et al., 2008] for the general concepts of manifolds. Table 1 Notations for Sphere Manifold{ SM−1,g } Sphere manifold for parameter matrix θ and the inner product of the manifold Tθ SM−1 Tangent space of the sphere manifold uθ ,vθ Tangent vectors at θ Projθ (z) Orthogonal projector from a vector in ambient space onto the tangent space at θ gradF(θ ) Riemannian gradient of F(θ ) on the manifold SM−1 GradF(θ ) Classical gradient of F(θ ) as seen in Euclidean space Expθ Retraction mapping Tθk+1←θk (uθk ) Vector transport The sphere manifold is the set of unit Frobenius norm vectors of size M, denoted as SM−1 = { θ ∈RM : ∥θ∥2 = 1 } . (1) It is endowed with a Riemannian manifold structure by considering it as a Rieman- nian submanifold of the embedding Euclidean space RM endowed with the usual inner product g(uθ ,vθ ) =uTθ vθ , (2) where uθ ,vθ ∈ Tθ SM−1 ⊂ RM are tangent vectors to SM−1 at θ . The inner product on SM−1 determines the geometry such as distance, angle, curvature on SM−1. Note that the tangent space Tθ SM−1 at element θ can be described by Tθ SM−1 = { uθ : uTθ θ = 0 } . (3) Xia Hong and Junbin Gao Riemannian gradient: Let the Riemannian gradient of a scalar function F(θ ) on SM−1 be denoted by gradF(θ ), and its classical gradient as seen in the Euclidean space as GradF(θ ). Then we have gradF(θ ) = Projθ ( GradF(θ ) ) , (4) where Projθ (z) is the orthogonal projection onto the tangent space, which can be computed as ProjX(z) = z−(θ Tz)θ (5) in which z represents a vector in the ambient space. Retraction mapping: An important concept in the recent retraction-based frame- work of Riemannian optimization is the retraction mapping, see Section 4.1 of [Ab- sil et al., 2008]. The exponential map ExpX, defined by Expθ ( λ uθ ) = cos ( ∥λ uθ∥2 ) θ + sin(∥λ uθ∥2) ∥uθ∥2 uθ , (6) is the canonical choice for the retraction mapping, where the scalar λ is a chosen step size. The retraction mapping is used to locate the next iterate on the manifold along a specified tangent vector, such as a search direction in line search in the New- ton’s algorithm or the suboptimal tangent direction in the trust-region algorithm, see Chapter 7 of [Absil et al., 2008]. For example, the line search algorithm is simply given by θk+1 = Expθk ( λkuθk ) . (7) where the search direction uθk ∈ Tθk S M−1 and λk is a chosen step size at iteration step k. Vector Transport: In Riemannian optimization algorithms, the second derivatives can be approximated by comparing the first-order information (tangent vectors) at distinct points on the manifold. The notion of vector transport Tθk+1←θk (uθk ) on a manifold, roughly speaking, specifies how to transport a tangent vector uθk from a point θk to another point θk+1 on the manifold. The vector transport for the sphere manifold is calculated as Tθk+1←θk (uθk ) = Projθk+1 (uθk ) (8) 3 Proposed estimator for the square root of probability density function Given the finite data set DN ={x j}Nj=1 consisting of N data samples, where the data x j ∈ Rm follows an unknown PDF p(x), the Gaussian mixture model [McLachlan and Peel, 2000] is in the form of Estimating the Square Root of Probability Density Function on Riemannian Manifold pGMM(x) = P ∑ i=1 gi (2π)m/2|Σi|m/2 exp ( − 1 2 (x−µ i) TΣi−1(x−µ i) ) (9) where µi and Σ i are called the mean vector, and covariance matrix (positive definite) of the ith mixture. The mixing coefficients satisfies 0 ≤ gi ≤ 1 for all i, and P ∑ i=1 gi = 1 (10) In this work we investigate the less studied problem of estimating the square root of the probability density function (PDF) estimation [Pinheiro and Vidakovic, 1998]. The reason that the root square of the probability density function is esti- mated is that the problem can be formulated as a Riemannian optimisation one for computational advantage, as well as that the resultant estimator can be used as an alternative to a probability density function estimator. Specifically, the problem un- der study is to find a sparse approximation of the square root of ψ(x) = √ p(x) > 0 using M component Gaussian kernels, given by ψ(x) = M ∑ i=1 ωiKσ ( x,ci ) = ω Tk(x) (11) subject to ∫ ψ 2(x)dx = 1 (12) where Kσ ( x,ci ) = 1( 2π σ 2 )m/2 exp ( − ∥x−ci∥2 2σ 2 ) , (13) in which ci = [ ci,1 ci,2 ···ci,m ]T is the center vector of the ith kernels and σ > 0 is the width parameter, and ωi is the ith kernel weight. ω = [ω1 ···ωM]T, k(x) = [Kσ ( x,c1 ) ···Kσ ( x,cM ) ]T. Applying (11) to the constraint (12), we have ∫ ψ 2(x)dx = M ∑ i=1 M ∑ j=1 ωiω j ∫ Kσ ( x,ci ) Kσ ( x,c j ) dx = ω TQω = 1 (14) where Q ={qi, j}, with qi, j = K√2σ ( ci,c j ) , as shown in Appendix A. Clustering algorithms can be used to find a set of centers which accurately reflects the distribution of the data points. From N data points x j , j = 1,··· ,N, the k-means algorithm [Haykin, S., 2009] seeks to partition the data points in M disjoint subset Si, each containing Ni data points, so as to minimize the sum-of-squares clustering function given by Xia Hong and Junbin Gao J = M ∑ i=1 ∑ x j∈Si ∥x j −ci∥2 (15) where ∈ denotes belongs to. J is minimized when ci = 1 Ni ∑ x j∈Si x j (16) The k-means algorithm is detailed in Algorithm 1. Algorithm 1 The k-means clustering algorithm Require: x j , j = 1,··· ,N, and a preset number of centers M. Ensure: ci, j = 1,··· ,M, that yields the minimum of J = ∑Mi=1 ∑x j∈Si ∥x j −ci∥ 2 1: Randomly select M data points from DN as initial centers coldi . 2: Randomly draw a data point x j from DN . 3: From all coldi , i = 1,··· ,M, find the nearest center to x j , denoted as c old k . 4: Update cnewk = c old k + ε(x j −c old k ), where ε > 0 is the predetermined learning rate. 5: Set cnewi as c old i . 6: Goto Step 2 until a sufficient large number of iterations has reached (for convergence). 7: Return cnewi as ci, i = 1,··· ,M. We now propose a new maximum likelihood estimation for parameters ωi which is based on that known M kernel centers and a preset kernel width is given. Initially denote the eigenvalue decomposition Q as Q = UΣ UT, where U is an orthogonal matrix consisting of the eigenvectors and Σ is a diagonal matrix of which the entries are eigenvalues Q. Let θ = √ Σ UTω , ψ(x) can be written as ψ(x) = θ Tk̄(x) (17) where k̄(x) = √ Σ −1 UTk(x). In order to satisfy ψ(x) > 0, we initialize all ωi as the same positive number ξ , and ωIni = ξ 1M , 1M is a length M vector with all elements as ones. The correspond- ing θ0 can be calculated as θ0 = √ Σ UT1M (18) followed by a normalization step θ0 = θ0/∥θ0∥ so that θ0 is on the sphere manifold. We are now ready to formulate the maximum likelihood estimator as the following Riemannian optimization problem, given as θ opt = min θ∈SM−1 { F(θ ) =− N ∑ j=1 log(θ Tk̄(x j)) } (19) followed by setting θ opt = √ Σ UTω opt. For our objective function F(θ ), it is easy to check that Euclidean gradient GradF(θ ) can be calculated respectively as Estimating the Square Root of Probability Density Function on Riemannian Manifold GradF(θ ) =− N ∑ j=1 k̄(x j) θ Tk̄(x j) (20) Based on GradF(θ ), Riemannian gradient of the objective function F(θ ) on the sphere manifold can be calculated according to (4), (5). We opt to Riemannian con- jugate gradient algorithm to solve (19), which generalizes the classical conjugate gradient algorithm [Hager and Zhang, 2006] to optimization problems over Rie- mannian manifolds [Boumal et al., 2014]. Since the logarithm function acts as a natural barrier at zero and the Riemannian conjugate gradient algorithm is a local minimization algorithm, the constraint ψ(x) > 0 can be met. With all the ingredients available, we form the algorithm for solving (19) in Algo- rithm 1, which is well implemented in the Manifold Optimization Toolbox Manopt http://www.manopt.org, see [Boumal et al., 2014]. We used the default pa- rameter settings in Manopt, so that in Step 4 of Algorithm 1 αk is based on line search backtracking procedure as described in Chapter 4, p63 of Section 4 of [Absil et al., 2008]. In Step 5 of Algorithm 1 ωk+1 is based on the default option “Hestenes- Stiefel’s modified rule”. Specifically, we have βk+1 = max { 0, < gradF(θk+1),γk+1 > < Tθk+1←θk (ηk),γk+1 > } (21) where < ·,·> denotes inner product, and γk+1 = gradF(θk+1)−Tθk+1←θk (gradF(θk)) (22) Algorithm 2 Riemannian conjugate gradient Algorithm for solving (19) Require: k̄(x j), j = 1,··· ,N. Initial point θ0 which is on the sphere manifold SM−1, and default threshold ϖ , e.g., ϖ = 10−6 or ϖ = 10−5; Ensure: θ that yields the minimum F ( θ ) . 1: Set η0 =−gradF(θ0) and k = 0; 2: while ∥gradF(θk)∥2 < ϖ do 3: k = k + 1; 4: Compute a step size αk and set θk = Expθk−1 ( αk ηk−1 ) ; (23) 5: Compute βk and set ηk =−gradF(θk)+ βkTθk←θk−1 (ηk−1); (24) 6: end while 7: Return θ = θk , F(θ ), and the associated U, Σ . In summary our proposed algorithm consists of two consecutive steps; (i) at the first stage, we determine M kernel centres using the k-means clustering algorithm, as presented in Algorithm 1; (ii) for a grid search of kernel width, estimate kernel parameters using Riemannian conjugate gradient algorithm ω based on the maxi- Xia Hong and Junbin Gao Algorithm 3 Proposed estimator for the square root of probability density function Require: x j, j = 1,··· ,N. Preset number of kernels M; Ensure: ψ(x) that maximizes the likelihood. 1: Use the k-means clustering algorithm (Algorithm 1) to find ci,i = 1,··· ,M; 2: for i = 1,2,...,(Iter + 1) do 3: σ = σmin + i−1n (σmax −σmin). where (Iter + 1),σmin and σmax are preset, and they denote the minimum, maximum value and the total number of kernel width on a grid search. 4: Form Q and its eigen-decomposition Q = UΣ UT 5: Obtain k̄(x j) = √ Σ −1 UTk(x j), j = 1,··· ,N 6: Find θ 0 according to (18) and normalize it. 7: Apply the Riemannian conjugate gradient Algorithm (Algorithm 2) to return F(θ ). 8: end for 9: Find σ opt that minimizes F(θ ) over the grid search. Return the associated θ , U, Σ . 10: Return ω opt = U √ Σ −1 θ opt. mum likelihood criterion. For completeness, the proposed estimator is detailed in Algorithm 3. The computational complexity consists of the k-means clustering which is in the order Iterkmeans ∗O(N), where Iterkmeans is the number of iterations of k-means clustering algorithm, and that of the Riemannian conjugate algorithm, scaled by the number of grid search Iter. The main cost in Riemannian conjugate algorithm is function and derivative evaluation of (19)& (20) which are in the order of O(MN). Hence the total cost of the proposed algorithm is evaluated as Iterkmeans ∗O(N) + Iter∗O(MN). 4 Illustrative examples Example 1: A data set of N = 600 points was randomly drawn from a known dis- tribution p(x) and used to construct the estimator of the square root of probability density function ψ(x) using the proposed approach. p(x) is given as a mixture of one Gaussian and one Laplacian, as defined by p(x) = 2 3π exp ( −2(x1 −1)2 + 2(x2 −1)2 ) + 0.35 6 exp(−0.7|x1 + 1|−0.5|x2 + 1|) (25) We preset M = 40, the k-means algorithm was applied to find M = 40 centers ci,i = 1,··· ,M. The Riemannian conjugate gradient algorithm was applied to mini- mize the negative likelihood based on Q obtained a range of kernel widths of [0.5,2], with a grid width of 0.1. The result of log F(θ ) versus the kernel width was shown in Figure 1. The convergence of Riemannian conjugate gradient algorithm on the sphere manifold based on the optimal kernel width was shown shown in Figure 2, showing that the algorithm can converge rapidly. The performance of proposed es- Estimating the Square Root of Probability Density Function on Riemannian Manifold timator was shown in Figure 3(a),(b)&(c), which plots ψ 2(x), the true density p(x), and the estimation error e(x) = ψ(x)2 − p(x) respectively over a 41×41 meshed data, with a grid size of 0.2, ranging from -4 to 4 for both x1 and x2. 0.5 1 1.5 2 1220 1240 1260 1280 1300 1320 1340 1360 σ V a lu e o f m in im a l o f F ( θ ) Fig. 1 log(F(θ )) versus the kernel width for Example 1. 0 5 10 15 20 600 605 610 615 620 625 630 635 640 645 Iterations lo g F ( θ) Fig. 2 Convergence of Riemannian conjugate gradient algorithm for Example 1. The experiment was repeated for 100 different random runs in order to com- pare the performance of the proposed algorithm with the Gaussian mixture model (GMM) that is fitted using the EM algorithm [McLachlan and Peel, 2000]. A sepa- Xia Hong and Junbin Gao −4 −2 0 2 4 −4 −2 0 2 4 0 0.02 0.04 0.06 0.08 0.1 x 2 x 1 ψ ( x )2 (a) −4 −2 0 2 4 −4 −2 0 2 4 0 0.02 0.04 0.06 0.08 0.1 x 2 x 1 p ( x ) (b) −4 −2 0 2 4 −4 −2 0 2 4 0 0.02 0.04 0.06 0.08 0.1 x 2 x 1 e ( x ) (c) Fig. 3 The result of the proposed estimator for Example 1; (a) The resultant pdf estimator ψ(x)2 (b) The true pdf p(x); and (c)The estimation error e(x). Estimating the Square Root of Probability Density Function on Riemannian Manifold rate test data set of Ntest = 1681 was generated using a 41×41 meshed data, with a grid size of 0.2, ranging from -4 to 4 for both x1 and x2. These points were used for evaluation according to L1 = { 1 Ntest ∑ Ntest k=1 |p(xk)−ψ 2(xk)| Proposed 1 Ntest ∑ Ntest k=1 |p(xk)− pGMM(xk)| GMM (26) where pGMM is the resultant GMM model based on two Gaussian mixtures, with the constraint that the covariance matrix is diagonal. The GMM model fitting is im- plemented using Matlab command gmdistribution. f it.m. The results are as shown Table 2 which demonstrate that the proposed algorithm outperforms the GMM fitted using EM algorithm with two mixtures. Table 2 Performance of proposed density estimator in comparison with GMM model for Example 1. Method L1 test error (mean ± STD) Proposed (2.2±0.3)×10−4 GMM (3.2±0.6)×10−4 Example 2: A data set of N = 1500 points was randomly drawn from a known distribution p(x) given as a mixture of one Gaussian and one Laplacian, as defined by p(x) = 1 2.1π exp ( −2(x1 −2)2 −(x2 −2)2/0.98 ) + 0.35 6 exp(−0.7|x1 + 1|−0.5|x2 + 1|) (27) and the proposed approach is experimented to estimate ψ(x), the square root of p(x). The number of centers are empirically set as M = 40, the k-means algorithm was implemented according to Algorithm 1, producing M centers ci,i = 1,··· ,M. Based on the centers, the kernel matrices Q are generated using any kernel width on the grid point on [0.2,2], with a grid width of 0.1. The Riemannian conjugate gradi- ent algorithm was used to minimize the negative likelihood for the range of kernel widths. The result of log F(θ ) versus the kernel width was shown in Figure 4. The convergence of Riemannian conjugate gradient algorithm on the sphere manifold based on the optimal kernel width was shown shown in Figure 5, showing that the algorithm can converge rapidly. The performance of proposed estimator was shown in Figure 6(a),(b)&(c), which plots ψ 2(x), the true density p(x), and the estimation error e(x) = ψ(x)2 − p(x) respectively over a 41×41 meshed data, with a grid size of 0.2, ranging from -4 to 4 for both x1 and x2. The experiment was repeated for 100 different random runs in order to com- pare the performance of the proposed algorithm with the Gaussian mixture model Xia Hong and Junbin Gao 0 5 10 15 20 25 30 1500 1550 1600 1650 1700 1750 Iterations lo g F ( θ) Fig. 4 log(F(θ )) versus the kernel width for Example 2. 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 σ V a lu e o f m in im a l o f F ( θ ) Fig. 5 Convergence of Riemannian conjugate gradient algorithm for Example 2. (GMM) that is fitted using the EM algorithm [McLachlan and Peel, 2000]. A sepa- rate test data set of Ntest = 1681 was generated using a 41×41 meshed data, with a grid size of 0.2, ranging from -4 to 4 for both x1 and x2. These points were used for evaluation according to L1 = { 1 Ntest ∑ Ntest k=1 |p(xk)−ψ 2(xk)| Proposed 1 Ntest ∑ Ntest k=1 |p(xk)− pGMM(xk)| GMM (28) Estimating the Square Root of Probability Density Function on Riemannian Manifold −4 −2 0 2 4 −4 −2 0 2 40 0.02 0.04 0.06 0.08 0.1 x 2x 1 p ( x ) (a) −4 −2 0 2 4 −4 −2 0 2 40 0.05 0.1 x 2x 1 ψ ( x )2 (b) −4 −2 0 2 4 −4 −2 0 2 40 0.05 0.1 x 2x 1 e ( x ) (c) Fig. 6 The result of the proposed estimator for Example 2; (a) The resultant pdf estimator ψ(x)2 (b) The true pdf p(x); and (c)The estimation error e(x). Xia Hong and Junbin Gao where p̂GMM is the resultant GMM model, with the constraint that the covariance matrix is diagonal. The GMM model fitting is implemented using Matlab command gmdistribution. f it.m with two mixtures. The results are as shown Table 3, showing that the proposed method is better than Gaussian mixture models with two mixtures. Table 3 Performance of proposed density estimator in comparison with GMM model for Example 2. Method L1 test error (mean ± STD) Proposed (1.7±0.2)×10−3 GMM (2.9±0.1)×10−3 5 Conclusions In this paper a new method has been introduced to estimate the square root of prob- ability density function from observational data using maximum likelihood. The proposed model is in the form of a linear combination of Gaussian kernels. Incor- porating the k-means clustering algorithm to determine the kernel center as well as a grid search to determine the kernel width, the model parameters are then esti- mated using the Riemannian optimization on the sphere. The first order Riemannian geometry of the sphere manifold and vector transport are explored. Two illustra- tive examples are employed to demonstrate that models obtained by the proposed approach are comparable with the GMM model fitted using EM algorithm. In this research we have included the well-known k-means clustering algorithm which is based on a predetermined number M centers which may not optimal. The choice of M is generally dependent on applications, and for k-means clustering al- gorithm, M should be set sufficiently large to perform well. Future research will be focused on model structure optimization so that the model size can be learnt from data. Appendix A Let x = [x1,...xm]T, we have References qi, j = 1 (2π σ 2)m ∫ ... ∫ exp ( − ∥x−ci∥2 2σ 2 − ∥x−c j∥2 2σ 2 ) dx1...dxm = 1 (2π σ 2)m m ∏ l=1 ∫ exp ( − (xl −ci,l)2 2σ 2 − (xl −c j,l)2 2σ 2 ) dxl (29) in which ∫ exp ( − (xl −ci,l)2 2σ 2 − (xl −c j,l)2 2σ 2 ) dxl = ∫ exp ( − x2l −(ci,l + c j,l)xl +(c 2 i,l + c 2 j,l)/2 σ 2 ) dxl = exp ( − c2j,l +c 2 i,l 2 − ( ci,l +c j,l 2 )2 σ 2 ) × ∫ exp ( − [ xl −(ci,l + c j,l) ]2 σ 2 ) dxl (30) By making use of ∫ 1√ 2π σ 2 exp ( − (xl−c) 2 2σ 2 ) dxl = 1, i.e. Gaussian density integrates to one, we have ∫ exp ( − (xl −ci,l)2 2σ 2 − (xl −c j,l)2 2σ 2 ) dxl = √ π σ 2 exp ( − c2j,l +c 2 i,l 2 − ( ci,l +c j,l 2 )2 σ 2 ) = √ π σ 2 exp ( − (c j,l −ci,l)2 4σ 2 ) (31) Hence qi, j = 1 (4π σ 2)m exp ( − ∥ci −c j∥2 4σ 2 ) = K√2σ ( ci,c j ) (32) References Absil, P.-A., Mahony, R., and Sepulchre, R. (2008). Optimization Algorithms on Matrix Manifolds. Princeton University Press. Bilmes, J. A. (1998). A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Tech- nical Report ICSI-TR-97-021, University of California, Berkeley. Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. (2014). Manopt, a mat- lab toolbox for optimization on manifolds. J. Machine Learning research, 15:1455–1459. Xia Hong and Junbin Gao Chen, S., Hong, X., and Harris, C. J. (2004a). Sparse kernel density construction using orthogonal forward regression with leave-one-out test score and local regularization. IEEE Trans. on Systems, Man and Cybernetics, Part B: Cy- bernetics, 34(4):1708–1717. Chen, S., Hong, X., and Harris, C. J. (2008). An orthogonal forward regression tech- nique for sparse kernel density estimation. Neurocomputing, 71(4-6):925– 943. Chen, S., Hong, X., and Harris, C. J. (2010). Particle swarm optimization aided orthogonal forward regression for unified data modelling. IEEE Trans. Evo- lutionary Computation, 14:477–499. Chen, S., Hong, X., Harris, C. J., and Sharkey, P. M. (2004b). Sparse modelling using orthogonal forward regression with PRESS statistic and regulariza- tion. IEEE Trans. on Systems, Man and Cybernetics, Part B: Cybernetics, 34(2):898–911. Choudhury, A. (2002). Fast Machine Learning Algorithms for Large Data. PhD thesis, School of Engineering Sciences, University of Southampton. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society B, 39:1–38. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley: New York, 1973. Girolami, M. and He, C. (2003). Probability density estimation from optimally condensed data samples. IEEE Trans. on Pattern Analysis and Machine In- telligence, 25(10):1253–1264. Hager, W. W. and Zhang, H. (2006). A survey of nonlinear conjugate gradient methods. Pacific J. Optimization, 2:35–58. Harandi, M., aqnd C. Shen, R. H., Lovell, B., and Sanderson, C. (2014). Extrin- sic methods for coding and dictionary learning on Grassmann manifolds. arXiv:1401.8126. Haykin, S. (2009). Neural Networks and Learning Machines. Pearson Education Inc. Hong, X., Chen, S., Qatawneh, A., Daqrouq, K., Sheikh, M., and Morfeq, A. (2013). Sparse probability density function estimation using the minimum integrated square error. Neurocomputing, 115:122–129. Hong, X., Gao, J., Chen, S., and Zia, T. (2015). Sparse density estimation on the multinomial manifold. IEEE Transactions on Neural Networks and Learning Systems, 26(11):2972–2977. Hong, X. and Gao, J. B. (2016). A fast algorithm to estimate the square root of probability density function. In Proceeding of AI-2016, Thirty-sixth SGAI International Conference on Artificial Intelligence, Cambridge, UK. Lui, Y. M. (2012). Advances in matrix manifolds for computer vision. Image and Vision Computing, 30:380–388. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley: New York. Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. (2013). Low-rank optimization with trace norm penalty. SIAM Journal on Optimization, 23:2124–2149. References Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33:1066–1076. Pinheiro, A. and Vidakovic, B. (1998). Estimating the square root of density via compactly supported wavelets. Computational Statistics and Data Analysis, 25:399–415. Rutkowski, L. (2004). Adaptive probabilistic neural networks for pattern classifi- cation in time-varying environment. IEEE Trans. Neural Networks, 15:811– 827. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chap- man and Hall: London. Vapnik, V. and Mukherjee, S. (2000). Support vector machine for multivariate den- sity estimation. In S. Solla, T. L. and Müller, K. R., editors, Advances in Neural Infromation Processing Systems, pages 659–665. MIT Press. Weston, J., Gammerman, A., Stitson, M., Vapnik, V., Vovk, V., and Watkins, C. (1999). Suppot vector density estimation. In B. Schölkopf, C. B. and Smola, A. J., editors, Advances in Kernel Methods, pages 293–306. MIT Press. Yin, H. and Allinson, N. W. (2001). Self-organizing mixture networks for probabil- ity density estimation. IEEE Trans. Neural Networks, 12:405–411.