Estimating the square root of probability 
density function on Riemannian manifold 
Article 

Accepted Version 

Hong, X. and Gao, J. (2018) Estimating the square root of 
probability density function on Riemannian manifold. Expert 
Systems: International Journal of Knowledge Engineering. 
e12266. ISSN 1468­0394 doi: 
https://doi.org/10.1111/exsy.12266 Available at 
http://centaur.reading.ac.uk/75374/ 

It is advisable to refer to the publisher’s version if you intend to cite from the 
work.  See Guidance on citing  .

To link to this article DOI: http://dx.doi.org/10.1111/exsy.12266 

Publisher: Wiley 

All outputs in CentAUR are protected by Intellectual Property Rights law, 
including copyright law. Copyright and IPR is retained by the creators or other 
copyright holders. Terms and conditions for use of this material are defined in 
the End User Agreement  . 

www.reading.ac.uk/centaur   

CentAUR 

http://centaur.reading.ac.uk/71187/10/CentAUR%20citing%20guide.pdf
http://www.reading.ac.uk/centaur
http://centaur.reading.ac.uk/licence


Central Archive at the University of Reading 

Reading’s research outputs online


Estimating the Square Root of Probability 
Density Function on Riemannian Manifold 
Article 

Hong, X. and Gao, J. (2018) Estimating the Square Root of 
Probability Density Function on Riemannian Manifold. Expert 
Systems. ISSN 1468­0394 Available at 
http://centaur.reading.ac.uk/75843/ 

It is advisable to refer to the publisher’s version if you intend to cite from the 
work. 

Publisher: Wiley­Blackwell 

All outputs in CentAUR are protected by Intellectual Property Rights law, 
including copyright law. Copyright and IPR is retained by the creators or other 
copyright holders. Terms and conditions for use of this material are defined in 
the End User Agreement  . 

www.reading.ac.uk/centaur   

CentAUR 

Central Archive at the University of Reading 

http://centaur.reading.ac.uk/licence
http://www.reading.ac.uk/centaur


Reading’s research outputs online


Estimating the Square Root of Probability
Density Function on Riemannian Manifold

Xia Hong and Junbin Gao

Abstract We propose that the square root of a probability density function can be
represented as a linear combination of Gaussian kernels. It is shown that, if the
Gaussian kernel centres and kernel width are known, then the maximum likelihood
parameter estimator can be formulated as a Riemannian optimization problem on
sphere manifold. The first order Riemannian geometry of the sphere manifold and
vector transport are initially explored, then the well-known Riemannian conjugate
gradient algorithm is used to estimate the model parameters. For completeness the k-
means clustering algorithm and a grid search are employed to determine the centers
and kernel width respectively. Simulated examples are employed to demonstrate that
the proposed approach is effective in constructing the estimate of the square root of
probability density function.

1 Introduction

The importance of probability density function (PDF) is evidenced by intensive the-
oretic researches and many data analysis and pattern recognition applications [McLach-
lan and Peel, 2000, Silverman, 1986, Duda and Hart, 1973, Chen et al., 2010,
Rutkowski, 2004, Yin and Allinson, 2001, Dempster et al., 1977, Parzen, 1962]. The
Parzen window estimator (PW) and the finite mixture models, especially a mixture
of Gaussians, are two widely researched and popular estimators. The mixture of
Gaussians model represents an underlying PDF as a linear combination of Gaus-
sian kernels, and the maximum likelihood (ML) estimator of the mixture models

Xia Hong
Department of Computer Science, School of Mathematical and Physical Sciences, University of
Reading, UK, e-mail: x.hong@reading.ac.uk

Junbin Gao
Discipline of Business Analytics, The University of Sydney Business School, The University of
Sydney, NSW 2006, Australia, e-mail: junbin.gao@sydney.edu.au


Xia Hong and Junbin Gao

parameters can be obtained using the expectation- maximization (EM) algorithm.
Alternatively, the Parzen window (PW) estimator [Parzen, 1962] can be regarded as
a special case of the finite mixture model [McLachlan and Peel, 2000], in which the
number of mixtures is equal to that of the training data samples and all the mixing
weights are equal, providing as a simple practical PDF estimator. However if the
number of training data samples is very large, then the point density estimate using
the PW estimator for a future data sample can be computationally expensive. The as-
sociated ML optimisation in a general finite mixture Gaussian is generally a highly
nonlinear optimisation process requiring extensive computation, while the EM algo-
rithm for Gaussian mixture models enjoys an explicit iterative form [Bilmes, 1998].
There are also considerable interests into research on sparse PDF estimation which
can be summarized into two categories. The first category is based on constrained
optimization [Weston et al., 1999, Vapnik and Mukherjee, 2000, Girolami and He,
2003, Hong et al., 2015]. The second category of sparse kernel density estimators
construct the PDF estimator in a forward regression manner [Choudhury, 2002,Chen
et al., 2004b, Chen et al., 2004a, Chen et al., 2008, Hong et al., 2013].

The Riemannian optimisation algorithms have been recently researched on many
types of matrix manifolds such as the Stiefel manifold, Grassmann manifold and the
manifold of positive definite matrices, see Section 3.4 of [Absil et al., 2008]. Since
Riemannian optimisation is directly based on the curved manifolds, one can elim-
inate those constraints such as orthogonality to obtain an unconstrained optimisa-
tion problem that, by construction, will only use feasible points. This allows one to
incorporate Riemannian geometry in the resulting optimisation problems, thus pro-
ducing far more accurate numerical results. The Riemannian optimisation have been
successfully applied in machine learning, computer vision and data mining tasks,
including fixed low rank optimisation [Mishra et al., 2013], Riemannian dictionary
learning [Harandi et al., 2014], and computer vision [Lui, 2012]. Since the constraint
on the mixing coefficients of the finite mixture model is the multinomial manifold,
recently we have introduced Riemannian trust-region (RTR) algorithm for sparse fi-
nite mixture model based on minimal integrated square error criterion [Hong et al.,
2015].

Note that all the aforementioned PDF estimation algorithms are aimed at directly
estimating the probability density function. However in this work we investigate the
less studied problem of estimating the square root of the probability density function
(PDF) estimation [Pinheiro and Vidakovic, 1998], i.e., the square root of a PDF,
rather than the PDF itself is a mixture of Gaussians. Clearly the resultant estimator
can be used similarly in many applications as can a PDF estimator. However the
estimation of the square root of the probability density function (PDF) poses as a
different problem, and new estimation algorithm is required.

In this paper, we introduce/extend a fast maximum likelihood estimator based
on a linear combination of Gaussian kernels which represents the square root of
probability density function [Hong and Gao, 2016]. Initially we apply the k-means
clustering algorithm and a grid search to determine the centers and kernel width. It
is shown that the underlying parameter estimation problem can be formulated as a
Riemannian optimization on the sphere manifold. The first order Riemannian geom-


Estimating the Square Root of Probability Density Function on Riemannian Manifold

etry of the sphere manifold and vector transport are explored, and the well-known
Riemannian conjugate gradient algorithm is used to estimate the model parameters.
Numerical examples are employed to demonstrate that the proposed approach is
effective in constructing the estimate of the square root of probability density func-
tion.

2 Preliminary on Sphere Manifold

This section briefly introduces the concept of sphere manifold and the necessary
ingredients used in the retraction based framework of Riemannian optimization. As
a reference, the main notations on Riemannian geometry on sphere manifold in this
section is summarized in Table 1. We refer the readers to [Absil et al., 2008] for the
general concepts of manifolds.

Table 1 Notations for Sphere Manifold{
SM−1,g

}
Sphere manifold for parameter matrix θ and the
inner product of the manifold

Tθ SM−1 Tangent space of the sphere manifold
uθ ,vθ Tangent vectors at θ
Projθ (z) Orthogonal projector from a vector in ambient space

onto the tangent space at θ
gradF(θ ) Riemannian gradient of F(θ ) on the manifold

SM−1
GradF(θ ) Classical gradient of F(θ ) as seen in Euclidean space
Expθ Retraction mapping
Tθk+1←θk (uθk ) Vector transport

The sphere manifold is the set of unit Frobenius norm vectors of size M, denoted
as

SM−1 =
{

θ ∈RM : ∥θ∥2 = 1
}
. (1)

It is endowed with a Riemannian manifold structure by considering it as a Rieman-
nian submanifold of the embedding Euclidean space RM endowed with the usual
inner product

g(uθ ,vθ ) =uTθ vθ , (2)

where uθ ,vθ ∈ Tθ SM−1 ⊂ RM are tangent vectors to SM−1 at θ . The inner product
on SM−1 determines the geometry such as distance, angle, curvature on SM−1. Note
that the tangent space Tθ SM−1 at element θ can be described by

Tθ SM−1 =
{

uθ : uTθ θ = 0
}
. (3)


Xia Hong and Junbin Gao

Riemannian gradient: Let the Riemannian gradient of a scalar function F(θ ) on
SM−1 be denoted by gradF(θ ), and its classical gradient as seen in the Euclidean
space as GradF(θ ). Then we have

gradF(θ ) = Projθ
(
GradF(θ )

)
, (4)

where Projθ (z) is the orthogonal projection onto the tangent space, which can be
computed as

ProjX(z) = z−(θ
Tz)θ (5)

in which z represents a vector in the ambient space.
Retraction mapping: An important concept in the recent retraction-based frame-

work of Riemannian optimization is the retraction mapping, see Section 4.1 of [Ab-
sil et al., 2008]. The exponential map ExpX, defined by

Expθ
(
λ uθ

)
= cos

(
∥λ uθ∥2

)
θ +

sin(∥λ uθ∥2)
∥uθ∥2

uθ , (6)

is the canonical choice for the retraction mapping, where the scalar λ is a chosen
step size. The retraction mapping is used to locate the next iterate on the manifold
along a specified tangent vector, such as a search direction in line search in the New-
ton’s algorithm or the suboptimal tangent direction in the trust-region algorithm, see
Chapter 7 of [Absil et al., 2008]. For example, the line search algorithm is simply
given by

θk+1 = Expθk
(
λkuθk

)
. (7)

where the search direction uθk ∈ Tθk S
M−1 and λk is a chosen step size at iteration

step k.
Vector Transport: In Riemannian optimization algorithms, the second derivatives

can be approximated by comparing the first-order information (tangent vectors) at
distinct points on the manifold. The notion of vector transport Tθk+1←θk (uθk ) on a
manifold, roughly speaking, specifies how to transport a tangent vector uθk from a
point θk to another point θk+1 on the manifold. The vector transport for the sphere
manifold is calculated as

Tθk+1←θk (uθk ) = Projθk+1 (uθk ) (8)

3 Proposed estimator for the square root of probability density
function

Given the finite data set DN ={x j}Nj=1 consisting of N data samples, where the data
x j ∈ Rm follows an unknown PDF p(x), the Gaussian mixture model [McLachlan
and Peel, 2000] is in the form of


Estimating the Square Root of Probability Density Function on Riemannian Manifold

pGMM(x) =
P

∑
i=1

gi
(2π)m/2|Σi|m/2

exp
(
−

1
2
(x−µ i)

TΣi−1(x−µ i)
)

(9)

where µi and Σ i are called the mean vector, and covariance matrix (positive definite)
of the ith mixture. The mixing coefficients satisfies 0 ≤ gi ≤ 1 for all i, and

P

∑
i=1

gi = 1 (10)

In this work we investigate the less studied problem of estimating the square
root of the probability density function (PDF) estimation [Pinheiro and Vidakovic,
1998]. The reason that the root square of the probability density function is esti-
mated is that the problem can be formulated as a Riemannian optimisation one for
computational advantage, as well as that the resultant estimator can be used as an
alternative to a probability density function estimator. Specifically, the problem un-
der study is to find a sparse approximation of the square root of ψ(x) =

√
p(x) > 0

using M component Gaussian kernels, given by

ψ(x) =
M

∑
i=1

ωiKσ
(
x,ci

)
= ω Tk(x) (11)

subject to ∫
ψ 2(x)dx = 1 (12)

where

Kσ
(
x,ci

)
=

1(
2π σ 2

)m/2 exp
(
−
∥x−ci∥2

2σ 2

)
, (13)

in which ci =
[
ci,1 ci,2 ···ci,m

]T
is the center vector of the ith kernels and σ > 0

is the width parameter, and ωi is the ith kernel weight. ω = [ω1 ···ωM]T, k(x) =
[Kσ

(
x,c1

)
···Kσ

(
x,cM

)
]T.

Applying (11) to the constraint (12), we have

∫
ψ 2(x)dx =

M

∑
i=1

M

∑
j=1

ωiω j
∫

Kσ
(
x,ci

)
Kσ

(
x,c j

)
dx

= ω TQω = 1 (14)

where Q ={qi, j}, with qi, j = K√2σ
(
ci,c j

)
, as shown in Appendix A.

Clustering algorithms can be used to find a set of centers which accurately reflects
the distribution of the data points. From N data points x j , j = 1,··· ,N, the k-means
algorithm [Haykin, S., 2009] seeks to partition the data points in M disjoint subset
Si, each containing Ni data points, so as to minimize the sum-of-squares clustering
function given by


Xia Hong and Junbin Gao

J =
M

∑
i=1

∑
x j∈Si
∥x j −ci∥2 (15)

where ∈ denotes belongs to. J is minimized when

ci =
1
Ni

∑
x j∈Si

x j (16)

The k-means algorithm is detailed in Algorithm 1.

Algorithm 1 The k-means clustering algorithm
Require: x j , j = 1,··· ,N, and a preset number of centers M.
Ensure: ci, j = 1,··· ,M, that yields the minimum of J = ∑Mi=1 ∑x j∈Si ∥x j −ci∥

2

1: Randomly select M data points from DN as initial centers coldi .
2: Randomly draw a data point x j from DN .
3: From all coldi , i = 1,··· ,M, find the nearest center to x j , denoted as c

old
k .

4: Update cnewk = c
old
k + ε(x j −c

old
k ), where ε > 0 is the predetermined learning rate.

5: Set cnewi as c
old
i .

6: Goto Step 2 until a sufficient large number of iterations has reached (for convergence).
7: Return cnewi as ci, i = 1,··· ,M.

We now propose a new maximum likelihood estimation for parameters ωi which
is based on that known M kernel centers and a preset kernel width is given.

Initially denote the eigenvalue decomposition Q as Q = UΣ UT, where U is an
orthogonal matrix consisting of the eigenvectors and Σ is a diagonal matrix of which
the entries are eigenvalues Q.

Let θ =
√

Σ UTω , ψ(x) can be written as

ψ(x) = θ Tk̄(x) (17)

where k̄(x) =
√

Σ
−1

UTk(x).
In order to satisfy ψ(x) > 0, we initialize all ωi as the same positive number ξ ,

and ωIni = ξ 1M , 1M is a length M vector with all elements as ones. The correspond-
ing θ0 can be calculated as

θ0 =
√

Σ UT1M (18)

followed by a normalization step θ0 = θ0/∥θ0∥ so that θ0 is on the sphere manifold.
We are now ready to formulate the maximum likelihood estimator as the following
Riemannian optimization problem, given as

θ opt = min
θ∈SM−1

{
F(θ ) =−

N

∑
j=1

log(θ Tk̄(x j))
}

(19)

followed by setting θ opt =
√

Σ UTω opt. For our objective function F(θ ), it is easy
to check that Euclidean gradient GradF(θ ) can be calculated respectively as


Estimating the Square Root of Probability Density Function on Riemannian Manifold

GradF(θ ) =−
N

∑
j=1

k̄(x j)
θ Tk̄(x j)

(20)

Based on GradF(θ ), Riemannian gradient of the objective function F(θ ) on the
sphere manifold can be calculated according to (4), (5). We opt to Riemannian con-
jugate gradient algorithm to solve (19), which generalizes the classical conjugate
gradient algorithm [Hager and Zhang, 2006] to optimization problems over Rie-
mannian manifolds [Boumal et al., 2014]. Since the logarithm function acts as a
natural barrier at zero and the Riemannian conjugate gradient algorithm is a local
minimization algorithm, the constraint ψ(x) > 0 can be met.

With all the ingredients available, we form the algorithm for solving (19) in Algo-
rithm 1, which is well implemented in the Manifold Optimization Toolbox Manopt
http://www.manopt.org, see [Boumal et al., 2014]. We used the default pa-
rameter settings in Manopt, so that in Step 4 of Algorithm 1 αk is based on line
search backtracking procedure as described in Chapter 4, p63 of Section 4 of [Absil
et al., 2008]. In Step 5 of Algorithm 1 ωk+1 is based on the default option “Hestenes-
Stiefel’s modified rule”. Specifically, we have

βk+1 = max
{

0,
< gradF(θk+1),γk+1 >
< Tθk+1←θk (ηk),γk+1 >

}
(21)

where < ·,·> denotes inner product, and

γk+1 = gradF(θk+1)−Tθk+1←θk (gradF(θk)) (22)

Algorithm 2 Riemannian conjugate gradient Algorithm for solving (19)
Require: k̄(x j), j = 1,··· ,N. Initial point θ0 which is on the sphere manifold SM−1, and default

threshold ϖ , e.g., ϖ = 10−6 or ϖ = 10−5;
Ensure: θ that yields the minimum F

(
θ
)
.

1: Set η0 =−gradF(θ0) and k = 0;
2: while ∥gradF(θk)∥2 < ϖ do
3: k = k + 1;
4: Compute a step size αk and set

θk = Expθk−1
(
αk ηk−1

)
; (23)

5: Compute βk and set
ηk =−gradF(θk)+ βkTθk←θk−1 (ηk−1); (24)

6: end while
7: Return θ = θk , F(θ ), and the associated U, Σ .

In summary our proposed algorithm consists of two consecutive steps; (i) at the
first stage, we determine M kernel centres using the k-means clustering algorithm,
as presented in Algorithm 1; (ii) for a grid search of kernel width, estimate kernel
parameters using Riemannian conjugate gradient algorithm ω based on the maxi-


Xia Hong and Junbin Gao

Algorithm 3 Proposed estimator for the square root of probability density function
Require: x j, j = 1,··· ,N. Preset number of kernels M;
Ensure: ψ(x) that maximizes the likelihood.
1: Use the k-means clustering algorithm (Algorithm 1) to find ci,i = 1,··· ,M;
2: for i = 1,2,...,(Iter + 1) do
3: σ = σmin + i−1n (σmax −σmin). where (Iter + 1),σmin and σmax are preset, and they denote

the minimum, maximum value and the total number of kernel width on a grid search.
4: Form Q and its eigen-decomposition Q = UΣ UT

5: Obtain k̄(x j) =
√

Σ
−1

UTk(x j), j = 1,··· ,N
6: Find θ 0 according to (18) and normalize it.
7: Apply the Riemannian conjugate gradient Algorithm (Algorithm 2) to return F(θ ).
8: end for
9: Find σ opt that minimizes F(θ ) over the grid search. Return the associated θ , U, Σ .

10: Return ω opt = U
√

Σ
−1

θ opt.

mum likelihood criterion. For completeness, the proposed estimator is detailed in
Algorithm 3.

The computational complexity consists of the k-means clustering which is in
the order Iterkmeans ∗O(N), where Iterkmeans is the number of iterations of k-means
clustering algorithm, and that of the Riemannian conjugate algorithm, scaled by the
number of grid search Iter. The main cost in Riemannian conjugate algorithm is
function and derivative evaluation of (19)& (20) which are in the order of O(MN).
Hence the total cost of the proposed algorithm is evaluated as Iterkmeans ∗O(N) +
Iter∗O(MN).

4 Illustrative examples

Example 1: A data set of N = 600 points was randomly drawn from a known dis-
tribution p(x) and used to construct the estimator of the square root of probability
density function ψ(x) using the proposed approach. p(x) is given as a mixture of
one Gaussian and one Laplacian, as defined by

p(x) =
2

3π
exp

(
−2(x1 −1)2 + 2(x2 −1)2

)
+

0.35
6

exp(−0.7|x1 + 1|−0.5|x2 + 1|) (25)

We preset M = 40, the k-means algorithm was applied to find M = 40 centers
ci,i = 1,··· ,M. The Riemannian conjugate gradient algorithm was applied to mini-
mize the negative likelihood based on Q obtained a range of kernel widths of [0.5,2],
with a grid width of 0.1. The result of log F(θ ) versus the kernel width was shown
in Figure 1. The convergence of Riemannian conjugate gradient algorithm on the
sphere manifold based on the optimal kernel width was shown shown in Figure 2,
showing that the algorithm can converge rapidly. The performance of proposed es-


Estimating the Square Root of Probability Density Function on Riemannian Manifold

timator was shown in Figure 3(a),(b)&(c), which plots ψ 2(x), the true density p(x),
and the estimation error e(x) = ψ(x)2 − p(x) respectively over a 41×41 meshed
data, with a grid size of 0.2, ranging from -4 to 4 for both x1 and x2.

0.5 1 1.5 2
1220

1240

1260

1280

1300

1320

1340

1360

 σ 

V
a

lu
e

 o
f 

m
in

im
a

l o
f 

 F
( 
θ 

)

Fig. 1 log(F(θ )) versus the kernel width for Example 1.

0 5 10 15 20
600

605

610

615

620

625

630

635

640

645

Iterations

  
lo

g
 F

( 
θ)

Fig. 2 Convergence of Riemannian conjugate gradient algorithm for Example 1.

The experiment was repeated for 100 different random runs in order to com-
pare the performance of the proposed algorithm with the Gaussian mixture model
(GMM) that is fitted using the EM algorithm [McLachlan and Peel, 2000]. A sepa-


Xia Hong and Junbin Gao

−4
−2

0
2

4 −4

−2

0

2

4

0

0.02

0.04

0.06

0.08

0.1

x
2

x
1

ψ
( 

x
)2

(a)

−4
−2

0
2

4 −4

−2

0

2

4

0

0.02

0.04

0.06

0.08

0.1

x
2

x
1

p
( 

x
)

(b)

−4
−2

0
2

4 −4

−2

0

2

4

0

0.02

0.04

0.06

0.08

0.1

x
2

x
1

e
( 

x
)

(c)

Fig. 3 The result of the proposed estimator for Example 1; (a) The resultant pdf estimator ψ(x)2
(b) The true pdf p(x); and (c)The estimation error e(x).


Estimating the Square Root of Probability Density Function on Riemannian Manifold

rate test data set of Ntest = 1681 was generated using a 41×41 meshed data, with a
grid size of 0.2, ranging from -4 to 4 for both x1 and x2. These points were used for
evaluation according to

L1 =

{
1

Ntest ∑
Ntest
k=1 |p(xk)−ψ

2(xk)| Proposed
1

Ntest ∑
Ntest
k=1 |p(xk)− pGMM(xk)| GMM

(26)

where pGMM is the resultant GMM model based on two Gaussian mixtures, with
the constraint that the covariance matrix is diagonal. The GMM model fitting is im-
plemented using Matlab command gmdistribution. f it.m. The results are as shown
Table 2 which demonstrate that the proposed algorithm outperforms the GMM fitted
using EM algorithm with two mixtures.

Table 2 Performance of proposed density estimator in comparison with GMM model for Example
1.

Method L1 test error (mean ± STD)
Proposed (2.2±0.3)×10−4
GMM (3.2±0.6)×10−4

Example 2: A data set of N = 1500 points was randomly drawn from a known
distribution p(x) given as a mixture of one Gaussian and one Laplacian, as defined
by

p(x) =
1

2.1π
exp

(
−2(x1 −2)2 −(x2 −2)2/0.98

)
+

0.35
6

exp(−0.7|x1 + 1|−0.5|x2 + 1|) (27)

and the proposed approach is experimented to estimate ψ(x), the square root of
p(x).

The number of centers are empirically set as M = 40, the k-means algorithm was
implemented according to Algorithm 1, producing M centers ci,i = 1,··· ,M. Based
on the centers, the kernel matrices Q are generated using any kernel width on the
grid point on [0.2,2], with a grid width of 0.1. The Riemannian conjugate gradi-
ent algorithm was used to minimize the negative likelihood for the range of kernel
widths. The result of log F(θ ) versus the kernel width was shown in Figure 4. The
convergence of Riemannian conjugate gradient algorithm on the sphere manifold
based on the optimal kernel width was shown shown in Figure 5, showing that the
algorithm can converge rapidly. The performance of proposed estimator was shown
in Figure 6(a),(b)&(c), which plots ψ 2(x), the true density p(x), and the estimation
error e(x) = ψ(x)2 − p(x) respectively over a 41×41 meshed data, with a grid size
of 0.2, ranging from -4 to 4 for both x1 and x2.

The experiment was repeated for 100 different random runs in order to com-
pare the performance of the proposed algorithm with the Gaussian mixture model


Xia Hong and Junbin Gao

0 5 10 15 20 25 30
1500

1550

1600

1650

1700

1750

Iterations

lo
g

 F
( 

θ)

Fig. 4 log(F(θ )) versus the kernel width for Example 2.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

 σ 

V
a

lu
e

 o
f 

m
in

im
a

l o
f 

 F
( 
θ 

)

Fig. 5 Convergence of Riemannian conjugate gradient algorithm for Example 2.

(GMM) that is fitted using the EM algorithm [McLachlan and Peel, 2000]. A sepa-
rate test data set of Ntest = 1681 was generated using a 41×41 meshed data, with a
grid size of 0.2, ranging from -4 to 4 for both x1 and x2. These points were used for
evaluation according to

L1 =

{
1

Ntest ∑
Ntest
k=1 |p(xk)−ψ

2(xk)| Proposed
1

Ntest ∑
Ntest
k=1 |p(xk)− pGMM(xk)| GMM

(28)


Estimating the Square Root of Probability Density Function on Riemannian Manifold

−4 −2 0 2 4 −4
−2 0

2 40

0.02

0.04

0.06

0.08

0.1

x
2x

1

p
( 

x
)

(a)

−4 −2 0 2 4 −4
−2 0

2 40

0.05

0.1

x
2x

1

ψ
( 

x
)2

(b)

−4 −2 0 2 4 −4 −2
0 2

40

0.05

0.1

x
2x

1

e
( 

x
)

(c)

Fig. 6 The result of the proposed estimator for Example 2; (a) The resultant pdf estimator ψ(x)2
(b) The true pdf p(x); and (c)The estimation error e(x).


Xia Hong and Junbin Gao

where p̂GMM is the resultant GMM model, with the constraint that the covariance
matrix is diagonal. The GMM model fitting is implemented using Matlab command
gmdistribution. f it.m with two mixtures. The results are as shown Table 3, showing
that the proposed method is better than Gaussian mixture models with two mixtures.

Table 3 Performance of proposed density estimator in comparison with GMM model for Example
2.

Method L1 test error (mean ± STD)
Proposed (1.7±0.2)×10−3
GMM (2.9±0.1)×10−3

5 Conclusions

In this paper a new method has been introduced to estimate the square root of prob-
ability density function from observational data using maximum likelihood. The
proposed model is in the form of a linear combination of Gaussian kernels. Incor-
porating the k-means clustering algorithm to determine the kernel center as well
as a grid search to determine the kernel width, the model parameters are then esti-
mated using the Riemannian optimization on the sphere. The first order Riemannian
geometry of the sphere manifold and vector transport are explored. Two illustra-
tive examples are employed to demonstrate that models obtained by the proposed
approach are comparable with the GMM model fitted using EM algorithm.

In this research we have included the well-known k-means clustering algorithm
which is based on a predetermined number M centers which may not optimal. The
choice of M is generally dependent on applications, and for k-means clustering al-
gorithm, M should be set sufficiently large to perform well. Future research will be
focused on model structure optimization so that the model size can be learnt from
data.

Appendix A

Let x = [x1,...xm]T, we have


References

qi, j =
1

(2π σ 2)m
∫

...
∫

exp
(
−
∥x−ci∥2

2σ 2
−
∥x−c j∥2

2σ 2

)
dx1...dxm

=
1

(2π σ 2)m
m

∏
l=1

∫
exp

(
−

(xl −ci,l)2

2σ 2
−

(xl −c j,l)2

2σ 2
)

dxl (29)

in which ∫
exp

(
−

(xl −ci,l)2

2σ 2
−

(xl −c j,l)2

2σ 2
)

dxl

=
∫

exp
(
−

x2l −(ci,l + c j,l)xl +(c
2
i,l + c

2
j,l)/2

σ 2
)

dxl

= exp
(
−

c2j,l +c
2
i,l

2 −
( ci,l +c j,l

2

)2
σ 2

)
×

∫
exp

(
−

[
xl −(ci,l + c j,l)

]2
σ 2

)
dxl (30)

By making use of
∫ 1√

2π σ 2
exp

(
− (xl−c)

2

2σ 2

)
dxl = 1, i.e. Gaussian density integrates

to one, we have

∫
exp

(
−

(xl −ci,l)2

2σ 2
−

(xl −c j,l)2

2σ 2
)

dxl

=
√

π σ 2 exp
(
−

c2j,l +c
2
i,l

2 −
( ci,l +c j,l

2

)2
σ 2

)
=
√

π σ 2 exp
(
−

(c j,l −ci,l)2

4σ 2
)

(31)

Hence

qi, j =
1

(4π σ 2)m
exp

(
−
∥ci −c j∥2

4σ 2

)
= K√2σ

(
ci,c j

)
(32)

References

Absil, P.-A., Mahony, R., and Sepulchre, R. (2008). Optimization Algorithms on
Matrix Manifolds. Princeton University Press.

Bilmes, J. A. (1998). A gentle tutorial of the em algorithm and its application to
parameter estimation for gaussian mixture and hidden markov models. Tech-
nical Report ICSI-TR-97-021, University of California, Berkeley.

Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. (2014). Manopt, a mat-
lab toolbox for optimization on manifolds. J. Machine Learning research,
15:1455–1459.


Xia Hong and Junbin Gao

Chen, S., Hong, X., and Harris, C. J. (2004a). Sparse kernel density construction
using orthogonal forward regression with leave-one-out test score and local
regularization. IEEE Trans. on Systems, Man and Cybernetics, Part B: Cy-
bernetics, 34(4):1708–1717.

Chen, S., Hong, X., and Harris, C. J. (2008). An orthogonal forward regression tech-
nique for sparse kernel density estimation. Neurocomputing, 71(4-6):925–
943.

Chen, S., Hong, X., and Harris, C. J. (2010). Particle swarm optimization aided
orthogonal forward regression for unified data modelling. IEEE Trans. Evo-
lutionary Computation, 14:477–499.

Chen, S., Hong, X., Harris, C. J., and Sharkey, P. M. (2004b). Sparse modelling
using orthogonal forward regression with PRESS statistic and regulariza-
tion. IEEE Trans. on Systems, Man and Cybernetics, Part B: Cybernetics,
34(2):898–911.

Choudhury, A. (2002). Fast Machine Learning Algorithms for Large Data. PhD
thesis, School of Engineering Sciences, University of Southampton.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. J. Royal Statistical Society B, 39:1–38.

Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis.
Wiley: New York, 1973.

Girolami, M. and He, C. (2003). Probability density estimation from optimally
condensed data samples. IEEE Trans. on Pattern Analysis and Machine In-
telligence, 25(10):1253–1264.

Hager, W. W. and Zhang, H. (2006). A survey of nonlinear conjugate gradient
methods. Pacific J. Optimization, 2:35–58.

Harandi, M., aqnd C. Shen, R. H., Lovell, B., and Sanderson, C. (2014). Extrin-
sic methods for coding and dictionary learning on Grassmann manifolds.
arXiv:1401.8126.

Haykin, S. (2009). Neural Networks and Learning Machines. Pearson Education
Inc.

Hong, X., Chen, S., Qatawneh, A., Daqrouq, K., Sheikh, M., and Morfeq, A. (2013).
Sparse probability density function estimation using the minimum integrated
square error. Neurocomputing, 115:122–129.

Hong, X., Gao, J., Chen, S., and Zia, T. (2015). Sparse density estimation on the
multinomial manifold. IEEE Transactions on Neural Networks and Learning
Systems, 26(11):2972–2977.

Hong, X. and Gao, J. B. (2016). A fast algorithm to estimate the square root of
probability density function. In Proceeding of AI-2016, Thirty-sixth SGAI
International Conference on Artificial Intelligence, Cambridge, UK.

Lui, Y. M. (2012). Advances in matrix manifolds for computer vision. Image and
Vision Computing, 30:380–388.

McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley: New York.
Mishra, B., Meyer, G., Bach, F., and Sepulchre, R. (2013). Low-rank optimization

with trace norm penalty. SIAM Journal on Optimization, 23:2124–2149.


References

Parzen, E. (1962). On estimation of a probability density function and mode. Annals
of Mathematical Statistics, 33:1066–1076.

Pinheiro, A. and Vidakovic, B. (1998). Estimating the square root of density via
compactly supported wavelets. Computational Statistics and Data Analysis,
25:399–415.

Rutkowski, L. (2004). Adaptive probabilistic neural networks for pattern classifi-
cation in time-varying environment. IEEE Trans. Neural Networks, 15:811–
827.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chap-
man and Hall: London.

Vapnik, V. and Mukherjee, S. (2000). Support vector machine for multivariate den-
sity estimation. In S. Solla, T. L. and Müller, K. R., editors, Advances in
Neural Infromation Processing Systems, pages 659–665. MIT Press.

Weston, J., Gammerman, A., Stitson, M., Vapnik, V., Vovk, V., and Watkins, C.
(1999). Suppot vector density estimation. In B. Schölkopf, C. B. and Smola,
A. J., editors, Advances in Kernel Methods, pages 293–306. MIT Press.

Yin, H. and Allinson, N. W. (2001). Self-organizing mixture networks for probabil-
ity density estimation. IEEE Trans. Neural Networks, 12:405–411.