Human action recognition with sparse classification and multiple-view learning


This document is published in: 

Expert Systems (2014). 31(4), 354-364. 
DOI: http://dx.doi.org/10.1111/exsy.12040

© 2013 Wiley Publishing Ltd.

http://e-archivo.uc3m.es/
http://dx.doi.org/10.1111/exsy.12040


Human action recognition with sparse classification and
multiple-view learning

Rodrigo Cilla, Miguel A. Patricio, Antonio Berlanga                                  
and José M. Molina 

Applied Artificial Intelligence Group, Universidad Carlos III de Madrid, Madrid, Spain
E-mail: rodri.cilla@gmail.com;rcilla@inf.uc3m.es

Abstract: Employing multiple camera viewpoints in the recognition of human actions increases 
performance. This paper presents a feature fusion approach to efficiently combine 2D observations extracted 
from different camera viewpoints. Multiple-view dimensionality reduction is employed to learn a common 
parameterization of 2D action descriptors computed for each one of the available viewpoints. Canonical 
correlation analysis and their variants are employed to obtain such parameterizations. A sparse sequence 
classifier based on L1 regularization is proposed to avoid the problem of having to choose the proper number 
of dimensions of the common parameterization. The proposed system is employed in the classification of the 
Inria Xmas Motion Acquisition Sequences (IXMAS) data set with successful results.

Keywords: Human Action Recognition, Multiple View Learning, L1 regularization

1. Introduction

The recognition of human actions has received increasing
attention by the computer vision community during the
last two decades (Lavee et al., 2009; Weinland et al.,
2011). The objective is to build computational models
to study how humans move and behave. A wide range
of applications such as ambient intelligence (Chaaraoui
et al., 2012), video surveillance (Foresti et al., 2004) or
human–computer interaction (Ren & Xu, 2002) have
benefitted from enhancements made in the field.

Whereas the first systems built for human action
recognition were limited to a single camera (Cedras &
Shah, 1995), advances made in visual sensor networks
have motivated the inclusion of multiple cameras to
enhance recognition robustness (Cilla et al., 2012). This
way, wider scenes can be covered, and systems can
deal with occlusions caused by walls and furniture that
complicate the recognition from a single camera view.
Additionally, the perception of the motion from different
viewpoints provides supplementary information that
facilitates the recognition.

A current trend in human action recognition is the efficient
combination of observations grabbed from different camera
viewpoints. Most of the existing approaches are based on
recovering a 3D model of the target of study exploiting
multiple-view geometry. Visual hulls (Weinland et al., 2006)
or 3D skeletons (Parameswaran & Chellappa, 2006) are
examples of these models. However, they have some
drawbacks: (1) the need of accurate camera calibration
parameters to recover the 3D model and (2) large network
bandwidth requirements, as many raw data should be sent to
a central node where the reconstruction is performed. All these
facts make current 3D reconstruction methods not appropriate
for their implementation in visual sensor networks.

Experimental evidence has shown that common 2D high-
dimensional cues employed in human action recognition,

such as silhouettes or optical flow, might be parameterized
in low-dimensional manifolds where most of their variance
is preserved (Blackburn & Ribeiro, 2007; Wang & Suter,
2008). At the same time, 3D descriptors recovered from
multiple camera viewpoints, such as visual hulls or
skeletons, are also parameterizable in low-dimensional
manifolds (Turaga et al., 2008; Peng et al., 2009b). Both
kinds of models contain information about the same real-
world phenomena but at different depths, and both share
the property of being parameterizable in low-dimensional
manifolds preserving most of their variance. A question
arising from this fact is if it would be possible to recover a
low-dimensional manifold parameterization from multiple
2D descriptors with a similar performance to that recovered
from 3D descriptors.

A design decision when learning low-dimensional
manifold parameterizations is the number of dimensions
that the learned low-dimensional space should have. There
are different proposals to automatically adjust it. Variance
preservance ratios (Jolliffe & MyiLibrary, 2002) or
automatic relevance determination priors (Nounou et al.,
2002) are some popular choices. They adjust manifold
dimensionality to preserve the maximum variance employing
a reasonable number of parameters. However, these methods
do not ensure a good performance when employing the
obtained low-dimensional parameterization in a subsequent
task, such as action class prediction. In practice, experimental
cross-validation is employed to select the right number of
dimensions. It has a high computational cost, as many set-
ups have to be evaluated to find the best one. An alternative
is to select the right number of dimensions in the subsequent
task incorporating feature selection to the model with a
sparse prior (Tibshirani, 1996).

This paper presents a multi-camera human action
recognition system taking into account the considerations
presented earlier. A feature fusion algorithm is proposed
to learn a common manifold representation of feature

1


descriptors extracted from each camera viewpoint. The
manifold is learned with a large number of dimensions. A
sparse sequence classifier is proposed to select relevant
features from the manifold parameterization. This way,
the problem of choosing the right number of dimensions
for the manifold vanishes.

1.1. Purpose and contributions

The purpose and contributions of this paper are summarized
as follows:

• The aim of the paper is to provide a method to classify
human actions using descriptors computed from different
camera views.

• A feature fusion approach is proposed to combine action
descriptors extracted from each camera view. Canonical
correlation analysis (CCA) and kernel CCA (KCCA)
are employed to obtain a common manifold param-
eterization of the motion descriptors.

• To avoid the problem of selecting the manifold dimen-
sionality, a sparse hidden conditional random field
(SHCRF) is introduced for action sequence classification.
Sparsification is achieved by introducing an L1 penalty to
the objective function optimized during training.

• An efficient online learning method is employed to train
SHCRFs, reducing training time.

• The proposed system is evaluated in the recognition of
IXMAS data set. Experimental evidence shows that the
proposed method has a performance similar to state-of-
the-art proposals employing 3D models.

1.2. Paper organization

This paper is organized as follows: Section 2 reviews
relevant work on human action recognition with a special
focus on methods employing multiple camera views.
Section 3 introduces feature fusion algorithms to recover
common manifold parameterizations for the action
descriptors extracted from each one of the camera views.
Section 4 introduces an SHCRF model employed to predict
human actions while selecting relevant features from the
manifold parameterization. Experimental results are
presented and discussed in Section 5. Finally, Section 6
summarizes the contributions of this work.

2. Related work

2.1. Human action recognition

Multiple works have been developed to bridge the semantic
gap from pixel intensity values in image sequences to
descriptions of human actions performed in them. Each
work defines different steps to solve the correspondence,
but, in general, the process might be split in two steps: (1)
feature extraction and representation and (2) action class
prediction.

The former step deals with the extraction and efficient
encoding of features to describe motions of interest.
Multiple features might be extracted for motion modelling.

Parametric models have been fitted to the targets, and
temporal moments of the recovered parameter values have
been employed for recognition (Ribeiro & Santos-Victor,
2005). Recovering model parameters is a difficult task, so
these approaches have been deprecated in favour of visual
features. Appearance descriptors, such as silhouettes or
skeletons, describe how the target looks like (Bobick &
Davis, 2001). Local motion descriptors, such as optical
flow, describe the apparent motion of the pixels, providing
a very strong cue for action recognition (Efros et al., 2003).
The main disadvantage of these methods is their lack of
robustness towards partial occlusions of the target. Local
feature descriptors have been proposed to overcome this
limitation, encoding temporal and spatial variations in
the neighbourhood of pixels with spatio-temporal saliency
properties (Laptev, 2005).

The latter step deals with the transformation of features
into semantic descriptions. Sequence models capture
temporal correlations among feature values and employ
them to select the appropriate action label. It is possible to
apply exemplar-based models with different distances
measures to select action labels (Efros et al., 2003;
Blackburn & Ribeiro, 2007; Wang & Suter, 2008), but most
of the systems employ probabilistic graphical models. The
generative Hidden Markov Model (HMM) is the de facto
algorithm for human action recognition (Rabiner, 1989;
Piccardi & Perez, 2007). However, discriminative graphical
models have shown a better performance in action class
prediction. The HCRF (Quattoni et al., 2007) has
outperformed HMMs in multiple action recognition tasks.
However, there are many open problems related to the
usage of HCRF. This work deals with model and feature
selection for the HCRF.

2.2. Multi-camera approaches to human action recognition

Dasarathy’s (1997) input–output data fusion model
provides a categorization framework for data fusion systems
according to the level of abstraction where system inputs
and outputs are defined (data, features and decisions). Here,
it is employed to organize relevant works to human action
recognition from multiple camera viewpoints. Data-based
levels of the framework are not considered, as the
information employed in human action recognition is only
defined at the feature and decision levels.

Diverse methods have been defined at the feature-in
feature-out data fusion level to perform human action
recognition from multiple camera viewpoints. It is
possible to divide the works at this level into three
different categories: (1) projection of 2D features to 3D;
(2) feature fusion in a subspace; and (3) selection of the
best view.

Different 3D representations might be obtained from
projecting 2D features to 3D. A popular approach is to
project 2D silhouettes to 3D to obtain a visual hull
representing 3D appearance (Gkalelis et al., 2009; Peng
et al., 2009a; Pehlivan & Duygulu, 2011). Visual hull
reconstruction requires accurate silhouette segmentation
from the different views. Recent works have proposed to
project optical flow to 3D (Holte & Chakraborty, 2012) or
to project local interest points (Holte et al., 2011). Other
works recover 3D star skeletons from 2D skeleton feature

2


correspondences (Chen et al., 2008). Action sketch
correspondence across multiple views has been also
proposed (Yan et al., 2008). The main drawback of 3D
projection approaches is the need of accurate camera
calibration parameters.

Other methods compute 2D features for each view and
combine them by employing some simple scheme.
Averaging of multiple features representing pose, global
and local motion has been proposed, improving accuracy
when compared with other alternatives (Maatta et al.,
2010). A joint bag-of-words histogram might be built with
local feature descriptors extracted from each one of the
camera views (Wu et al., 2010), but a better performance
is reported when other fusion strategies are employed.
Projections maximizing cross-covariance have been learned
to combine R -transform derivatives extracted from each
camera view (Karthikeyan et al., 2011). Two-level linear
discriminant analysis has been employed to learn silhouette
projections maximizing action class separability (Iosifidis
et al., 2012). All these methods provide more flexible
solutions for the combination of the features obtained from
multiple cameras. However, experimental evidence shows a
performance lower than that reported by 3D projection
methods.

The last class of methods is based on the computation of a
quality measure for each camera view, to perform the
recognition employing only the data from the best view.
An estimation of the orientation of human with respect to
the camera (Shen et al., 2007) or the different properties of
the silhouette (Määttä & Aghajan, 2010) have been
employed to obtain a quality measure. It has been proposed
to select the camera with the highest number of detections
(Wu et al., 2010) when methods based on interest point
location are employed. There are different quality measures
for studying saliency, concavity or variations in silhouette
stacks (Rudoy & Zelnik-Manor, 2011). The main drawback
of these approaches is that they do not exploit comple-
mentary and redundant information obtained from multiple
camera views.

Methods defined at the feature-in decision-out level
encode existing correlations between feature descriptors
extracted from each camera view and action labels. The
concatenation of input features is the most straightforward
procedure to perform data fusion in this way (Määttä &
Aghajan, 2010; Wu et al., 2010). The fused HMM (Wang
et al., 2007) proposes modelling correlations among
observations coupling the values of the hidden state chains
of parallel HMMs defined for each camera view. Histo-
grams of local features are fused, rotating the ordering of
the inputs to account for the variations in orientation
(Srivastava et al., 2009a). The main drawback of these
works is their lack of flexibility, assuming that camera
configuration remains unchanged between the train and
test steps. A procedure to align camera views when the
configuration changes from the train to test steps is defined
in Ramagiri et al. (2011), but it requires to know relative
camera placement.

Decision-in decision-out is the highest level of
abstraction. Action prediction is performed for each
camera view to later combine the results obtained.
Majority voting is the most common method for the fusion
of decisions (Määttä & Aghajan, 2010). A weighted voting

strategy is proposed in Zhu et al. (2012), correcting each
vote according to the value of the observed feature. Errors
produced at each camera view might be incorporated into
the voting procedure as proposed by Cilla et al. (2012).

3. Feature fusion for human action recognition

This section presents feature extraction and feature
fusion methods employed in the proposed systems. The
2D human motion descriptor employed is first introduced.
Then, CCA, a multiple-view dimensionality reduction
method employed to find a common manifold param-
eterization of the motion descriptors, is described. This
section finishes with a presentation of KCCA, a non-linear
extension to CCA.

3.1. Feature extraction

Section 2 has shown that it is possible to employ different
visual features in the recognition of human actions. This
work employs the action descriptor proposed by Tran
et al. (2008). It combines motion and appearance infor-
mation. It has been selected because it has shown a high
experimental performance in predicting the data set that will
be employed for system evaluation.

It extracted normalizing the bounding box of the observed
human to a square box preserving aspect ratio. Shape and
optical flow are computed from the normalized box. Vertical
and horizontal planes of the optical flow are split and blurred
with a median filter. Thus, each box has three channels:
silhouette, vertical flow and horizontal flow. The box is
divided into four tiles, and a radial 18-bin histogram is
computed from each tile and each channel. The obtained
histograms are concatenated to obtain a 216-d vector. A
Principal Component Analysis (PCA) reduction of the
surrounding past, present and future vectors is appended to
generate a descriptor of dTRAN = 286 dimensions. Readers
are referred to Tran et al. (2008) for more details.

3.2. Canonical correlation analysis

The objective of CCA (Hardoon et al., 2004) is to find a
pair of linear projections maximizing the correlation in the
projected space between a pair of multivariate random
variables. Given the zero-mean random variables in the
input space x1 and x2 with dimensions d1 and d2, CCA finds
a pair of linear transformations w1, w2 such that one
component within each set of transformed variables is
correlated with a single component in the other set.
The correlation between the corresponding components is
called canonical correlation, and there can be at most
d = min(d1, d2) canonical correlations. The first canonical
correlation is defined as

r ¼ max
w1;w2

wT1 x1�wT2 x2
� �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

k wT1 x1k2
� �

k wT2 x2k2
� �q (1)

¼ max
w1;w2

wT1 x1x
T
2

� �
w2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

wT1 x1x
T
1

� �
w1w

T
2 x2x

T
2

� �
w2

q (2)

3


where x1x
T
1

� �
, x2x

T
2

� �
and x1x

T
2

� �
are estimated as eΣ11, eΣ22 andeΣ12 , respectively, that is, the different minors of the

empirical covariance matrix eΣ ¼ eΣ11 eΣ12eΣ21 eΣ22
 !

of a set

of training data x = (x1, x2). The remaining canonical
correlation directions are orthogonal to w1 and w2,
respectively. They are solutions of the generalized
eigenvalue problem:

eΣ11 eΣ12eΣ21 eΣ22
!

w1

w2

� �
¼ 1 þ rð Þ

eΣ11 0
0 eΣ22

!
w1

w2

� �

The standard CCA model is defined for only two
random variables x1 and x2. Bach and Jordan
(2003) generalize CCA to m random variables. The
generalized eigenvalue problem to solve is defined as
follows:

eΣ11 ⋯ eΣ1m
⋮ ⋮eΣm1 ⋯ eΣmm

0
B@

1
CA w1⋮

wm

0
B@

1
CA ¼ l

eΣ11 ⋯ 0
⋮ ⋮

0 ⋯ eΣmm
0
B@

1
CA w1⋮

wm

0
B@

1
CA

where

eΣ11 ⋯ eΣ1m
⋮ ⋮eΣm1 ⋯ eΣmm

0
B@

1
CA denotes the empirical covariance

matrix of a set of training data x = (x1, . . ., xm)

3.3. Kernel canonical correlation analysis

The main limitation of the CCA model is given by the
linearity of the projections obtained. Kernel methods
(Burges, 1999) provide a procedure to transform linear
algorithms based on inner products of the input data to
non-linear algorithms, mapping the input data x to a high-
dimensional feature space f(x):

f : x ¼ x1; . . . ; xnð Þ ! f xð Þ
¼ f1 xð Þ; . . . ; fN xð Þð Þ n < Nð Þ

The linear algorithm (CCA in this case) is applied
in the transformed feature space. The mapping from
the input to feature spaces is not explicitly made.
Instead, inner products performed by the linear
algorithm in the input space are replaced by inner
products in the feature space. Inner products in the
feature space are computed by means of kernel
functions. A kernel is a function K such that for all
x, z 2 X,

K x; zð Þ ¼ < f xð Þ�f zð Þ > (3)

In CCA, data are introduced into the algorithm

through the empirical covariance matrix eΣ . If the input
data X = [X1X2] is centred, the minors of eΣ are computed
as follows:

eΣ11 ¼ XT1 X1 (4)
eΣ12 ¼ XT1 X2 (5)

Projection directions wx, wy can be replaced as the
projection of the data into directions a1 and a2:

w1 ¼ XT1 a1 (6)

w2 ¼ XT2 a2 (7)
Equation (2) can then be rewritten as

r ¼ max
a1; a2

aT1 X1X
T
1 X2X

T
2 a2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

aT1 X1X
T
1 X1X

Ta1 � aT2 X2XT2 X2XT2 a2
q (8)

Let K1 ¼ X1XT1 and K2 ¼ X2XT2 be the Gram
matrices computed from the input data. Substituting into
equation (8),

r ¼ max
a1; a2

aT1 K1K2a2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
aT1 K

2
1a1 � aT2 K22a2

q (9)

With this transform, the input data are now introduced
into the algorithm through the Gram matrices K1 and K2
instead of the empirical covariance matrix eΣ. If the Gram
matrices are obtained using a non-linear kernel function,
the CCA algorithm now is non-linear. The generalized
eigenproblem to solve in order to obtain the projection
directions is

0 K1K2

K2K1 0

� �
a1
a2

� �
¼ r K

2
1 0

0 K22

!
a1
a2

� �

The eigenproblem in equation (10) has a trivial solution
where all the values of r equal to one. That solution is not
useful, and it can be avoided with a regularized version of
the problem (Bach & Jordan, 2003):

0 K1K2

K2K1 0

� �
a1
a2

� �

¼ r
K1 þ

Nk
2
I

� �2
0

0 K2 þ
Nk
2
I

� �2
0
BBB@

1
CCCA a1a2
� �

where N is the number of samples in the training set, k is a
small regularization constant and I is the N � N identity
matrix.

Finally, the generalization of KCCA to m input variables
is given by

4


where Km corresponds to the Gram matrix generated from
the samples of the input variable m.

The kernel function employed in this paper is the
Gaussian radial basis kernel:

K x; zð Þ ¼ e� 1s2 x � zk k (10)

where s is a parameter controlling the bandwidth of
the Gaussian.

4. Sparse sequence classification

This section presents the sequence classification
algorithm employed to test the performance of the
feature fusion method introduced in the previous section.
A sparse version of the HCRF model is developed to
select relevant features from the result of the feature
fusion algorithm. The HCRF model is introduced in the
first term, to later present the proposed sparse extension,
and the optimization algorithm employed to recover
optimal model parameters.

4.1. Hidden conditional random fields

The HCRF (Quattoni et al., 2007) extends the CRF
(Lafferty et al., 2001) by introducing hidden state variables
into the model. An HCRF is an undirected graphical
model composed of three different sets of nodes, as shown
in Figure 1. The node y represents the class label for an
input sequence. X = x1, . . ., xt is the set of nodes
corresponding to the temporal observations in the input

sequence. H = h1, . . ., ht is the set of hidden variables
modelling the relationship between the observations xi
and the class label y and the temporal evolution of the
sequence.

The conditional probability of a sequence label y and a set
of hidden part assignments h given a sequence of
observations X is defined using the Hammersley–Clifford
theorem of Markov random fields:

P y; hjx; yð Þ ¼ e
Ψ y;h;x;yð ÞX

y
0

X
h
eΨ y;h;x;yð Þ

(11)

where y is the vector of model parameters. HCRFs belong
to the general class of log-linear models. The conditional
probability of the class label y given the observation
sequence X is obtained by marginalizing over all the possible
value assignments to hidden parts h:

Figure 1: Graphical model representation of the hidden
conditional random field.

K1 þ
Nk
2
I

� �2
K1K2 . . . K1Km

K2K1 K2 þ
Nk
2
I

� �2
. . . K2Km

⋮ ⋮ ⋱ ⋮

KmK1 KmK2 . . . Km þ
Nk
2
I

� �2

0
BBBBBBBBBBBB@

1
CCCCCCCCCCCCA

a1

a2

⋮

am

0
BBBBB@

1
CCCCCA ¼

l

K1 þ
Nk
2
I

� �2
0 . . . 0

0 K2 þ
Nk
2
I

� �2
. . . 0

⋮ ⋮ ⋱ ⋮

0 0 . . . Km þ
Nk
2
I

� �2

0
BBBBBBBBBBBB@

1
CCCCCCCCCCCCA

a1

a2

⋮

am

0
BBBBB@

1
CCCCCA

5


P yjx; yð Þ ¼
X

h
eΨ y;h;x;yð ÞX

y
0

X
h
eΨ y

0;h;x;yð Þ (12)

The potential Ψ(y, h, x; y) is a linear function of the
input variables:

Ψ y; h; x; yð Þ ¼
X
i

f xið Þ � y hið Þ þ
X
i

y y; hið Þ

þ
X

j; kð Þ 2 E
y y; hj; hk
� � (13)

The first term, parameterized by y(hi), measures the
compatibility of the observation at instant xi with the
assignment to the hidden variable hi. The second term
measures the compatibility of the hidden part hi with the
class label and is parameterized by y(y, hi). Finally, the third
term models sequence dynamics, measuring the
compatibility of adjacent hidden parts hi and hj with the
class y.

Values for model parameters y are estimated from
training samples {xi, yi} to maximize the L2-regularized
conditional likelihood function of the model:

L yð Þ ¼
Xn
i ¼ 1

logP yijxi; yð Þ þ
1
2s2

jjyjj22 (14)

where the parameter s controls the amount of penalization
induced by the L2 norm of the model parameters. Different
convex optimization techniques have been proposed to find
the optimal parameters y* maximizing the conditional
likelihood function in equation (14). Among them, Limited
Memory - Broyden–Fletcher–Goldfarb–Shanno (L-BFGS)
(Liu & Nocedal, 1989) and conjugate gradient (Nocedal
& Wright, 1999) are the most popular.

Inference of the posterior probability distribution in
equation (12) and auxiliary probability distributions needed
for the estimation of the conditional likelihood gradient in
equation (14) is made, using belief propagation as proposed
in Quattoni et al. (2007).

4.2. L1-regularized hidden conditional random fields

The L2-regularized training of the HCRF in equation (14)
produces solutions y * where all the components have a
small value. This L2 regularization approach has some
drawbacks from the computational learning perspective:

• There is no feature selection during training. Irrelevant
features at input sequences are given a non-zero weight
caused by the nature of the gradient of the L2 norm,
producing model overfitting.

• No model selection is performed. Occam’s razor
principle of machine learning stands that the best model
is the one with a lower complexity best adapting to
training data. A consequence of not fulfilling this
requirement is overfitting. Similarly to the previous
case, parameters y(y, hi) and y(y, hj, hk) corresponding

to unnecessary hidden parts obtain a non-zero weight
in the HCRF result. In practice, the right number of
hidden parts is selected by means of cross-validation,
requiring to test multiple configurations to find the
best one.

A possible way to overcome these limitations is to replace
the L2 penalty term in equation (14) by an L1 penalty term.
This way, the objective function to minimize for parameter
estimation has the form:

L yð Þ ¼
Xn
i ¼ 1

logP yijxi; yð Þ þ
1
s
jjyj 1j (15)

The L1 norm causes components in y to have a zero
value if they are not needed. If a zero value is given to a
parameter y(hi), then the corresponding input feature is
not taken into account, producing a feature selection
effect. If the zero value is given to a parameter y(y,hi) or
y(y,hj,hk), then model selection is performed, reducing the
complexity of the model. Although the L1 regularization
has been employed to estimate the optimal parameters of
log-linear models such as CRFs (Lafferty et al., 2001), this
is, to the best of our knowledge, the first time that it has
been employed to train HCRFs.

Unfortunately, the L1 norm has the property of non-
smoothness at 0, and conventional convex optimization
techniques are no longer valid to recover optimal models
parameters. Different approaches have been proposed for
the optimization of L1-regularized log-linear models. A
first approach is to reparameterize the model to transform
the optimization problem into a smooth one (Vail et al.,
2007). Model parameters are split in a pair of vectors
y = y+ � y�, such as y+ > 0 and y� > 0. Standard
convex optimization techniques are then applied to find
the optimal parameters y*. Unfortunately, the convergence
speed of the method is very slow requiring many iterations
until an optimal solution is found.

An alternative to this method is to perform stochastic
gradient descent (Tsuruoka et al., 2009). Stochastic
gradient descent updates model parameters after
presenting each sample. Although the obtained solutions
are worse than those obtained using conventional methods,
in practice, they are good enough to evaluate the
performance of the feature fusion approaches proposed in
the previous section.

The algorithm performs as follows:

y
k þ 12
i ¼ @ki þ �k

@L j; yk
� �
@yki

(16)

yk þ 1i ¼
max 0; y

k þ
1
2

i � uk þ qk � 1i
� �0@

1
A if yk þ 12i < 0

min 0; y
k þ

1
2

i þ uk � qk � 1i
� �0@

1
A if yk þ 12i > 0

8>>>>>>><
>>>>>>>:

(17)

where qki is the total L1 penalty that yi has received up to
the point:

6


qki ¼
Xk
t ¼1

yt þ1i � y
t þ 12
i

	 
(18)

�k controls the learning rate and at each iteration is
decreased by an exponential decay:

�k ¼ �0a
k
N (19)

where �0 and a are constants and N is the number of training
samples.

5. Experiments

5.1. Experimental set-up

Algorithms presented in previous sections are going to be
evaluated in the prediction of IXMAS data set (Weinland
et al., 2006). It contains 11 actions performed by 10 different
actors at least three times each. The actions are recorded
from five different viewpoints. The actions are (1) check
watch; (2) cross arms; (3) scratch head; (4) sit down; (5)
get up; (6) turn around; (7) walk; (8) wave; (9) punch; (10)
kick; and (11) pick up. A sample frame of the data set
observed from the five viewpoints is presented in Figure 2.

The protocol to evaluate system performance is the leave-
one-actor-out cross-validation. The data set is divided into
10 different sets according to the actor performing each
sample. The system is trained using all sets except one used
for testing. The procedure is repeated until every actor has
been used for testing.

The CCA and KCCA are configured to provide 150
projection directions corresponding to the 150 highest
canonical correlations. It should be emphasized again that
a high value is given to this parameter as the SHCRF will
select the relevant projections during training. KCCA is
employed with a radial basis kernel of width s = 0.5.

The SHCRF has been set up with |H| = 22 hidden parts.
A value s = 0.2 is given to the sparsity penalty term. The
training algorithm is run for 40 iterations with a batch size
of 20 samples. The exponential decay of the learning rate �
is configured with an initial value �0 = 0.01 and a decay rate
a = 0.95.

A Monte Carlo scheme is employed to evaluate the action
prediction performance as the objective function employed
to train the SHCRF is non-convex, producing different
solutions depending on the initial value y0. The accuracy
obtained for each test set is averaged over 30 different runs
of the SHCRF training algorithm to obtain a real estimate
of the algorithm performance.

Finally, to measure the improvement produced by feature
fusion in the predictive performance, a baseline model is
going to be employed. The descriptors computed for
cameras 1 and 2 are independently projected using PCA to
retain the 150 most significant dimensions. The SHCRF is
respectively evaluated using this projected sequences with
the same procedure presented earlier.

5.2. Results and discussion

5.2.1. Feature fusion algorithms To visualize the effect of
the feature fusion algorithms, the first three most significant
features obtained by them and the PCA baseline are shown
in Figures 3(a)–3(f). Projections have been obtained with the
data of actors 2–10. It can be observed that fused features
have a stronger class structure than features obtained by
the PCA baselines, although they do not seem to be
separated in any case. This is something normal in action
recognition domains, as different action sequences usually
share common frames, being their temporal evolution the
real discriminative factor for action class prediction.

Another phenomenon observed in the plots of the linear
models (Figures 3(a), 3(b), 3(c) and 3(e)) is that the main
cause of class variation is produced by the direction and
amount of movement. It can be seen that the variation of
the actions sit down, get up and pick up, involving a vertical
movement, seems to have a stronger structure than the
others. However, in the features from the non-linear models
presented in Figures 3(d) and 3(f), that behaviour is not
present. Instead, class structure is appreciated for most of
the actions. It seems that non-linear algorithms model other
hidden factor of variations distinct from the movement
direction.

5.2.2. Action classification Table 1 shows the accuracy
obtained by the SHCRF predicting class labels for the
IXMAS data set. For each evaluation fold, the worst,
median and best results have been extracted. Results show
that predicting action classes with fused features has an
accuracy higher than predicting them with only one camera,
although for CCA12 the worst case is worse than the
classification with PCA2.

(a) Camera 1 (b) Camera 2

(c) Camera 3 (d) Camera 4

(e) Camera 5

Figure 2: The kick action in the IXMAS data set from the
five available views.

7


Other interesting fact is that CCA and KCCA have
almost the same accuracy when employing the data from
the five camera viewpoints, whereas in the two cameras
scenario, KCCA performs better than CCA. The feature
fusion algorithms employed here extract common
information in the data from each camera, so this saturation
phenomenon seems reasonable. A non-linear algorithm
extracts common information better than the linear
algorithm, but as the number of data sources increases, the
gap is reduced until they have a similar performance.

To better understand the behaviour of the presented
algorithms, Figure 4 presents the box plots of accuracies
obtained evaluating each actor following a Monte Carlo

scheme. It should be noted that the feature fusion accuracy
does not increase in all the cases reported. The most striking
is when evaluating actor 2 (Figure 4(b)). The disastrous
performance of camera 1 makes the fusion perform worse
in all the cases than when using only camera 2. The cause
for this phenomenon is that the fusion algorithms assume
that every data source has a similar quality, not weighting
them according to some quality metric. The phenomenon
is produced for other actors too (Figures 4(h) and 4(i)),
although there is not a big difference in the results using a
single camera. That might be produced because the sources
have complementary information that are not well
represented in the transformed space.

(a) Camera 1 PCA (b) Camera 2 PCA

(c) CCA Cameras 1-2 (d) KCCA Cameras 1-2

(e) CCA Cameras 1-2-3-4-5 (f ) KCCA Cameras 1-2-3-4-5

Figure 3: Projections of the first three components obtained with the different dimensionality reduction algorithms. CCA,
canonical correlation analysis; KCCA, kernel CCA.

Table 1: Worst, average and best case accuracy estimations of the different methods obtained using Monte Carlo
leave-one-actor-out cross-validation

PCA1 PCA2 CCA12 KCCA12 CCA KCCA

Minimum 0.634 0.688 0.685 0.709 0.748 0.748
Median 0.754 0.778 0.799 0.814 0.850 0.850
Maximum 0.865 0.868 0.880 0.907 0.934 0.940

8


Finally, Table 2 compares the result of the presented
proposal to others. The best case result performs as good
as 3D proposals using visual hulls, whereas the average case
improves results from decision-in decision-out.

6. Conclusions

This work has proposed the usage of multiple-view learning as
a feature fusion method for human action recognition from
multiple cameras. It has been shown that the usage of the
information shared by rich 2D motion descriptors computed
from multiple camera viewpoints improves the predictive
accuracy. The usage of an L1-regularized sequence classifier
has avoided the manual choice of the number of dimensions
of the projected space. Testing multiple configurations to find
the best dimension has been avoided, saving computational
time. The usefulness of the proposed systems has been shown
by predicting IXMAS data set with an accuracy similar to
that reported by state-of-the-art methods.

References

BACH, F.R. and M.I. JORDAN (2003) Kernel independent component
analysis, Journal of Machine Learning Research, 3, 1–48.

BLACKBURN, J. and E. RIBEIRO (2007) Human motion recognition
using isomap and dynamic time warping, in Proceedings of the
2nd conference on Human motion: understanding, modeling,
capture and animation, Rio de Janeiro, Brazil: Springer-Verlag,
285–298.

BOBICK, A.F. and J.W. DAVIS (2001) The recognition of human
movement using temporal templates, IEEE Transactions on
Pattern Analysis and Machine Intelligence, 23, 257–267.

BURGES, C.J.C. (1999) Advances in Kernel Methods: Support
Vector Learning, The MIT Press.

CEDRAS, C. and M. SHAH (1995) Motion-based recognition: a
survey, Image and Vision Computing, 13, 129–155.

CHAARAOUI, A.A., P. CLIMENT-PÉREZ and F. FLÓREZ-REVUELTA
(2012) A review on vision techniques applied to human behaviour
analysis for ambient-assisted living, Expert Systems with
Applications, 39(12), 10873–10888.

CHEN, D., P.C. CHOU C.B. FOOKES and, S. SRIDHARAN (2008) Multi-
view human pose estimation using modified five-point skeleton model.
In International Conference on Signal Processing and Communication
Systems 2007, 17–19 Dec 2007, Gold Coast, Australia.

CILLA, R., M.A. PATRICIO, A. BERLANGA and J.M. MOLINA (2012)
A probabilistic, discriminative and distributed system for the
recognition of human actions from multiple views, Neurocomputing,
75(1), 78–87.

DASARATHY, B.V. (1997) Sensor fusion potential exploitation-
innovative architectures and illustrative applications, Proceedings
of the IEEE, 85, 24–38.

EFROS, A.A., A.C. BERG, G. MORI and J. MALIK (2003)
Recognizing action at a distance. IEEE International Conference
on Computer Vision, 2, 726–733.

FORESTI, G.L., C. MICHELONI and L. SNIDARO (2004) Event
classification for automatic visual-based surveillance of parking
lots, in Pattern Recognition, 2004. ICPR 2004. Proceedings of
the 17th International Conference on, vol. 3, 314–317, IEEE.

GKALELIS, N., H. KIM, A. HILTON, N. NIKOLAIDIS and I. PITAS
(November 2009) The i3DPost multi-view and 3D human
action/interaction database. 2009 Conference for Visual Media
Production, 159–168.

HARDOON, D.R., S. SZEDMAK and J. SHAWE-TAYLOR (2004)
Canonical correlation analysis: an overview with application to
learning methods, Neural Computation, 16, 2639–2664.

HOLTE, M.B. B. CHAKRABORTY, J. GONZALEZ and T. B. MOESLUND
(2012) A local 3D motion descriptor for multi-view human action
recognition from 4D spatio-temporal interest points, Selected
Topics in Signal Processing, IEEE Journal of, 6(5), 553–565.

HOLTE, M.B., T.B. MOESLUND, N. NIKOLAIDIS and I. PITAS (May
2011) 3D human action recognition for multi-view camera
systems. 2011 International Conference on 3D Imaging, Modeling,
Processing, Visualization and Transmission, 342–349.

IOSIFIDIS, A., A. TEFAS, N. NIKOLAIDIS and I. PITAS (2012) Multi-
view human movement recognition based on fuzzy distances

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

(a) Actor 1 (b) Actor 2

(c) Actor 3 (d) Actor 4

(e) Actor 5 (f ) Actor 6

(g) Actor 7 (h) Actor 8

(i) Actor 9 (j) Actor 10

Figure 4: Results for the different evaluation actors. A is the
PCA from camera 1. B is the PCA from camera 2. C is the
canonical correlation analysis (CCA) fusion of cameras 1 and
2. D is the kernel CCA (KCCA) fusion of cameras 1 and 2. E
is the CCA fusion of the five cameras. F is the KCCA fusion of
the five cameras.

Table 2: Comparison of the accuracy of our method to others

Method Accuracy Type

Srivastava et al. (2009b) 81.4 Decision-in
Decision-out

Our Average 85.00 2D feature-in
2D feature-out

Weinland et al. (2006) 93.33 2D feature-in
3D feature-out

Our Best 94.00 2D feature-in
2D feature-out

Peng et al. (2009b) 94.59 2D feature-in
3D feature-out

9


and linear discriminant analysis, Computer Vision and Image
Understanding, 116, 347–360.

JOLLIFFE, I.T. (2002) Principal Component Analysis. Springer verlag.
KARTHIKEYAN, S., U. GAUR, B.S. MANJUNATH and S. GRAFTON

(November 2011) Probabilistic subspace-based learning of shape
dynamics modes for multi-view action recognition. 2011 IEEE
International Conference on Computer Vision Workshops (ICCV
Workshops), 1282–1286.

LAFFERTY, J., A. MCCALLUM and F. PEREIRA (2001) Conditional
random fields: probabilistic models for segmenting and labeling
sequence data, in ICML ’01 Proceedings of the Eighteenth
International Conference on Machine Learning, 282–289.

LAPTEV, I. (2005) On space-time interest points, International
Journal of Computer Vision, 64, 107–123.

LAVEE, G., E. RIVLIN and M. RUDZSKY (2009) Understanding video
events: a survey of methods for automatic interpretation of semantic
occurrences in video, IEEE Transactions on Systems, Man and
Cybernetics - Part C: Applications and Reviews, 39, 489–504.

LIU, D.C. and J. NOCEDAL (1989) On the limited memory BFGS
method for large scale optimization, Mathematical Programming,
45, 503–528.

MÄÄTTÄ, T. and H. AGHAJAN (2010) On efficient use of multi-view
data for activity recognition, 158–165.

MÄÄTTÄ, T., A. HÄRMÄ and H. AGHAJAN (2010) On efficient use
of multi-view data for activity recognition, in Proceedings of the
Fourth ACM/IEEE International Conference on Distributed
Smart Cameras, (pp. 158–165), ACM.

NOCEDAL, J. and S.J. WRIGHT (1999) Numerical Optimization,
Springer Verlag.

NOUNOU, M.N., B.R. BAKSHI, P.K. GOEL and X. SHEN (2002)
Bayesian principal component analysis. Journal of Chemometrics,
16, 576–595.

PARAMESWARAN, V. and R. CHELLAPPA (2006) View invariance for
human action recognition, International Journal of Computer
Vision, 66, 83–101.

PEHLIVAN, S. and P. DUYGULU (2011) A new pose-based
representation for recognizing actions from multiple cameras,
Computer Vision and Image Understanding, 115, 140–151.

PENG, B., G. QIAN and S. RAJKO (August 2009a) View-invariant
full-body gesture recognition via multilinear analysis of voxel
data, 2009 Third ACM/IEEE International Conference on
Distributed Smart Cameras (ICDSC), 1–8.

PENG, B., G. QIAN and S. RAJKO (September 2009b) View-invariant
full-body gesture recognition via multilinear analysis of voxel data,
Third ACM/IEEE Conference on Distributed Smart Cameras.

PICCARDI, M. and O. PEREZ (2007) Hidden Markov models
with kernel density estimation of emission probabilities and
their use in activity recognition. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR’07), 1–8.

QUATTONI, A., S. WANG, L.-P. MORENCY, M. COLLINS and T.
DARRELL (2007) Hidden conditional random fields, Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 29, 1848–1852.

RABINER, L.R. (1989) A tutorial on hidden Markov models and
selected applications in speech recognition, Proceedings of the
IEEE, 77, 257–286.

RAMAGIRI, S., R. KAVI and V. KULATHUMANI (August 2011) Real-
time multi-view human action recognition using a wireless camera
network, 2011 Fifth ACM/IEEE International Conference on
Distributed Smart Cameras, 1–6.

REN, H. and G. XU (2002) Human action recognition in smart classroom,
in Automatic Face and Gesture Recognition, 2002. Proceedings of the
Fifth IEEE International Conference on, 417–422, IEEE.

RIBEIRO, P.C. and J. SANTOS-VICTOR (2005) Human activity
recognition from video: modeling, feature selection and classi-
fication architecture, in International Workshop on Human Activity
Recognition and Modeling (HAREM).

RUDOY, D. and L. ZELNIK-MANOR (2011) Viewpoint selection for
human actions, International Journal of Computer Vision, 97, 243–254.

SHEN, C., C. ZHANG and S. FELS (2007) A multi-camera
surveillance system that estimates quality-of-view measure-
ment. 2007 IEEE International Conference on Image Proce-
ssing, III–193–III–196.

SRIVASTAVA, G., H. IWAKI, J. PARK and A.C. KAK (August 2009a)
Distributed and lightweight multi-camera human activity

classification, 2009 Third ACM/IEEE International Conference on
Distributed Smart Cameras (ICDSC), 1–8.

SRIVASTAVA, C., H. IWAKI, J. PARK and A.C. KAK (September
2009b) Distributed and lightweight multi-camera human activity
classification, in Third ACM/IEEE Conference on Distributed
Smart Cameras, 1–8.

TIBSHIRANI, R. (1996) Regression shrinkage and selection via
the lasso. Journal of the Royal Statistical Society. Series B
(Methodological), 58(1), 267–288.

TRAN, D., A. SOROKIN and D. FORSYTH (2008) Human activity
recognition with metric learning, in Proceedings of the 10th
European Conference on Computer Vision: Part I, Springer
Berlin Heidelberg: Springer-Verlag, 561.

TSURUOKA, Y., J. TSUJII and S. ANANIADOU (2009) Stochastic
gradient descent training for l1-regularized log-linear models with
cumulative penalty, in Proceedings of the Joint Conference of the
47th Annual Meeting of the ACL and the 4th International Joint
Conference on Natural Language Processing of the AFNLP: Volume
1-Volume 1, Association for Computational Linguistics, 477–485.

TURAGA, P., A. VEERARAGHAVAN and R. CHELLAPPA (2008)
Statistical analysis on Stiefel and Grassmann manifolds with
applications in computer vision, in Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, 1–8, IEEE.

VAIL, D.L., J.D. LAFFERTY and M.M. VELOSO (2007) Feature
selection in conditional random fields for activity recognition, in
Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ
International Conference on, 3379–3384, IEEE.

WANG, L. and D. SUTER (2008) Visual learning and recognition
of sequential data manifolds with applications to human movement
analysis, Computer Vision and Image Understanding, 110, 153–172.

WANG, Y., K. HUANG and T. TAN (2007) Multi-view gymnastic
activity recognition with fused HMM, In Computer Vision-
ACCV 2007 (pp. 667–677). Berlin Heidelberg: Springer.

WEINLAND, D., R. RONFARD and E. BOYER (2006) Free viewpoint
action recognition using motion history volumes, Computer
Vision and Image Understanding, 104, 249–257.

WEINLAND, D., R. RONFARD and E. BOYER (2011) A survey of
vision-based methods for action representation, segmentation
and recognition, Computer Vision and Image Understanding,
115, 224–241.

WU, C., A.H. KHALILI and H. AGHAJAN (2010) Multiview activity
recognition in smart homes with spatio-temporal features,
Proceedings of the Fourth ACM/IEEE International Conference
on Distributed Smart Cameras - ICDSC’10, 142.

YAN, P., S.M. KHAN and M. SHAH (June 2008) Learning 4D action
feature models for arbitrary view action recognition. 2008 IEEE
Conference on Computer Vision and Pattern Recognition, 1–7.

ZHU, F., L. SHAO and M. LIN (2012). Multi-view action recognition
using local similarity random forests and sensor fusion. Pattern
Recognition Letters, 34(1), 20–24.

The authors

Rodrigo Cilla

Rodrigo Cilla received his BS, MS and PhD degrees in
computer science from the Universidad Carlos III de
Madrid in 2007, 2008 and 2012, respectively. His research
interest includes, among others, human action recognition,
multiple target tracking, high-dimensional data analysis,
data fusion and biological image processing.

Miguel A. Patricio

Miguel A. Patricio received his BS and MS degrees in
computer science and his PhD degree in artificial
intelligence from the Universidad Politécnica de Madrid in
1991, 1995 and 2002, respectively. He has held an
administrative position at the Computer Science
Department of Universidad Politécnica de Madrid since

10


1993. He is currently an associate professor at the Escuela
Politécnica Superior of the Universidad Carlos III de
Madrid and a research fellow of the Applied Artificial
Intelligence Group (GIAA). He has carried out a number
of research projects and consulting activities in the areas of
automatic visual inspection systems, texture recognition,
neural networks and industrial applications.

Antonio Berlanga

Antonio Berlanga received the BA degree in physics from
the Universidad Autónoma de Madrid, Spain, and the
PhD degree in computer engineering from the Universidad
Carlos III de Madrid in 1995 and 2000, respectively.
He is currently an associate professor with the Department
of Computer Science, Universidad Carlos III de Madrid.
His current research interests include evolutionary

computation for multi-objective optimization, machine
learning and data mining.

José M. Molina

José M. Molina is a full professor at the Universidad Carlos
III de Madrid. He joined the Computer Science Department
of the Universidad Carlos III de Madrid in 1993. Currently,
he leads the Applied Artificial Intelligence Group (GIAA).
His current research focuses on the application of soft
computing techniques (neural network, evolutionary
computation, fuzzy logic and multi-agent systems) to radar
data processing, air traffic management and e-commerce.
He is the author of more than 20 journal papers and 80
conference papers. He received his BS and PhD degrees in
telecommunications engineering from the Universidad
Politécnica de Madrid in 1993 and 1997, respectively.

11