Keybook: Unbias object recognition using keywords


Expert Systems with Applications 42 (2015) 3991–3999
Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
Keybook: Unbias object recognition using keywords
http://dx.doi.org/10.1016/j.eswa.2015.01.019
0957-4174/� 2015 Elsevier Ltd. All rights reserved.

⇑ Corresponding author. Tel.: +60 3 7967 6433.
E-mail addresses: wailam88@siswa.um.edu.my (W.L. Hoo), ch_lim@siswa.um.

edu.my (C.H. Lim), cs.chan@um.edu.my (C.S. Chan).
Wai Lam Hoo, Chern Hong Lim, Chee Seng Chan ⇑
Centre of Image and Signal Processing, Faculty of Computer Science & Information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia

a r t i c l e i n f o a b s t r a c t
Article history:
Available online 17 January 2015

Keywords:
Dataset bias
Object recognition
Codebook generation
Bag-of-Words model
The presence of bias in existing object recognition datasets is now a well-known problem in the computer
vision community. In this paper, we proposed an improved codebook representation in the Bag-of-Words
(BoW) approach by generating Keybook. In specific, our Keybook is composed from the keywords that
significantly represent the object classes. It is extracted utilizing the concept of mutual information.
The intuition is to perform feature selection by maximize the mutual information of the features between
the object classes; while minimize the mutual information of the features between the domains. With
this, the Keybook will not bias to any of the domain and consists of valuable keywords among the object
classes. The proposed method is tested on four public datasets to evaluate the classification performance
in seen and unseen datasets. Experiment results have showed the effectiveness of our proposed methods
in undo the dataset bias problem.

� 2015 Elsevier Ltd. All rights reserved.
1. Introduction Fritz, & Darrell, 2010) (in this case, one dataset is treated as one
Object recognition has been an active research since decades
due to the demands in many real-world applications, for example
object tracking (Choi & Christensen, 2012; Serratosa, Alquézar, &
Amézquita, 2012), visual surveillance (Guo, Xia, & Xiaofei, 2014;
Lim, Tang, & Chan, 2014; Szpak & Tapamo, 2011), human-robot
interactions (Stasse, Foissotte, & Kheddar, 2008; Rudinac,
Kootstra, Kragic, & Jonker, 2012; Bielecki, Buratowski, &
Śmigielski, 2013), etc. Though successful attempts, Torralba and
Efros (2011) raised an important question – ‘‘how well does a typ-
ical object detector trained on one dataset generalize when tested on
a representative set of other datasets, compared with its performances
on the native test set?’’, or in other words, unbiased. Unfortunately,
extensive experiments in Torralba and Efros (2011) found out that
there is existence of various types of strong build-in bias (e.g
selection bias, capture bias, and negative set bias) in popular
object recognition datasets (e.g Caltech101, ImageNet, SUN09,
LabelMe). These biases are embedded to these datasets uninten-
tionally, during the data collection stage. Such situation is very
critical as it causes one object representation and recognition algo-
rithm will only work well in the specific dataset that one choose to
use; and resulting in poor recognition rate when another dataset is
selected. We visualize this problem in Fig. 1(a). This lead to some
solutions (Duan, Xu, Tsang, & Luo, 2012; Kulis, Saenko, & Darrell,
2011; Li, Shi, Liu, Hauptmann, & Xiong, 2012; Saenko, Kulis,
domain) where the main objective in these works is to achieve
good recognition rate across all the datasets as depicted in
Fig. 1(b).

Khosla, Zhou, Malisiewicz, Efros, and Torralba (2012) proposed
to learn a visual world SVM model to undo the bias factor. This
approach learns the bias information for each dataset from a set
of BoW representation in different datasets. However, there is
room for improvement as the current approach is not feasible for
the rapid dataset augmentation in the learning process. This is
because their method needs to learn the bias weight for each of
the unseen dataset to achieve better performance. That is to say,
the approach will need to re-learn bias weight whenever there is
a new set of images. To cope with this, our idea is to undo the bias
during the codebook generation stage, and we name this as Key-
book generation. From our investigation, biases exist because of
the strong preference on certain information or features that cause
the inclination towards a specific dataset. For an example, Fig. 2
shows a set of images of car class from two different datasets.
Dataset 1 (Fig. 2(a)) has strong preference on cars with rectangle
windows (representing in red boxes); while sports car with slanted
front (representing in red boxes red boxes) is expected on dataset 2
(Fig. 2(b)) in order for an object to be recognized as a car. Literally,
these features are not significant representation for car and such
features are biased towards the corresponding training dataset.
Thus, it deteriorates the recognition performance when another
dataset is used in the testing stage. A better solution is to discover
features for an object class that are generalized across all datasets.
For example, the wheel (representing in green box) can be a choice
for a car class as illustrated in Fig. 2.

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.01.019&domain=pdf
http://dx.doi.org/10.1016/j.eswa.2015.01.019
mailto:wailam88@siswa.um.edu.my
mailto:ch_lim@siswa.um.edu.my
mailto:ch_lim@siswa.um.edu.my
mailto:cs.chan@um.edu.my
http://dx.doi.org/10.1016/j.eswa.2015.01.019
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


Fig. 1. (a) The recognition accuracy using a classifier that train on a specific domain of car images and then test on all other domains. Generally, it is known that a high
accuracy will be obtained when the training and testing images are originated from the same domain. However, the accuracy will drop drastically when other domains are
used as the testing sets because of the existence of dataset bias problem as reported in Torralba and Efros (2011). (b) The main objective of this work is to generate a trained
model that are able achieve consistent accuracy when tested across all the domains.

Fig. 2. Good features vs bad features. Red boxes indicate features that are bias to a particular dataset, while green boxes indicate features that share across datasets. One can
notice that windows in (a) and car front in (b) are feature that biased to respective datasets. Car tyres, instead, are the common features that exist in both datasets. This image
is best view in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

3992 W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999
To achieve this, we extend the work in Khosla et al. (2012) by
proposing Keybook generation with the aim to discover the key-
words from codebook that learned from the different datasets.
In general, keywords are the visual features that significantly rep-
resent the respective object classes. We utilize the mutual infor-
mation (MI) to compute the semantic correlation of the visual
features between the object to the classes and the datasets.
According to Zitnick and Parikh (2013), MI is a good choice to dis-
cover the subset of image specific information that is semantically
meaningful. In this context, the MI is a measurement between the
visual features (codeword in our case) to obtain the relative
degree that they are required in object classes and datasets for
semantic understanding. Then, by maximizing MI between the
object classes and minimize MI between the datasets, the bias
towards the datasets could be undo and keywords candidate are
preserved to form the Keybook. Taking example in Fig. 2 with
the use of the proposed Keybook idea, the green box features that
are generalized over the object class will be retained, while the
red box features that are bias towards a particular dataset will
be discarded.

The contribution of our work can be highlighted as follow. We
propose Keybook generation to discover keywords from the con-
ventional codebook model. The collected Keybook that is general-
ized over the datasets will greatly enhance the object recognition
performance. For this purpose, the MI is utilized to undo the bias
between the objects to the dataset. This approach is in contrast
with the state-of-the-art solution (Khosla et al., 2012) where our
proposed are performed in the initial training stage, which allows
the framework to learn the model in an unbias manner as early as
in the codebook generation stage.


Fig. 3. Overall framework for our proposed Keybook generation. Keybook is generated via preserving the significant codewords of each object class, and remove codewords
that show strong preference towards a dataset. This is make possible using the mutual information.

W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999 3993
The rest of the paper is organized as follows. Section 2 covers
the related works in domain adaptation. Our proposed framework
is detailed in Section 3. Sections 4 and 5 discuss the experiments
settings and results using seen and unseen dataset. Finally, we con-
clude our work in Section 6.

2. Related work

Detecting and recognizing objects is one of the most important
research areas in computer vision community till date. However,
Khosla et al. (2012) showed that conventional object recognition
methods performed poorly when cross testing datasets were
employed. Since then, several solutions such as transfer learning,
semi-supervised learning, and unsupervised learning had been
applied in this context.

To begin with, in transfer learning, the notable work was
Prasath Elango, Tommasi, and Caputo (2012) whom studied learn-
ing of visual perceptual tasks such as recognizing a place or an
object, taking the advantage of what the peer system had learned
beforehand. However, the work is unable to handle when the sys-
tem is exposed to a new environment or to deal with similar object
but in a different domain, as they did not account the bias in differ-
ent domains. Instead, they only transfer the discriminative model
on this problem to the second system.

In semi-supervised approaches, one of the pioneer works is
Daumé and Marcu (2006). The work modeled the domain adapta-
tion problem as such there are huge labels from the source domain
and scarce labels from the target domain. Then, the work formu-
lated it in terms of a simple mixture model and applies in the
context of conditional classification models and conditional lin-
ear-chain sequence labeling models. As such, inference could be
efficiently solved using the conditional expectation maximization
algorithm. In a more advanced work, Blitzer, Crammer, Kulesza,
Pereira, and Wortman (2007) proposed to explicitly model the
inherent trade-off between training on a large but inaccurate
source dataset, and a small but accurate target training set.

Their approach achieve much lower target error compare to the
standard empirical risk minimization method. Unfortunately in
these aforesaid approaches, minimum amount of labels in target
domain is still required. This is not a feasible approach in domain
adaptation as the target dataset and label will not be available in
the training process.

The above issue leads to unsupervised domain adaptation solu-
tions as such approaches do not require labeled target data.
Bergamo and Torresani (2010) proposed transductive SVMs while
Bruzzone and Marconcini (2010) iteratively relabeling the target
domain. Though successful, these works suffered from extensive
computational cost as their approach depend very much on tuning
parameters during the SVM training stage. A resolution for this was
proposed by Gong, Shi, Sha, and Grauman (2012) which inspired
from the idea of geodesic flows (Gopalan, Li, & Chellappa, 2011)
to derive intermediate subspaces that interpolate between the
source and target domain. Also, Gong et al. (2012) contributed in
eliminating the need of tuning many parameters needed in previ-
ous works and achieve improvement in terms of computational
complexity.

In what constitute closer to our work is the bias learning in dif-
ferent domains, for instance Khosla et al. (2012). Their idea was
inspired by Torralba and Efros (2011) whom found out that there
is existence of bias in popular object classification datasets.
Khosla et al. (2012) exploited dataset bias during training stage,
and learnt two sets of weights: (1) bias vectors associated with
each individual dataset, and (2) visual world weights that are com-
mon to all datasets, which are learned by undoing the associated
bias from each dataset. Their approach outperforms the SVM clas-
sification that does not account for the presence of the bias. How-
ever, the requirement of prior knowledge on the unseen domain in
the their work is impractical to the real world application.

Despite the aforementioned methods achieve good accuracy,
none of them considering undo the bias and collecting the signif-
icant features at the early stage (e.g during codebook generation
stage). Note that, one cannot always train on augmented dataset
to achieve better recognition rate. Instead, the idea of Keybook is
useful to achieve better classify the unseen datasets, as well as
providing a more feasibility solution for the rapid dataset aug-
mentation. From our study, there is a surge of interest recently
in understanding the semantic meaning of the image depends
on the presence of objects, their features, and their relations to


Fig. 4. Keybook generation pipeline. First, objects from different d are obtained. Then, we build V d for each d. BoW is build based on all V d generated. Based on the BoW, we
calculate MIc and MId based on class label c and dataset label d. Finally, by maximizing MIc and minimizing MId , Keybook, V key is generated.

3994 W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999
the scene properties (Zitnick & Parikh, 2013). The quantitative
measurement to study the semantic importance of various scene
properties and features is using the mutual information (MI).
Inspired from this, in our work, similar concept was employed
to understand the importance of the presence of features towards
class and dataset. For instance, if MI between a feature and a class
is large, it indicates that the feature provides important informa-
tion about an object class. On the other hand, if MI between a fea-
ture to a specific dataset is high, this show that the particular
feature may be bias to that dataset. From this, one can undo
the bias by maximize MI between features and object class labels;
while minimize MI between features and dataset labels. With
this, the significant differences between our proposed solution
and those conventional solutions are firstly, our proposed solu-
tion do not require label from the target domain during the train-
ing step. This is in contrast to the semi-supervised methods.
Secondly, our proposed method undo the dataset bias during
the codebook generation stage. This is in contrast to the solution
in Khosla et al. (2012).
3. Proposed framework

In the section, we first detail how a visual codebook is gener-
ated as illustrated in Fig. 3. Then, we elaborate our proposed
method, Keybook generation that uses the MI computation. Finally,
we provide the explanation on the classification method that based
on Khosla et al. (2012).
3.1. Codebook generation

BoW model which is initially being employed in Natural Lan-
guage Processing (NLP) has now established its contribution in
the field of computer vision (Fei-Fei & Perona, 2005; Niebles,
Wang, & Fei-Fei, 2008; Yang, Jin, Sukthankar, & Jurie, 2008), partic-
ularly in the area of object recognition. In general, BoW is a sparse
vector of occurrence count of codewords, that is a sparse histogram
over the vocabulary. In NLP, the vocabulary is a database of words
such as the popular WordNet (Miller, 1995) that covered almost
every words available and is online. However in computer vision,
yet we had own the vocabulary of all the objects in the world,
the best representation of an object is still in the discovering pro-
gress. There is still a milestone before the codebook generation
which is a vital step in the BoW approach to achieve that.

To begin with the codebook generation, we represent an image
I ¼fxi; yc; ydg, where x are the extracted image features using fea-
ture extraction algorithms (for instance SIFT, DSIFT, HOG, etc.), and
each x is associated with a class label yc and domain label yd . A
codebook V ¼fw1; . . . ; wkg is obtained by finding the cluster cen-
ters w from xi , extracted from training images Itrain. Conventionally,
we build V d from the training set of domain d. Then, each I is
represented by the BoW model based on the obtained V via a
quantization process:

BoWðwÞ¼
1
n

Xn
i¼1

1; if w ¼ arg min
w2V

ðDðw; xiÞÞ

0; otherwise

(
ð1Þ


Table 1
Experiments on seen datasets (i.e the train set of the dataset used for testing is

W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999 3995
where n is the number of features in I, and Dðw; raÞ is the distance
between w and feature xi .
available during training), where all four datasets are used in training and only one of
the datasets is used for testing at one time. PAS, LAB, CAL, SUN refer to the four
datasets, Pascal VOC2007, LabelMe, Caltech101 and SUN09, respectively. We compare
both results (i.e without V_{key} and with V_{key}), where the better results are in
bold. Overall, our proposed (i.e with V_{key}) outperforms in 19 out of 25 cases. The
underlined results indicate that these implementations are similar to conventional
classification implementation (i.e train and test on same dataset).

Train Test W PAS W LAB W Cal W SUN W vw

Without V key
All PAS 16.89 27.19 33.40 28.19 25.70
All LAB 51.01 67.82 65.68 69.13 68.24
All CAL 5.75 37.86 96.68 25.92 59.60
All SUN 16.90 25.46 27.68 50.97 40.31
Average 22.63 39.68 55.86 43.55 48.46

With V key
All PAS 17.36 27.46 32.67 27.96 25.81
All LAB 52.53 67.96 65.16 68.72 67.64
All CAL 8.37 41.96 96.99 26.81 62.59
All SUN 18.70 26.58 28.53 51.43 41.76
Average 24.24 40.99 55.84 43.73 49.45
3.2. Keybook generation

Unfortunately, conventional codebook generation induce
domain bias problem, as illustrated in Fig. 2. This motivates us to
propose Keybook that consists of keywords that are significant rep-
resentation of the object classes. The principle of the Keybook is to
discover the underlay unique features for an object class that are
shared across all domains. On the other hand, we reject those fea-
tures that cause bias in each specific domain. In order to achieve
that, we obtain Keybook V key from a set of V d from different d by
utilizing the computation of MI. In general, MI is useful in finding
the correlation between two random variables A and B:

MIðA; BÞ¼
X
a2A

X
b2B

pða; bÞ log
pða; bÞ

pðaÞpðbÞ

� �
: ð2Þ

In our paper, A would be the BoW that is generated from a set of V d ,
and B would the labels, either class label c or dataset label d.

In the proposed framework, V key contains image features that
share among c across d, and discard those image features that bias
to d. To build V key, we chose keyword w� 2 V key discovered from all
V d (we denote as V all), by select the w that is maximizing the
mutual information among the object classes, MIc , and minimizing
the mutual information of domains, MId :

w� ¼ arg max
w2V all

ðMIc � MIdÞ; ð3Þ

where MIc ¼ MIðBoWðwÞ; ycÞ and MId ¼ MIðBoWðwÞ; ydÞ. Then, we
build a new BoW representations based on the V key, and learn a
classifier from these BoW for the object recognition task. Fig. 4
shows a step-by-step process on how we obtain the Keybook.
Fig. 5. Sample images of five different classes in Ca
3.3. Classification

Since our proposed framework is almost similar to the conven-
tional codebook generation method, we adopt the visual world
SVM model as proposed in Khosla et al. (2012) for the classification
stage. The visual world weight wvw and bias vector Md is taken into
account and learn as below for different d:

min
wvw;Md;�;/

1
2
kwvwk

2 þ
k
2

Xn
d¼1
kMdk

2 þ C1
Xn
d¼1

Xsd
j¼1
�dj þ C2

Xn
d¼1

Xsd
j¼1

/dj ð4Þ

subject to wd ¼ wvw þMd ð5Þ
yij wvw � x

i
j P 1 ��

i
j; i ¼ 1 . . . n; j ¼ 1 . . . sd ð6Þ

yij wi � x
i
j P 1 ��

i
j; i ¼ 1 . . . n; j ¼ 1 . . . sd ð7Þ

� P 0; /ij P 0; i ¼ 1 . . . n; j ¼ 1 . . . sd ð8Þ
ltech101, LabelMe, SUN09 and Pascal VOC07.


3996 W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999
Eqs. (5)–(8) define the constraints for the visual world SVM model,
while C1 and C2 are the hyperparameters that employed to bal-
ance the learning terms in the objective function. This is in order
to control the relative importance of the constraints. k is used to
define the weight between independent weights of different data-
sets, and a common weight for all datasets. � represents losses
across datasets when using the visual world weights wvw, which
need to be minimized because wvw is expected to generalize across
all datasets. / represents the losses when test images are wrongly
classified by the biased weights. Further details can be found in
Khosla et al. (2012). We summarize the proposed framework in
Algorithm 1.

Algorithm 1. Unbiased framework by learning Keybook (V key)
and visual world SVM model

Require: A set of training images with fxi; yc; ydg.
Ensure: All parameters are set: total number of codewords for

each dataset codebook V d, total number of codewords for
Keybook V key.

1. Learn V d for each dataset d using Eq. (1).
2. Find w� for V key using mutual information (MI) as described

in Eqs. (2) and (3).
3. Learn visual world SVM model using V key by optimizing Eq.

(4) based on constraints in Eqs. (5)–(8).
4. Experiments

We tested our proposed framework on four different datasets
that share the common object classes, but have huge variation in
terms of orientation, viewpoint and environment, namely: Cal-
tech101 (CAL) (Fei-Fei, Fergus, & Perona, 2004), LabelMe (LAB)
(Russell, Torralba, Murphy, & Freeman, 2008), SUN09 (SUN)
(Choi, Lim, Torralba, & Willsky, 2010), and Pascal VOC07 (VOC)
(Everingham, Van Gool, Williams, Winn & Zisserman). The five
object classes are ‘bird’, ‘car’, ‘chair’, ‘dog’, and ‘person’ and some
of the example images are shown in Fig. 5.
Fig. 6. The graphs show improvement in percent average precision (%AP) of classification
et al. (2012)) in 4 dataset. The labels on the x-axis ‘P’, ‘L’, ‘C’, and ‘S’ represent the datasets
the mean AP (mAP) increment over all datasets. Y-axis is the %AP increment over the base
25 cases, with an overall improvement of 5% mAP.
4.1. Implementation details

We employed the experimental settings as to Khosla et al.
(2012), except that we sample SIFT features on all training set
using patch size = 8 and step size = 8 for simplicity and efficiency.
We build V d and V key that consist of 256 codewords and keywords
respectively to ensure a fair comparison. Then, we use LLC coding
(Wang et al., 2010) with 3 level spatial pyramid to pool SIFT fea-
tures based on V key. The evaluations are tested on both seen and
unseen datasets.
4.2. Testing on seen dataset

Table 1 shows the experimental results when we trained the
model using training data from all four datasets, and test on one
dataset at a time. Since the training data consist of images from
the same dataset for the testing data, we consider this as seen data-
set classification task. Table 1 (top: Without V_{key}) shows the
baseline approach based on visual world SVM model (Khosla
et al., 2012) without V key implementation, while Table 1 (bottom:
With V_{key}) shows our proposed method that combines both
benefits from V key and the visual world SVM model.

In general, the visual world SVM model that use V key shows
around 1% of mean average precision (mAP) improvement com-
pared to the model that use conventional codebook. This shows
that V key enhance the model for cross-domain classification task.
This is intuitively possible as Keybook attempts to find a more gen-
eral representation of the objects among the same classes. There-
fore it discards visual codewords that are biased to a particular
dataset. Takes note that, the usage of more simpler feature space
in feature representation will cause some variations to the results
published in Khosla et al. (2012). However, this is a fair comparison
as both methods used the same features during testing. Besides
that, the underlined results indicate results using the conventional
SVM model, where wd is used to classify d, instead of wvw . Compar-
ing both results using the conventional SVM model, our proposed
on unseen datasets using the proposed method over the baseline approach (Khosla
Pascal VOC2007, LabelMe, Caltech101 and SUN09 respectively, while ‘M’ represents
line Khosla et al. (2012). Overall, our algorithm outperforms the baseline in 20 out of


Fig. 7. Qualitative analysis on experiments that classify unseen datasets. Blue region indicates correct classification, and red means false classification. For (a) and (b), first
row: Pascal VOC07, second row: SUN09, third row: LabelMe, fourth row: Caltech101. (For interpretation of the references to color in this figure legend, the reader is referred
to the web version of this article.)

W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999 3997
method achieved 0.35 mAP improvement on all four datasets com-
pared to those methods without V key.

4.3. Testing on unseen dataset

We tested our proposed method on unseen dataset, which indi-
cates that for each testing iteration, we only select three datasets
instead of four for training, and perform testing on the dataset that
was leave out. For example, if we train on Caltech101, LabelMe and
SUN09 datasets, then we will test on PASCAL VOC07 dataset. In
here, the proposed method employs wvw during the classification
stage. Fig. 6 shows most of the average precision (AP) for unseen
dataset is increased, except for Caltech101 dataset. This is because
from Khosla et al. (2012), Caltech101 has the best AP improvement,
while the Keybook generation obtained a more generalized model
across the dataset. This generalized model might had slightly
affected the performance in Caltech101. This also implies the bot-
tleneck of the unbias algorithm, which further improve classifica-
tion result would be hard, by comparable results could be
maintained. In spite of that, we have significant improvement over
5% mAP in the overall performance.

5. Discussion

Fig. 7 shows testing images that are correctly classified and
miss-classified in our proposed method for unseen dataset. From
our visual inspection, one can notice that the correct classified
images are those cars that the car wheels are visible. For the
miss-classified images, the car wheels are either occluded or miss-
ing. This reflects our claim in the proposed method, where the car
wheels could be the keywords found in the V key, and we gain
improve classification result in the car class when the car wheels


3998 W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999
are visible. In contrast, our model will work poorly if the keywords
(e.g. wheels) are occluded or missing. Besides, with the help of V key,
the proposed method could correct images that wrongly classified
in the system that without V key. Note that this observation is some-
how not applicable for Caltech101 samples. This is maybe due to
the characteristics of Caltech101 samples posses strong capture
bias problem, as indicated in Torralba and Efros (2011).

6. Conclusion

Domain adaptation has risen as an important issue in object
recognition research and the recent trend towards the solution
has been focused on the study of the domain bias. Failing to handle
the domain bias problem will deteriorate the overall object recog-
nition system performance, especially when the trained model is
used in different unseen domains. This is due to the trained model
is tailored to learn for a specific domain. In this paper, we proposed
a framework that undo the domain bias thru finding the most sig-
nificant features of an object class which are generalized across all
the domains. Specifically, we employed mutual information (MI)
computation to analyse the correlation between the visual feature
of the object classes and the datasets; and generate the Keybook by
keyword selection. The proposed approach was tested on four pub-
licly available datasets and achieved better performance in either
seen or unseen dataset evaluation.

Although our proposed method achieves better results in the
overall performance, it is still bounded with several limitations.
First, the recognition is very much depends on the significant
features of the object in an image, and so an occlusion on the
significant feature will deteriorate the overall performance. In con-
junction, deciding the number of codewords for the Keybook
remains a key issue that decides the performance of the final
classification.

The proposed framework will be very useful in application such
as visual surveillance, ambient assisted living, and robotic vision
that requires object recognition. Such systems that are employed
to assist or protect human in their daily live are required to be able
to perform consistently in different environments. With our pro-
posed solution, since the significant features of the object class
are retained, so it is able to cope with environments change.

There are a few potential future works to enhance the current
approach. First, the discriminative characteristic (i.e hard assign-
ment) in the current work can be relaxed with the integration
the soft assignment codebook representation (Hoo, Kim, Pei, &
Chan, 2014). Secondly, alternatives MI that might be more useful
in Keybook during the generation of the dataset codebook could
be investigated. Other possible research directions are performing
the unbias attempt in the feature level (i.e when feature extraction
from images occur), and dataset collection level (i.e how researcher
could collect bias-free dataset). Finally, the work could be extended
to space–time domain such as human motion analysis (Lim, Vats, &
Chan, in press); or to fine-grained classification problem (Yao,
Khosla, & Fei-Fei, 2011).

Acknowledgments

This research is supported by the High Impact Research MoE
Grant UM.C/625/1/HIR/MoE/FCSIT/08, H-22001-00-B0008 from
the Ministry of Education Malaysia.

References

Bergamo, A., & Torresani, L. (2010). Exploiting weakly-labeled web images to
improve object classification: A domain adaptation approach. In Advances in
neural information processing systems (NIPS) (pp. 181–189).
Bielecki, A., Buratowski, T., & Śmigielski, P. (2013). Recognition of two-dimensional
representation of urban environment for autonomous flying agents. Expert
Systems with Applications, 40, 3623–3633.

Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J. (2007). Learning bounds
for domain adaptation. In Advances in neural information processing systems
(NIPS) (Vol. 20, pp. 129–136).

Bruzzone, L., & Marconcini, M. (2010). Domain adaptation problems: A dasvm
classification technique and a circular validation strategy. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 32, 770–787.

Choi, C., & Christensen, H. I. (2012). 3d textureless object detection and tracking: An
edge-based approach. In IEEE/RSJ international conference on intelligent robots
and systems (IROS) (pp. 3877–3884).

Choi, M. J., Lim, J. J., Torralba, A., & Willsky, A. S. (2010). Exploiting hierarchical
context on a large database of object categories. In IEEE conference on computer
vision and pattern recognition (CVPR) (pp. 129–136).

Daumé, H., III, & Marcu, D. (2006). Domain adaptation for statistical classifiers.
Journal of Artificial Intelligence Research, 26, 101–126.

Duan, L., Xu, D., Tsang, I. W.-H., & Luo, J. (2012). Visual event recognition in videos
by learning from web data. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 34, 1667–1680.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. The PASCAL
visual object classes challenge 2007 (VOC2007) results. <http://www.pascal-
network.org/challenges/VOC/voc2007/workshop/index.html>.

Fei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural
scene categories. In IEEE conference on computer vision and pattern recognition
(CVPR) (Vol. 2, pp. 524–531).

Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from
few training examples: An incremental bayesian approach tested on 101 object
categories. In IEEE computer vision and pattern recognition workshop (CVPRW)
(pp. 178–178).

Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for
unsupervised domain adaptation. In IEEE conference on computer vision and
pattern recognition (CVPR) (pp. 2066–2073).

Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition:
An unsupervised approach. In IEEE international conference on computer vision
(ICCV) (pp. 999–1006).

Guo, W., Xia, X., & Xiaofei, W. (2014). A remote sensing ship recognition method
based on dynamic probability generative model. Expert Systems with
Applications, 41, 6446–6458.

Hoo, W. L., Kim, T., Pei, Y., & Chan, C. S. (2014). Enhanced random forest with image/
patch-level learning for image understanding. In 22nd international conference
on pattern recognition, ICPR (pp. 3434–3439).

Khosla, A., Zhou, T., Malisiewicz, T., Efros, A., & Torralba, A. (2012). Undoing the
damage of dataset bias. In European conference on computer vision (ECCV) (pp.
158–171).

Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain
adaptation using asymmetric kernel transforms. In IEEE conference on computer
vision and pattern recognition (CVPR) (pp. 1785–1792).

Lim, C. H., Vats, E., Chan, C. S. (in press). Fuzzy human motion analysis: A review.
Pattern Recognition. http://dx.doi.org/10.1016/j.patcog.2014.11.016.

Lim, M. K., Tang, S., & Chan, C. S. (2014). iSurveillance: Intelligent framework for
multiple events detection in surveillance videos. Expert Systems with
Applications, 41, 4704–4715.

Li, H., Shi, Y., Liu, Y., Hauptmann, A. G., & Xiong, Z. (2012). Cross-domain video
concept detection: A joint discriminative and generative active learning
approach. Expert Systems with Applications, 39, 12220–12228.

Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the
ACM, 38, 39–41.

Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action
categories using spatial–temporal words. International Journal of Computer
Vision, 79, 299–318.

Prasath Elango, S., Tommasi, T., & Caputo, B. (2012). Transfer learning of visual
concepts across robots: A discriminative approach. Idiap-RR Idiap-RR-06-2012
Idiap.

Rudinac, M., Kootstra, G., Kragic, D., & Jonker, P. P. (2012). Learning and recognition
of objects inspired by early cognition. In IEEE/RSJ international conference on
intelligent robots and systems (IROS) (pp. 4177–4184).

Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A
database and web-based tool for image annotation. International Journal of
Computer Vision, 77, 157–173.

Saenko, K., Kulis, B., Fritz, M., Darrell, T. (2010). Adapting visual category models
to new domains. In European conference on computer vision (ECCV)
(pp. 213–226).

Serratosa, F., Alquézar, R., & Amézquita, N. (2012). A probabilistic integrated object
recognition and tracking framework. Expert Systems with Applications, 39,
7302–7318.

Stasse, O., Foissotte, T., & Kheddar, A. (2008). Treasure hunting for humanoids robot.
In IEEE RAS/RSJ international conference on humanoids robots, workshop on
cognitive humanoid vision.

Szpak, Z. L., & Tapamo, J. R. (2011). Maritime surveillance: Tracking ships inside a
dynamic background using a fast level-set. Expert Systems with Applications, 38,
6669–6680.

Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In IEEE conference on
computer vision and pattern recognition (CVPR) (pp. 1521–1528).

http://refhub.elsevier.com/S0957-4174(15)00034-2/h0010
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0010
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0010
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0020
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0020
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0020
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0035
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0035
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0040
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0040
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0040
http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0070
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0070
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0070
http://dx.doi.org/10.1016/j.patcog.2014.11.016
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0095
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0095
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0095
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0100
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0100
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0100
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0105
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0105
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0110
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0110
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0110
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0125
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0125
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0125
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0135
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0135
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0135
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0145
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0145
http://refhub.elsevier.com/S0957-4174(15)00034-2/h0145


W.L. Hoo et al. / Expert Systems with Applications 42 (2015) 3991–3999 3999
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained
linear coding for image classification. In IEEE conference on computer vision and
pattern recognition (CVPR) (pp. 3360–3367).

Yang, L., Jin, R., Sukthankar, R., Jurie, F. (2008). Unifying discriminative visual
codebook generation with classifier training for object category recognition. In
IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Yao, B., Khosla, A., & Fei-Fei, L. (2011). Combining randomization and discrimination
for fine-grained image categorization. In IEEE conference on computer vision and
pattern recognition (CVPR) (pp. 1577–1584).

Zitnick, C. L., & Parikh, D. (2013). Bringing semantics into focus using visual
abstraction. In IEEE conference on computer vision and pattern recognition (CVPR)
(pp. 3009–3016).


	Keybook: Unbias object recognition using keywords
	1 Introduction
	2 Related work
	3 Proposed framework
	3.1 Codebook generation
	3.2 Keybook generation
	3.3 Classification

	4 Experiments
	4.1 Implementation details
	4.2 Testing on seen dataset
	4.3 Testing on unseen dataset

	5 Discussion
	6 Conclusion
	Acknowledgments
	References