A knowledge-based multi-layered image annotation system


Expert Systems With Applications 42 (2015) 9539–9553

Contents lists available at ScienceDirect

Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa

A knowledge-based multi-layered image annotation system

Marina Ivasic-Kos a,∗, Ivo Ipsic a, Slobodan Ribaric b

a Department of Informatics, University of Rijeka, Rijeka, Croatia
b Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

a r t i c l e i n f o

Keywords:

Image annotation

Multi-layered image annotation

Knowledge representation

Fuzzy Petri Net

Fuzzy inference engine

a b s t r a c t

Major challenge in automatic image annotation is bridging the semantic gap between the computable low-

level image features and the human-like interpretation of images. The interpretation includes concepts on

different levels of abstraction that cannot be simply mapped to features but require additional reasoning

with general and domain-specific knowledge. The problem is even more complex since knowledge in con-

text of image interpretation is often incomplete, imprecise, uncertain and ambiguous in nature. Thus, in this

paper we propose a fuzzy-knowledge based intelligent system for image annotation, which is able to deal

with uncertain and ambiguous knowledge and can annotate images with concepts on different levels of ab-

straction that is more human-like. The main contributions are associated with an original approach of using a

fuzzy knowledge-representation scheme based on the Fuzzy Petri Net (KRFPN) formalism. The acquisition of

knowledge is facilitated in a way that besides the general knowledge provided by the expert, the computable

facts and rules about the concepts, as well as their reliability, are produced automatically from data. The rea-

soning capability of the fuzzy inference engine of the KRFPN is used in a novel way for inconsistency checking

of the classified image segments, automatic scene recognition, and the inference of generalized and derived

classes.

The results of image interpretation of Corel images belonging to the domain of outdoor scenes achieved by the

proposed system outperform the published results obtained on the same image base in terms of average pre-

cision and recall. Owing to the fuzzy-knowledge representation scheme, the obtained image interpretation

is enriched with new, more general and abstract concepts that are close to concepts people use to interpret

these images.

© 2015 Elsevier Ltd. All rights reserved.

1

p

d

i

m

v

p

a

c

w

o

w

t

s

a

a

t

n

F

i

c

c

f

i

J

t

t

f

t

h

0

. Introduction

Digital images have become unavoidable in the professional and

rivate lives of modern people. In recent years, the frequent use of

igital images has become necessary in different fields like medicine,

nsurance and security systems, geo-informatics, advertising, com-

erce, as well as in other business areas. On the other hand, in pri-

ate life, digital images are used for documenting people close to us,

ets, sights and events such as birthdays, parties, trips, excursions

nd sporting activities. This widespread use has caused a rapid in-

rease in the number of digital images that, today, on specialized

ebsites, can be counted in the millions. However, a large number

f images leads to problems with searching and retrieval, as well as

ith organizing and storing.

As the majority of images are barely documented, it is believed

hat we could retrieve and arrange images simply if they were
∗ Corresponding author. Tel.: +38551584710.
E-mail addresses: marinai@uniri.hr (M. Ivasic-Kos), ivoi@uniri.hr (I. Ipsic),

lobodan@zemris.fer.hr (S. Ribaric).

h

s

n

i

l

ttp://dx.doi.org/10.1016/j.eswa.2015.07.068

957-4174/© 2015 Elsevier Ltd. All rights reserved.
utomatically annotated and described with words that are used in

n intuitive image search. However, the task of mapping image fea-

ures that can be extracted from raw image data to words that users

ormally use for articulating their requirements is not a trivial one.

or example, it seems natural to use a destination name when retriev-

ng holiday images or some terms that describe a scene, such as the

oast, mountains or activities like diving, skiing, etc. A major research

hallenge is bridging the semantic gap between the low-level image

eatures available to a computer and the interpretation of the images

n the way that humans do (Smeulders, Worring, Santini, Gupta, &

ain, 2000). In addition, one should take into account that image in-

erpretation inherent to humans includes concepts associated with

he content of the image on different levels of abstraction. This is re-

erred to as the multi-layered interpretation of image content. To sys-

ematically describe visual content of an image and its semantics, we

ave defined a knowledge-based image representation model con-

isting of multiple layers of image representation. Layers are orga-

ized according to the amount of knowledge needed to automatically

nterpret the image using inference about concepts belonging to the

ayer.

http://dx.doi.org/10.1016/j.eswa.2015.07.068
http://www.ScienceDirect.com
http://www.elsevier.com/locate/eswa
http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2015.07.068&domain=pdf
mailto:marinai@uniri.hr
mailto:ivoi@uniri.hr
mailto:slobodan@zemris.fer.hr
http://dx.doi.org/10.1016/j.eswa.2015.07.068


9540 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

a

&

i

t

a

o

i

t

a

e

L

b

t

s

a

i

t

T

i

b

l

t

u

t

k

l

p

b

l

B

g

t

t

L

b

a

a

2

i

F

l

s

o

T

(

t

b

f

i

&

v

i

l

o

c

o

A

s

a

t

a

According to the defined image representation model, an intel-

ligent system for multi-layered image annotation is proposed. The

first layer of the image interpretation contains concepts obtained by

the classification of image segments using conventional supervised

classification method. Higher levels of image interpretation involve

concepts that are more abstract. These concepts are difficult to in-

fer directly based on low-level features and without knowledge rel-

evant to the problem domain. Therefore, we have defined the fuzzy

knowledge-representation schemes based on fuzzy Petri net (KRFPN)

formalism to represent knowledge about concepts that can appear in

an image. Fuzzy Petri nets combine fuzzy set theory and Petri net the-

ory to provide the representation of knowledge, which is in context

of image interpretation often incomplete, imprecise, uncertain and

ambiguous in nature.

The KRFPN formalism is originally supported with a fuzzy infer-

ence engine that deals with approximate reasoning. The reasoning

capability of the inference engine was used in an original way to draw

conclusions about classes of image scenes and more abstract classes.

The system can handle the ambiguity and uncertainty about concepts

and relations, so decisions about more abstract concepts can be made

even when input information about the concepts present in an image

are imprecise and vague. To reduce the propagation of errors through

the hierarchical structure of concepts and to increase the reliability of

conclusions, as well as to improve the precision of image annotation,

a consistency-checking procedure is proposed.

The acquisition of knowledge used by inference engine is facil-

itated in a way that all the facts and rules of the composition and

distribution of concepts as well as their reliability are produced auto-

matically from data. Both new relationships and new concepts with

appropriate measure of reliability are stored into the knowledge base

and used by the inference engine.

The paper is organized as follows: First, in Section 2, different

approaches to image-content interpretation are explained and a de-

tailed overview of related work is given. The layers of the multi-

layered image representation with respect to the amount of knowl-

edge needed for the image interpretation are given in Section 3.

A system for the multi-layered image annotation is proposed in

Section 4. A fuzzy-knowledge representation scheme adapted for the

outdoor image domain is presented in Section 5. Inputs to the scheme

are concepts obtained as the results of an image-segments classifi-

cation using a Bayesian classifier. The application of the fuzzy infer-

ence engine for checking the consistency of the obtained results of

the image segment classification and the recognition of scene con-

text is given in Sections 6 and 7, respectively. The fuzzy inference al-

gorithm used to derive more abstract concepts associated with the

image is described in Section 8. The experimental results of the im-

age interpretation at the layer that corresponds to automatic image

annotation are given and compared to previously reported methods

in Section 9. Additionally, in Section 9, an improvement to the results

of the automatic image annotation after checking the inconsistency

of the concepts obtained during the image-segments classification is

presented and discussed.

2. Related work

Image interpretation is a complex task that strongly depends on

purpose of annotation. Moreover, human interpretation is limited by

the knowledge, culture, experience and point of view of the person.

Therefore, in the development of the automatic image annotation

system, types of concepts that would be used for image interpretation

should be decided first, depending on the purpose of the annotation.

Among the oldest models for image annotation is Shatford’s

image-content classification of general-purpose images drawing on

theory from art history that classifies image content into general,

specific and abstract concepts (Shatford, 1986). Additionally, the con-

tents of an image are associated with aspects of objects, with spatial
nd temporal aspects and aspects of activities or events. In (Eakins

Graham, 2000), a multilayer interpretation of the image content

s considered in the context of image search. The authors defined

hree semantic layers of image interpretation. At the first level, im-

ge interpretation is based on the presence of certain combinations

f features, such as color, texture or shape, while at the second level,

mage interpretation deals with the presence and distribution of cer-

ain types of objects. At the third level, image interpretation includes

description of specific types of events or activities, locations and

motions that one can associate with the image. The authors (Hare,

ewis, Enser, & Sandom, 2006) provide a simplified hierarchical view

etween the two extremes, the image itself and its full semantic in-

erpretation. At the lowest level are the image and its “raw” data. The

econd level consists of low-level features related to a part of an im-

ge or to the whole image. A combination of prototype feature vectors

s part of the third level. If these image parts can be associated with

he corresponding objects, then this would make the fourth level.

he top level of image interpretation, referred to as full semantics,

ncludes concepts that describe the events, actions, emotions and a

roader context of the image. This model, particularly in layers re-

ated to visual image content, mostly influenced the image represen-

ation model that we propose. The main difference is in higher layers

sed to model the image semantics.

There are two major approaches widely used for image anno-

ation, one using statistical methods and the other mostly using

nowledge-based methods belonging to the field of artificial intel-

igence. Both approaches are used in our systems: the statistical ap-

roach in the first layer of the image interpretation and knowledge-

ased approach in the higher layers.

In the statistical approach, most methods can be grouped as trans-

ation or classification models. In the translation model of (Duygulu,

arnard, de Freitas, & Forsyth, 2002) the co-occurrence of image re-

ions and annotation words are used to model the relationship be-

ween annotation words and images or image regions. In classifica-

ion methods, such as (Barnard et al., 2003, Li & Wang, 2003, Hu and

am, 2013), words used for image annotation correspond to class la-

els for which classifiers are trained. Due to the intra-class variability

nd inter-class similarity, usually class labels correspond to objects in

n image, but can correspond to scenes as well. In (Fei-Fei & Perona,

005) natural scenes were learned by a Bayesian hierarchical model

n unsupervised way from local image regions. In (Yin, Jiao, Chai, &

ang, 2015) discriminant scene features were learned using single-

ayer sparse autoencoder (SAE) and then SVM classifier is used for

cene classification.

Some methods use multi-label learning for solving the problem

f annotating images with more than one word (Feng & Xu, 2010).

o improve the accuracy of multi-label classification algorithm, in

Yu, Pedrycz, & Miao, 2014) correlation among the labels and uncer-

ainty of classification between feature space and label space have

een considered and in (Hong et al., 2014) selection of discriminative

eatures has been proposed. Lately, deep neural networks are exam-

ned for the task of multi-label image annotation. In (Chengjian, Zhu,

Shi, 2015) multimodal deep neural network pre-trained with con-

olutional neural networks is proposed.

Such statistical methods commonly use quite simple vocabular-

es that can be large but are generally not structured because no re-

ations are defined between the concepts in the vocabulary. On the

ther hand, methods that rely on knowledge bases used sophisti-

ated, structured vocabularies in which geometrical, hierarchical or

ther relations between concepts are established (Tousch, Herbin, &

udibert, 2012). We have defined a vocabulary of this kind that is

uitable for image retrieval to be used in our system.

A few approaches have explored the dependence of words on im-

ge regions (Blei and Jordan, 2003) or exploit the ontological rela-

ionships between annotation words, demonstrating their effect on

utomatic image annotation and retrieval (Maillot, 2005).


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9541

Objects sand, sea, sky plane, sky, trees, building snow, polar bear
Scenes Coast Scene Plane Scene Polar bear 
More general 
(abstract) 
concepts

Natural scene,
Outdoor

Vehicle
Man-made object,

Outdoor

Wildlife, 
Mammal, 
Outdoor,

Natural scene
Derived 
concepts

Beach, SeaShore,
Tallinn, Estonia

Meeting

Transportation Arctic

Fig. 1. Examples of images and their annotation at different levels of abstraction.

a

&

m

T

n

v

t

t

p

f

t

e

c

e

k

c

p

i

T

s

a

l

v

f

o

a

i

(

r

i

l

e

2

b

2

m

k

m

i

m

t

t

f

t

m

c

t

a

i

t

k

s

a

g

w

n

3

a

t

a

d

a

p

h

b

t

i

j

r

s

g

d

n

p

F

t

c

t

T

w

T

i

l

a

A comprehensive survey of research made in the field of statistical

utomatic image annotation methods can be found in (Liu, Zhang, Lu,

Ma, 2007; Datta, Joshi, & Li, 2008; Zhang, Islam, & Lu, 2012).

For a multi-layered image annotation, several approaches that use

odels for knowledge representation and reasoning were proposed.

he authors (Benitez, Smith, & Chang, 2000) described a semantic

etwork to represent the semantics of multimedia content (images,

ideo, audio, graphics and text). The basic components of the seman-

ic network are concepts that correspond to real-world objects and

he relations among them, such as generalization, aggregation and

erceptual relationships based on the similarities of their low-level

eatures.

The authors (Marques & Barman, 2003) propose the model with

hree levels. The lowest level contains vectors of low-level features

xtracted from images. The feature vectors are classified into the con-

epts from flat vocabulary using Bayesian networks. On the high-

st level is the RDF ontology that contains knowledge about the

eywords and information about the relations between different

oncepts.

The authors (Srikanth, Varner, Bowden, & Moldovan, 2005) pro-

osed using a hierarchical dependency between annotation words to

mprove translation-based automatic image annotation and retrieval.

he hierarchy is derived from the text ontology WordNet and repre-

ents the various levels of generality of the concepts expressed in im-

ge regions and words. To predict the likelihood of assigning a class

abel given an image, statistical language models defined on a visual

ocabulary of blobs, represented by region feature vectors, are used.

In (Ivasic-Kos, Ribarić, & Ipsic, 2010) an image content analysis

ramework based on Fuzzy Petri Net is proposed for classification

f image segments into objects. Also, a formal description of hier-

rchical and spatial relationships among concepts from the outdoor

mage domain is described. Fuzzy formalism was also applied in

Nezamabadi-pour & Kabir, 2009) where fuzzy k-NN classifier with

elevance feedback was used to assign semantic labels to database

mages.

In (Athanasiadis et al., 2009, Simou, Athanasiadis, Stoilos, & Kol-

ias, 2008) an ontology and the inference engine FIRE (Fuzzy Infer-

nce Reasoning Engine) (Stoilos, Stamou, Tzouvaras, Pan, & Horrocks,

005) were used for analyzing the image content belonging to the

each domain. Later, the same group of authors (Papadopoulos et al.,

011) compared different approaches attempting to use spatial infor-

ation for semantic image analysis.

Unlike the above described approaches, we propose a model of a

nowledge-based multi-layered image annotation system. We have

erged the statistical approach for classification of image segments

nto objects and knowledge-based approach to infer concepts that are

ore abstract. We took advantages of statistical methods to facilitate

he knowledge acquisition, so that computable facts and rules about
he concepts as well as their reliability are automatically generated

rom data.

The key components of the proposed multi-layered image anno-

ation system are the KRFPN scheme based on Fuzzy Petri Net for-

alism and integrated fuzzy inference engine. We have exploited the

apability of the KRFPN inference engine for reasoning with uncer-

ainty to infer scenes and concepts that cannot be mapped to im-

ges without using the domain knowledge. In addition, to refine the

mage annotation and to reduce the propagation of errors through

he hierarchical structure of concepts we have included the novel,

nowledge-based, consistency-checking procedure into the proposed

ystem. For image annotation refinement, the correlation between

nnotated keywords has been used in previous research, and lately

raph-based algorithms for image analysis have been investigated as

ell. A comprehensive survey on image annotation refinement tech-

iques is given in (Dong, 2014).

. Multi-layered image representation

An image representation includes the visual content and the

nnotation of an image. The visual content of an image refers to

he information that may be collected by analyzing low-level im-

ge features while the image annotation includes concepts that may

escribe both the content and the context of an image. The task of

utomatic image annotation is challenging because the number of

ossible concepts that one can use to describe most images is large,

ighly dependent on application, user’s knowledge, needs, cultural

ackground, etc. and it is hard to choose the right type of concepts

hat would be universally appropriate. For instance, to annotate the

mages in the Fig. 1, one can use concepts that are related to the ob-

ects that appear in the image (sand, sea, sky, snow), concepts that rep-

esent the scene (beach, coast, coastline, shore, seashore), more general

cene concepts (wildlife, outdoor, natural scene) or activities (walking,

et wet feet). If the user is familiar with the context of an image, its

escription will be more subjective and will probably include the

ame of a place (e.g. Tallinn, Estonia for Fig. 1a), names of the peo-

le appearing in it, description of the relevant event (e.g. Meeting for

ig. 1a) or evoked emotions, etc.

Although different people will most likely use different concepts

o annotate the same image, used concepts can be organized ac-

ording to the amount of knowledge needed to reach each abstrac-

ion level of image interpretation (Ivasic-Kos, Pavlic, & Pobar, 2009).

herefore, we propose a multi-layered image representation model in

hich layers correspond to concepts at different levels of abstraction.

he layers reflect the increase of the amount of knowledge included

n the automatic image annotation (Fig. 2) from the lower to higher

ayers, where the lower layers (V1 - V2) represent the visual content,

nd the layers MI − MI represent the image semantics.
1 4


9542 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

Fig. 2. Layers of image representation in relation to the knowledge level.

t

b

m

l

c

t

r

A

i

m

v

1

i

m

a

i

v

c

t

P

t

d

s

w

o

t

m

s

f

a

t

r

m

p

r

f

r

i

The initial layer of an image representation is the layer V0, and it

represents the raw image. The image is usually segmented (layer V1)

for analysis, and the low-level features are extracted from the image

segments (layer V2). The amount of knowledge required for segmen-

tation (layer V1) and feature extraction (layer V2) is low. It is assumed

that a multi-layered image annotation includes concepts ranging

from elementary classes EC (layer MI1) in which image segments are

classified, scene classes SC (layer MI2) that describe the scene, ending

with generalized classes GC (layer MI3) and derived classes DC (layer

MI4). For instance, the proposed multi-layered image annotation re-

lated to Fig. 1c is EC = {snow, polar bear}; SC = {Scene-Polarbear}; GC
= {Wildlife, Mammal, Outdoor, Natural scene}; DC = {Arctic}.

Elementary classes are obtained as results of image-segments

classification and are used as flat vocabulary for automatic image

annotation. It is assumed that instances of elementary classes corre-

spond to objects in the real world. Spatial relations, spatial locations

and co-occurrence relations can be defined for elementary classes,

like EC1 is-above EC2, or EC1 is-on-top, EC1 occurs-with EC3. Scene

classes are used to represent the context or semantics of the whole

image, according to common sense and expert knowledge. A part-of

relation or, its inverse, relation consists-of, can be defined between

an elementary class and a scene class, e.g. EC1 is-part-of SC2 or SC2
consists-of EC1. Generalized classes are defined as a generalization of

scene classes. The is-a relation can be defined between a scene class

and a generalized class, e.g. SC2 is-a GC1. There can be multiple levels

of generalization so the relation is-a can be defined between gener-

alized classes too, e.g. GC1 is-a GC3 is-a GC5. Derived classes include

abstract concepts, activities, events or emotions that can be associ-

ated with an image. Different types of relations, such as associate-to

or is-synonym-of relation can be defined between derived classes and

generalized or scene classes.

4. A multi-layered image annotaton system

The architecture of our intelligent multi-layered image annota-

tion system (MIAS) is depicted in Fig. 3. The system deals with all

the layers of image representation given in Fig. 2, ranging from the

segmented image at layer V1 to the multilayer image interpreta-

tion at layer MI4. The input to the system is an image belonging to

the V0 layer of the image representation and the system output is a

multi-layered interpretation of the image that consists of concepts

obtained from four layers of image interpretation, i.e., layers MI1,

MI2, MI3 and MI4.

A raw image I at layer V0 is first segmented with a normalized-cuts

algorithm (Shi & Malik, 2000). The segmented image corresponds to
he V1 layer of the image representation. Formally, the relationship

etween the raw image I and the image segments si, i = 1, . . . . . , m
ay be written as V1(I) = {s1, s2, . . . , sm}. From each image segment,

ow-level features are extracted (such as size, position, height, width,

olour, shape, etc.) which should represent the geometric and pho-

ometric properties of a segment. Each image segment is then rep-

esented by the k-component feature vector x = (x1, x2, . . . , xk)T .
ccordingly, an image at the V2 layer of the image representation

s described with as many feature vectors as there are image seg-

ents. Thus, the relationship between the raw image I and the feature

ectors xi, i = 1, . . . . . , m obtained from the image segment si, i =
, . . . . . , m is given as V2(I) = {x1, x2, . . . , xm}.

Each image segment is then classified using the Bayes classifier

nto one of the elementary classes ECi ∈ EC according to the maxi-
um posterior probability (cMAP). The Bayes classifier was trained on

training set of image segments annotated with labels correspond-

ng to natural and artificial objects. For each occurrence of the feature

ector x, a classification is based on the Bayes theorem:

MAP = argmax
ECi ∈EC

P(x|ECi)P(ECi)
P(x)

. (1)

The conditional probability P(x|ECi) of a feature vector x for

he given elementary classes ECi ∈ EC and the prior probability
(ECi), ∀ECi ∈ EC are estimated according to data in a training set. It is
aken into account that the evidence factor P(x) is a scale factor that

oes not influence the classification results.

The result of the image-segments classification is m annotated

egments of the image I in such a manner that each one is annotated

ith one of the elementary classes. The union of elementary classes,

btained by the classification of the image segments, forms an au-

omatic image interpretation at layer MI1, often referred to as auto-

atic image annotation. The classes or elements of the interpretation

et MI1(I)⊆EC are also called labels, annotation words, or keywords.
A knowledge-representation scheme based on the Fuzzy Petri Net

ormalism (Ribarić & Pavešić, 2009) and the fuzzy inference engine

re the key components of the proposed multi-layered image annota-

ion system MIAS. Defined fuzzy knowledge-representation scheme

epresents the knowledge about concepts and relations in image do-

ain, which is often uncertain, imprecise, and ambiguous.

The fuzzy knowledge base contains the following main com-

onents: fuzzy relationships between elementary classes, fuzzy

elationships between elementary classes and scene classes and

uzzy relationships between scene classes and generalized or de-

ived classes. The fuzzy relationships are defined using the train-

ng set and expert knowledge. One of the components of the


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9543

Fig. 3. Architecture of a multi-layered image-interpretation system (MIAS).

s

p

p

T

c

r

t

c

c

d

a

c

s

l

fi

t

l

n

M

e

5

G

l

m

a

i

T

a

I

e

g

a

t

5

k

k

a

k

f

F

K

l

5

i

k

i

a

K

w

(

ystem MIAS is an inference engine (IE) used for image inter-

retation on the layers MI1 − MI4. The inference engine sup-
orts the fuzzy inheritance and fuzzy recognition procedures.

he fuzzy inheritance is used for inconsistency checking and for

lass generalization and the fuzzy recognition is applied for scene

ecognition.

The facts in the fuzzy knowledge base, particularly those related

o relationships among elementary classes, are used to check the

onsistency of the set MI1(I). An elementary class for which it is

oncluded that it does not belong to a likely context, obtained e.g.

ue to inaccurate segmentation, can be discarded or replaced with

nother elementary class that has similar properties and fits the

ontext.

The elementary classes of an image that have passed incon-

istency checking are the inputs into the MI2 image-interpretation

ayer for scene recognition. Each scene in the knowledge base is de-

ned based on a training set as an aggregation of typical elemen-

ary classes. Thus, it is possible to conclude which scene is the most

ikely one from the elementary classes from the set MI1(I). The recog-

ised scene class makes the image interpretation at the layer MI2,

I2(I)⊆SC.
Based on the scene class from the set MI2(I), more abstract gen-

ralized classes are inferred by the inference engine (see Section

.1) using generalized relationships from the fuzzy knowledge base.

eneralized relationships as well as heuristics about this particu-

ar domain are explicitly specified by a human expert. Once deter-

ined, the generalized classes can be further generalized to a more

bstract generalized class. Inferred generalized classes form image

nterpretation at the layer MI3, so for a given image I, MI3(I)⊆GC.
he analogous inference procedure can be applied on generalized

nd scene classes to obtain derived classes related to a given image

, MI4(I)⊆DC.
The outputs from the proposed system are classes at different lev-

ls of abstraction that include elementary classes, scene classes and

eneralized classes as well as derived classes.

The defined KRFPN scheme can be independently used, modified

nd connected with other KRFPN schemes in a hierarchical structure

o expand the knowledge base with new concepts.

. A knowledge-representation scheme

To model objects and their relationships in an image, some

nowledge-representation formalism has to be used and domain

nowledge needs to be included. However, considering that im-

ge segmentation is often imprecise and subject to errors, and that

nowledge about the concept is often incomplete, an ability to per-

orm conclusions from imprecise, fuzzy knowledge is necessary.

or this purpose, a knowledge-representation scheme based on the

RFPN formalism (Ribarić & Pavešić, 2009) is defined for a multi-

ayered annotation of images.

.1. Definition of the KRFPN scheme for the multi-layered

mage annotation

We have defined the KRFPN scheme to present elements of the

nowledge base used for inferring concepts on higher layers of image

nterpretation.

The KRFPN scheme for multi-layered image annotation is defined

s 13-tuple:

RF PN = (P, T, I, O, M, �, μ, f, c, α, β, λ, Con), (2)
here the first ten components are of the marked Fuzzy Petri Net

FPN) (Li & Lara-Rosano, 2000):


9544 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

pi pktj

d1 d2r1

c(m1)>0 f(tj)>0

m1

Fig. 4. A generic form of a chunk of knowledge in the Fuzzy Petri Net formalism.

pi pktj

d1 d2r1

c(m2) = c(m1)* f(tj)>0f(tj)>0

m2

Fig. 5. A new token value is obtained in the output place after firing.

f

c

s

a

m

t

t

T

e

e

p

e

a

O

n

(

i

s

t

a

e

a

c

m

5

o

t

v

s

l

o

s

t

s

t

t

p

P = {p1, p2, . . . , pn}, n ∈ N is a set of places; a function α: P → D
maps a place from a set P to a concept from a set D used for multi-

layered image annotation. It is set that D = EC ∪ SC ∪ GC ∪ DC where
the subset EC includes 28 elementary classes such as {Airplane, Train,

Shuttle, Ground, Cloud, Sky, Coral, Dolphin, Bird, Lion, Mountain, etc.},

the subset SC includes 20 scene classes such as {Seaside, Inland, Sea,

Space, Airplane Scene, Train Scene, Tigre Scene, Lion Scene, etc.}, the sub-

set GC includes generalized classes such as {Outdoor Scenes, Natural

Scenes, Man-made Objects, Landscape, Vehicles, Wildlife, etc.}, and sub-

set DC includes {Savannah, Africa, Safari, Vacation, etc.}.

T = {t1, t2, . . . , tm}, m ∈ N is a set of transitions; a function β: T
→ � maps a transition from a set T to a relationship from a set �
defined according to expert knowledge; a set � includes a relation-

ship occurs_with between elementary classes that models the com-

mon occurrence of elementary classes in the image and its negation

not_occurs_with, then the aggregation relationship consists_of defined

between a scene class that has a role of aggregation and elementary

classes that have the role of components of aggregation, then a gen-

eralization relationship is_a that is defined either between a scene

class and generalized class or between generalized classes or derived

classes and in addition a is_synonim_of relation defined between syn-

onyms of concepts. For a relationship consists_of an inverse relation-

ship – (consists_of) = is_part_of is defined.
I: T → P∞ is an input function, while O: T → P∞ is an output func-

tion for a transition. In our scheme, the co-domain of input and out-

put functions is a set P instead of a bag P∞ as defined in (Peterson,
1981).

M = {m1, m2, . . . , mr}, 1 ≤ r < ∞ is a set of tokens used by the
inference engine. The inference procedure is based on the dynamic

properties of the Petri Net, i.e. by firing of the transitions (Peterson,

1981). The tokens’ distribution within places is given as �(p) ∈ P(M),
where P(M) is a power set of M. The initial distribution of tokens
defines the initial marking vector μ0 = (μ1, μ2, . . . , μn) and μi =
μ(pi) ∈ {0, 1}, i.e. in the initial marking a place can have no or at most
one token. In case of scene recognition, μ0 corresponds to elementary
classes obtained at the layer MI1.

c: M → [0, 1] is an association function that gives a token value
that corresponds to the degree of truth of the concept mapped to a

place marked with that token. The value of a token in an initial distri-

bution can be set to the estimated posteriori probability of a concept

that is associated with that marked place or set to 1.

f: T → [0, 1] is an association function that gives a transition value
that corresponds to the degree of truth of a relationship mapped to

a transition. The measure of truthfulness of the relationship depends

on the relationship kind and is computed using data in the training

set in case of pseudo-spatial and spatial relationships based on co-

occurrence of elementary classes in images. Also the function f can

be defined by an expert in case of more abstract classes (SC, GC and

DC).

λ ∈ [0, 1] is a threshold value related to transitions firing. If the
threshold value λ is set, the truth value c(m1) of each token must ex-
ceed the value of λ if the transition is to be enabled.

Con ⊆ (� × �) is in this scheme defined as a set of pairs of
mutually contradictory relations. It is defined on a set of relations

occurs_with, not_occurs_with between elementary classes. It can be

also defined between concepts if necessary.

The KRFPN scheme can be visualized by a directed graph contain-

ing two types of nodes: places and transitions. Graphically, the places

pi ∈ P are represented by circles and the transitions tj ∈ T by bars. The
directed arcs between the places and transitions, and the transitions

and places represent the transition input I(tj)⊆P and output O(tj)⊆P
functions, respectively (Fig. 4). In a semantic sense, each place from

the set P corresponds to a concept di ∈ D and any transition from set
T to a relation rk ∈ �.

A dot within a place represents a token m1 ∈ M. To a token at the
input place pi ∈ I(tj) and the transitiontj ∈ T, the values c(m1) and
(tj) are assigned, respectively. The assigned values implement un-

ertainty and fuzziness in the scheme and can be expressed by truth

cales, where 0 means “not true” and 1 “always true”. Semantically,

value c(m1) expresses the degree of uncertainty of a concept di ∈ D
apped to a particular place pi ∈ P, and the value f(tj) corresponds to

he degree of uncertainty of a relationship ri ∈ � mapped to a transi-
ion tj ∈ T.

A place that contains one or more tokens is called a marked place.

he tokens give dynamic properties to the Petri Net and define its ex-

cution by firing an enabled transition. A transition is enabled when

very input place of the transition is marked, i.e., if each of the input

laces of the transition has at least one token and if each token value

xceeds the threshold value λ.
An enabled transition tj can be fired. By firing, a token moves from

ll its input places pi ∈ I(tj) to the corresponding output places pk ∈
(tj). In Fig. 4, there is only one input place for the transition tj, I(t j) =

pi and only one output place O(t j) = pk. After the transition firing, a
ew token value c(m2) at the output place is obtained as c(m1)f(tj)

Fig. 5).

The dynamic properties of the scheme are important for the

nference-engine definition. The inference engine on the KRFPN

cheme consists of two automated reasoning processes: fuzzy inheri-

ance and fuzzy recognition. All the steps of the inference algorithms

re given in (Ribarić & Pavešić, 2009). The complexity of both infer-

nce algorithms is O(nm), where n is the number of places (concepts)

nd m is the number of transitions (relations) in the knowledge base.

The algorithms are used in novel and original ways to check the

onsistency of classes, for scene recognition and for reasoning on

ore abstract classes, as described in detail below.

.2. Modeling the truth value of relationships

Given that the mapping between concepts and image features is

ften unreliable, and due to incomplete knowledge of the concepts,

he uncertainty is implemented into the scheme by associating a

alue with a transition and with a token in a marked place. A tran-

ition value expresses the degree of truth or the reliability of the re-

ated relationship, while a token value corresponds to the truth value

r the reliability of the concept. The degree of truth of the relation-

hips depends on the type of the relationship and is set according

o the expert knowledge or it is computed using data in the training

et. For example, the degrees of truth of the relationships that model

he generalization of classes are determined by the expert, while the

ruth value of the relationships consists_o f and occurs_with is com-

uted using data in the training set, as explained below.


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9545

grass
p11

consists_of
t80

Seaside
p45

tree
p25

consists_of
t85

water
p26

consists_of
t86

0.40

0.95

cloud
p6

consists_of
t79

0.40

sky
p20

consists_of
t84

rock
p17

consists_of
t82

sand
p18

consists_of
t830.38

0.85

0.90

0.47

building
p4

consists_of
t78

Fig. 6. Relations among the scene “Seaside” and its components.

Fig. 7. Position of objects sky, water, grass, trees and the spatial relations between the objects in the image.

5

c

a

e

p

r

t

c

b

m

B

s

M

o

t

c

t

i

E

c

c

w

P

P

r

n

{

E

t

a

d

m

a

5

a

o

a

P

r

E

o

r

v

w

t

5

u

i

d

w

f

E

n

o

t

t

a

.2.1. Relationship consists_of

To define the truth-value of the aggregation relationships

onsists_of, here it is assumed, that a scene may contain several char-

cteristic elementary classes. Thus, the relation among the scene and

lementary classes is an aggregation relationship where the scene

lays the role of the aggregation and the elementary classes have the

ole of the components of the aggregation. Analyzing the data in the

raining set, common occurrence of elementary classes in the scene

lass is determined and used for creation of the rules on relationships

etween scenes and elementary classes. Instead of choosing an ele-

entary class with a maximum posterior probability, the modified

ayes rule is used to form a set MS that corresponds to the specific

cene class. A set MSSCi for a specific scene class SCi∀i is given by:

SSCi =
{

ECk : arg
i

P(SCi|ECk ) ≈ arg
k

P(ECk|SCi )
P(ECk)

≥ ε
}

. (3)

The Eq. (3) mirrors the idea of finding a most representative set

f elementary classes for a given scene class. MSSCi is a set of all

hose elementary classes ECk, k = 1, 2, . . . that participate in a scene
lass SCi with the posterior probability P(SCi|ECk), ∀k ECk exceeding
he marginal value ɛ ≥ 0.05. The marginal value is determined exper-
mentally. The prior probability P(ECk) for a given elementary class

Ck is computed from the training set to bring in the degree of dis-

rimination of each elementary class for a given scene class.

The truth value attached to the aggregation relationship

onsists_of between the elementary classes and the scene class

as determined using the Bayes rule for the posterior probability

(SCi|ECk ), ∀k ECk ∈ MSSCi for the specific scene:

(SCi|ECk ) = P(ECk|SCi )P(SCi)s∑
j=1

P(ECk|SC j )P
(
SC j

) ,
(4)

s = |SC| is a number of scene classes.
In Fig. 6, a part of a knowledge base is presented, showing the

elationships among a particular scene class seaside and its compo-

ents that correspond to elementary classes from the set MSseaside =
sky, cloud, water, grass, tree, rock, sand, building} defined by the
q. (3). The degree of truth f(tj) of the transition tj that corresponds

o the relation consists_of between a particular scene class “Seaside”

nd its components, is given by P(Seaside|ECk), ECk ∈ MSseaside and is
etermined by Eq. (4). For instance, truth value of relation consists_of

apped to transition t86 between a scene class “Seaside” of place p45
nd elementary class “water” of place p26 is f (t86) = 0.95.

.2.2. Relationship occurs_with

To create the rules on relationships between elementary classes

nd to define the truth value of the relationship occurs_with, mutual

ccurrence of classes ECj and ECi in each image in the training set is

nalyzed. This can be formally defined as:

(EC j|ECi) =
P(EC j ∩E Ci)

P(ECi)
. (5)

If the P(ECj|ECi) is less than the threshold value τ = 0.1 then the
elationship not_occurs_with is defined between elementary classes

Cj and ECi, i = j with the truth value of 0.9. Otherwise, the truth value
f the not_occurs_with relationship is 1 − P(EC j|ECi). The occurs_with
elationship is used in proposed inconsistency checking procedure to

alidate the results of the image segment classification and to check

hether the results obtained on all the image segments are consis-

ent.

.2.3. Spatial relationships

Spatial relationships like at the top, at the bottom, have not been

sed in this experiment since the relationships between the objects

n the image differed from the natural relations. In images from the

omain of natural scenes that we have used, the sky, trees, grass and

ater can appear both at the bottom and at the top of the image, so

or example, water can appear above the grass and trees, as in Fig. 7e.

llipses in the Fig 7a–e show the positions of the segments that are

ot in line with the common knowledge about spatial relationships

f objects in nature. For example, the grass segment in Fig. 7c is above

he tiger segment.

If it turns out to be useful, spatial relationships, as well as fuzzy

emporal relationships or new concepts can be added to the scheme

nd used by inference engine.


9546 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

Fig. 8. Example of image representations at layers V0 , V1 , MI1 .

shuttle
p19

not_occurs_with
t393

0.75

rock
p170.80

water
p26not_occurs_with

t396

not_occurs_with
t394

sand
p18

0.75

not_occurs_with
t395

tree
p25

0.90

not_occurs_with
t381

lion
p10

...

not_occurs_with
t371

coral
p7

shuttle
p19

not_occurs_with
t393

0.75

rock
p170.80

water
p26not_occurs_with

t396

not_occurs_with
t394

sand
p18

0.75

not_occurs_with
t395

tree
p25

0.90

not_occurs_with
t381

lion
p10

...

not_occurs_with
t371

coral
p7

(a) (b)

Fig. 9. A part of KRFPN scheme related to elementary class shuttle and the relationship not_occurs_with (a) before firing and (b) after firing the transitions.

c

p

u

c

s

p

m

t

s

(
t

f

t

e

p

T

A

n

w

i

n

p

s

h

6. Knowledge-based approach to inconsistency checking

It is to be expected that some of the elementary classes obtained

using the Bayes classification rule (Eq. 4) do not fit the image con-

text. To check for inconsistency of the obtained elementary classes,

the inconsistency checking procedure is proposed that uses the facts

in the knowledge base related to the occurs_with and not_occurs_with

relations. The relations occurs_with and not_occurs_with for each ob-

tained elementary class can be analyzed using the fuzzy-inheritance

algorithm. Based on the results of fuzzy inheritance, the classes which

are elements of domain of relation not_occurs_with are eliminated

from the set MI1, the first semantic layer.

In order to illustrate the proposed procedure for inconsistency

checking, an example follows. Let the image I in Fig. 8 be given

for a multi-layered image annotation. After the segmentation, us-

ing a normalized-cuts algorithm, the image is segmented into 7 ar-

eas: V1(I) = {s1, s2, . . . , s7}. For each image segment the low-level
features are extracted and a feature vector is formed, so the im-

age is represented at level V2 by the set of feature vectors: V2(I) =
{x1, x2, . . . , x7}. Then, using the Bayes classification method, each
feature vector is classified into one of the elementary classes ECi	 EC
according to the maximum posterior probability (cMAP, Eq. (1)). For

image I in Fig. 8, the obtained result, after the classification of

all the image segments, is: “sky, water, water, shuttle, rock, water,

sand”. Thus, the set of obtained elementary classes forms an au-

tomatic image interpretation at the layer MI1 of the image I, as

MI (I) = {sky, water, shuttle, rock, sand}. Note that the elementary
1
lass shuttle is a result of misclassification, because a shuttle is not

resent in the image.

Every obtained elementary class can be checked for inconsistency

sing the not_occurs_with relationships defined between elementary

lasses in MI1(I) and the fuzzy-inheritance algorithm.

For instance, to check the inconsistency of the elementary class

huttle, the fuzzy-inheritance algorithm is used as follows. The ap-

ropriate place in the knowledge-representation scheme is deter-

ined by the function α−1(shuttle) = p19, shuttle ∈ EC (Fig. 9). On
he Fig. 9, presented are those not_occurs_with relations for which

huttle is the input place. The initial token distribution is �0 =
∅, ∅, . . . ., {p19, 1}, . . . , ∅), i.e. the initial token is placed only on
he place p19. For inconsistency checking only relations with outputs

rom the set MI1(I) are useful and are shown in black. According to

he original FPN algorithm all transitions related to these relations are

nabled and can be fired because the number of tokens in the input

lace (shuttle) is equal to the number of input arcs of the transitions.

he transition values are obtained from the training set using Eq. (5).

fter firing, the token is removed from the input place (shuttle) and

ew tokens are created and distributed to output places (sand, rock,

ater, …) as shown in Fig. 9b.

The inheritance tree is formed starting from the root node which

s for this example π 0(p19, {1.0}). Firing of the transitions creates
ew frontier nodes of the inheritance tree that correspond to out-

ut places of transitions. This step is repeated until the condition for

topping of the algorithm is satisfied or the desired depth of the in-

eritance tree is reached. The frontier nodes are converted by the


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9547

Fig. 10. Inheritance tree for the class “shuttle” (Fig. 9).

i

t

n

t

o

f

(

F

a

t

t

w

p

v

f

m

o

w

c

c

t

F

i

{

7

t

(

t

O

2

p

i

c

i

d

t

c

T

t

i

f

{
t

c

i

v

o

{
t

(

t

α
–

t

d

t

π

s

i

t

p

c

w

r

D

t

n

d

nheritance tree algorithm into the frozen node (marked F), k-

erminal (marked k-T) or identical (marked I), or one of the types of

odes defined for the reachability tree (terminal, duplicate and in-

erior). The inheritance tree of the KRFPN is similar to the concept

f reachability tree of Petri Nets (Chen, Ke, & Chang, 1990), except

or the stopping conditions that are integrated in the KRFPN scheme

by the set �\{is_a}) or defined by the desired number of tree levels.
ig. 10 shows a 1-level inheritance tree on the KRFPN scheme and the

ppropriate semantic interpretation of the inheritance paths for

he elementary class shuttle. The nodes of the inheritance tree have

he form (p j , c(ml )) j = 1, 2, . . . , p, l = 1, 2, . . . , r, 0 ≤ r ≤ |M|,
here c(ml) is the value of a token ml in place pj, computed as the

roduct of the token value at the input place and the corresponding

alue f(tj). The arcs of the inheritance tree are marked with a value

(tj) and the label of a transition tj ∈ T, where, for example, t396 = 0.8
eans f(t396) = 0.8. For each of the inheritance paths the measure

f truth is determined by the token value in a leaf node (the node in

hich the algorithm stops).

The obtained inheritance tree for the concept shuttle gives the

onclusion that the class shuttle does not occur with the elementary

lasses from the set MI1(I), so it can be concluded that the class shut-

le most likely does not match the context of the image depicted in

ig. 8 and therefore can be discarded.

Accordingly, after checking for the inconsistency, the refined

mage interpretation at the semantic layer MI1 is: MI1(I) =
sky, rock, sand, water}.

. Scene recognition

For the task of scene recognition for a new, unknown image,

he fuzzy-recognition algorithm based on the inverse KRFPN scheme

marked as –KRFPN) is originally used. The –KRFPN scheme is ob-

ained by interchanging the position of the input I and the output
grass
p11

-(consists_of)
t80

S

-(consist
t85

water
p26

-(consists_of)
t86

0.40

0.95

cloud
p6

-(consists_of)
t79

0.40

sky
p20

-(consists_of)
t84

0.90

SceneAirplane
p29

-(consists_of)
t68

0.95

-(con
sists_

of)

t69

0.95

0.5

0.6

Fig. 11. A small part of the –KRFPN scheme for the s
functions for the transition T in the 13-tuple (Ribarić & Pavešić,

009). Additionally, by changing the position of the input and out-

ut functions, the relation mapped to the transition is transformed

nto its corresponding inverse relation. For example, for the relation

onsists_of in the KRFPN scheme its inverse relation is_part_of is used

n the –KRFPN scheme, i.e., –(consists_of) = is_part_of. Also, the co-
omain of the associated function c: M → [0, 1] that assigns values
o the tokens (see 5.1) is expanded by cr : M → [−1, 1] so that in the
ase of an exception, a token may be associated with a negative value.

The proposed procedure for the scene recognition is as follows.

he results of the image interpretation at layer MI1, after inconsis-

ency checking, are the input to the scheme used for further image

nterpretation at the layer MI2. The obtained elementary classes ECi
rom MI1(I) are treated as components of an unknown scene class X.

The elementary classes ECi are mapped to the places

p1, p2, . . . , pn} using the function α−1 : ECi → pk. If defined,
he reliability based on a posterior probability of each elementary

lass ECi can be used as the token value cr(ml) in the place pk, whose

nterpretations correspond to the given class ECi. Otherwise, a token

alue is set to 1.

For instance, let us take an image I depicted in Fig. 8. The results

f the image interpretation at the layer MI1(I) are elementary classes

sky, rock, sand, water} that exist in the knowledge base. Based on
he Bayes classification rule (Eq. 1) the degrees of truth are assigned:

sky {0.5}, sand ({0.7}), rock ({0.4}), water ({0.6}). By using the func-

ion α−1 the initially marked places are determined (α−1(sky) = p20,
−1(sand) = p18, α−1(rock) = p17, α−1(water) = p26). A small part of a
KRFPN scheme with initially marked places and the corresponding

oken value is given in Fig. 11.

According to the initially marked places and the corresponding

egrees of truth, four root nodesπ0
i, i = 1, . . . , 4 of the recognition

rees will be formed:

1
0 (p20, {0.5}), π 20(p18, {0.7}), π 30 (p17, {0.4}), π 40(p26, {0.6}).

Fig. 12 shows four corresponding recognition trees in the -KRFPN

cheme with enabled transitions, starting from the root node. By fir-

ng of the enabled transitions on the -KRFPN scheme, new nodes at

he following higher level of the recognition tree are created and ap-

ropriate values of the tokens are obtained:

r(mk+1) = cr(mk) f (tl ) (6)
here tl is the transition between concepts ECi and SCl, cr(mk) is the

eliability of the elementary class ECi and f(tl) is computed in Eq (4).

ue to the simplicity of the example, only one level of the recognition

ree is generated. Note that only the recognition tree with the root

ode π 2
0

(Fig. 12b) directly corresponds to the small part of –KRFPN

epicted in Fig. 11. The other recognition trees (Fig. 12a, c and d) also
easide
p45

tree
p25

s_of)

rock
p17

-(consists_of)
t82

sand
p18

-(consists_of)
t830.38

0.85

0.47

-(consists_of)
t94

0.4
0.7

SceneBear
p30

-(consists_of)
t93

SceneShuttle
p47

cene recognition for image depicted in Fig. 8.


9548 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

Fig. 12. Recognition trees with enabled transitions for each root node.

(
s

i

v

I

I

o

a

1

i

o

S

o

a

t

t

i

{

8

n

h

c

contain leaf nodes corresponding to the scene classes that are part of

the knowledge base but are not depicted in Fig. 11.

The following steps of scene recognition are as follows. Each leaf

node π k
i

in the recognition tree k = 1, 2, . . . , b is represented by a
vector of dimension |P|, where P is a set of places, so the index of

a node in the recognition tree corresponds to the index of the vector

component and the value of a node is assigned to a value of the vector

component. For example, a node π 2
1

= (p45, 0.595) (Fig. 12b) is rep-
resented by the vector π2

1
= (0, 0 . . . 0, 0.595, 0, . . , 0) so that all the

vector components are assented to a value 0, except the 45th vector

component, to which a node value of 0.595 is assigned. Accordingly,

the total sum Z of all leaf nodes in all recognition trees is computed:

Z =
b∑

k=1

ok∑
i=1

πk
i
. (7)

where πk
i

is ith leaf node in the k–th recognition tree, b ≤ |M| is the
number of recognition trees, ok ≤ |P| is the total number of leaves in
the recognition tree k.

In this example there is b = 4 recognition trees, the corresponding
numbers of leaves are: o1 = 12, o2 = 2, o3 = 9, o4 = 11, and the total
sum is:

Z =
4∑

(k=1)

(ok)∑
(i=1)

π k
i

=
12∑

i=1
π 1i +

2∑
i=1

π 2i +
9∑

i=1
π 3i +

11∑
i=1

π 4i

= (0 . . . 0, 0.36, 0.44, 0.09, 0.05, 0, 0.03, 0, 0.04, 0, 0, 0.05,
0.03, 0.03, 0.04, 0, 0.16, 1.80, 0.05, 0.05, 1.11, 0, . . . 0).

For example, the 30th component of the vector Z with the

value 0.44 is obtained by summing all the values of the nodes

in all the recognition trees that correspond to the place p30 (i.e.

π 1, π 2, π 3, π 4): 0.115 + 0.175 + 0.052 + 0.102 = 0.44

7 2 8 9
Then, a set of indices of elements with the highest sum Z =
Z1, Z2, . . . , Z|P|) among all of the nodes in all the recognition trees is
elected as:

∗ = arg max
i=1,..,|P|

{Zi}. (8)

In the case that there are several i for which the same maximum

alue of {Zi} is obtained, the set I∗ is created:
∗ = {i∗1, i∗2, . . . ,}. (9)

A scene class assigned to a place with the max argument pi: i ∈
∗ is chosen as the best match for a given set of elementary classes
btained during image interpretation at the layer MI1. In this ex-

mple, the 45th component of the vector Z has the maximum value

.80. Therefore, a set of max arguments consists of only one element
∗
1

= 45, so only one scene class is chosen as the best match, i.e., the
ne that is assigned to a place with that max argument, α(p45) =
easide. The next scene candidate is α(p48) = Inland with a value
f 1.11.

By merging all the classes that are so far associated with the im-

ge, from elementary classes to the scene class, the multi-layered in-

erpretation of the image is formed. For example, a multi-layered in-

erpretation of image I (in Fig. 8) includes the results of the image

nterpretation at the layers MI1 and MI2 : MI(I) = MI1(I) ∪ MI2(I) =
sky, rock, sand, water} ∪ { Seaside}.

. Inference of more abstract classes

The obtained scene classes can be used as root nodes for the

ext inheritance process that will infer more abstract concepts from

igher semantic levels (here referred as generalized and derived

lasses) either because they are directly linked with the concept or


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9549

grass
p11

consists_of
t80

Seaside
p45

tree
p25

consists_of
t85

water
p26

consists_of
t86

0.40

0.95

cloud
p6

consists_of
t79

0.40

sky
p20

consists_of
t84

rock
p17

consists_of
t82

sand
p18

consists_of
t830.38

0.85

0.90

0.47

building
p4

consists_of
t78

Landscape
p60

is_a
t93

1.00

Natural
Scene

p53

is_a
t392

1.00

Outdoor
Scene

p55

is_a
t387

1.00

Vacation
p59

is_associated_to
t81

0.70

Fig. 13. A part of the knowledge base that shows the properties of the class “Seaside” and its parents.

Fig. 14. The inheritance tree for the concept “Seaside”.

m

(

t

s

w

t

s

k

p

e

m

t

{

f

w

c

w

t

a

ay be inferred by means of concepts at a higher level of abstraction

parents).

To determine related, more abstract classes for a given scene class,

he relations with its parents at higher levels of abstraction are in-

pected using an inheritance algorithm. The proposed procedure, by

hich classes that are more abstract are concluded, will be illus-

rated using the example of scene class Seaside ∈ SC that was re-
ult of the recognition algorithm in Section 7. In Fig. 13, a part of a

nowledge base is shown that includes information about the com-

onents of the class “Seaside” and its more abstract classes defined by

xpert.
At the first step of the algorithm the appropriate place is deter-

ined by the function α−1(Seaside) = p45. A token value c(ml) is set
o 1, so the corresponding root node of the inheritance tree is π 0(p45,
1.0}). Fig. 14 shows a 3-level inheritance tree on the KRFPN scheme

or the class ’Seaside’ that shows its more abstract classes (nodes

ithin the ellipsis) as well as its properties.

To determine more abstract classes associated with the given

lass, the key nodes are those in the parent-child relationship

ith the given class. The nodes in parent-child relationship for

he class ‘Seaside’ are: π 14(p59, {0.7}), π 15(p60, {1.0}), π 21(p53, {1.0})
nd π 31(p55, {1.0}) and the following applies: α(p59) = Vacation,


9550 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

Fig. 15. The inheritance tree for a synonym of Seacoast concept Seaside.

i

m

s

m

r

i

(

(

R

(

m

p

P

M

T

p

u

c

e

g

e

e

e

f

I

w

1

m

–

f

o

e

f

t

i

c

2

m

e

l

o

α(p60) = Landscape, α(p53) = Natural scene, α(p55) = Outdoor scene.
The classes “Landscape”, “Natural scene” and “Outdoor scene” are a

generalization of the class “Seaside”, while the class “Vacation” is a

derived class that one can associate with the class “Seaside” using the

relation is_associated_to.

Thus, the result of a multi-layered image annotation for the

image I given in Fig. 8, after the generalization and the derived-

concepts inference is: MI(I) = MI1(I) ∪M I2(I) ∪M I3(I) ∪M I4(I) =
{sky, rock, sand, water} ∪{ Seaside} ∪{ Landscape, Natural scene,
Outdoor scene} ∪{ V acation}.

Also, new concepts can be added to the knowledge. Some exam-

ples of such an extension are synonyms of the concepts defined in

a scheme like Seacoast for Seaside or terms that are colloquially un-

derstood as synonyms like Forest or Logs for Trees. In these cases,

the is_synonim_of relation is defined between a class that is already

defined in the knowledge base (e.g. Seaside) and the synonym that

should be added (e.g., Seacoast). Fig. 15 shows the fuzzy-inheritance

tree for the concept Seacoast, for which applies α−1(Seacoast) = p57,
so the corresponding root node of the inheritance tree is π 0(p57,
{1.0}).

Inclusion of concepts at different levels of abstraction maps the

organization of concepts from natural language to image annotation

and facilitates the adjustment of the system to the user’s needs and

expectations.

9. Experiments and discussion

To evaluate the proposed model of a multi-layered image anno-

tation, an experiment on a part of the Corel image dataset related

to outdoor scenes (e.g., Landscape, Vehicles, Animals, Space) was

performed.

The images were automatically segmented, based on the visual

similarity of pixels, using the normalized-cut algorithm. Most of the

images were segmented into approximately 10 regions. Every image

segment was more precisely characterized by a set of 16 visual fea-

tures based on the color in CIE L∗a∗b∗ color model, size, position,
height, width and shape of the area (Duygulu et al., 2002).

Also, each image segment of interest was annotated with the

first keyword from the set of corresponding keywords provided by

(Carbonetto, Freitas, & Barnard, 2004). The vocabulary used to anno-

tate the image segments has 28 keywords related to natural and arti-

ficial objects such as ’airplane’, ’bird’, ‘lion’, ‘train’ etc. and landscapes

like ‘ground’, ‘sky’, ‘water’ etc. The keywords from the vocabulary cor-

respond to the elementary classes.

Visual features and keywords of each image segment make a data

set. The data set used for the experiment consists of 3960 segments

obtained from 475 images of outdoor scenes. The data was because

of supervised learning divided into training (2772) and test (1188)

subsets by a 10-fold holdout cross validation. Data in the training set

were used to learn a classification model of each elementary class us-
ng a Bayes classifier. The features from the test set are used to test the

odel using the corresponding elementary classes as ground-truth.

To evaluate the MIAS system at layer MI1, the results of image clas-

ification at that layer are compared to the ground truth. The perfor-

ance of MIAS system at layer MI1 is expressed with measures of

ecall (10) and precision (11). Average scores after 10 runs are shown

n Fig. 16.

The recall is the ratio of the correctly predicted elementary classes

tp - true positive) and all elementary classes in the ground-truth data

tp + fn; fn - false negative):

ecall = tp
tp + fn . (10)

The precision is the ratio of correctly predicted elementary classes

tp) and total number of elementary classes obtained from the auto-

atic image interpretation at layer MI1 of the MIAS (tp + fp; fp - false

ositive):

recision = tp
tp + fp (11)

The proposed system MIAS for image interpretation at the layer

I1 achieves an average precision of 32.6% and average recall of 27.5%.

he average precision is calculated as the average of all the values of

recision that are obtained for each elementary class in the test set

sing the 10-fold cross validation. Similarly, the average recall is cal-

ulated as the average of all the values of recall that are obtained for

ach elementary class in the test set. Each elementary class in the

raph (Fig. 16) is marked with class ID, so that ID 1 corresponds to the

lementary class ‘airplane’, ID 2 to the elementary class ‘bear’, ID 3 to

lementary class ‘bird’ and so on until ID 28 that corresponds to el-

mentary class ‘zebra’. The highest precision, over 56% was obtained

or the elementary classes: ‘grass’- ID 11, ‘polar bear’ - ID 15, ‘rock’ -

D 16, ‘sky’ - ID 20 and ‘tracks’ - ID 23. The highest recall, over 57%

as obtained for the elementary classes: ‘grass’ - ID 11, ‘ground’ - ID

2, ‘rock’ - ID 13, ‘sky’ - ID 20, ‘train’ - ID 24 and ‘trees’ - ID 25. For ele-

entary classes ‘bear’ – ID 2, ‘building’ – ID 4, ‘cheetah’ – ID 5, ‘coral’

ID 7, ‘dolphin’ – ID 8, ‘fox’ – ID 10 and ‘zebra’ – ID 28 obtained value

or both, precision and recall, is zero. Some of the reasons for this

utcome are too few samples that we had available for the particular

lementary class (e.g. for the class building we had only 24 segments,

or the class dolphin only 20 segments), then the big diversity of fea-

ures within the class (e.g. instances of class coral significantly differ

n color) as well as errors in segmentation.

The obtained results, given as outputs of MIAS at layer MI1, are

ompared to the results of the models published in (Carbonetto et al.,

004). The results of the automatic image annotation obtained for the

entioned set of images with the dMRF model defined in (Carbonetto

t al., 2004) and the dInd model from (Duygulu et al., 2002) are pub-

ished in (Carbonetto et al., 2004). The dMRF model uses the method

f Markov random fields for the automatic image annotation, while


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9551

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Class ID

P
re

ci
si

o
n
/R

e
ca

ll

Precision

Recall

Fig. 16. A precision/recall graph for the image automatic interpretation with MIAS at layer MI1 .

Table 1

Comparison of the results achieved with MIAS at MI1 , dInd, dMRF.

Models MIAS–MI1 dInd dMRF

Number of correctly predicted classes 21 23 24

Average precision 32.6% 19.9% 21%

t

a

a

a

b

i

a

c

l

h

t

o

p

H

i

t

T

m

t

i

d

l

c

o

t

i

f

d

b

s

r

a

d

c

A

a

M

e

l

fi

a

he dInd model is an example of a translation model that treats image

nnotation as the translation between two discrete languages. The

uthors have reported the precision for the task of automatic image

nnotation for each of 28 keywords in the vocabulary achieved by

oth models. Comparing the results of the automatic annotation on

mages related to outdoor scenes, the dMRF model achieves an aver-

ge precision of 21%, while the dInd model achieves an average pre-

ision of 20%. As specified in the Table 1, average precision of MIAS at

ayer MI1 exceeds the precision of both models although our system

as correctly predicted fewer classes, 21 classes out of 28 possible.

Comparing the results presented in Table 1, it should be noted

hat, for learning, the models dInd and dMRF used image labels, while

ur approach uses labels of the image segments. Because of the su-

ervised learning approach, somewhat better results were expected.

owever, the achieved difference for the average precision is signif-

cant, although the dMRF model took into account the context. Note

hat the given results of our system MIAS at layer MI1 (represented in

able 1.) are without any checking for inconsistency.
Table 2

Examples of multi-layered image annotation by MIAS.

Image example:

Multi-layered image

annotation

MI1 ‘shuttle’ - ID 19 ‘train’’ – ID 2

–ID 23, ‘sk

MI2 ‘Shuttle Scene’, ‘Train Scene

MI3 ‘Vehicle’, ‘Man-Made

Object’, ‘Outdoor’

‘Vehicle’, ‘Ma

Objectȁ, ‘O

MI4 ‘Space’ ‘Transport’
Generally, the results achieved by image automatic annotation

odels on outdoor image domains are relatively poor and the ques-

ion is whether they can meet customer requirements when retriev-

ng or organizing images. Often the results of automatic annotation

epend on the quality of the segmentation, so when an image has a

ot of segments and when an object is over segmented, the results

an include labels that do not correspond to the object or context

f an image. Here, to mitigate this problem, the obtained results of

he image interpretation at the layer MI1 are analyzed with the fuzzy

nheritance algorithm. The aim is to purify the classification results

rom class labels that do not match the likely context of the image. To

o so, the facts from the knowledge base related to the relationships

etween elementary classes as well as the reliability of the relation-

hips computed with (5) are used with the fuzzy inheritance algo-

ithm. Using inconsistency checking, those elementary classes that

re obtained as a result of the image interpretation at layer MI1 and

id not fit the likely context are discarded. As a consequence, the pre-

ision of the image interpretation at layer MI1 is increased up to 43%.

further improvement of the precision could be achieved by defining

dditional relationships between the elementary classes.

Afterwards, automatic image interpretation at layer MI2 of the

IAS is performed by the fuzzy-recognition algorithm, using the

lementary classes obtained as results of image interpretation at

ayer MI1 and using knowledge about a particular domain. To de-

ne the relationships between the scenes and the elementary classes

utomatically, and to determine their reliability, we have analyzed
4, ‘tracks’

y’ - ID 20

’grass’ – ID 11, ‘tiger’’- ID

22

’water’ – ID 26, ‘sand’ –

ID 18, ‘sky’ - ID 20,

‘road’-ID 16

’, ‘Tiger Scene’, ‘Seaside’,

n-Made

utdoor’

‘Wildcat’, ‘Wildlife’,

‘Natural Scenes’,

‘Outdoor Scene’

‘Natural Scenes’, ‘Outdoor

Scene’

‘Savannah’ ‘Vacation’


9552 M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553

w

T

l

a

f

g

a

e

a

h

l

c

c

c

w

c

(

i

t

s

D

(

a

t

p

c

t

w

o

f

l

i

a

m

l

t

t

a

f

r

a

u

a

e

g

t

c

e

i

i

t

a

n

q

t

s

s

t

b

r

t

elementary classes and scenes related to each image. Therefore, we

have supplemented the existing vocabulary with 20 classes related

to the scenes such as ‘Airplane Scene’, ‘Bird Scene’, ‘Sea’, etc. Then, we

have used these classes of 475 images to make a data set to be used for

scene recognition. The data was divided into training and test subsets

(70:30) by a 5-fold holdout cross validation. Data in the training set

were used to produce the rules about relationships between scenes

and elementary classes according to (3), and to learn a classification

model of each scene class using a Bayes classifier, according to (4).

Obtained precision of automatic image interpretation at layer MI2
is 61% and the recall is 55%. The results at the layer MI2 depend on

the results at layer MI1. For those scenes for which there is one main

object class which is highly discriminant for that scene (e.g train for

Train Scene), it is crucial to detect that object. In this kind of scenes

background objects that are common to most scenes do not play an

important role, but in scenes without one prominent object (e.g. Sea,

Inland) they are important. Additionally, the inheritance algorithm is

used to infer generalized classes related to a scene class that make the

interpretation at the layer MI3 and derived classes at the layer MI4. In

Table 2, some examples of a multi-layered image annotation obtained

by MIAS are shown.

10. Conclusion

The aim of the present research was to annotate automatically im-

ages with words that are used in an intuitive image search. These

words correspond to concepts on different levels of abstraction, in

order to enable simple retrieval and organization of images. These

concepts cannot be simply mapped to features but require additional

reasoning with general and domain-specific knowledge, which can in

context of image interpretation often be incomplete, imprecise, and

ambiguous. Therefore, the ability of handling uncertainty and reason-

ing from fuzzy knowledge turned out to be an important property.

We have developed the fuzzy-knowledge based intelligent system

MIAS for multi-layered image annotation supported with fuzzy in-

ference engine that is capable to deal with approximate reasoning

and uses the available knowledge to draw conclusions about image

semantics.

In order to bridge the semantic gap between the visual content of

an image and the image semantics, the MIAS system deals with visual

content of images (low-level features) and with image semantics (el-

ementary, scene, generalized and derived classes) that are inspired

by human image interpretation and presented on layers MI1 − MI4.
We have merged the statistical approach for classification of im-

age segments with knowledge-based approach to infer concepts that

are more abstract. For classification of image segments into elemen-

tary classes below the MI1 layer, a Bayesian classifier is used. The ar-

chitecture of the MIAS system facilitates its compatibility with vari-

ous classification methods so that other classification methods can be

used as well. The fuzzy knowledge base is built using a fuzzy knowl-

edge representation scheme based on Fuzzy Petri Nets (KRFPN) for-

malism. The hierarchical arrangement of the KRFPN schemes used in

MIAS allows that the schemes can be independently used, modified

and connected with each other into a new hierarchical structure, e.g.

to expand the knowledge base with new concepts that may be syn-

onyms or with concepts on different semantic levels.

The acquisition of knowledge was facilitated so that all the facts

and rules on composition and distribution of concepts, as well as their

reliability are produced automatically from data in a training set. A

human expert explicitly specifies only the facts about general knowl-

edge and heuristics about the particular domain. Both new relation-

ships and new concepts with appropriate reliabilities can be stored

into the knowledge base and used by the inference engine.

The approximate reasoning capability of the inference engine sup-

ported in KRFPN scheme was used in an original way for automatic

scene recognition, for inference of classes that are more abstract as
ell as for inconsistency checking of the classified image segments.

he concepts obtained by classification of image segments at the

ayer MI1 (elementary classes) were treated as components of scenes

t the layer MI2 that can be inferred by further analysis with the in-

erence engine. Thus obtained concepts were then used for inferring

eneralized and derived classes related to the image at the layers MI3
nd MI4.

Since the decisions about more abstract concepts can be made

ven when input information about the concepts present in an image

re imprecise and vague, the errors can be propagated through the

ierarchical structure of concepts and affect the inference on higher

evels. To reduce this problem, we have proposed a novel consistency-

hecking procedure that checks consistency of obtained elementary

lasses at the layer MI1 with the determined image context and dis-

ards the intruder classes, to increase the reliability of conclusions, as

ell as to improve the precision of image annotation.

The results of image annotation at layer MI1 of the MIAS were

ompared with the published results of automatic image annotation

Carbonetto et al., 2004, Duygulu et al., 2002), on the same set of

mages and using the same image features. It has been shown that

he supervised learning approach provides significantly better re-

ults than the unsupervised methods used in (Carbonetto et al., 2004,

uygulu et al., 2002), even when they take into account the context

Carbonetto et al., 2004).

After the inconsistency checking is performed, the results of im-

ge annotation at layer MI1 are significantly improved considering

he average precision. Additionally, the proposed system MIAS sup-

orts the recognition of scenes and reasoning about the related con-

epts at different levels of abstraction, to mimic the way people in-

erpret images and to enrich the image annotation with concepts that

ould people most likely use when searching for these images.

The main contributions of the presented research in the field

f expert and intelligent systems are related to the definition of

uzzy knowledge-representation scheme KRFPN for automatic multi-

ayered image annotation and to novel and original use of approx-

mate reasoning capabilities of the inference engine for inferring

bout the semantics of images. The main advantages of the proposed

ulti-layered image annotation MIAS system is the fusion of low-

evel image features and knowledge based concepts related to seman-

ics of an image. Another advantage stems from the connection be-

ween the statistical and knowledge-based approach in order to take

dvantages of their strengths so that statistical methods are used to

acilitate the knowledge acquisition and for automatic generation of

elationships between concepts as well as for computing their reli-

bility. Other strengths of the MIAS system arise from the original

se of fuzzy inference engine for scene recognition and for reasoning

bout more abstract concepts as well as the novel use of inference

ngine for checking consistency of concepts to reduce error propa-

ation through the hierarchical structure of the scheme. Thanks to

he KRFPN formalism, the MIAS system proved to be successful in

oping with incomplete, imprecise, uncertain and ambiguous knowl-

dge. The rules in the knowledge-base of MIAS can be visualized us-

ng Fuzzy Petri Nets and conclusions can be directly understood us-

ng the inference trees. Another advantage of the proposed system

hat arises from the KRFPN formalism is the ability to be extended by

dding new rules and to be adapted to a new domain by acquiring

ew facts and adapting the fuzzy knowledge base.

The proposed system architecture facilitates the knowledge ac-

uisition phase, but due to automatic generation of rules, a larger

raining set of images is needed. The automatically generated rules

trongly depend on the used data set, so when images in the training

et are not representative, the automatically generated rules on spa-

ial relationships between objects in the images and the relationships

etween objects and scenes may not be general enough and their

eliability may not be properly set. Therefore, after development,

he system should be additionally tuned for accuracy. Although the


M. Ivasic-Kos et al. / Expert Systems With Applications 42 (2015) 9539–9553 9553

a

c

i

v

i

w

a

M

m

d

t

t

f

d

o

s

b

A

t

i

R

A

B

B

B

C

C

C

D

D

D

E

F

F

H

H

H

I

I

L

L

L

M

M

N

P

P
R

S

S

S

S

S

S

T

Y

Y

Z

rchitecture of the MIAS system is general, the limitation is that it

annot be immediately used for new applications or domains. The

mages should be preprocessed to obtain low-level features, a new

ocabulary should be defined and new rules created, either automat-

cally using the training data set or provided by an expert.

This research was oriented to the domain of outdoor images, so

e plan to implement and test the proposed system in new domains

nd with very large image databases. Since the architecture of the

IAS system facilitates its compatibility with various classi fication

ethods, for the first layer of image interpretation we will examine

ifferent classification methods and methods of probability estima-

ion as well as optimized mechanisms for extracting visual features.

In the future research, we plan to expand the proposed model and

o examine the possibilities of its adaptation for annotation of videos,

or recognition of activities in image or video contents and for pre-

iction of future actions. Therefore, we will examine the possibility

f including the fuzzy spatial and temporal relations into the MIAS

ystem as well as explore the required adaptations of formalisms to

e used in the system.

cknowledgment

This work has been fully supported by Croatian Science Founda-

ion under the project 6733 De-identification for Privacy Protection

n Surveillance Systems (DePPSS).

eferences

thanasiadis, T., et al. (2009). Integrating image segmentation and classification for

fuzzy knowledge-based multimedia. In Proceedings of the MMM2009.

arnard, K., Duygulu, P., Forsyth, D., Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Match-
ing words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

enitez, A. B., Smith, J. R., & Chang, S. F. (2000). Medianet: A multimedia information
network for knowledge representation. In Proceedings of the IS&T/SPIE: v. 4210 MA.

lei, D., & Jordan, M. (2003). Modeling annotated data. In Proceedings of the 26th annual
International ACM SIGIR Conference on Research and Development in Information Re-

trieval (pp. 127–134).

arbonetto, P., Freitas, Nde., & Barnard, K. (2004). A statistical model for general contex-
tual object recognition. In Proceedings of ECCV 2004, Czech Republic, (pp. 350–362).

hen, S. M., Ke, J. S., & Chang, J. F. (1990). Knowledge representation using fuzzy Petri
nets. IEEE Transactions on Knowledge and Data Engineering, 2(3), 311–319 1990.

hengjian, S., Zhu, S., & Shi, Z. (2015, May). Image annotation via deep neural network.
In Proceedings of IEEE 14th IAPR International Conference on Machine Vision Applica-

tions (MVA) (pp. 518–521).

atta, R., Joshi, D., & Li, J. (2008). Image retrieval: Ideas, influences, and trends of the
new age. ACM Transactions on Computing Surveys, 20, 1–60.

ong, P. T. (2014). A survey of refining image annotation techniques. International Jour-
nal of Multimedia & Ubiquitous Engineering, 9(3).

uygulu, P., Barnard, K., de Freitas, J. F. G., & Forsyth, D. A. (2002). Object recognition as
machine translation: learning a lexicon for a fixed image vocabulary. In Proceedings

of European Conference on Computer Vision (pp. 97–112).

akins, J., & Graham, M. (2000). Content-Based Image Retrieval. Technical Report JTAP-
039, JISC, Institute for Image Data Research, University of Northumbria, Newcastle.

ei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene
categories. In Proceedings of IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, CVPR.: vol. 2 (pp. 524–531). IEEE.
eng, S., & Xu, D. (2010). Transductive multi-instance multi-label learning algorithm
with application to automatic image annotation. Expert Systems with Applications,

37(1), 661–670.
are, J. S., Lewis, P. H., Enser, P. G. B., & Sandom, C. J. (2006). Mind the Gap: Another look

at the problem of the semantic gap in image retrieval. In Proceedings of Multimedia
Content Analysis, Management and Retrieval San Jose, California, USA..

ong, R., Wang, M., Gao, Y., Tao, D., Li, X., & Wu, X. (2014). Image annotation by
multiple-instance learning with discriminative feature mapping and selection.

IEEE Transactions on Cybernetics, 44(5), 669–680.

u, J., & Lam, K. M. (2013). An Efficient Two-Stage Framework for Image Annotation.
Pattern Recognition, 46(3), 936–947.

vasic-Kos, M., Pavlic, M. &, & Pobar, M. (2009). Analyzing the semantic level of out-
door image annotation. In Proceedings of 32nd IEEE International Convention On In-

formation And Communication Technology, Electronics And Microelectronics-MIPRO
(pp. 293–296). Opatija, Croatia.

vasic-Kos, M., Ribarić, S. &, & Ipsic, I. (2010). Image annotation using fuzzy knowledge

representation scheme. In Proceedings of the IEEE 2010 International Conference of
Soft Computing and Pattern Recognition (pp. 218–223). Paris, France.

i, X., & Lara-Rosano, F. (2000). Adaptive fuzzy Petri nets for dynamic knowledge rep-
resentation and inference. Expert Systems with Applications, 19(3), 235–241.

i, J., & Wang, J. Z. (2003). Automatic linguistic indexing of pictures by a statistical
modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence,

25(19), 1075–1088.

iu, Y., Zhang, D., Lu, G., & Ma, W. Y. (2007). A survey of content-based image retrieval
with high-level semantics. Pattern Recognition, 40(1), 262–282.

aillot, N.E. (2005). Ontology based object learning and recognition (PhD thesis), Univer-
site de Nice-Sophia Antipolis.

arques, O., & Barman, N. (2003). Semi-automatic semantic annotation of images us-
ing machine learning techniques. In Proceedings Of International Semantic Web Con-

ference (pp. 550–565).

ezamabadi-pour, H., & Kabir, E. (2009). Concept learning by fuzzy k-NN classification
and relevance feedback for efficient image retrieval. Expert Systems with Applica-

tions, 36(3), 5948–5954 Part 2, April.
apadopoulos, G. T. H., Saathoff, C., Escalante, H. J., Mezaris, V., Kompatsiaris, I., &

Strintzis, M. Z. (2011). A comparative study of object-level spatial context tech-
niques for semantic image analysis. Computer Vision and Image Understanding,

115(9), 1288–1307.

eterson, J. L. (1981). Petri net theory and the modeling of systems. Prentice Hall PTR.
ibarić, S., & Pavešić, N. (2009). Inference procedures for fuzzy knowledge representa-

tion scheme. Applied Artificial Intelligence, 23, 16–43 January 2009.
hatford, S. (1986). Analyzing the subject of a picture: A theoretical approach. Cata-

loguing & Classification Quarterly, 5(3), 39–61.
hi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transaction on

PAMI, 22(8), 888–905.

imou, N., Athanasiadis, T., Stoilos, G., & Kollias, S. (2008). Image indexing and re-
trieval using expressive fuzzy description logics, Signal, Image and Video Processing:

2 (pp. 321–335) December. Springer December.
meulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based

image retrieval at the end of the early years. IEEE Transaction on Pattern Analysis and
Machine Intelligence, 22(12), 1349–1380.

rikanth, M., Varner, J., Bowden, M., & Moldovan, D. (2005). Exploiting ontologies for
automatic image annotation. In Proceedings of SIGIR: 05 (pp. 552–558).

toilos, G., Stamou, G., Tzouvaras, V., Pan, J. Z., & Horrocks, I. (2005). The fuzzy descrip-

tion logic f-shin. In Proceedings of International Workshop on Uncertainty Reasoning
For the Semantic Web.

ousch, A. M., Herbin, S., & Audibert, J. Y. (2012). Semantic hierarchies for image anno-
tation: A survey. Pattern Recognition, 45(1), 333–345.

in, H., Jiao, X., Chai, Y., & Fang, B. (2015). Scene classification based on single-layer SAE
and SVM. Expert Systems with Applications, 42(7), 3368–3380 1 May.

u, Y., Pedrycz, W., & Miao, D. (2014). Multi-label classification by exploiting label cor-

relations. Expert Systems with Applications, 41(6), 2989–3004 May.
hang, D., Islam, M. M., & Lu, G. (2012). A review on automatic image annotation tech-

niques. Pattern Recognition, 45(1), 346–362.

http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0001
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0002
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0003a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0004
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0004
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0004
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0004
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0004
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0005
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0006
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0007
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0022
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0022
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0008
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0009
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0010
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0011
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0012a
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0013
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0014
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0015
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0016
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0017
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0018
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0019
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0020
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0021
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0021
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0023
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0023
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0023
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0023
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0024
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0024
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0025
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0026
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0027
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0028
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0029
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0030
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0031
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0032
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0032
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0032
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0032
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0032
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0033
http://refhub.elsevier.com/S0957-4174(15)00528-X/sbref0033

	A knowledge-based multi-layered image annotation system
	1 Introduction
	2 Related work
	3 Multi-layered image representation
	4 A multi-layered image annotaton system
	5 A knowledge-representation scheme
	5.1 Definition of the KRFPN scheme for the multi-layered image annotation
	5.2 Modeling the truth value of relationships
	5.2.1 Relationship consists_of
	5.2.2 Relationship occurs_with
	5.2.3 Spatial relationships


	6 Knowledge-based approach to inconsistency checking
	7 Scene recognition
	8 Inference of more abstract classes
	9 Experiments and discussion
	10 Conclusion
	 Acknowledgment
	 References