Probabilistic approaches
to rough sets

Y.Y. Yao

Department of Computer Science, University of
Regina, Regina, Saskatchewan, Canada S4S 0A2
E-mail: yyao@cs.uregina.ca

Abstract: Probabilistic approaches to rough sets in granulation,
approximation and rule induction are reviewed. The Shannon

entropy function is used to quantitatively characterize partitions of

a universe. Both algebraic and probabilistic rough set approx-

imations are studied. The probabilistic approximations are defined

in a decision-theoretic framework. The problem of rule induction,

a major application of rough set theory, is studied in probabilistic

and information-theoretic terms. Two types of rules are analyzed:

the local, low order rules, and the global, high order rules.

Keywords: rough set approximations, granular computing,
decision-theoretic rough set model, rule induction, high
order rules, belief functions

1. Introduction

As a recently renewed research topic, granular computing

is an umbrella term to cover any theories, methodologies,

techniques and tools that make use of granules (i.e. sub-

sets of a universe) in problem solving (Zadeh, 1979, 1997;

Yao, 2000; Lin et al., 2002). The basic guiding principle

of granular computing is to ‘exploit the tolerance for

imprecision, uncertainty and partial truth to achieve

tractability, robustness, low solution cost and better

rapport with reality’ (Zadeh, 1997). This principle offers

a more practical philosophy for real-world problem

solving. Instead of searching for the optimal solution, one

may search for good approximate solutions.

The theory of rough sets provides a special and concrete

model of granular computing (Pawlak, 1982, 1991). Three

related issues of granulation, approximation and rule

induction are central to studies of rough sets.

Granulation of a universe involves the decomposition of

the universe into families of subsets, or the clustering of

elements into groups. It leads to a collection of granules,

with a granule being a clump of points (objects) drawn

together by indistinguishability, similarity, proximity or

functionality (Zadeh, 1997). Granulation may produce

either a single-level flat structure or a multi-level hierarch-

ical structure (Yao, 2001a). The theory of rough sets uses

equivalence relations to represent relationships between

elements of a universe. An equivalence relation induces a

single-level granulation, namely, a partition of the universe.

A natural consequence of granulation is approximation.

With respect to an equivalence relation, some subsets

cannot be exactly expressed in terms of the equivalence

classes and must be approximately represented by a pair

of lower and upper approximations. By extending the

approximations of subsets to a family of subsets, it is

possible to study the approximation of a partition.

Rule induction deals with finding relationships between

concepts. With each granule, or a family of granules,

representing instances of a certain concept, one can study

rule induction in set-theoretic terms (Yao, 2001b). Approx-

imations of subsets and families of subsets offer insights

and methods for rule induction (Pawlak, 1991).

Although mainstream research in rough set theory has

been dominated by algebraic and non-probabilistic studies,

probabilistic approaches have been applied to the theory

ever since its inception (Wong & Ziarko, 1987; Pawlak

et al., 1988; Yao et al., 1990; Yao & Wong, 1992; Düntsch &

Gediga, 2001). More specifically, many authors implicitly

used a probabilistic approach by counting the number of

elements of a set. On the other hand, there is still a lack of

systematic study of probabilistic approaches in a unified

and general framework.

The main objective of this paper is to provide a critical

analysis and review of probabilistic and information-

theoretic approaches to rough sets. With respect to three

related issues of granulation, approximation and rule

induction, we focus the discussion on probability-related

measures. Such a comprehensive study provides a solid

basis for further study of probabilistic rough set theory.

2. Approximation space and information granulation

The underlying notion for granulation in rough sets is

equivalence relations or partitions. Let U be a finite and

nonempty universe. A binary relation E � U � U on U is
called an equivalence relation if it is reflexive, symmetric

and transitive. A partition of U is a collection of nonempty

and pairwise disjoint subsets of U whose union is U. Each

subset in a partition is also called a block. There is a one-to-

Article ______________________________________

Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 287


one relationship between equivalence relations and parti-

tions. For an equivalence relation E, the equivalence class

½x�E ¼ fy 2 UjyExg ð1Þ
consists of all elements equivalent to x, and is the

equivalence class containing the element x. The family of

equivalence classes

U=E ¼ f½x�Ejx 2 Ug ð2Þ
is a partition of the universe U. On the other hand, given a

partition p of the universe, one can uniquely define an
equivalence relation Ep:

xEpy , x and y are in the same block of the partition p
ð3Þ

In this paper, we will use equivalence relations and

partitions interchangeably. The pair apr ¼ (U, E) is called
an approximation space, indicating the intended applica-

tion of the partition U = E for approximation (Pawlak,
1982).

The partition U = E is commonly known as the quotient
set and provides a granulated view of the universe. A

coarse-grained view of the universe may arise in several

ways. For instance, the equivalence relation is derived

based on available knowledge. Due to a lack of informa-

tion or vague information, some distinct objects cannot

be differentiated (Pawlak, 1982). That is, the available

information only allows us to talk about an equivalence

class as a whole instead of many individuals. In some

situations, it may only be possible to observe or measure

equivalence classes. It may also happen that a coarse-

grained view is sufficient for a particular problem (Zhang &

Zhang, 1992; Zadeh, 1997).

In the granulated view, equivalence classes are the basic

building blocks and are called elementary (equivalence)

granules. They are the smallest nonempty subsets that

can be defined, observed or measured. From elementary

granules, we can construct larger granules by taking unions

of elementary granules. It is reasonable to assume that one

can define, observe and measure these granules through the

information and knowledge on the equivalence granules.

The set of all definable granules, denoted by s(U = E),
consists of the empty set +, the entire universe U, and
unions of equivalence classes. The system s(U = E) is closed
under set complement, intersection and union. It is a sub-

Boolean algebra of the Boolean algebra formed by the

power set 2
U
of U and a s-algebra of subsets of U generated

by the family of equivalence classes U = E. In addition,
U = E is the basis of the s-algebra s(U = E).

Each partition represents one granulated view of the

universe. Granulated views induced by all partitions form a

partition lattice. The order relation of the lattice is defined

as follows. A partition p1 is a refinement of another
partition p2, or equivalently p2 is a coarsening of p1,
denoted by p1 � p2, if every block of p1 is contained in some

block of p2, or equivalently each block of p2 is a union of
some blocks of p1. In terms of equivalence relations, we
have U = E1 � U = E2 if and only if E1 � E2. Given two
partitions p1 and p2, their meet p14p2 is the largest
partition which is a refinement of both p1 and p2, and their
join p13p2 is the smallest partition which is a coarsening of
both p1 and p2. The meet has all nonempty intersections
of a block from p1 and a block from p2 as its blocks. The
blocks of join are the smallest subsets which are exactly

a union of blocks from both p1 and p2. In terms of
equivalence relations, given two equivalence relations E1
and E2, the meet of U = E1 and U = E2 is defined by the
equivalence relation E1 \ E2, and the join is defined by the
equivalence relation (E1 [ E2)n, the transitive closure of
relation E1 [ E2. The finest partition is given by {{x}jx 2 U}
consisting of singleton subsets from U, and the coarsest

partition is {U}.

The partition lattice clearly shows the structure of

different granulations of the universe. It can be used to

search for a suitable level of granulation for problem

solving (Zhang & Zhang, 1992). Many machine learning

algorithms using rough sets are based on the search of the

partition lattice (Yao & Yao, 2002).

Information-theoretic measures can be used to quantify

the degree of granularity of each partition (Lee, 1987;

Düntsch & Gediga, 2001; Yao, 2003b). With respect to

a partition p ¼ {A1, A2, . . . , Am}, we have a probability
distribution

Pp ¼
jA1j
jUj

;
jA2j
jUj

; . . . ;
jAmj
jUj

� �
ð4Þ

where j � j denotes the cardinality of a set. The Shannon
entropy function of the probability distribution is defined

by

HðpÞ ¼ HðPpÞ ¼ �
Xm
i ¼ 1

jAij
jUj

log
jAij
jUj

ð5Þ

The entropy reaches the maximum value log jUj for the
finest partition consisting of singleton subsets of U, and

it reaches the minimum value 0 for the coarsest partition

{U}. In general, for two partitions with p1 � p2, we have
H(p1) � H(p2). That is, the value of the entropy correct-
ly reflects the order of partitions with respect to their

granularity.

Additional support for using the entropy as a measure

of generality can be seen as follows. We can re-express

equation (5) as

HðpÞ ¼ log jUj �
Xm
i ¼ 1

jAij
jUj

logjAij ð6Þ

The first term is a constant independent of any partition.

The quantity log jAij is commonly known as the Hartley
measure of information of the set Ai. It has been used to

measure the amount of uncertainty associated with a finite

set of possible alternatives, namely the nonspecificity

288 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5


inherent in the set (Klir & Folger, 1988). The function log

jAij is a monotonic increasing transformation of the size
of a set. It may be used to measure the granularity of the

set. Large sets result in higher degrees of granularity than

small sets. The second term of the equation is basically

an expectation of granularity with respect to all subsets in

a partition. It follows that we can use the following function

as a measure of granularity for a partition:

GðpÞ ¼
Xm
i ¼ 1

jAij
jUj

log jAij ð7Þ

In contrast to the entropy function, for two partitions p1
and p2 with p1 � p2, we have G(p1) � G(p2). The coarsest
partition {U} has the maximum granularity value log |U|,

and the finest partition {{x}|x 2 U} has the minimum
granularity value 0.

3. Rough set approximations

In this section we discuss approximations of sets and

approximations of probabilities, as well as probabilistic

approximations of sets.

3.1. Approximations of sets

Consider an approximation space apr ¼ (U, E). The set of
definable subsets is given by s(U = E). For a subset A � U,
the greatest definable set contained in A is called the lower

approximation of A, written apr
U=E

ðAÞ, and the least
definable set containing A is called the upper approxima-

tion of A, written aprU=EðAÞ. The subscript U = E indicates
that the approximations are defined with respect to the

partition U = E. When no confusion arises, we simply drop
U = E. Lower and upper approximations can be expres-
sed as

aprðAÞ ¼
[

fXjX 2 sðU=EÞ; X � Ag
aprðAÞ ¼

\
fXjX 2 sðU=EÞ; X 	 Ag

ð8Þ

In terms of equivalence classes, lower and upper approx-

imations can be expressed by

aprðAÞ ¼
[

½x�E � A
½x�E

aprðAÞ ¼
[

½x�E \ A 6¼ ;
½x�E

ð9Þ

The lower approximation aprðAÞ is the union of equiva-
lence classes which are subsets of A. The upper approxima-

tion aprðAÞ is the union of equivalence classes which have a
nonempty intersection with A.

One may interpret apr; apr: 2U ! 2Uas two unary set-
theoretic operators, called approximation operators. The

system ð2U; :; apr; apr; \; [Þ is called a Pawlak rough set
algebra (Yao, 1996). It is an extension of the set algebra

(2
U
, :, \ , [ ). Properties of approximation operators,

pertinent to our discussion, are summarized below:

(i) aprðAÞ ¼ :aprð:AÞ;
aprðAÞ ¼ :aprð:AÞ;

(ii) aprðAÞ ¼ aprðAÞ ¼ A; for A 2 sðU=EÞ;

(iii) aprðAÞ � A � aprðAÞ;

(iv) aprðA \ BÞ ¼ aprðAÞ \ aprðBÞ;
aprðA [ BÞ ¼ aprðAÞ [ aprðBÞ;

(v) aprðA [ BÞ 	 aprðAÞ [ aprðBÞ;
aprðA \ BÞ � aprðAÞ \ aprðBÞ:

Property (i) shows that lower and upper approximations

are dual to each other. Property (ii) indicates that the lower

and upper approximations of a definable set are the set

itself. By property (iii), a set lies within its lower and upper

approximations. Property (iv) states that the lower appro-

ximation distributes over intersection, and the upper

approximation distributes over union. Property (v) shows

the sub-distributivity of approximation operators.

Many probability-related measures on approximations

have been proposed and studied. Pawlak (1982, 1991)

suggested an accuracy measure of rough set approximation

given by

aðAÞ ¼
japrðAÞj
japrðAÞj

¼ PðaprðAÞjaprðAÞÞ ð10Þ

It may be interpreted as the probability that an element

belongs to the lower approximation, given that the element

belongs to the upper approximation. This measure can also

be expressed in terms of the well-known Marczewski–

Steinhaus metric (Yao, 2001a). Measures of quality of

lower and upper approximations are given respectively by

Pawlak (1991):

qðAÞ ¼
japrðAÞj

jUj
¼ PðaprðAÞÞ

qðAÞ ¼
japrðAÞj

jUj
¼ PðaprðAÞÞ

ð11Þ

They are referred to as rough probability by Pawlak (1984)

and have been used by many authors (Grzymala-Busse,

1987; Wong & Lingras, 1989; Yao & Lingras, 1998;

Düntsch & Gediga, 2001). The relationship between the

accuracy and quality of approximations can be expres-

sed as

aðAÞ ¼
qðAÞ
qðAÞ

ð12Þ

The accuracy measure can be re-expressed as

aðAÞ ¼
japrðAÞj
japrðAÞj

¼
japrðAÞj

jUj � japrð:AÞj
ð13Þ

Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 289


which suggests that the accuracy measure also depends on

the lower approximation of :A. Based on this observation,
Gediga and Düntsch (2001) suggested use of the function

gðAÞ ¼
japrðAÞj

jAj
¼ PðaprðAÞjAÞ ð14Þ

as a measure of the precision of deterministic approxima-

tion of A. Using the same argument, we suggest that the

quality of non-deterministic approximation of A can be

measured by

gðAÞ ¼
jAj

japrðAÞj
¼ PðAjaprðAÞÞ ð15Þ

In this case, we have

aðAÞ ¼ gðAÞgðAÞ ð16Þ
The two measures q and g monotonically increase as aprðAÞ
approaches A when different partitions are used. On the

other hand, q and g show the opposite direction of changes.
For two partitions p1 and p2 with p1 � p2, we have

(vi) apr
p1
ðAÞ 	 apr

p2
ðAÞ; aprp1ðAÞ � aprp2ðAÞ:

By combining these with (iii), we obtain

apr
p2
ðAÞ � apr

p1
ðAÞ � A � aprp1ðAÞ � aprp2ðAÞ ð17Þ

As expected, a finer partition induces a tighter approxima-

tion. All of the measures correctly reflect this observation,

as shown by the following properties:

(1) ap1ðAÞ � ap2ðAÞ;

(2) q
p1
ðAÞ � q

p2
ðAÞ; qp1ðAÞ � qp2ðAÞ;

(3) g
p1
ðAÞ � g

p2
ðAÞ; gp1ðAÞ � gp2ðAÞ:

That is, for two partitions with p1 � p2, we obtain the same
qualitative evaluation by all those measures, namely, p1 is
the same or better than p2. For an arbitrary pair of
partitions, the pair of q and g, or the pair of q and g,
produce the same qualitative evaluation, which may be

different from the one given by the a accuracy measure
(Gediga & Düntsch, 2001).

The approximation of a subset can be easily extended to

the approximation of a family of subsets. Consider two

partitions pA ¼ {A1, A2, . . . , An} and pB ¼ {B1, B2, . . . ,
Bm} We construct an approximation space

aprpA ¼ ðU; EpAÞ using the partition pA. Each equivalence
class of pB is approximated by aprpA

ðBiÞ and aprpAðBiÞ. By
extending the accuracy and quality measure to the

approximation of partition, Pawlak (1991) suggested the

following quantities:

apAðpBÞ ¼
�mi ¼ 1j aprpAðBiÞj
�mi ¼ 1j aprpAðBiÞj

g
pA
ðpBÞ ¼

�mi ¼ 1j aprpAðBiÞj
jUj

ð18Þ

Furthermore, g
pA
can be expressed as, respectively, the

expectation, a weighted sum, and the sum of g, a and q on
individual equivalence classes (Gediga & Düntsch, 2001):

g
pA
ðpBÞ ¼

Xm
i ¼ 1

jBij
jUj

g
pA
ðBiÞ ¼

Xm
i ¼ 1

PðBiÞ gpAðBiÞ

g
pA
ðpBÞ ¼

Xm
i ¼ 1

j aprpAðBiÞj
jUj

apAðBiÞ

ð19Þ
¼

Xm
i ¼ 1

PðaprpAðBiÞÞapAðBiÞ

g
pA
ðpBÞ ¼

Xm
i ¼ 1

q
pA
ðBiÞ

The overall measure apAðpBÞ cannot be similarly expressed.
It is reasonable to use as an alternative overall measure

a0pAðpBÞ ¼
Xm
i ¼ 1

apAðBiÞ ð20Þ

In fact, apAðpBÞ and a0pAðpBÞ represent two different
averaging methods: one is the application of a measure to

the pooled results, and the other is the average of the

measurements on the individual results.

Given a subset B � U, we can partition the universe
as {B, :B}. By applying the partition based measure
g
pA
, Düntsch and Gediga (2001) suggested the following

measure of the approximation quality of pA with respect
to B:

g0
pA
ðBÞ ¼ g

pA
ðpBÞ ¼ qpAðBÞ þ qpAð:BÞ ð21Þ

This measure is the ratio of correct classification of either B

or :B based on the partition pA.

3.2. Approximations of probabilities

In an approximation space apr ¼ (U, E), suppose a set
function is defined on s(U = E). One can extend the func-
tion to nondefinable subsets through the lower and upper

approximations. Many authors have studied the approxi-

mation of probabilities in the framework of rough sets,

which leads to belief functions (Pawlak, 1984; Grzymala-

Busse, 1987; Skowron, 1989, 1990; Wong & Lingras, 1989;

Skowron & Grzymala-Busse, 1994; Yao & Lingras, 1998).

A belief function is a mapping from 2
U

to the unit

interval [0, 1] and satisfies the following axioms.

(F1) Bel(+) ¼ 0.
(F2) Bel(U) ¼ 1.
(F3) For every positive integer n and every collection A1,

. . . , An � U,

BelðA1 [ A2 . . . [ AnÞ �
X
i

BelðAiÞ

�
X
i < j

BelðAi \ AjÞ
 . . .

þ ð�1Þnþ1BelðA1 \ . . . \ AnÞ

290 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5


Axioms (F1) and (F2) may be considered as normaliza-

tion conditions. Axiom (F3) is a weaker version of the

commonly known additivity axiom of probability func-

tions. It is referred to as the axiom of superadditivity. The

dual of a belief function, called a plausibility function Pl, is

defined by

PlðAÞ ¼ 1 � Belð:AÞ ð22Þ

For any subset A � U, Bel(A) � Pl(A).
Consider first a simple case where a probability function

P on 2
U
is defined based on the counting of elements in a

set (Grzymala-Busse, 1987; Skowron & Grzymala-Busse,

1994); namely, for A � U,

PðAÞ ¼
jAj
jUj

ð23Þ

Clearly, we have

qðAÞ ¼ PðaprðAÞÞ � PðAÞ � PðaprðAÞÞ ¼ qðAÞ ð24Þ

While aprðAÞ and aprðAÞ are approximations of the set A,
qðAÞ and qðAÞ are the approximations of the probability of
the set A. It can easily be verified by using properties (i)–(iv)

that the qualities of lower and upper approximations are a

pair of belief and plausibility functions.

Suppose P is a probability function defined on sðU=EÞ.
It is not defined for subsets of U which are not members of

sðU=EÞ. One can extend P to 2U in two standard ways by
defining functions Pn and P

n
, traditionally called the inner

measure and the outer measure induced by P. For an

arbitrary subset A � U, we define

PnðAÞ ¼ supfPðXÞjX 2 sðU=EÞ; X � Ag ¼ PðaprðAÞÞ
PnðAÞ ¼ inffPðXÞjX 2 sðU=EÞ; X 	 Ag ¼ PðaprðAÞÞ

ð25Þ

Pawlak (1984) referred to the pair (Pn(A), P
n
(A)) as the

rough probability of A. The inner and outer probabilities

Pn and P
n
are a pair of belief and plausibility functions

(Wong & Lingras, 1989; Fagin & Halpern, 1991; Yao &

Lingras, 1998).

3.3. Probabilistic rough set approximations

Algebraic rough set approximations may be considered as

qualitative approximations of a set. The extent of over-

lap between a set and an equivalence class is not considered.

By incorporating the overlap, many authors have intro-

duced and studied probabilistic rough set approximations

(Wong & Ziarko, 1987; Pawlak et al., 1988; Ziarko, 1993;

Pawlak & Skowron, 1994). Most proposals introduced cer-

tain parameters based on intuitive arguments. The decision-

theoretic rough set model provides a solid basis for

probabilistic approximations (Yao et al., 1990; Yao &

Wong, 1992).

We first briefly review the Bayesian decision-theoretic

framework (Duda & Hart, 1973). Let O ¼ {o1, . . . , os} be
a finite set of s states, and let A ¼ {a1, . . . , am} be a finite
set of m possible actions. Let P(ojjx) be the conditional

probability of an object x being in state oj given that the
object is described by x. Let l(aijoj) denote the loss, or cost,
for taking action ai when the state is oj. For an object with
description x, suppose an action ai is taken. Since P(ojjx) is
the probability that the true state is oj given x, the expected
loss associated with taking action ai is given by

RðaijxÞ ¼
Xs
j ¼ 1

lðaijojÞPðojjxÞ ð26Þ

The quantity R(ai|x) is called the conditional risk. Given a

description x, a decision procedure is a function t(x) that
specifies which action to take. For every x, t(x) chooses one
action from a1, . . . , am. The overall risk R is the expected

loss associated with a given decision procedure. Since

R(t(x)|x) is the conditional risk associated with action t(x),
the overall risk is defined by

R ¼
X
x

RðtðxÞjxÞPðxÞ ð27Þ

where the summation is over the set of all possible descrip-

tions of objects.

One can obtain an optimal decision procedure by

minimizing the overall risk. If t(x) is chosen so that
R(t(x)jx) is as small as possible for every x, the overall risk
R is minimized. The Bayesian decision procedure can

therefore be formally stated as follows. For every x,

compute the conditional risk R(aijx) for i ¼ 1, . . . , m
defined by equation (26), and then select the action for

which the conditional risk is the minimum. If more than

one action minimizes R(aijx), any tie-breaking rule can be
used.

The Bayesian decision procedure can be applied to define

probabilistic rough set approximations. Given a subset

A � U, we can form a set of two states O ¼ {A, :A}
indicating that an element is in A and not in A, respectively.

We use the same symbol to denote both a subset A and the

corresponding state. In the non-probabilistic rough set

model, with respect to A, we divide the universe U into three

disjoint regions, the positive region POS(A), the negative

region NEG(A) and the boundary region BND(A):

POSðAÞ ¼ aprðAÞ
NEGðAÞ ¼ U � aprðAÞ
BNDðAÞ ¼ aprðAÞ � aprðAÞ

ð28Þ

In developing a probabilistic rough set model, with respect

to three regions, the set of actions is given by A ¼ {a1, a2,
a3}, where a1, a2 and a3 represent the three actions in

classifying an object, deciding POS(A), deciding NEG(A)

and deciding BND(A), respectively. The symbol [x]E, the

equivalence class containing x, is also used to represent a

description of x. The required conditional probabilities are

defined by the rough membership functions (Pawlak &

Skowron, 1994)

mAðxÞ ¼
j½x�E \ Aj
j½x�Ej

¼ PðAj½x�EÞ ð29Þ

Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 291


Let l(aijA) denote the loss incurred for taking action ai
when an object in fact belongs to A, and let l(aij:A) denote
the loss incurred for taking the same action when the ob-

ject does not belong to A. The expected loss R(aij[x]E)
associated with taking the individual actions can be

expressed as

Rða1j½x�EÞ ¼ l11PðAj½x�EÞ þ l12Pð:Aj½x�EÞ
Rða2j½x�EÞ ¼ l21PðAj½x�EÞ þ l22Pð:Aj½x�EÞ
Rða3j½x�EÞ ¼ l31PðAj½x�EÞ þ l32Pð:Aj½x�EÞ

ð30Þ

where li1 ¼ l(ai|A), li2 ¼ l(ai|:A) and i ¼ 1, 2, 3. The Baye-
sian decision procedure leads to the following minimum-

risk decision rules.

(P) If R(a1j[x]E) � R(a2j[x]E) and R(a1j[x]E) � R(a3j[x]E),
decide POS(A).

(N) If R(a2j[x]E) � R(a1j[x]E) and R(a2j[x]E) � R(a3j[x]E),
decide NEG(A).

(B) If R(a3j[x]E) � R(a1j[x]E) and R(a3j[x]E) � R(a2j[x]E),
decide BND(A).

Tie-breaking rules should be added so that each element

is classified into only one region. Since P(Aj[x]E) þ
P(:Aj[x]E) ¼ 1, the above decision rules can be simplified
such that only the probabilities P(Aj[x]E) are involved. We
can classify any object in the equivalence class [x]E based

only on the probabilities P(Aj[x]E), i.e. the rough mem-
bership values, and the given loss function lij (i ¼ 1, 2, 3;
j ¼ 1, 2).

Consider a special kind of loss function with l11 �
l31 < l21 and l22 � l32 < l12. That is, the loss of classifying
an object x belonging to A into the positive region POS(A)

is less than or equal to the loss of classifying x into the

boundary region BND(A), and both of these losses are

strictly less than the loss of classifying x into the negative

region NEG(A). The reverse order of losses is used for

classifying an object that does not belong to A. For this

type of loss function, the minimum-risk decision rules (P),

(N), (B) can be written as

(P) if P(Aj[x]E) � g and P(Aj[x]E) � a, decide POS(A);
(N) if P(Aj[x]E) � b and P(Aj[x]E) � g, decide NEG(A);
(B) if b � P(Aj[x]E) � a, decide BND(A);

where

a ¼
l12 � l32

ðl31 � l32Þ � ðl11 � l12Þ

g ¼
l12 � l22

ðl21 � l22Þ � ðl11 � l12Þ

b ¼
l32 � l22

ðl21 � l22Þ � ðl31 � l32Þ

ð31Þ

By the assumptions l11 � l31 < l21 and l22 � l32 < l12, it
follows that a 2 (0, 1], g 2 (0, 1) and b 2 [0, 1).

A loss function should be chosen in such a way as to

satisfy the condition a � b. This ensures that the results are

consistent with rough set approximations. Namely, the

lower approximation is a subset of the upper approxima-

tion, and the boundary region may be nonempty. When

a>b, we have a>g>b. After tie-breaking, we obtain the
decision rules

(P1) if P(Aj[x]E) � a, decide POS(A);
(N1) if P(Aj[x]E) � b, decide NEG(A);
(B1) if b < P(Aj[x]E) < a, decide BND(A).

When a ¼ b, we have a ¼ g ¼ b. In this case, we use the
decision rules

(P2) if P(Aj[x]E)>a, decide POS(A);
(N2) if P(Aj[x]E) < a, decide NEG(A);
(B2) if P(Aj[x]E) ¼ a, decide BND(A).

For the second set of decision rules, we use a tie-breaking

criterion so that the boundary region may be nonempty.

The standard and other probabilistic rough set models

can be easily derived by choosing different loss functions.

Consider the loss function

l12 ¼ l21 ¼ 1 l11 ¼ l22 ¼ l31 ¼ l32 ¼ 0 ð32Þ

There is a unit cost if an object belonging to A is classified

into the negative region or if an object not belonging to A is

classified into the positive region; otherwise there is no cost.

In this case, we have a ¼ 1 > b ¼ 0, a ¼ 1 � b and g ¼ 0.5.
According to decision rules (P1), (N1), (B1), we obtain the

standard rough set approximations (Pawlak, 1982, 1991).

Another loss function is given by

l12 ¼ l21 ¼ 1 l31 ¼ l32 ¼ 0:5 l11 ¼ l22 ¼ 0 ð33Þ

A unit cost is incurred if the system classifies an object

belonging to A into the negative region or an object not

belonging to A into the positive region; half of a unit cost

is incurred if any object is classified into the boundary

region. There is no cost for other cases. It follows that

a ¼ b ¼ g ¼ 0.5. By using decision rules (P2), (N2), (B2), we
obtain the probabilistic rough set approximation proposed

by Pawlak et al. (1988).

Suppose a loss function with l11 � l31 < l21 and l22 �
l32 < l12 satisfies the conditions

l12 � l32 � l31 � l11
ðl12 � l32Þðl32 � l22Þ ¼ ðl31 � l11Þðl21 � l31Þ

ð34Þ

We have a ¼ 1 � b � 0.5. This leads to the variable preci-
sion rough set model (Ziarko, 1993).

4. Probabilistic measures for rule induction

An important application of rough sets is data analysis

and rule induction (Wong & Ziarko, 1986; Pawlak, 1991;

Grzymala-Busse, 1992; Tsumoto, 1998). This section reviews

probabilistic and information-theoretic measures used

in rule induction algorithms (Yao & Zhong, 1999; Yao,

2003b).

292 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5


4.1. Information tables

An information table provides a convenient way to des-

cribe a finite set of objects by a finite set of attributes (Pawlak,

1991). In this paper, we use an extended information table

by adding binary relations on attribute values, and two

languages (Yao, 2001b). Formally, an information table

can be expressed as

S ¼ ðU; At; Lv; Lr; fVaja 2 Atg; fRaja 2 Atg;
fIaja 2 AtgÞ

where U is a finite nonempty set of objects, At is a finite

nonempty set of attributes, Lv is a language dealing with

values of objects, Lr is a language dealing with relations of

objects, Va is a nonempty set of values for a 2 At, Ra is a
nonempty set of binary relations on Va, a 2 At, and Ia: U-
Va is an information function. Each information function

Ia is a total function that maps an object of U to exactly one

value in Va. An information table represents all available

information and knowledge. That is, objects are only

perceived, observed or measured by using a finite number

of properties.

In the language Lv, anatomic formula is givenby (a, R, v),

where a 2 At, R 2 Ra and v 2 Va. If f and c are formulae,
then so are :f, f ^ c, f _ c, f - c and f � c. The
semantics of the language Lv can be defined in the Tarski

style through the notions of a model and satisfiability. The

model is an information table S, which provides an

interpretation for symbols and formulae of Lv. The

satisfiability of a formula f by an object x, written
x �s f or in short x � f if S is understood, is given by
the following conditions:

(m1) x � (a, R, v) iff Ia(x) Rv,
(m2) x � :f iff not x � f,
(m3) x � f 4c iff x � f and x � c,
(m4) x � f 3c iff x � f or x � c,
(m5) x � f-c iff x � :f3c,
(m6) x � f � c iff x � f-c and x � c-f.

If f is a formula, the set ms(f) defined by

mSðfÞ ¼ fx 2 Ujx � fg ð35Þ
is called the meaning of the formula f in S. If S is
understood, we simply write m(f). The following proper-
ties hold:

(a) m(a, R, v) ¼ {x 2 U|Ia(x) R v},
(b) m(:f) ¼ :m(f),
(c) m(f4c) ¼ m(f) \ m(c),
(d) m(f3c) ¼ m(f) [ m(c),
(e) m(f-c) ¼ :m(f) [ m(c),
(f ) m(f � c) ¼ (m(f) \ m(c)) [ (:m(f) \ :m(c)).

The meaning of a formula f is therefore the set of all
objects having the property expressed by the formula f. In
other words, f can be viewed as the description of the set of

objects m(f). Thus, a connection between formulae of
Lv and subsets of U is established. When the relation R

is chosen to be the equality relation ¼ , we obtain the
conventional decision logic language (Pawlak, 1991).

With the introduction of language Lv, we have a formal

description of concepts. A concept definable in an infor-

mation table is a pair (f, m(f)), where f 2 Lv. More
specifically, f is a description of m(f) in S, the intension of
concept (f, m(f)), and m(f) is the set of objects satisfying
f, the extension of concept (f, m(f)). A concept (f, m(f))
is said to be a sub-concept of another concept (c, m(c)), or
(c, m(c)) a super-concept of (f, m(f)), if m(f) � m(c). A
concept (f, m(f)) is said to be the smallest nonempty
concept in S if there does not exist another proper

nonempty sub-concept of (f, m(f)). Two concepts (f,
m(f)) and (c, m(c)) are disjoint if m(f) \ m(c) ¼ |. If
m(f) \ m(c) 6¼ |, we say that the two concepts have a
nonempty overlap and hence are related.

The language Lr is defined in a similar manner to Lv,

except that an atomic formula is given by (a, R), where

R 2 Ra and a 2 At. Semantics of formulae of Lr are inter-
preted by pairs of objects in U. That is,

(m10) (x, y) � (a, R) iff Ia(x) R Ia(y).

For formula f, the set ms(f) defined by

mSðfÞ ¼ fðx; yÞ 2 U � Ujðx; yÞ � fg ð36Þ

is called the meaning set of f in S. If S is understood, we
simply write m(f). A pair (x, y) 2 m(f) is said to satisfy the
expression f. Similarly, the formula f can be viewed as
the description of the set of object pairs m(f), and each
object pair in m(f) as an instance of the concept given by f.

4.2. Two types of rules

Knowledge derivable from an information table is com-

monly represented in the form of rules. Roughly speaking,

rules show the connections between attributes, which are

normally characterized by the problem of determining the

values of one set of attributes based on the values of

another set of attributes. Depending on the meanings and

forms of rules, we can classify rules in many ways. A clear

classification of rules is useful for an understanding of the

basic tasks of machine learning and data mining.

Rules can be classified into two groups in terms of their

directions, one-way and two-way connections, and further

classified into two levels in terms of their applicability, local

and global connections (Yao & Zhong, 1999; Yao, 2001b,

2003b). A one-way connection shows that the values of

one set of attributes determine the values of another set

of attributes, but does not say anything about the reverse.

A two-way connection is a combination of two one-

way connections, representing two different directions of

connection. A local connection is characterized by a rule

showing the relationship between one specific combination

Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 293


of values on one set of attributes and one specific com-

bination of values on another set of attributes. A global

connection is characterized by a rule showing the rela-

tionships between all combinations of values on one set of

attributes and all combinations of values on another set of

attributes.

For clarity, we only consider one-way connections and

the equality relation on attribute values, as was commonly

done in rough sets. In this case, a local one-way connection

is expressed by a rule of the form, using formulae of Lv,

ða; ¼ ; vaÞ ) ðb; ¼ ; vbÞ ð37Þ

where a, b 2 At, and va 2 Va, vb 2 Vb. It can be more
conveniently expressed as, for x 2 U,

IaðxÞ ¼ va ) IbðxÞ ¼ vb ð38Þ
The rule is commonly paraphrased as ‘if the value of an

object is va on an attribute a, then its value is vb on another

attribute b’. A global one-way connection is expressed by a

rule of the form, using formulae of Lr,

ða; ¼ Þ ) ðb; ¼ Þ ð39Þ
where a, b 2 At, or conveniently as, for (x, y) 2 U � U,

IaðxÞ ¼ IaðyÞ ) IbðxÞ ¼ IbðyÞ ð40Þ

That is, ‘if two objects have the same value on an attribute

a, then they have the same value on another attribute b’.

Functional dependence in a database is an example of such

global rules.

The formulation of rules using atomic formulae can be

easily extended to any formulae of languages Lv and Lr. A

local rule states knowledge about one object. A local one-

way rule shows that, if the object has a specific value on one

set of attributes, then it will have a specific value on another

set of attributes. On the other hand, a global rule states

knowledge about a pair of objects. A global one-way rule

suggests that, if a pair of objects have the same value on

one set of attributes, then they will have the same value

on another set of attributes. Based on this observation,

a global rule is also called a high order rule, while a local

rule is called a low order rule (Yao, 2003a).

4.3. Interpretation of rough set theory in information tables

The abstract theory of rough sets can be explained by using

an information table. Such an interpretation is useful for

rule induction.

Let W ¼ {W1, . . . , Wn} � At be a set of attributes in
an information table. We form a family of elementary

formulae FW ¼ f^ni ¼ 1ðWi; ¼ ; wiÞjwi 2 VWig of the lan-
guage Lv. For simplicity, let VW ¼ VW1 � . . . � VWn. We
also express the family of elementary formulae by FW ¼
{W¼ wjw2 VW}. The family of nonempty meaning sets form
a partition of the universe, namely

pW ¼ fmð^ni ¼ 1ðWi; ¼ ; wiÞÞ 6¼ |jwi 2 VWig ð41Þ

It is referred to as the partition induced by the set of

attributes W. For the same set of attributes, we can

construct a formula ^ni ¼ 1ðWi; ¼ Þ of the language Lr. The
meaning of the formula

EW ¼ mð^ni ¼ 1ðWi; ¼ ÞÞ
¼ fðx; yÞ 2 U � UjIWi ðxÞ ¼ IWiðyÞ; 1 � i � ng ð42Þ

is an equivalence relation on U. Similarly, another set of

attributes Z ¼ {Z1, . . . , Zm} defines another partition pZ
and the corresponding equivalence relation EZ.

Rough set approximations of a single subset are relevant

to the induction of local or low order rules. Consider a

formula ^mj ¼ 1ðZj; ¼ ; zjÞ. In terms of attributes in W, we
can obtain various local rules of the following format:

^ni ¼ 1ðWi; ¼ ; wiÞ ) ^
m
j ¼ 1ðZj; ¼ ; zjÞ ð43Þ

Let fW ¼ ^ni ¼ 1 ðWi; ¼ ; wiÞ and cZ ¼ ^mj ¼ 1 ðZj; ¼ ; zjÞ.
With respect to the three regions of rough set approxima-

tions, we can construct three classes of rules.

(I) Positive region: certain positive rules

m(fW) � m(cZ),
fW-cZ.

(II) Boundary region: uncertain positive rules

m(fW) 6� m(cZ) and m(fW) \ m(cZ) 6¼ +2,
fW ) cZ.

(III) Negative region: certain negative rules

m(fW) \ m(cZ) ¼ +,
fW - :cZ.

Certain rules can be considered as the degenerate cases of

uncertain rules. Since certain rules can be interpreted using

the logical connective -, we use the same symbol. All three
classes of rules express the relationship between two

concepts in terms of their meaning sets. Probabilistic

measures introduced earlier can be used to quantify the

uncertainty of rules. For example, a measure from the

rough membership function

jmðfWÞ \ mðcZÞj
jmðfWÞj

ð44Þ

can be used to measure the accuracy of one rule. Other

measures associated with approximation can be used to

show the characteristic of a set of rules. For instance, the

measure suggested by Gediga and Düntsch (2001)

g
pW

ðmðcZÞÞ ¼
j apr

pW
ðmðcZÞÞj

jmðcZÞj

¼
P

fW 2 FW ;mðfW Þ � mðcZÞ jmðfWÞj
jmðcZÞj

ð45Þ

is the ratio of objects correctly classified by all certain

positive rules to the objects satisfying the condition cZ.
Similarly, the accuracy of approximation apW ðmðcZÞÞ is
the ratio of objects correctly classified by all certain positive

294 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5


rules to the objects classified by both certain and uncertain

positive rules.

Approximation of one partition based on another parti-

tion can be summarized by the following rule:

^ni ¼ 1ðWi; ¼ Þ ) ^
m
j ¼ 1ðZj; ¼ Þ or simply W ) Z ð46Þ

The measures apW ðpZÞ and gpW ðpZÞ can be used to measure
the strength of the global, high order rule.

A more detailed probabilistic and information-theoretic

analysis of low and high order rules is given in the following

sections (Yao & Zhong, 1999; Yao, 2003b).

4.4. Probabilistic measures for low order rules

Suppose f and c are two formulae of the language Lv. For
a rule f ) c, its characteristics can be summarized by the
following contingency table:

� c :c Total

f a b a þ b
:f c d c þ d
Total a þ c b þ d a þ b þ c þ d ¼ jUj

a ¼ jmðf ^ cÞj b ¼ jmðf ^ :cÞj
c ¼ jmð:f ^ cÞj d ¼ jmð:f ^ :cÞj

Different measures can be defined to reflect various aspects

of rules.

The generality of f is defined by

GðfÞ ¼
jmðfÞj
jUj

¼
a þ b
jUj

ð47Þ

which indicates the relative size of the concept f. Obvi-
ously, we have 0 � G(f) � 1. A concept is more general if
it covers more instances of the universe. A sub-concept has

a lower generality than its super-concept. The quantity may

be viewed as the probability of a randomly selected element

satisfying f.
The absolute support of c provided by f is the quantity

ASðf ) cÞ ¼ ASðcjfÞ

¼
jmðcÞ \ mðfÞj

jmðfÞj
¼

a

a þ b ð48Þ

The quantity 0 � AS(c|f) � 1 states the degree to which f
supports c. It may be viewed as the conditional probability
of a randomly selected element satisfying c given that the
element satisfies f. In set-theoretic terms, it is the degree to
which m(f) is included in m(c). Clearly, AS(c|f) ¼ 1 if and
only if m(f) 6¼ | and m(f) � m(c). That is, a rule with the
maximum absolute support 1 is a certain rule.

The change of support of c provided by f is defined by

CSðf ) cÞ ¼ CSðcjfÞ
¼ ASðcjfÞ � GðcÞ

¼
a

a þ b
�
a þ c
jUj ð49Þ

Unlike the absolute support, the change of support varies

from � 1 to 1. We can consider G(c) to be the prior
probability of c and AS(cjf) the posterior probability of c
after knowing f. The difference of posterior and prior
probabilities represents the change in our confidence regard-

ing whether f is actually related to c. For a positive value,
we may say that f is positively related to c; for a negative
value, we may say that f is negatively related to c.

The change of support relative to c is given by

RCSðc ) cÞ ¼
CSðcjfÞ
GðcÞ

¼
ASðcjfÞ
GðcÞ

� 1

¼
Gðc ^ fÞ
GðcÞGðfÞ

� 1

¼
jUjjmðcÞ \ mðfÞj
jmðcÞjjmðfÞj

� 1

¼
ajUj

ða þ cÞða þ bÞ
� 1 ð50Þ

It is interesting to note that the first term in the relative

change of support is related to the probabilistic indepen-

dence of c and f.
The generality G(c) is related to the satisfiability of c by

all objects in the database, and AS(f ) c) is related to the
satisfiability of c in the subset m(f). A high AS(f ) c)
does not necessarily suggest a strong association between

f and c, as a concept c with a large G(c) value tends to
have a large AS(f ) c) value. The change of support
CS(f ) c), or the relative change of support RCS(f ) c),
may be more accurate.

4.5. Information-theoretic measures for high order rules

Recall that a set of attributes W induces a partition pW of
the universe. Let

PðwÞ ¼ PðmðW ¼ wÞÞ ¼
jmðW ¼ wÞj

jUj
ð51Þ

Shannon’s entropy function of pW, simply written as
H(P(W)), is given by

HðPðWÞÞ ¼ EPðWÞ½�logPðWÞ�
¼ �

X
w 2 VW

PðwÞ logPðwÞ ð52Þ

where EP(W)[ � ] denotes the expected value with respect to
the probability distribution of W. For two sets of attributes

Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 295


W and Z, their joint entropy is defined by

HðZ; WÞ ¼ �
X
z 2 VZ

X
w 2 VW

pðz; wÞ log pðz; wÞ ð53Þ

The conditional entropy H(Z|W) is defined as the expected

value of subpopulation entropies H(Z|w) with respect to

the probability distribution P(W):

HðZjWÞ ¼
X

w 2 VW
PðwÞHðZjwÞ

¼ �
X

w 2 VW
PðwÞ

X
z 2 VZ

PðzjwÞ logPðzjwÞ

¼ �
X
z 2 VZ

X
w 2 VW

Pðz; wÞ logPðzjwÞ

¼ EPðZ;WÞ½� log PðZjWÞ� ð54Þ

Conditional entropy is nonnegative and nonsymmetric,

namely H(ZjW)� 0 and in general H(ZjW) 6¼ H(WjZ). Con-
ditional entropy can also be expressed by

HðZjWÞ ¼ HðZ; WÞ � HðWÞ ð55Þ
It measures the additional amount of information provided

by Z if W is already known.

The probability P(z) is the generality of the granule

m(Z ¼ z). The function �log P(z) is a monotonic decreasing
transformation of P(z). As the expected values of �log P(z),
the entropy function is related to the granularity of the

partition pZ.
The probability P(zjw) is a measure for the local rule

W ¼ w ) Z ¼ z. As an expected value, the conditional
entropy H(Z|W) provides a measure for the global rule

W ) Z. It may be viewed as an inverse measure of global
one-way association of two sets of attributes (Pawlak et al.,

1988):

IC1ðW ) ZÞ ¼ HðZjWÞ ð56Þ
A normalized version is given by (Pawlak et al., 1988)

IC2ðW ) ZÞ ¼
HðZjWÞ
logjVZj

ð57Þ

For an attribute Z, conditional entropy can be used to

select important attributes for discovering a one-way asso-

ciation W ) Z. Measures IC1 and IC2 can be used to rank
attributes in an increasing order. If one prefers to rank

attributes in a decreasing order, the following correspond-

ing direct measures of one-way association can be used:

C1ðW ) ZÞ ¼ logjVZj � HðZjWÞ ð58Þ

C2ðW ) ZÞ ¼ 1 �
HðZjWÞ
logjVZj

ð59Þ

In these measures, the attribute entropy H(Z) may be used

in place of log jVZj. We obtain the following measures:

C3ðW ) ZÞ ¼ HðZÞ � HðZjWÞ ¼ IðZ; WÞ ð60Þ

C4ðW ) ZÞ ¼ 1 �
HðZjWÞ
HðZÞ

¼
IðZ; WÞ
HðZÞ

ð61Þ

Measure C3 is in fact the mutual information between W

and Z. It is commonly referred to as information gain and

is widely used in machine learning (Quinlan, 1986). Like

the change of support for local rules, C3 may be viewed as

changes of entropy for global rules. Similarly, C4 may be

viewed as a relative change of entropy for global rules.

5. Conclusion

While nonprobabilistic studies of rough sets focus on

algebraic and qualitative properties of the theory, prob-

abilistic approaches are more practical and capture quan-

titative properties of the theory. The granularity of

a partition can be quantified by information-theoretic

measures. Existing measures of accuracy and quality of

approximations can be quantified by probability-related

measures. The probabilistic and information-theoretic

approaches are particularly useful in rule induction, an

important application of rough set theory.

Most of the measures discussed in this paper are based

on simple counting of the number of elements of a set.

Furthermore, we have restricted our discussion to granula-

tions by equivalence relations or partitions. It should be

pointed out that the argument can be easily extended to

more general probability functions and general granulation

structures.

References

DUDA, R.O. and P.E. HART (1973) Pattern Classification and Scene
Analysis, New York: Wiley.

DÜNTSCH, I. and G.R. GEDIGA (2001) Rough information analysis,
International Journal of Intelligent Systems, 16, 121–147.

FAGIN, R. and J.Y. HALPERN (1991) Uncertainty, belief, and
probability, Computational Intelligence, 7, 160–173.

GEDIGA, G. and I. DÜNTSCH (2001) Rough approximation quality
revisited, Artificial Intelligence, 132, 219–234.

GRZYMALA-BUSSE, J.W. (1987) Rough-set and Dempster–Shafer
approaches to knowledge acquisition under uncertainty—
a comparison, Manuscript, Department of Computer Science,
University of Kansas.

GRZYMALA-BUSSE, J.W. (1992) LERS—a system for learning from
example based on rough sets, in Intelligent Support: Handbook of
Applications and Advances of the Rough Set Theory, R. Slowinski
(ed.), Dordrecht: Kluwer Academic, 3–18.

KLIR, G.J. and T.A. GOLGER (1988) Fuzzy Sets, Uncertainty, and
Information, Englewood Cliffs, NJ: Prentice Hall.

LEE, T.T. (1987) An information-theoretic analysis of relational
databases—part I: data dependencies and information metric,
IEEE Transactions on Software Engineering, SE-13, 1049–1061.

LIN, T.Y., Y.Y. YAO and L.A. ZADEH (eds) (2002) Data Mining,
Rough Sets and Granular Computing, Heidelberg: Physica-Verlag.

PAWLAK, Z. (1982) Rough sets, International Journal of Computer
and Information Sciences, 11, 341–356.

PAWLAK, Z. (1984) Rough probability, Bulletin of Polish Academy
of Sciences, Mathematics, 32, 607–615.

296 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5


PAWLAK, Z. (1991) Rough Sets, Theoretical Aspects of Reasoning
about Data, Dordrecht: Kluwer Academic.

PAWLAK, Z. and A. SKOWRON (1994) Rough membership func-
tions, in Advances in the Dempster–Shafer Theory of Evidence,
R.R. Yager, M. Fedrizzi and J. Kacprzyk (eds), New York: Wiley,
251–271.

PAWLAK, Z., S.K.M. WONG and W. ZIARKO (1988) Rough sets:
probabilistic versus deterministic approach, International Journal
of Man–Machine Studies, 29, 81–95.

QUINLAN, J.R. (1986) Induction of decision trees, Machine
Learning, 1, 81–106.

SKOWRON, A. (1989) The relationship between the rough set theory
and evidence theory, Bulletin of Polish Academy of Sciences,
Mathematics, 37, 87–90.

SKOWRON, A. (1990) The rough set theory and evidence theory,
Fundamenta Informaticae, 13, 245–262.

SKOWRON, A. and J. GRZYMALA-BUSSE (1994) From rough set
theory to evidence theory, in Advances in the Dempster–Shafer
Theory of Evidence, R.R. Yager, M. Fedrizzi and J. Kacprzyk
(eds), New York: Wiley, 193–236.

TSUMOTO, S. (1998) Automated extraction of medical expert system
rules from clinical databases on rough set theory, Information
Sciences, 112, 67–84.

WONG, S.K.M. and P.J. LINGRAS (1989) The compatibility view of
Shafer–Dempster theory using the concept of rough set,
Methodologies of Intelligent Systems, 4, 33–42.

WONG, S.K.M. and W. ZIARKO (1986) Algorithm for inductive
learning, Bulletin of the Polish Academy of Sciences, Technical
Sciences, 34, 271–276.

WONG, S.K.M. and W. ZIARKO (1987) Comparison of the
probabilistic approximate classification and the fuzzy set model,
Fuzzy Sets and Systems, 21, 357–362.

YAO, J.T. and Y.Y. YAO (2002) Induction of classification rules by
granular computing, Proceedings of the Third International
Conference on Rough Sets and Current Trends in Computing,
LNAI 2475, Berlin: Springer, 331–338.

YAO, Y.Y. (1996) Two views of the theory of rough sets in finite
universes, International Journal of Approximation Reasoning, 15,
291–317.

YAO, Y.Y. (2000) Granular computing: basic issues and possible
solutions, Proceedings of the 5th Joint Conference on Information
Sciences, Association for Intelligent Machinery, 186–189.

YAO, Y.Y. (2001a) Information granulation and rough set
approximation, International Journal of Intelligent Systems, 16,
87–104.

YAO, Y.Y. (2001b) Modeling data mining with granular computing,
Proceedings of the 25th Annual International Computer Software
and Applications Conference, 638–643.

YAO, Y.Y. (2003a) Mining high order decision rules, in Rough Set
Theory and Granular Computing, M. Inuiguchi, S. Hirano and
S. Tsumoto (eds), Berlin: Springer, 125–135.

YAO, Y.Y. (2003b) Information-theoretic measures for knowledge
discovery and data mining, in Entropy Measures, Maximum
Entropy and Emerging Applications, Karmeshu (ed.), Berlin:
Springer, 115–136.

YAO, Y.Y. and P.J. LINGRAS (1998) Interpretations of belief
functions in the theory of rough sets, Information Sciences, 104,
81–106.

YAO, Y.Y. and S.K.M. WONG (1992) A decision theoretic
framework for approximating concepts, International Journal of
Man–Machine Studies, 37, 793–809.

YAO, Y.Y. and N. ZHONG (1999) An analysis of quantitative
measures associated with rules, Proceedings of the Third Pacific-
Asia Conference on Knowledge Discovery and Data Mining, LNAI
1574, Berlin: Springer, 479–488.

YAO, Y.Y., S.K.M. WONG and P.J. LINGRAS (1990) A decision-
theoretic rough set model, in Methodologies for Intelligent
Systems, Vol. 5, Z.W. Ras, M. Zemankova and M.L. Emrich
(eds), New York: North-Holland, 17–24.

ZADEH, L.A. (1979) Fuzzy sets and information granularity, in
Advances in Fuzzy Set Theory and Applications, N. Gupta, R.
Ragade and R. Yager (eds), Amsterdam: North-Holland, 3–18.

ZADEH, L.A. (1997) Towards a theory of fuzzy information
granulation and its centrality in human reasoning and fuzzy
logic, Fuzzy Sets and Systems, 19, 111–127.

ZHANG, B. and L. ZHANG (1992) Theory and Applications of
Problem Solving, Amsterdam: North-Holland.

ZIARKO, W. (1993) Variable precision rough set model, Journal of
Computer and System Sciences, 46, 39–59.

The author

Yiyu Yao

Yiyu Yao is a professor of computer science in the

Department of Computer Science, University of Regina,

Regina, Saskatchewan, Canada. His research interests are

information retrieval, Web intelligence, data mining, fuzzy

sets, rough sets and granular computing. He received a

PhD degree from the University of Regina, Canada. He has

published over 100 journal and conference papers. He is a

member of the editorial boards of the Web Intelligence and

Agent Systems journal (IOS Press). He has served and is

serving as a program co-chair of three international

conferences, and as a program committee member in over

20 international conferences. He is a member of IEEE and

ACM.

Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 297