Probabilistic approaches to rough sets Y.Y. Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: yyao@cs.uregina.ca Abstract: Probabilistic approaches to rough sets in granulation, approximation and rule induction are reviewed. The Shannon entropy function is used to quantitatively characterize partitions of a universe. Both algebraic and probabilistic rough set approx- imations are studied. The probabilistic approximations are defined in a decision-theoretic framework. The problem of rule induction, a major application of rough set theory, is studied in probabilistic and information-theoretic terms. Two types of rules are analyzed: the local, low order rules, and the global, high order rules. Keywords: rough set approximations, granular computing, decision-theoretic rough set model, rule induction, high order rules, belief functions 1. Introduction As a recently renewed research topic, granular computing is an umbrella term to cover any theories, methodologies, techniques and tools that make use of granules (i.e. sub- sets of a universe) in problem solving (Zadeh, 1979, 1997; Yao, 2000; Lin et al., 2002). The basic guiding principle of granular computing is to ‘exploit the tolerance for imprecision, uncertainty and partial truth to achieve tractability, robustness, low solution cost and better rapport with reality’ (Zadeh, 1997). This principle offers a more practical philosophy for real-world problem solving. Instead of searching for the optimal solution, one may search for good approximate solutions. The theory of rough sets provides a special and concrete model of granular computing (Pawlak, 1982, 1991). Three related issues of granulation, approximation and rule induction are central to studies of rough sets. Granulation of a universe involves the decomposition of the universe into families of subsets, or the clustering of elements into groups. It leads to a collection of granules, with a granule being a clump of points (objects) drawn together by indistinguishability, similarity, proximity or functionality (Zadeh, 1997). Granulation may produce either a single-level flat structure or a multi-level hierarch- ical structure (Yao, 2001a). The theory of rough sets uses equivalence relations to represent relationships between elements of a universe. An equivalence relation induces a single-level granulation, namely, a partition of the universe. A natural consequence of granulation is approximation. With respect to an equivalence relation, some subsets cannot be exactly expressed in terms of the equivalence classes and must be approximately represented by a pair of lower and upper approximations. By extending the approximations of subsets to a family of subsets, it is possible to study the approximation of a partition. Rule induction deals with finding relationships between concepts. With each granule, or a family of granules, representing instances of a certain concept, one can study rule induction in set-theoretic terms (Yao, 2001b). Approx- imations of subsets and families of subsets offer insights and methods for rule induction (Pawlak, 1991). Although mainstream research in rough set theory has been dominated by algebraic and non-probabilistic studies, probabilistic approaches have been applied to the theory ever since its inception (Wong & Ziarko, 1987; Pawlak et al., 1988; Yao et al., 1990; Yao & Wong, 1992; Düntsch & Gediga, 2001). More specifically, many authors implicitly used a probabilistic approach by counting the number of elements of a set. On the other hand, there is still a lack of systematic study of probabilistic approaches in a unified and general framework. The main objective of this paper is to provide a critical analysis and review of probabilistic and information- theoretic approaches to rough sets. With respect to three related issues of granulation, approximation and rule induction, we focus the discussion on probability-related measures. Such a comprehensive study provides a solid basis for further study of probabilistic rough set theory. 2. Approximation space and information granulation The underlying notion for granulation in rough sets is equivalence relations or partitions. Let U be a finite and nonempty universe. A binary relation E � U � U on U is called an equivalence relation if it is reflexive, symmetric and transitive. A partition of U is a collection of nonempty and pairwise disjoint subsets of U whose union is U. Each subset in a partition is also called a block. There is a one-to- Article ______________________________________ Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 287 one relationship between equivalence relations and parti- tions. For an equivalence relation E, the equivalence class ½x�E ¼ fy 2 UjyExg ð1Þ consists of all elements equivalent to x, and is the equivalence class containing the element x. The family of equivalence classes U=E ¼ f½x�Ejx 2 Ug ð2Þ is a partition of the universe U. On the other hand, given a partition p of the universe, one can uniquely define an equivalence relation Ep: xEpy , x and y are in the same block of the partition p ð3Þ In this paper, we will use equivalence relations and partitions interchangeably. The pair apr ¼ (U, E) is called an approximation space, indicating the intended applica- tion of the partition U = E for approximation (Pawlak, 1982). The partition U = E is commonly known as the quotient set and provides a granulated view of the universe. A coarse-grained view of the universe may arise in several ways. For instance, the equivalence relation is derived based on available knowledge. Due to a lack of informa- tion or vague information, some distinct objects cannot be differentiated (Pawlak, 1982). That is, the available information only allows us to talk about an equivalence class as a whole instead of many individuals. In some situations, it may only be possible to observe or measure equivalence classes. It may also happen that a coarse- grained view is sufficient for a particular problem (Zhang & Zhang, 1992; Zadeh, 1997). In the granulated view, equivalence classes are the basic building blocks and are called elementary (equivalence) granules. They are the smallest nonempty subsets that can be defined, observed or measured. From elementary granules, we can construct larger granules by taking unions of elementary granules. It is reasonable to assume that one can define, observe and measure these granules through the information and knowledge on the equivalence granules. The set of all definable granules, denoted by s(U = E), consists of the empty set +, the entire universe U, and unions of equivalence classes. The system s(U = E) is closed under set complement, intersection and union. It is a sub- Boolean algebra of the Boolean algebra formed by the power set 2 U of U and a s-algebra of subsets of U generated by the family of equivalence classes U = E. In addition, U = E is the basis of the s-algebra s(U = E). Each partition represents one granulated view of the universe. Granulated views induced by all partitions form a partition lattice. The order relation of the lattice is defined as follows. A partition p1 is a refinement of another partition p2, or equivalently p2 is a coarsening of p1, denoted by p1 � p2, if every block of p1 is contained in some block of p2, or equivalently each block of p2 is a union of some blocks of p1. In terms of equivalence relations, we have U = E1 � U = E2 if and only if E1 � E2. Given two partitions p1 and p2, their meet p14p2 is the largest partition which is a refinement of both p1 and p2, and their join p13p2 is the smallest partition which is a coarsening of both p1 and p2. The meet has all nonempty intersections of a block from p1 and a block from p2 as its blocks. The blocks of join are the smallest subsets which are exactly a union of blocks from both p1 and p2. In terms of equivalence relations, given two equivalence relations E1 and E2, the meet of U = E1 and U = E2 is defined by the equivalence relation E1 \ E2, and the join is defined by the equivalence relation (E1 [ E2)n, the transitive closure of relation E1 [ E2. The finest partition is given by {{x}jx 2 U} consisting of singleton subsets from U, and the coarsest partition is {U}. The partition lattice clearly shows the structure of different granulations of the universe. It can be used to search for a suitable level of granulation for problem solving (Zhang & Zhang, 1992). Many machine learning algorithms using rough sets are based on the search of the partition lattice (Yao & Yao, 2002). Information-theoretic measures can be used to quantify the degree of granularity of each partition (Lee, 1987; Düntsch & Gediga, 2001; Yao, 2003b). With respect to a partition p ¼ {A1, A2, . . . , Am}, we have a probability distribution Pp ¼ jA1j jUj ; jA2j jUj ; . . . ; jAmj jUj � � ð4Þ where j � j denotes the cardinality of a set. The Shannon entropy function of the probability distribution is defined by HðpÞ ¼ HðPpÞ ¼ � Xm i ¼ 1 jAij jUj log jAij jUj ð5Þ The entropy reaches the maximum value log jUj for the finest partition consisting of singleton subsets of U, and it reaches the minimum value 0 for the coarsest partition {U}. In general, for two partitions with p1 � p2, we have H(p1) � H(p2). That is, the value of the entropy correct- ly reflects the order of partitions with respect to their granularity. Additional support for using the entropy as a measure of generality can be seen as follows. We can re-express equation (5) as HðpÞ ¼ log jUj � Xm i ¼ 1 jAij jUj logjAij ð6Þ The first term is a constant independent of any partition. The quantity log jAij is commonly known as the Hartley measure of information of the set Ai. It has been used to measure the amount of uncertainty associated with a finite set of possible alternatives, namely the nonspecificity 288 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5 inherent in the set (Klir & Folger, 1988). The function log jAij is a monotonic increasing transformation of the size of a set. It may be used to measure the granularity of the set. Large sets result in higher degrees of granularity than small sets. The second term of the equation is basically an expectation of granularity with respect to all subsets in a partition. It follows that we can use the following function as a measure of granularity for a partition: GðpÞ ¼ Xm i ¼ 1 jAij jUj log jAij ð7Þ In contrast to the entropy function, for two partitions p1 and p2 with p1 � p2, we have G(p1) � G(p2). The coarsest partition {U} has the maximum granularity value log |U|, and the finest partition {{x}|x 2 U} has the minimum granularity value 0. 3. Rough set approximations In this section we discuss approximations of sets and approximations of probabilities, as well as probabilistic approximations of sets. 3.1. Approximations of sets Consider an approximation space apr ¼ (U, E). The set of definable subsets is given by s(U = E). For a subset A � U, the greatest definable set contained in A is called the lower approximation of A, written apr U=E ðAÞ, and the least definable set containing A is called the upper approxima- tion of A, written aprU=EðAÞ. The subscript U = E indicates that the approximations are defined with respect to the partition U = E. When no confusion arises, we simply drop U = E. Lower and upper approximations can be expres- sed as aprðAÞ ¼ [ fXjX 2 sðU=EÞ; X � Ag aprðAÞ ¼ \ fXjX 2 sðU=EÞ; X Ag ð8Þ In terms of equivalence classes, lower and upper approx- imations can be expressed by aprðAÞ ¼ [ ½x�E � A ½x�E aprðAÞ ¼ [ ½x�E \ A 6¼ ; ½x�E ð9Þ The lower approximation aprðAÞ is the union of equiva- lence classes which are subsets of A. The upper approxima- tion aprðAÞ is the union of equivalence classes which have a nonempty intersection with A. One may interpret apr; apr: 2U ! 2Uas two unary set- theoretic operators, called approximation operators. The system ð2U; :; apr; apr; \; [Þ is called a Pawlak rough set algebra (Yao, 1996). It is an extension of the set algebra (2 U , :, \ , [ ). Properties of approximation operators, pertinent to our discussion, are summarized below: (i) aprðAÞ ¼ :aprð:AÞ; aprðAÞ ¼ :aprð:AÞ; (ii) aprðAÞ ¼ aprðAÞ ¼ A; for A 2 sðU=EÞ; (iii) aprðAÞ � A � aprðAÞ; (iv) aprðA \ BÞ ¼ aprðAÞ \ aprðBÞ; aprðA [ BÞ ¼ aprðAÞ [ aprðBÞ; (v) aprðA [ BÞ aprðAÞ [ aprðBÞ; aprðA \ BÞ � aprðAÞ \ aprðBÞ: Property (i) shows that lower and upper approximations are dual to each other. Property (ii) indicates that the lower and upper approximations of a definable set are the set itself. By property (iii), a set lies within its lower and upper approximations. Property (iv) states that the lower appro- ximation distributes over intersection, and the upper approximation distributes over union. Property (v) shows the sub-distributivity of approximation operators. Many probability-related measures on approximations have been proposed and studied. Pawlak (1982, 1991) suggested an accuracy measure of rough set approximation given by aðAÞ ¼ japrðAÞj japrðAÞj ¼ PðaprðAÞjaprðAÞÞ ð10Þ It may be interpreted as the probability that an element belongs to the lower approximation, given that the element belongs to the upper approximation. This measure can also be expressed in terms of the well-known Marczewski– Steinhaus metric (Yao, 2001a). Measures of quality of lower and upper approximations are given respectively by Pawlak (1991): qðAÞ ¼ japrðAÞj jUj ¼ PðaprðAÞÞ qðAÞ ¼ japrðAÞj jUj ¼ PðaprðAÞÞ ð11Þ They are referred to as rough probability by Pawlak (1984) and have been used by many authors (Grzymala-Busse, 1987; Wong & Lingras, 1989; Yao & Lingras, 1998; Düntsch & Gediga, 2001). The relationship between the accuracy and quality of approximations can be expres- sed as aðAÞ ¼ qðAÞ qðAÞ ð12Þ The accuracy measure can be re-expressed as aðAÞ ¼ japrðAÞj japrðAÞj ¼ japrðAÞj jUj � japrð:AÞj ð13Þ Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 289 which suggests that the accuracy measure also depends on the lower approximation of :A. Based on this observation, Gediga and Düntsch (2001) suggested use of the function gðAÞ ¼ japrðAÞj jAj ¼ PðaprðAÞjAÞ ð14Þ as a measure of the precision of deterministic approxima- tion of A. Using the same argument, we suggest that the quality of non-deterministic approximation of A can be measured by gðAÞ ¼ jAj japrðAÞj ¼ PðAjaprðAÞÞ ð15Þ In this case, we have aðAÞ ¼ gðAÞgðAÞ ð16Þ The two measures q and g monotonically increase as aprðAÞ approaches A when different partitions are used. On the other hand, q and g show the opposite direction of changes. For two partitions p1 and p2 with p1 � p2, we have (vi) apr p1 ðAÞ apr p2 ðAÞ; aprp1ðAÞ � aprp2ðAÞ: By combining these with (iii), we obtain apr p2 ðAÞ � apr p1 ðAÞ � A � aprp1ðAÞ � aprp2ðAÞ ð17Þ As expected, a finer partition induces a tighter approxima- tion. All of the measures correctly reflect this observation, as shown by the following properties: (1) ap1ðAÞ � ap2ðAÞ; (2) q p1 ðAÞ � q p2 ðAÞ; qp1ðAÞ � qp2ðAÞ; (3) g p1 ðAÞ � g p2 ðAÞ; gp1ðAÞ � gp2ðAÞ: That is, for two partitions with p1 � p2, we obtain the same qualitative evaluation by all those measures, namely, p1 is the same or better than p2. For an arbitrary pair of partitions, the pair of q and g, or the pair of q and g, produce the same qualitative evaluation, which may be different from the one given by the a accuracy measure (Gediga & Düntsch, 2001). The approximation of a subset can be easily extended to the approximation of a family of subsets. Consider two partitions pA ¼ {A1, A2, . . . , An} and pB ¼ {B1, B2, . . . , Bm} We construct an approximation space aprpA ¼ ðU; EpAÞ using the partition pA. Each equivalence class of pB is approximated by aprpA ðBiÞ and aprpAðBiÞ. By extending the accuracy and quality measure to the approximation of partition, Pawlak (1991) suggested the following quantities: apAðpBÞ ¼ �mi ¼ 1j aprpAðBiÞj �mi ¼ 1j aprpAðBiÞj g pA ðpBÞ ¼ �mi ¼ 1j aprpAðBiÞj jUj ð18Þ Furthermore, g pA can be expressed as, respectively, the expectation, a weighted sum, and the sum of g, a and q on individual equivalence classes (Gediga & Düntsch, 2001): g pA ðpBÞ ¼ Xm i ¼ 1 jBij jUj g pA ðBiÞ ¼ Xm i ¼ 1 PðBiÞ gpAðBiÞ g pA ðpBÞ ¼ Xm i ¼ 1 j aprpAðBiÞj jUj apAðBiÞ ð19Þ ¼ Xm i ¼ 1 PðaprpAðBiÞÞapAðBiÞ g pA ðpBÞ ¼ Xm i ¼ 1 q pA ðBiÞ The overall measure apAðpBÞ cannot be similarly expressed. It is reasonable to use as an alternative overall measure a0pAðpBÞ ¼ Xm i ¼ 1 apAðBiÞ ð20Þ In fact, apAðpBÞ and a0pAðpBÞ represent two different averaging methods: one is the application of a measure to the pooled results, and the other is the average of the measurements on the individual results. Given a subset B � U, we can partition the universe as {B, :B}. By applying the partition based measure g pA , Düntsch and Gediga (2001) suggested the following measure of the approximation quality of pA with respect to B: g0 pA ðBÞ ¼ g pA ðpBÞ ¼ qpAðBÞ þ qpAð:BÞ ð21Þ This measure is the ratio of correct classification of either B or :B based on the partition pA. 3.2. Approximations of probabilities In an approximation space apr ¼ (U, E), suppose a set function is defined on s(U = E). One can extend the func- tion to nondefinable subsets through the lower and upper approximations. Many authors have studied the approxi- mation of probabilities in the framework of rough sets, which leads to belief functions (Pawlak, 1984; Grzymala- Busse, 1987; Skowron, 1989, 1990; Wong & Lingras, 1989; Skowron & Grzymala-Busse, 1994; Yao & Lingras, 1998). A belief function is a mapping from 2 U to the unit interval [0, 1] and satisfies the following axioms. (F1) Bel(+) ¼ 0. (F2) Bel(U) ¼ 1. (F3) For every positive integer n and every collection A1, . . . , An � U, BelðA1 [ A2 . . . [ AnÞ � X i BelðAiÞ � X i < j BelðAi \ AjÞ . . . þ ð�1Þnþ1BelðA1 \ . . . \ AnÞ 290 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5 Axioms (F1) and (F2) may be considered as normaliza- tion conditions. Axiom (F3) is a weaker version of the commonly known additivity axiom of probability func- tions. It is referred to as the axiom of superadditivity. The dual of a belief function, called a plausibility function Pl, is defined by PlðAÞ ¼ 1 � Belð:AÞ ð22Þ For any subset A � U, Bel(A) � Pl(A). Consider first a simple case where a probability function P on 2 U is defined based on the counting of elements in a set (Grzymala-Busse, 1987; Skowron & Grzymala-Busse, 1994); namely, for A � U, PðAÞ ¼ jAj jUj ð23Þ Clearly, we have qðAÞ ¼ PðaprðAÞÞ � PðAÞ � PðaprðAÞÞ ¼ qðAÞ ð24Þ While aprðAÞ and aprðAÞ are approximations of the set A, qðAÞ and qðAÞ are the approximations of the probability of the set A. It can easily be verified by using properties (i)–(iv) that the qualities of lower and upper approximations are a pair of belief and plausibility functions. Suppose P is a probability function defined on sðU=EÞ. It is not defined for subsets of U which are not members of sðU=EÞ. One can extend P to 2U in two standard ways by defining functions Pn and P n , traditionally called the inner measure and the outer measure induced by P. For an arbitrary subset A � U, we define PnðAÞ ¼ supfPðXÞjX 2 sðU=EÞ; X � Ag ¼ PðaprðAÞÞ PnðAÞ ¼ inffPðXÞjX 2 sðU=EÞ; X Ag ¼ PðaprðAÞÞ ð25Þ Pawlak (1984) referred to the pair (Pn(A), P n (A)) as the rough probability of A. The inner and outer probabilities Pn and P n are a pair of belief and plausibility functions (Wong & Lingras, 1989; Fagin & Halpern, 1991; Yao & Lingras, 1998). 3.3. Probabilistic rough set approximations Algebraic rough set approximations may be considered as qualitative approximations of a set. The extent of over- lap between a set and an equivalence class is not considered. By incorporating the overlap, many authors have intro- duced and studied probabilistic rough set approximations (Wong & Ziarko, 1987; Pawlak et al., 1988; Ziarko, 1993; Pawlak & Skowron, 1994). Most proposals introduced cer- tain parameters based on intuitive arguments. The decision- theoretic rough set model provides a solid basis for probabilistic approximations (Yao et al., 1990; Yao & Wong, 1992). We first briefly review the Bayesian decision-theoretic framework (Duda & Hart, 1973). Let O ¼ {o1, . . . , os} be a finite set of s states, and let A ¼ {a1, . . . , am} be a finite set of m possible actions. Let P(ojjx) be the conditional probability of an object x being in state oj given that the object is described by x. Let l(aijoj) denote the loss, or cost, for taking action ai when the state is oj. For an object with description x, suppose an action ai is taken. Since P(ojjx) is the probability that the true state is oj given x, the expected loss associated with taking action ai is given by RðaijxÞ ¼ Xs j ¼ 1 lðaijojÞPðojjxÞ ð26Þ The quantity R(ai|x) is called the conditional risk. Given a description x, a decision procedure is a function t(x) that specifies which action to take. For every x, t(x) chooses one action from a1, . . . , am. The overall risk R is the expected loss associated with a given decision procedure. Since R(t(x)|x) is the conditional risk associated with action t(x), the overall risk is defined by R ¼ X x RðtðxÞjxÞPðxÞ ð27Þ where the summation is over the set of all possible descrip- tions of objects. One can obtain an optimal decision procedure by minimizing the overall risk. If t(x) is chosen so that R(t(x)jx) is as small as possible for every x, the overall risk R is minimized. The Bayesian decision procedure can therefore be formally stated as follows. For every x, compute the conditional risk R(aijx) for i ¼ 1, . . . , m defined by equation (26), and then select the action for which the conditional risk is the minimum. If more than one action minimizes R(aijx), any tie-breaking rule can be used. The Bayesian decision procedure can be applied to define probabilistic rough set approximations. Given a subset A � U, we can form a set of two states O ¼ {A, :A} indicating that an element is in A and not in A, respectively. We use the same symbol to denote both a subset A and the corresponding state. In the non-probabilistic rough set model, with respect to A, we divide the universe U into three disjoint regions, the positive region POS(A), the negative region NEG(A) and the boundary region BND(A): POSðAÞ ¼ aprðAÞ NEGðAÞ ¼ U � aprðAÞ BNDðAÞ ¼ aprðAÞ � aprðAÞ ð28Þ In developing a probabilistic rough set model, with respect to three regions, the set of actions is given by A ¼ {a1, a2, a3}, where a1, a2 and a3 represent the three actions in classifying an object, deciding POS(A), deciding NEG(A) and deciding BND(A), respectively. The symbol [x]E, the equivalence class containing x, is also used to represent a description of x. The required conditional probabilities are defined by the rough membership functions (Pawlak & Skowron, 1994) mAðxÞ ¼ j½x�E \ Aj j½x�Ej ¼ PðAj½x�EÞ ð29Þ Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 291 Let l(aijA) denote the loss incurred for taking action ai when an object in fact belongs to A, and let l(aij:A) denote the loss incurred for taking the same action when the ob- ject does not belong to A. The expected loss R(aij[x]E) associated with taking the individual actions can be expressed as Rða1j½x�EÞ ¼ l11PðAj½x�EÞ þ l12Pð:Aj½x�EÞ Rða2j½x�EÞ ¼ l21PðAj½x�EÞ þ l22Pð:Aj½x�EÞ Rða3j½x�EÞ ¼ l31PðAj½x�EÞ þ l32Pð:Aj½x�EÞ ð30Þ where li1 ¼ l(ai|A), li2 ¼ l(ai|:A) and i ¼ 1, 2, 3. The Baye- sian decision procedure leads to the following minimum- risk decision rules. (P) If R(a1j[x]E) � R(a2j[x]E) and R(a1j[x]E) � R(a3j[x]E), decide POS(A). (N) If R(a2j[x]E) � R(a1j[x]E) and R(a2j[x]E) � R(a3j[x]E), decide NEG(A). (B) If R(a3j[x]E) � R(a1j[x]E) and R(a3j[x]E) � R(a2j[x]E), decide BND(A). Tie-breaking rules should be added so that each element is classified into only one region. Since P(Aj[x]E) þ P(:Aj[x]E) ¼ 1, the above decision rules can be simplified such that only the probabilities P(Aj[x]E) are involved. We can classify any object in the equivalence class [x]E based only on the probabilities P(Aj[x]E), i.e. the rough mem- bership values, and the given loss function lij (i ¼ 1, 2, 3; j ¼ 1, 2). Consider a special kind of loss function with l11 � l31 < l21 and l22 � l32 < l12. That is, the loss of classifying an object x belonging to A into the positive region POS(A) is less than or equal to the loss of classifying x into the boundary region BND(A), and both of these losses are strictly less than the loss of classifying x into the negative region NEG(A). The reverse order of losses is used for classifying an object that does not belong to A. For this type of loss function, the minimum-risk decision rules (P), (N), (B) can be written as (P) if P(Aj[x]E) � g and P(Aj[x]E) � a, decide POS(A); (N) if P(Aj[x]E) � b and P(Aj[x]E) � g, decide NEG(A); (B) if b � P(Aj[x]E) � a, decide BND(A); where a ¼ l12 � l32 ðl31 � l32Þ � ðl11 � l12Þ g ¼ l12 � l22 ðl21 � l22Þ � ðl11 � l12Þ b ¼ l32 � l22 ðl21 � l22Þ � ðl31 � l32Þ ð31Þ By the assumptions l11 � l31 < l21 and l22 � l32 < l12, it follows that a 2 (0, 1], g 2 (0, 1) and b 2 [0, 1). A loss function should be chosen in such a way as to satisfy the condition a � b. This ensures that the results are consistent with rough set approximations. Namely, the lower approximation is a subset of the upper approxima- tion, and the boundary region may be nonempty. When a>b, we have a>g>b. After tie-breaking, we obtain the decision rules (P1) if P(Aj[x]E) � a, decide POS(A); (N1) if P(Aj[x]E) � b, decide NEG(A); (B1) if b < P(Aj[x]E) < a, decide BND(A). When a ¼ b, we have a ¼ g ¼ b. In this case, we use the decision rules (P2) if P(Aj[x]E)>a, decide POS(A); (N2) if P(Aj[x]E) < a, decide NEG(A); (B2) if P(Aj[x]E) ¼ a, decide BND(A). For the second set of decision rules, we use a tie-breaking criterion so that the boundary region may be nonempty. The standard and other probabilistic rough set models can be easily derived by choosing different loss functions. Consider the loss function l12 ¼ l21 ¼ 1 l11 ¼ l22 ¼ l31 ¼ l32 ¼ 0 ð32Þ There is a unit cost if an object belonging to A is classified into the negative region or if an object not belonging to A is classified into the positive region; otherwise there is no cost. In this case, we have a ¼ 1 > b ¼ 0, a ¼ 1 � b and g ¼ 0.5. According to decision rules (P1), (N1), (B1), we obtain the standard rough set approximations (Pawlak, 1982, 1991). Another loss function is given by l12 ¼ l21 ¼ 1 l31 ¼ l32 ¼ 0:5 l11 ¼ l22 ¼ 0 ð33Þ A unit cost is incurred if the system classifies an object belonging to A into the negative region or an object not belonging to A into the positive region; half of a unit cost is incurred if any object is classified into the boundary region. There is no cost for other cases. It follows that a ¼ b ¼ g ¼ 0.5. By using decision rules (P2), (N2), (B2), we obtain the probabilistic rough set approximation proposed by Pawlak et al. (1988). Suppose a loss function with l11 � l31 < l21 and l22 � l32 < l12 satisfies the conditions l12 � l32 � l31 � l11 ðl12 � l32Þðl32 � l22Þ ¼ ðl31 � l11Þðl21 � l31Þ ð34Þ We have a ¼ 1 � b � 0.5. This leads to the variable preci- sion rough set model (Ziarko, 1993). 4. Probabilistic measures for rule induction An important application of rough sets is data analysis and rule induction (Wong & Ziarko, 1986; Pawlak, 1991; Grzymala-Busse, 1992; Tsumoto, 1998). This section reviews probabilistic and information-theoretic measures used in rule induction algorithms (Yao & Zhong, 1999; Yao, 2003b). 292 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5 4.1. Information tables An information table provides a convenient way to des- cribe a finite set of objects by a finite set of attributes (Pawlak, 1991). In this paper, we use an extended information table by adding binary relations on attribute values, and two languages (Yao, 2001b). Formally, an information table can be expressed as S ¼ ðU; At; Lv; Lr; fVaja 2 Atg; fRaja 2 Atg; fIaja 2 AtgÞ where U is a finite nonempty set of objects, At is a finite nonempty set of attributes, Lv is a language dealing with values of objects, Lr is a language dealing with relations of objects, Va is a nonempty set of values for a 2 At, Ra is a nonempty set of binary relations on Va, a 2 At, and Ia: U- Va is an information function. Each information function Ia is a total function that maps an object of U to exactly one value in Va. An information table represents all available information and knowledge. That is, objects are only perceived, observed or measured by using a finite number of properties. In the language Lv, anatomic formula is givenby (a, R, v), where a 2 At, R 2 Ra and v 2 Va. If f and c are formulae, then so are :f, f ^ c, f _ c, f - c and f � c. The semantics of the language Lv can be defined in the Tarski style through the notions of a model and satisfiability. The model is an information table S, which provides an interpretation for symbols and formulae of Lv. The satisfiability of a formula f by an object x, written x �s f or in short x � f if S is understood, is given by the following conditions: (m1) x � (a, R, v) iff Ia(x) Rv, (m2) x � :f iff not x � f, (m3) x � f 4c iff x � f and x � c, (m4) x � f 3c iff x � f or x � c, (m5) x � f-c iff x � :f3c, (m6) x � f � c iff x � f-c and x � c-f. If f is a formula, the set ms(f) defined by mSðfÞ ¼ fx 2 Ujx � fg ð35Þ is called the meaning of the formula f in S. If S is understood, we simply write m(f). The following proper- ties hold: (a) m(a, R, v) ¼ {x 2 U|Ia(x) R v}, (b) m(:f) ¼ :m(f), (c) m(f4c) ¼ m(f) \ m(c), (d) m(f3c) ¼ m(f) [ m(c), (e) m(f-c) ¼ :m(f) [ m(c), (f ) m(f � c) ¼ (m(f) \ m(c)) [ (:m(f) \ :m(c)). The meaning of a formula f is therefore the set of all objects having the property expressed by the formula f. In other words, f can be viewed as the description of the set of objects m(f). Thus, a connection between formulae of Lv and subsets of U is established. When the relation R is chosen to be the equality relation ¼ , we obtain the conventional decision logic language (Pawlak, 1991). With the introduction of language Lv, we have a formal description of concepts. A concept definable in an infor- mation table is a pair (f, m(f)), where f 2 Lv. More specifically, f is a description of m(f) in S, the intension of concept (f, m(f)), and m(f) is the set of objects satisfying f, the extension of concept (f, m(f)). A concept (f, m(f)) is said to be a sub-concept of another concept (c, m(c)), or (c, m(c)) a super-concept of (f, m(f)), if m(f) � m(c). A concept (f, m(f)) is said to be the smallest nonempty concept in S if there does not exist another proper nonempty sub-concept of (f, m(f)). Two concepts (f, m(f)) and (c, m(c)) are disjoint if m(f) \ m(c) ¼ |. If m(f) \ m(c) 6¼ |, we say that the two concepts have a nonempty overlap and hence are related. The language Lr is defined in a similar manner to Lv, except that an atomic formula is given by (a, R), where R 2 Ra and a 2 At. Semantics of formulae of Lr are inter- preted by pairs of objects in U. That is, (m10) (x, y) � (a, R) iff Ia(x) R Ia(y). For formula f, the set ms(f) defined by mSðfÞ ¼ fðx; yÞ 2 U � Ujðx; yÞ � fg ð36Þ is called the meaning set of f in S. If S is understood, we simply write m(f). A pair (x, y) 2 m(f) is said to satisfy the expression f. Similarly, the formula f can be viewed as the description of the set of object pairs m(f), and each object pair in m(f) as an instance of the concept given by f. 4.2. Two types of rules Knowledge derivable from an information table is com- monly represented in the form of rules. Roughly speaking, rules show the connections between attributes, which are normally characterized by the problem of determining the values of one set of attributes based on the values of another set of attributes. Depending on the meanings and forms of rules, we can classify rules in many ways. A clear classification of rules is useful for an understanding of the basic tasks of machine learning and data mining. Rules can be classified into two groups in terms of their directions, one-way and two-way connections, and further classified into two levels in terms of their applicability, local and global connections (Yao & Zhong, 1999; Yao, 2001b, 2003b). A one-way connection shows that the values of one set of attributes determine the values of another set of attributes, but does not say anything about the reverse. A two-way connection is a combination of two one- way connections, representing two different directions of connection. A local connection is characterized by a rule showing the relationship between one specific combination Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 293 of values on one set of attributes and one specific com- bination of values on another set of attributes. A global connection is characterized by a rule showing the rela- tionships between all combinations of values on one set of attributes and all combinations of values on another set of attributes. For clarity, we only consider one-way connections and the equality relation on attribute values, as was commonly done in rough sets. In this case, a local one-way connection is expressed by a rule of the form, using formulae of Lv, ða; ¼ ; vaÞ ) ðb; ¼ ; vbÞ ð37Þ where a, b 2 At, and va 2 Va, vb 2 Vb. It can be more conveniently expressed as, for x 2 U, IaðxÞ ¼ va ) IbðxÞ ¼ vb ð38Þ The rule is commonly paraphrased as ‘if the value of an object is va on an attribute a, then its value is vb on another attribute b’. A global one-way connection is expressed by a rule of the form, using formulae of Lr, ða; ¼ Þ ) ðb; ¼ Þ ð39Þ where a, b 2 At, or conveniently as, for (x, y) 2 U � U, IaðxÞ ¼ IaðyÞ ) IbðxÞ ¼ IbðyÞ ð40Þ That is, ‘if two objects have the same value on an attribute a, then they have the same value on another attribute b’. Functional dependence in a database is an example of such global rules. The formulation of rules using atomic formulae can be easily extended to any formulae of languages Lv and Lr. A local rule states knowledge about one object. A local one- way rule shows that, if the object has a specific value on one set of attributes, then it will have a specific value on another set of attributes. On the other hand, a global rule states knowledge about a pair of objects. A global one-way rule suggests that, if a pair of objects have the same value on one set of attributes, then they will have the same value on another set of attributes. Based on this observation, a global rule is also called a high order rule, while a local rule is called a low order rule (Yao, 2003a). 4.3. Interpretation of rough set theory in information tables The abstract theory of rough sets can be explained by using an information table. Such an interpretation is useful for rule induction. Let W ¼ {W1, . . . , Wn} � At be a set of attributes in an information table. We form a family of elementary formulae FW ¼ f^ni ¼ 1ðWi; ¼ ; wiÞjwi 2 VWig of the lan- guage Lv. For simplicity, let VW ¼ VW1 � . . . � VWn. We also express the family of elementary formulae by FW ¼ {W¼ wjw2 VW}. The family of nonempty meaning sets form a partition of the universe, namely pW ¼ fmð^ni ¼ 1ðWi; ¼ ; wiÞÞ 6¼ |jwi 2 VWig ð41Þ It is referred to as the partition induced by the set of attributes W. For the same set of attributes, we can construct a formula ^ni ¼ 1ðWi; ¼ Þ of the language Lr. The meaning of the formula EW ¼ mð^ni ¼ 1ðWi; ¼ ÞÞ ¼ fðx; yÞ 2 U � UjIWi ðxÞ ¼ IWiðyÞ; 1 � i � ng ð42Þ is an equivalence relation on U. Similarly, another set of attributes Z ¼ {Z1, . . . , Zm} defines another partition pZ and the corresponding equivalence relation EZ. Rough set approximations of a single subset are relevant to the induction of local or low order rules. Consider a formula ^mj ¼ 1ðZj; ¼ ; zjÞ. In terms of attributes in W, we can obtain various local rules of the following format: ^ni ¼ 1ðWi; ¼ ; wiÞ ) ^ m j ¼ 1ðZj; ¼ ; zjÞ ð43Þ Let fW ¼ ^ni ¼ 1 ðWi; ¼ ; wiÞ and cZ ¼ ^mj ¼ 1 ðZj; ¼ ; zjÞ. With respect to the three regions of rough set approxima- tions, we can construct three classes of rules. (I) Positive region: certain positive rules m(fW) � m(cZ), fW-cZ. (II) Boundary region: uncertain positive rules m(fW) 6� m(cZ) and m(fW) \ m(cZ) 6¼ +2, fW ) cZ. (III) Negative region: certain negative rules m(fW) \ m(cZ) ¼ +, fW - :cZ. Certain rules can be considered as the degenerate cases of uncertain rules. Since certain rules can be interpreted using the logical connective -, we use the same symbol. All three classes of rules express the relationship between two concepts in terms of their meaning sets. Probabilistic measures introduced earlier can be used to quantify the uncertainty of rules. For example, a measure from the rough membership function jmðfWÞ \ mðcZÞj jmðfWÞj ð44Þ can be used to measure the accuracy of one rule. Other measures associated with approximation can be used to show the characteristic of a set of rules. For instance, the measure suggested by Gediga and Düntsch (2001) g pW ðmðcZÞÞ ¼ j apr pW ðmðcZÞÞj jmðcZÞj ¼ P fW 2 FW ;mðfW Þ � mðcZÞ jmðfWÞj jmðcZÞj ð45Þ is the ratio of objects correctly classified by all certain positive rules to the objects satisfying the condition cZ. Similarly, the accuracy of approximation apW ðmðcZÞÞ is the ratio of objects correctly classified by all certain positive 294 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5 rules to the objects classified by both certain and uncertain positive rules. Approximation of one partition based on another parti- tion can be summarized by the following rule: ^ni ¼ 1ðWi; ¼ Þ ) ^ m j ¼ 1ðZj; ¼ Þ or simply W ) Z ð46Þ The measures apW ðpZÞ and gpW ðpZÞ can be used to measure the strength of the global, high order rule. A more detailed probabilistic and information-theoretic analysis of low and high order rules is given in the following sections (Yao & Zhong, 1999; Yao, 2003b). 4.4. Probabilistic measures for low order rules Suppose f and c are two formulae of the language Lv. For a rule f ) c, its characteristics can be summarized by the following contingency table: � c :c Total f a b a þ b :f c d c þ d Total a þ c b þ d a þ b þ c þ d ¼ jUj a ¼ jmðf ^ cÞj b ¼ jmðf ^ :cÞj c ¼ jmð:f ^ cÞj d ¼ jmð:f ^ :cÞj Different measures can be defined to reflect various aspects of rules. The generality of f is defined by GðfÞ ¼ jmðfÞj jUj ¼ a þ b jUj ð47Þ which indicates the relative size of the concept f. Obvi- ously, we have 0 � G(f) � 1. A concept is more general if it covers more instances of the universe. A sub-concept has a lower generality than its super-concept. The quantity may be viewed as the probability of a randomly selected element satisfying f. The absolute support of c provided by f is the quantity ASðf ) cÞ ¼ ASðcjfÞ ¼ jmðcÞ \ mðfÞj jmðfÞj ¼ a a þ b ð48Þ The quantity 0 � AS(c|f) � 1 states the degree to which f supports c. It may be viewed as the conditional probability of a randomly selected element satisfying c given that the element satisfies f. In set-theoretic terms, it is the degree to which m(f) is included in m(c). Clearly, AS(c|f) ¼ 1 if and only if m(f) 6¼ | and m(f) � m(c). That is, a rule with the maximum absolute support 1 is a certain rule. The change of support of c provided by f is defined by CSðf ) cÞ ¼ CSðcjfÞ ¼ ASðcjfÞ � GðcÞ ¼ a a þ b � a þ c jUj ð49Þ Unlike the absolute support, the change of support varies from � 1 to 1. We can consider G(c) to be the prior probability of c and AS(cjf) the posterior probability of c after knowing f. The difference of posterior and prior probabilities represents the change in our confidence regard- ing whether f is actually related to c. For a positive value, we may say that f is positively related to c; for a negative value, we may say that f is negatively related to c. The change of support relative to c is given by RCSðc ) cÞ ¼ CSðcjfÞ GðcÞ ¼ ASðcjfÞ GðcÞ � 1 ¼ Gðc ^ fÞ GðcÞGðfÞ � 1 ¼ jUjjmðcÞ \ mðfÞj jmðcÞjjmðfÞj � 1 ¼ ajUj ða þ cÞða þ bÞ � 1 ð50Þ It is interesting to note that the first term in the relative change of support is related to the probabilistic indepen- dence of c and f. The generality G(c) is related to the satisfiability of c by all objects in the database, and AS(f ) c) is related to the satisfiability of c in the subset m(f). A high AS(f ) c) does not necessarily suggest a strong association between f and c, as a concept c with a large G(c) value tends to have a large AS(f ) c) value. The change of support CS(f ) c), or the relative change of support RCS(f ) c), may be more accurate. 4.5. Information-theoretic measures for high order rules Recall that a set of attributes W induces a partition pW of the universe. Let PðwÞ ¼ PðmðW ¼ wÞÞ ¼ jmðW ¼ wÞj jUj ð51Þ Shannon’s entropy function of pW, simply written as H(P(W)), is given by HðPðWÞÞ ¼ EPðWÞ½�logPðWÞ� ¼ � X w 2 VW PðwÞ logPðwÞ ð52Þ where EP(W)[ � ] denotes the expected value with respect to the probability distribution of W. For two sets of attributes Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 295 W and Z, their joint entropy is defined by HðZ; WÞ ¼ � X z 2 VZ X w 2 VW pðz; wÞ log pðz; wÞ ð53Þ The conditional entropy H(Z|W) is defined as the expected value of subpopulation entropies H(Z|w) with respect to the probability distribution P(W): HðZjWÞ ¼ X w 2 VW PðwÞHðZjwÞ ¼ � X w 2 VW PðwÞ X z 2 VZ PðzjwÞ logPðzjwÞ ¼ � X z 2 VZ X w 2 VW Pðz; wÞ logPðzjwÞ ¼ EPðZ;WÞ½� log PðZjWÞ� ð54Þ Conditional entropy is nonnegative and nonsymmetric, namely H(ZjW)� 0 and in general H(ZjW) 6¼ H(WjZ). Con- ditional entropy can also be expressed by HðZjWÞ ¼ HðZ; WÞ � HðWÞ ð55Þ It measures the additional amount of information provided by Z if W is already known. The probability P(z) is the generality of the granule m(Z ¼ z). The function �log P(z) is a monotonic decreasing transformation of P(z). As the expected values of �log P(z), the entropy function is related to the granularity of the partition pZ. The probability P(zjw) is a measure for the local rule W ¼ w ) Z ¼ z. As an expected value, the conditional entropy H(Z|W) provides a measure for the global rule W ) Z. It may be viewed as an inverse measure of global one-way association of two sets of attributes (Pawlak et al., 1988): IC1ðW ) ZÞ ¼ HðZjWÞ ð56Þ A normalized version is given by (Pawlak et al., 1988) IC2ðW ) ZÞ ¼ HðZjWÞ logjVZj ð57Þ For an attribute Z, conditional entropy can be used to select important attributes for discovering a one-way asso- ciation W ) Z. Measures IC1 and IC2 can be used to rank attributes in an increasing order. If one prefers to rank attributes in a decreasing order, the following correspond- ing direct measures of one-way association can be used: C1ðW ) ZÞ ¼ logjVZj � HðZjWÞ ð58Þ C2ðW ) ZÞ ¼ 1 � HðZjWÞ logjVZj ð59Þ In these measures, the attribute entropy H(Z) may be used in place of log jVZj. We obtain the following measures: C3ðW ) ZÞ ¼ HðZÞ � HðZjWÞ ¼ IðZ; WÞ ð60Þ C4ðW ) ZÞ ¼ 1 � HðZjWÞ HðZÞ ¼ IðZ; WÞ HðZÞ ð61Þ Measure C3 is in fact the mutual information between W and Z. It is commonly referred to as information gain and is widely used in machine learning (Quinlan, 1986). Like the change of support for local rules, C3 may be viewed as changes of entropy for global rules. Similarly, C4 may be viewed as a relative change of entropy for global rules. 5. Conclusion While nonprobabilistic studies of rough sets focus on algebraic and qualitative properties of the theory, prob- abilistic approaches are more practical and capture quan- titative properties of the theory. The granularity of a partition can be quantified by information-theoretic measures. Existing measures of accuracy and quality of approximations can be quantified by probability-related measures. The probabilistic and information-theoretic approaches are particularly useful in rule induction, an important application of rough set theory. Most of the measures discussed in this paper are based on simple counting of the number of elements of a set. Furthermore, we have restricted our discussion to granula- tions by equivalence relations or partitions. It should be pointed out that the argument can be easily extended to more general probability functions and general granulation structures. References DUDA, R.O. and P.E. HART (1973) Pattern Classification and Scene Analysis, New York: Wiley. DÜNTSCH, I. and G.R. GEDIGA (2001) Rough information analysis, International Journal of Intelligent Systems, 16, 121–147. FAGIN, R. and J.Y. HALPERN (1991) Uncertainty, belief, and probability, Computational Intelligence, 7, 160–173. GEDIGA, G. and I. DÜNTSCH (2001) Rough approximation quality revisited, Artificial Intelligence, 132, 219–234. GRZYMALA-BUSSE, J.W. (1987) Rough-set and Dempster–Shafer approaches to knowledge acquisition under uncertainty— a comparison, Manuscript, Department of Computer Science, University of Kansas. GRZYMALA-BUSSE, J.W. (1992) LERS—a system for learning from example based on rough sets, in Intelligent Support: Handbook of Applications and Advances of the Rough Set Theory, R. Slowinski (ed.), Dordrecht: Kluwer Academic, 3–18. KLIR, G.J. and T.A. GOLGER (1988) Fuzzy Sets, Uncertainty, and Information, Englewood Cliffs, NJ: Prentice Hall. LEE, T.T. (1987) An information-theoretic analysis of relational databases—part I: data dependencies and information metric, IEEE Transactions on Software Engineering, SE-13, 1049–1061. LIN, T.Y., Y.Y. YAO and L.A. ZADEH (eds) (2002) Data Mining, Rough Sets and Granular Computing, Heidelberg: Physica-Verlag. PAWLAK, Z. (1982) Rough sets, International Journal of Computer and Information Sciences, 11, 341–356. PAWLAK, Z. (1984) Rough probability, Bulletin of Polish Academy of Sciences, Mathematics, 32, 607–615. 296 ________________________________________________________________ Expert Systems, November 2003, Vol. 20, No. 5 PAWLAK, Z. (1991) Rough Sets, Theoretical Aspects of Reasoning about Data, Dordrecht: Kluwer Academic. PAWLAK, Z. and A. SKOWRON (1994) Rough membership func- tions, in Advances in the Dempster–Shafer Theory of Evidence, R.R. Yager, M. Fedrizzi and J. Kacprzyk (eds), New York: Wiley, 251–271. PAWLAK, Z., S.K.M. WONG and W. ZIARKO (1988) Rough sets: probabilistic versus deterministic approach, International Journal of Man–Machine Studies, 29, 81–95. QUINLAN, J.R. (1986) Induction of decision trees, Machine Learning, 1, 81–106. SKOWRON, A. (1989) The relationship between the rough set theory and evidence theory, Bulletin of Polish Academy of Sciences, Mathematics, 37, 87–90. SKOWRON, A. (1990) The rough set theory and evidence theory, Fundamenta Informaticae, 13, 245–262. SKOWRON, A. and J. GRZYMALA-BUSSE (1994) From rough set theory to evidence theory, in Advances in the Dempster–Shafer Theory of Evidence, R.R. Yager, M. Fedrizzi and J. Kacprzyk (eds), New York: Wiley, 193–236. TSUMOTO, S. (1998) Automated extraction of medical expert system rules from clinical databases on rough set theory, Information Sciences, 112, 67–84. WONG, S.K.M. and P.J. LINGRAS (1989) The compatibility view of Shafer–Dempster theory using the concept of rough set, Methodologies of Intelligent Systems, 4, 33–42. WONG, S.K.M. and W. ZIARKO (1986) Algorithm for inductive learning, Bulletin of the Polish Academy of Sciences, Technical Sciences, 34, 271–276. WONG, S.K.M. and W. ZIARKO (1987) Comparison of the probabilistic approximate classification and the fuzzy set model, Fuzzy Sets and Systems, 21, 357–362. YAO, J.T. and Y.Y. YAO (2002) Induction of classification rules by granular computing, Proceedings of the Third International Conference on Rough Sets and Current Trends in Computing, LNAI 2475, Berlin: Springer, 331–338. YAO, Y.Y. (1996) Two views of the theory of rough sets in finite universes, International Journal of Approximation Reasoning, 15, 291–317. YAO, Y.Y. (2000) Granular computing: basic issues and possible solutions, Proceedings of the 5th Joint Conference on Information Sciences, Association for Intelligent Machinery, 186–189. YAO, Y.Y. (2001a) Information granulation and rough set approximation, International Journal of Intelligent Systems, 16, 87–104. YAO, Y.Y. (2001b) Modeling data mining with granular computing, Proceedings of the 25th Annual International Computer Software and Applications Conference, 638–643. YAO, Y.Y. (2003a) Mining high order decision rules, in Rough Set Theory and Granular Computing, M. Inuiguchi, S. Hirano and S. Tsumoto (eds), Berlin: Springer, 125–135. YAO, Y.Y. (2003b) Information-theoretic measures for knowledge discovery and data mining, in Entropy Measures, Maximum Entropy and Emerging Applications, Karmeshu (ed.), Berlin: Springer, 115–136. YAO, Y.Y. and P.J. LINGRAS (1998) Interpretations of belief functions in the theory of rough sets, Information Sciences, 104, 81–106. YAO, Y.Y. and S.K.M. WONG (1992) A decision theoretic framework for approximating concepts, International Journal of Man–Machine Studies, 37, 793–809. YAO, Y.Y. and N. ZHONG (1999) An analysis of quantitative measures associated with rules, Proceedings of the Third Pacific- Asia Conference on Knowledge Discovery and Data Mining, LNAI 1574, Berlin: Springer, 479–488. YAO, Y.Y., S.K.M. WONG and P.J. LINGRAS (1990) A decision- theoretic rough set model, in Methodologies for Intelligent Systems, Vol. 5, Z.W. Ras, M. Zemankova and M.L. Emrich (eds), New York: North-Holland, 17–24. ZADEH, L.A. (1979) Fuzzy sets and information granularity, in Advances in Fuzzy Set Theory and Applications, N. Gupta, R. Ragade and R. Yager (eds), Amsterdam: North-Holland, 3–18. ZADEH, L.A. (1997) Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems, 19, 111–127. ZHANG, B. and L. ZHANG (1992) Theory and Applications of Problem Solving, Amsterdam: North-Holland. ZIARKO, W. (1993) Variable precision rough set model, Journal of Computer and System Sciences, 46, 39–59. The author Yiyu Yao Yiyu Yao is a professor of computer science in the Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada. His research interests are information retrieval, Web intelligence, data mining, fuzzy sets, rough sets and granular computing. He received a PhD degree from the University of Regina, Canada. He has published over 100 journal and conference papers. He is a member of the editorial boards of the Web Intelligence and Agent Systems journal (IOS Press). He has served and is serving as a program co-chair of three international conferences, and as a program committee member in over 20 international conferences. He is a member of IEEE and ACM. Expert Systems, November 2003, Vol. 20, No. 5 ________________________________________________________________ 297