key: cord-1022119-orq82f3c
authors: Zhang, Xiaoping; Li, Jinjin; Li, Weikang
title: A new mechanism of rule acquisition based on covering rough sets
date: 2022-02-04
journal: Appl Intell (Dordr)
DOI: 10.1007/s10489-021-03067-x
sha: 85b0d3390eb05788ab491a8516f5aeef9ef82727
doc_id: 1022119
cord_uid: orq82f3c

Rule acquisition, known as knowledge acquisition, is an important and topical issue in granular computing theory. Granules are not only composed of objects but also have feature values. However, In the granule associativity rules, the traditional rule extraction methods fail to consider the influence of granules on the decision, thus the method is not well adapted in reality. On the other hand, the existing methods lack a rule-based measure for information systems. In this paper, the action parameters are first introduced to establish a more realistic granule associativity rule in covering information system. Further, we present the rule-based data potential to address the measurement problem. In addition, rule-based scale selection in multi-scale covering rough sets is explored, followed by a scale combination integrating generalization capability, data potential, and lower approximation. Finally, algorithm is designed and experiments are conducted.

Granular computing, a data processing method, has become a popular research direction [17] . In granular computing, granule is the basic unit for solving problems and generally known as an indivisible element or subset of the set [23] . Thus, the main research problem of granular computing is how to construct and use information granules to calculate [27] . Different level and thickness of granules have different influence on the problem solving process. For example, we will gain smaller knowledge granularity if the description of an object is more careful. However, we will gain the lower generalization ability in acquiring knowledge and greater complexity in solving the problem at the same time.

Information system is a common carrier used for granular computing [19] . In order to deal with incomplete and fuzzy data, Pawlak proposed the classical rough set theory based on the idea that upper and lower approximation operators can approach the target set [12, 13] . In data mining, we do not need any priori knowledge using rough set theory. Thus, the theory has attracted considerable attention in the area of computer and mathematics since it was proposed [16, 30] . At present, the theory and its innovative extensions have been widely used in data mining, medical diagnosis, decision analysis, feature selection and so on [14] .

Classical rough set is based on equivalence relation [12, 13] . But an object may belong to multiple classes in practice. Therefore, Zakowski proposed the covering rough set in 1983 [24] . Covering rough set is one of the important generalized models. Many scholars constructed different upper and lower approximation operators of covering rough set in different perspectives [1, 4, 14, 15, [20] [21] [22] [29] [30] [31] [32] . However, these studies are limited to single-scale covering. Subsequently, some scholars extended the covering rough set from single scale to multi-scale [7] [8] [9] [10] 25] . It is more comprehensive and flexible to apply multi-scale covering rough set to data mining. In multi-scale covering rough set, not only knowledge acquisition at different levels but also knowledge and granularity conversion between different levels can be realized.

In data mining, rule extraction serves as knowledge discovery [19, 27] . A rule contains predecessor and successor, i.e. the condition and decision. The certain factor of the rule is essentially the degree to which condition granule supports decision granule. According to the idea of rough set theory, the lower approximation represents certain classification and certain rules, while the upper approximation represents uncertain information and uncertain (i.e., possible) rules [18] . There have been many studies on rule extraction [2, 5, 6, 11, 18, 26, 28] .

In this paper, we firstly propose a rule validity function to reflect the classical rule and promote it to fit the granule associativity rule. Then motivated by the adaptability of rule extraction methods and the measurement of decision descriptions, more effective granule associativity rules are established by considering the impact of different granules on decision descriptions. There are three types of objects holding intensities of rules, i.e., strongly, generally and barely. Further, the rule-based data potential of information systems for decision descriptions is reflected by considering the rule-based classification of objects and then measuring the number of objects. We mainly use it to evaluate the potential of information systems. Subsequently, we explore the use of the new rule extraction mechanism in scale selection. The optimal scale combination based on rule extraction is put forward for the first time. The final work is to conduct experiments on three UCI datasets [3] , including Iris Plants, Hayes-Roth and Somerville Happiness Survey. As a result, we present a new mechanism of rule acquisition based on covering rough sets, which can be more effectively fitted to the reality than the classical rule extraction.

The rest parts of the paper are organized as follows. Some related work on our proposed mechanism of covering rough set and rule acquisition is introduced in Section 2. In Section 3, some basic notions of covering rough set and rules extraction are reviewed, and some notations are explained. In Section 4, the mechanism of rule extraction is studied, and rule-based data potential is proposed. In Section 5, the proposed mechanism of rule acquisition is employed for scale selection. In Section 6, algorithm for calculation of rule-based data potential is given, and several experiments are shown. Finally, we wrap up the paper with a summary and an outlook of further research in Section 7.

In this section, we mainly review some ralated studies in covering rough set and rule extraction.

Since the covering rough set was proposed, there have been two studies directions, approximate operators and multi-scale covering. In [31] , Zhang unified the operators of covering rough set in neighborhood systems based on elements, granules and subsystems and then presented the properties. In [22] , Yao summarized and generalized the framework of covering approximate operators based on element, granularity and subsystem. Both of the works are basic but very important. Zhang et al. [29] described the covering rough set in an axiomatic way and gave a more detailed axiom set. Axiomatic characterizations of covering-based approximation operators guarantee the existence of coverings reproducing the operators. In terms of multi-scale covering, Li [7] constructed the scale relation based on covering and extended the multi-scale covering rough set model. Since then, the relation between scales is no longer limited to the usual coarse and fine relationships. The study is also extended to the fuzzy set field. For example, Zhan [25] studied multi-attribute decision making based on the mult-grain (I, T) fuzzy rough set model. Combined with the fuzzy set, theoretical and applied research on rough sets will become more and more extensive.

In 1991, Pawlak began to study the rule extraction in rough set [13] . However, Pawlak's model was only applied to symbolic data. Thus, the extended researches of rule extraction have been made by many scholars. In 1998, Kryszkiewicz proposed an incomplete information system based on tolerance relationship and derived its knowledge [5, 6] . In 2011, Duyong et al. [2] utilized the neighborhood rough set model to extract rule on numerical data. Zhang proposed a rule classification method for symbolic data in the covering [26] . Further, many researchers realized that data structures have complexity and heterogeneity and tried to overcome it [11, 28] . In 2013, Zhang et al. [28] realized multi-confidence rule acquisition in the covering decision system by using the idea of combinatorial optimization. Meng and Shi explored the rule acquisition in incomplete heterogeneous information system in 2020 [11] . But these studies focused more on the attribute reduction rather than on generalization ability of rules. In covering rough set, different description granules of an object may support the same decision in different degrees. In 2017, Wu [18] combined belief function and plausibility function with rough set theory to extract rule in incomplete multi-scale information systems. The work was systematically and completely but didn't show the different effects of the description granules to decision granule. The existing methods do not consider the interactions between different granules, so the extracted rules are poorly adapted. In addition, decision classes in information systems determine the diversity of knowledge, yet little research has been done on this aspect. Motivated by this, we will discuss the new mechanism of rule acquisition based on covering rough sets.

Different relations but not limited to equivalence relations yield different granularities. A coverage can be considered as the granularity of a finite and nonempty universe of discourse U = {x 1 , x 2 , ⋯ , x n } . In this section, we review some basic conceptual knowledge of covering information system (CIS) and rule extraction. Please see References [14, 16, 22, 32] for more details.

For a decision information system (DIS) (U, A, d), define a binary relation R a = {(x, y) ∈ U × U|a(x) = a(y)} , where a ∈ A is called a condition attribute, a(⋅) represents the attribute value of the object, and the partition of U consisting of all decision classes is denoted as U/d. Clearly, we can obtain a covering {(x) a |x ∈ U} of U by R a , where (x) a = {y ∈ U|(x, y) ∈ R a } . This is one but not the only way to generate coverage by a DIS. Thus the CIS is defined as follows.

Definition 1 A covering information system is an ordered pair (U, ℂ) , where ℂ is a family of coverings of U. Meanwhile, (U, ℂ, d) is known as the covering decision information system, where d is a decision attribute.

Notice that attribute induces the binary relation on U to further generate granularity in the covering information system. Generally, complete, incomplete, set-valued or intervalvalued information systems can be handled by the covering information system approach. In Example 1, we imitate the application of a covering information system in a medical diagnosis scenario.

In medical diagnosis, giving a diagnosis based on symptoms is making a decision. Table 1 records the symptoms of 10 people who went to the doctor at a hospital and the final diagnosis of COVID-19. Normal body temperature (BT) is less than 37.3 • C, otherwise there is a fever. There are four degrees of cough, difficult breathing (DB), and limbs weakness (LW): no, slight, general and severe. In terms of suffering from COVID-19 (SfC), we have two decision classes: no and yes. How do you generate a covering information system based on Table 1 ?

For the conditional attribute body temperature, group the ID with the same temperature value together, so we get a covering as follows,

In [22] , Yao summarized four common neighborhood operators for a covering C of U as follows, In addition, there are two pairs of dual granule-based lower and upper approximation operators as follows,

Rule acquisition is to extract judgment criteria with logical connection from empirical data. The symbols ∧ , ∨ and → denote the conjunction, disjunction, and implication, respectively. Any attribute value pair (a, v) is called atom, where a is the attribute, v ∈ V a . The logical connection between atoms is called a description. If t is a description, the attribute appearing in t is denoted as

The object set with description is called the support set represented by

The set consisting of all the condition descriptions of the information system is denoted as T.

Assume that t is a description consisting of condition attributes and their attribute values, and s = (d, v) is a decision description, v ∈ V d . Then t → s is called a decision rule. Sometimes we also use r to denote a decision rule.

The certain factor of the decision rule is

The decision rule with Cer(t → s) = 1 is called certain rule, and the decision rule with 0 < Cer(t → s) < 1 is called the possible rule. The validity of the rule can be judged by the certain factor, see Example 2.

Example 2 According to Example 1, three rules are obtained as follows,

Calculate their certain factors as follows,

Since

On the contrary, the other two are valid rules. In addition, we obtain the lower and upper approximations of the two decision classes ||(SfC, no)|| and ||(SfC, yes)|| as follows,

Furthermore, it is easy to conclude that the lower and upper approximations of equivalence classes generate the certain rules and possible rules respectively.

The classical rough set method theory presents that a conjunctions operation is used to combine different granules to construct the antecedents of rules. Fig. 1 shows the action mechanism of the classical rule extraction method on different granules. In Fig. 1 , G 1 , G 2 , G 3 and G 4 respectively represent granules from different attributes, then the intersection is obtained through intersection operation ∧ , and finally the corresponding rules are obtained by using the certain factor Cer. In this section, a new mechanism of rule acquisition in covering information system is first proposed.

In a granularity space, granules are the basic unit of computation. People can detect the validity of rules by the interaction between granules. The definition of the rule validity function is presented as follows.

Definition 4 Let (U, ℂ, d) be a granularity decision space, where ℂ is a family of condition granularity, and d is a decision granularity, |U| = n , |C| = m . Consider a function

A larger value for the validity of a rule indicates that the rule is more trustworthy. It is obvious that the certain factor Cer is a rule validity function. We can conclude from Fig. 1 that in the classical rule extraction method, all the condition granules involved in the rule have the same effect on the decision granule. However, in reality different condition granules act differently on the decision granule which causes discrepancies with reality in the classical approach. As a result we employ granule action parameters to address this problem and then propose granule associativity rule, as follows.

When V = Cer , assuming (1)

we can use Fig. 2 to illustrate the granule associativity rule extraction process. The color-filled blocks in Fig. 2 indicate that they need to be taken into account when we do rule extraction.

Comparing Fig. 2 with Fig. 1 , it can be concluded that the rule extraction method proposed in this paper comprehensively considers the effect of condition granules on decision granules to extract granule associativity rules by (1) . In addition, an interesting and significant theorem is obtained as follows.

Theorem 1 When 1 = ⋯ = m = 1 , the granule associativity rule is the same as the ordinary rule.

Proof It is obvious according to Definition 5. Table 3 .

The columns 3-6 of Table 3 , respectively, represent the certain factor, the validity during the outbreak, stabilization and mutation periods. From Table 3 , it can be seen that the rule validity is generally minimum during the period of mutation. In addition, the rule validity of R1 and R2 are equal, and the rule validity of R3, R4, R5 and R6 are equal. Therefore, according to the data in Table 3 , the variation of the rule validity in different periods can be obtained, as shown in Fig. 3. In Fig. 3 , the red line represents the variation of the rule validity using the traditional method for three different periods, while the blue line shows the application of the method proposed in this paper. The abscissa values 1, 2 and 3 indicate the outbreak, stabilization and mutation periods, respectively. Through the analysis of Fig. 3 , the conclusions can be drawn as follows.

• The rule validity decreases as the pneumonia virus transitions from outbreak to mutation. Therefore, during the mutation period, we need to find new symptoms to improve rules. • Traditional method cannot reflect the variation of rule validity in different periods, but the method proposed in this paper can.

In a covering information system, the ultimate diversity of rules lies in the decision classes. The identification of decision class, which is one of the most important applications of information system, depends on the object in U. Inspired by the rules, the rule-based classification of object is carried out.

Given a rule validity function V, let s be a decision description, and are two non-negative parameters,

According to Definition 6, when x ⇒ ≥ s , x strongly supports decision class ||s||. When x ⇒ ( , ) s , x generally supports decision class ||s||. However, when x ⇒ ≤ s , the decision class ||s|| is barely supported by x. The potential of an information system for a decision class is determined by the object. To measure the potential, the sets {x ∈ U|x ⇒ ≥ s} , , Ω s 2 and Ω s 3 constitute a partition of U.

Assume that the variable E ≥ 0 . If E is respectively positive correlation with 1 and 2 , and negative correlation with 3 , then E is said to be rule-based data potential for s.

In accordance with Definition 7, we attempt to construct the rule-based data potential calculation formula as follows, (2) reveals that when 1 and 2 are increased by the same amount separately, the increase brought by 1 of E is greater than the increase brought by 2 of E. On the other hand when 1 = 2 = 0 , E = 0 . The denominator 3 + 1 is not going to be equal to 0.

In the case of 1 = ⋯ = m = 1 and = 1 , = 0 , the rule extraction can be transformed into classical rule extraction by Equation (1) . An interesting theorem is presented as follows.

( .

In addition, a one-to-one mapping can be established between T and ℂ . Hence, based on Definition 6, Theorem 2 is straightforward.

The traditional method of rule extraction lacks the evaluation of the information regarding decision class, which leads to the inability to pick a suitable information system. Theorem 2 shows that the rule-based data potential can be considered in classical rule extraction. Item (4) of Theorem 2 states that based on Ω s 1 and Ω s 2 , the approximation accuracy of the equivalence class is not reduced. A greater data potential refers to the more information contained in the information system. As a result, after finding the specific rule validity and rule-based data potential, E can be employed as the measurement criterion for choosing an information system. In other words, for a decision description (d, v), an information system with a greater E should be selected for consideration. On the other hand, based on rule-based data potential, a comprehensive comparison of all decision descriptions can be drawn. An example of the application of rule-based data potential in business management is given as follows.

Wages are the key to good management. There are two groups of employees in a certain department of a company. The two groups of employees are numbered 1001 to 1010, and 2001 to 2010, respectively. Tables 4 and 5 collect the performance and salary of the employees respectively. There are three levels of salary, I:1000∼1100, II:1100∼1500 and III:1500∼2000, unit: US dollar. Two factors that affect salary are the completed workload (CW) and the number of absences (NoA). The company needs to choose either Tables 4 or 5 as the reference for the coming salary level.

From Table 4 , we derive that as follows,

From Table 5 , we derive that as follows, The level  of salary   1001 100 4  I  1002 400 2  II  1003 600 1  III  1004 200 3  I  1005 200 3  II  1006 100 4  I  1007 100 3  I  1008 300 2  II  1009 400 2 II 1010 500 1 III Table 5 The performance and salary of the employees ID CW NoA The level  of salary   2001 500 1  III  2002 100 4  I  2003 200 3  I  2004 300 3  II  2005 400 2  II  2006 200 3  I  2007 200 3  II  2008 100 4  I  2009 400 2  II  2010 According to Theorem 2, the results are displayed as follows by calculation.

For Table 4 , {1002, 1003, 1008, 1009, 1010} , 1002, 1004, 1005, 1006, 1007 , 1008, 1009}.

For Table 5 , .

Since E I,Table5 > E I,Table4 , choose Table 5 as the reference for the salary level I; since E II,Table4 > E II, Table5 , choose  Table 4 as the reference for the salary level II; since E III,Table4 = E III, Table5 , choose Tables 4 or 5 as the reference for the salary level III.

Scale selection can effectively improve the generalization ability of the covering information system. However, the different scales inherently cause a shift in the rules. In this section, scale selection based on rule extraction is explored on multi-scale covering information system.

Let C ′ and C ′′ be two coverings of U, respectively. We say that there is a scale relationship between C ′ and C ′′ , denoted as C ′ ⊳ C ′′ , if and only if there exists a function f ∶ C � → 2 C �� satisfying the following conditions:

A covering information system (U, ℂ 1 ∪ ⋯ ∪ ℂ m ) is said to be a multi-scale covering information system if for any C � , C �� ∈ ℂ j , C ′ ⊳ C ′′ , where j = 1, ⋯ , m , m is a positive integer, and ℂ j is a family of coverings. At the same time, 

Assume that if S � ⪯ S �� and S �� ⪯ S � , then S � = S �� . Thus, ( , ⪯) is a partial order set.

In the partial order set ( , ⪯) , S * is known as optimal scale combination based on rule extraction meeting the requirements as follows,

where E * s and E * s 0 are the rule-based data potentials of S * for s and s 0 respectively, and E s , E s 0 the rule-based data potentials of S for s, s 0 .

Sometimes, there is no optimal scale combination based on rule extraction in a multi-scale covering decision information system. However, an interesting theorem is presented as follows.

Theorem 3 In the partial order set ( , ⪯) , for s ∈ U∕d , there exists S * ,s ∈ such that

where E * s and E s are the rule-based data potentials of S * ,s and S for s respectively. Proof Due to the finite set , it is straightforward.

Theorem 3 states that for a decision description in a multi-scale covering decision information system, there is an optimal scale combination that maximizes the data potential of the description. In the previous method of scale combination selection, the scale combination with greater generalization ability is selected by keeping the lower approximation of the equivalence class of S 1 . What is amazing is the introduction of the following theorem. Theorem 4 delivers a way to construct optimal scale combination based on rule extraction with keeping the lower approximation. Meanwhile, Theorem 4 theoretically proves that there exists a scale combination such that the generalization ability and data potential of the model are optimal and the lower approximation is maintained.

S 1 = {C 1 1 , ⋯ , C 1 m } , a decision description s, apr S 1 N 1 (||s||) and Ω s 1 , Ω s 2 , Ω s 3 . For x ∈ (Ω s 1 ∪ Ω s 2 ) − apr S 1 N 1 (||s||) , y ∈ Ω s 3 , K x ∈ C 1 j,x , K y ∈ C 1 j,y , let K xy = K x ∪ K y , C j = (C 1 j − {K x , K y }) ∪ {K xy } , S = {C 1 1 , ⋯ , C 1 j−1 , C j , C 1 j+1 , ⋯ , C 1 m } , then it is concluded that 1. S 1 ⊳ S, 2. E s ≥ E 1 s , 3. apr S N 1 (||s||) = apr S 1 N 1 (||s||),

In this section, firstly, Algorithm 1 is designed to calculate the data potential. Then experiments are carried out to illustrate the effectiveness of the algorithm and the comparison of data potentials at different scales.

Algorithm 1 is designed to calculate the rule-based data potential for an equivalence class D of an information + 1) ).

In order to demonstrate the effectiveness of the algorithm and the differences in data potential at different scales, we randomly obtained three data sets from UCI [3] as the original data of the experiment that fit the rule extraction model, i.e. all have only one decision attribute. All of the data sets are numerical data. These three data sets are 'Iris Plants', 'Hayes-Roth' and 'Somerville Happiness Survey'. Where 'Iris Plants' was also employed by references [11] and [26] . Tables 6 and 7 contain the basic and key information for the data sets. We directly use the raw data as the data combination at the first scale. By dividing the range equally, we can construct different scale combinations. Table 8 is the basic information of the constructed scale combination. After combinating the attribution, Iris Plants has three scales, Hayes-Roth has two scales,so does Somerville Happiness Survey. Let V = Cer , = 1 , and = 0 . According to Algorithm 1, MATLAB software is used to calculate the data potential. Some results are shown in Tables 9, 10 and 11. Based on these results, the effectiveness of Algorithm 1 is obvious.

According to Tables 9, 10 and 11, we can get the comparison results of the data potential of information system at different scales. As for 'Iris Plants', it is noted from Table 9 that the original information system has the smallest data potential 5.069 for class 1 in the first scale. In Table 9 , as the scale changes from the first to the third, |Ω 2 | of all equivalence classes is getting system. In Algorithm 1, if p = ∏ m i=1 �C i � , then x ∈ Ω 1 ; if q = ∏ m i=1 �C i � , then x ∈ Ω 3 . Lastly, Ω 2 = U − Ω 1 − Ω 3 . Let h = max{|C 1 |, ⋯ , |C m |} . Lines 4-8 possess the time larger, while |Ω 1 | and |Ω 3 | are getting smaller, and thus the data potential is getting larger. However, for 'Hayes-Roth' and 'Somerville Happiness Survey', it can be concluded from Tables 10 and 11 that the scale change does not cause a change in the data potential. In fact, for |Ω 3 | of all the equivalence classes, their values are 0 in Tables 10 and 11 , so it is impossible to construct the scale combination with greater data potential through Theorem 4.

Compared with the traditional rule extraction model, the granule associativity rule proposed in this paper enjoys better adaptability. The introduced action parameters address the validity of the rule. The objects are classified by rules, and then the measure of covering information systems about the decision class are carried out. The rule-based data potential is presented as a criterion for evaluating information system and is well employed in the scale selection for multi-scale covering rough set. Such a mechanism of rule acquisition further develops the rough set theory.

In the future we will devote to the association of action parameters with fuzzy measures and the use of improved rule extraction methods in larger data scenarios. What's more, the fusion of the three-way decision with the rule extraction mechanism arouses our interest. The theory is advised for pattern recognition, medical diagnosis and other fields. Table 10 The rule-based data potential of 'Hayes-Roth' under the different scales

The first scale

The second scale 

A new approach to attribute reduction of consistent and inconsistent covering decision systems with covering rough sets

Rule learning for classifification based on neighborhood covering reduction

UCI machine learning repository

Fuzzy-rough attributes reduction with application to web categorization

Rough set approach to incomplete information systems

Rules in incomplete information systems

A new rough set model based on multi-scale covering

Covering rough set model based on multigranulations. Fuzzy Set

Covering fuzzy rough set based on multigranulations

On multi-granulation covering rough sets

On rule acquisition methods for data classification in heterogeneous incomplete decision system

Rough sets

Theoretical Aspects of Reasoning about Data

The further investigation of covering-based rough sets: uncertainty characterization, similarity measure and generalized models

Fuzzy covering generalized rough sets (in chinese)

Relationships among generalized rough sets in six coverings and pure reflflexive neighborhood system

Theory and applications of granular labelled partitions in multi-scale decision tables

On rule acquisition in incomplete multi-scale decision tables

Granular computing and theory method of measurement of uncertain information(in Chinese)

On generalizing rough set theory

Neighborhood systems and approximation retrieval

Covering based rough set approximations

Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic

Approximations in the space

Covering based multigranulation (i, t )-fuzzy rough set models and applications in multi-attribute group decision-making

Representative-based classification through covering-based neighborhood rough sets

Multi-granularity knowledge acquisition and uncertainty measure (in Chinese)

Multiconfifidence rule acquisition oriented attribute reduction of covering decision systems via combinatorial optimization

On minimization of axiom sets characterizing covering-based approximation operators

Relationships between covering-based rough sets and relation-based rough sets

Relationships between generalized rough sets based on covering and reflflexive neighborhood system

Covering rough sets based on neighborhoods: an approach without using neighborhoods