key: cord-0047127-frrxbklw
authors: Hu, Mengjun
title: Concept Analysis Using Quantitative Structured Three-Way Rough Set Approximations
date: 2020-06-10
journal: Rough Sets
DOI: 10.1007/978-3-030-52705-1_21
sha: 327404350c28f08a71a039e71456e5777f32b5ca
doc_id: 47127
cord_uid: frrxbklw

One important topic of concept analysis is to learn an intension of a concept through a given extension. In the case where an exact intension cannot be formulated due to limited information, rough set theory introduces approximations to roughly learn the intension. Pawlak originally proposes a qualitative formulation of approximations which allows no error in the learned intension. Various quantitative formulations have been studied as generalizations, most of which use probabilistic measures. In contrast, non-probabilistic formulations have not been fully investigated. On the other hand, three-way approximations and structured approximations have been proposed to emphasize the semantics of approximations for the purpose of learning and interpreting intension. To combine the benefits of these two directions of generalizations, this paper investigates quantitative structured three-way approximations based on both probabilistic and non-probabilistic measures in the context of both complete and incomplete information.

Concept analysis is one common application of rough set theory [18, 19] . A concept can be formally represented by a pair of intension and extension [3] where the intension describes the definition and the extension lists all instances. Concept analysis is usually based on a dataset represented in a tabular form with rows as objects and columns as attributes [7, 20, 21, 24, 31, 32] . The attributes are used to describe the properties of objects as well as to formulate intensions. While learning extension from a given intension is not difficult, the opposite task may be complicated. In particular, we may not be able to find an exact intension of a given extension due to insufficient, incomplete, or limited information.

To solve the above issue, rough set theory introduces the concept of approximations to roughly approximate the true intension. As illustrated by Fig. 1(a) [12] , the set of all objects in a given table, represented by the biggest rectangle, is divided into pieces called definable sets. Each definable set can be precisely described by an intension and is used to approximate a given extension. Pawlak [18, 19] proposes a pair of lower and upper approximations, where the lower approximation corresponds to the positive region in Fig. 1 (a) and the upper approximation corresponds to the union of the positive and boundary regions. A set of classification rules is derived from an approximation by using the intension of a definable set as the premise of one rule. All such rules together approximate the true intension and can be used to classify instances of the concept. By interpreting classification rules through associated actions, Yao [26] further considers the negative region, which leads to a three-way approximation consisting of the positive, boundary, and negative regions. Semantically, the positive and negative regions are associated with actions of accepting and rejecting instances of the concept, respectively. The boundary region is associated with a non-commitment action, which reflects the limitation of our knowledge. [12] There are at least two directions in which the above approximations can be improved. The first direction is to allow a certain rate of misclassification in order to enlarge the positive and negative regions and shrink the boundary region. Instead of the qualitative set-inclusion used in Fig. 1(a) , a consideration of quantitative measures results in various quantitative rough set models. Most related research uses probabilistic measures [9, 23, 29, 33 ] and a few considers non-probabilistic [10, 28] . Yao and Deng [28] propose a general framework of formulating both probabilistic and non-probabilistic approximations based on subsethood measures whose properties are further studied in [11] . A few related works regarding concept-based non-probabilistic classifiers are investigated in [17] , which may inspire the research on non-probabilistic quantitative rough set models.

The second direction is to build explanation-oriented approximations that emphasize on explaining and understanding the semantics. Most formulations of approximations focus on which objects should be included without due attention to their descriptions which are necessary to formulate rules. The lower and upper approximations proposed by Pawlak in [18, 19] and the three-way approximations proposed by Yao in [26] are defined as sets of objects, which are better illustrated by Fig. 1(b) . A lack of internal structure leads to certain difficulties in deriving and interpreting classification rules. In contrast, a structured approximation [2, 12] is defined as a set of definable sets instead of a set of objects. With clearer semantics, structured approximations can be conveniently and meaningfully applied to learn concepts with incomplete information, where most existing rough set models face a common challenge of interpreting their approximations in order to formulate classification rules [12, 15] . This paper studies quantitative structured approximations as improvements in both directions. More specifically, we investigate quantitative structured approximations based on both probabilistic and non-probabilistic measures with both complete and incomplete information. This work focuses on exploring meaningful approaches to building explanation-oriented quantitative structured approximations. Accordingly, we present conceptual formulations of approximations that emphasize on the semantics, rather than computational formulations that emphasize efficient computations in practice. Further discussions on conceptual and computational formulations can be found in [4, 12, 16, 25] .

In the remainder of this paper, Sect. 2 reviews qualitative structured approximations with both complete (Sect. 2.1) and incomplete (Sect. 2.2) information. Section 3 explores the generalizations into quantitative structured approximations, including both complete (Sect. 3.1) and incomplete (Sects. 3.2 and 3.3) information. Conclusion and future work are discussed in Sect. 4.

This section reviews the main results of qualitative structured approximations proposed in [12] with both complete and incomplete information.

An information table is formally used in rough sets to represent a given dataset.

In the case of complete information, a complete information table is formulated as the following tuple:

where OB is a finite nonempty set of objects, AT is a finite nonempty set of attributes, V a is the domain of an attribute a, and I a is an information function which maps each object to a unique value in V a . This unique value reflects the complete information or our complete knowledge. Logic formulas regarding attributes and their values are used as formal descriptions of objects. By arguing that a consideration of logic conjunction is sufficient for the rule-learning purpose, Hu and Yao [12] use a conjunctive description language DL c consisting of formulas defined as follows:

(1) Atomic formulas: (a = v) ∈ DL c , where a ∈ AT and v ∈ V a ;

(2) Composite formulas: if p, q ∈ DL c , and p and q do not share any attribute, then p ∧ q ∈ DL c .

The satisfiability of a formula by an object, denoted by |=, is defined as:

where o ∈ OB, a ∈ AT , v ∈ V a , and p, q, p ∧ q ∈ DL c . Accordingly, a formula is associated with a set of objects exhibiting its meaning.

For a formula p ∈ DL c , the set of objects:

is called the meaning set of p.

On the other hand, objects in m(p) can be uniformly described by p and thus, is considered to be definable. By using a formula as intension and its meaning set as extension, one can form a definable concept.

DEF(T ) is widely used to represent the family of definable sets in recent works [4, 22, 25] . Accordingly, we use CDEF(T ) to represent the family of conjunctively definable concepts which is used to construct structured approximations.

For a set of objects X ⊆ OB, its structured positive and negative regions [12] are defined as:

where X c is the complement of X.

The boundary region is commonly defined through the positive and negative regions. With respect to Definition 3, the structured boundary region of X can be defined as:

From the view of learning intension, we are not interested in the boundary region since it doesn't lead to classification rules for recognizing either positive or negative instances of the concept.

Most research in the literature applies unstructured approximations [18, 19] which can be expressed as [12] :

As argued and illustrated in [12] , the structured approximations benefit rule learning with clear semantics obtained through preserving the internal structure as well as introducing intensions which are left-hand-sides of rules.

Although an object actually takes exactly one value on an attribute, due to our limited or incomplete information, we may not be able to know this actual value. In such a case, an incomplete information table is used, which can be formally represented as the following tuple:

where OB, AT , and V a have the same meanings as in a complete table, and I a maps each object to a nonempty subset of V a . Every value in I a (x) may be the actual value of an object x ∈ OB on an attribute a ∈ AT , but exactly one value is indeed the actual one which we do not know due to incomplete information.

Lipski [14] equivalently interprets an incomplete table as a family of complete

One gets a completion of T by picking up exactly one value for each object on each attribute. Since each value in I a (x) represents one possibility of the actual value, a completion is a possibility of the actual table and called a possible world. Accordingly, Lipski's interpretation is called the possible-world semantics of an incomplete table. The family of all completions of T is denoted as COMP( T ). The meaning set of a formula p in a completion T , denoted by m(p|T ), is a possibility of its actual meaning set. The collection of p's meaning sets in all completions covers all possibilities of p's actual meaning set and can be used to interpret p. Definition 4. The meaning set of a formula p ∈ DL c in an incomplete table T is defined as:

It is verified that m(p) is actually an interval set defined as [12] :

The sets m * (p) and m * (p) are the lower and upper bounds of m(p), respectively. The interval set [m * (p), m * (p)] contains all sets in-between these two bounds (inclusive). Moreover, the two bounds can be computed as:

The lower bound m * (p) contains objects satisfying p in every possible world, that is, they must satisfy p in the actual table and be included in p's actual meaning set. Similarly, the upper bound m * (p) contains objects satisfying p in at least one possible world, that is, they possibly satisfy p in the actual table and may be included in p's actual meaning set. By means of m(p), the definability can be generalized with respect to an incomplete table.

The family of conjunctively definable interval concepts CDEFI( T ) is used to construct the structured approximations in an incomplete table.

Definition 6. Given a set of objects X ⊆ OB in an incomplete table T , two pairs of structured regions are constructed as [12] :

SPOS * (X) and SNEG * (X) are called lower structured regions, and SPOS * (X) and SNEG * (X) are upper structured regions.

A lower structured region requires an exhaustivity of the set-inclusion relationship between a set in m(p) and X (or X c ), and an upper structured region requires an existence of such a relationship. The two lower structured regions give the lower bounds of the actual structured positive and negative regions, respectively, and the upper structured regions give the upper bounds [12] .

In this section, we generalize the qualitative structured regions into quantitative structured regions, in both complete and incomplete tables, based on two types of quantitative measures, namely, probabilities and subsethood measures. The generalization with respect to a complete table is straightforward based on existing research on quantitative unstructured approximations. In contrast, the generalization with respect to an incomplete table needs further investigation.

A probabilistic rough set model [27] replaces the qualitative set-inclusion with quantitative probabilities in defining unstructured approximations. By the same idea, if an object described by p ∈ DL c has a high probability of being a positive instance of X, then the concept (p, m(p)) is included in the structured positive region. This leads to the following probabilistic structured regions.

For a set of objects X ⊆ OB, its probabilistic structured positive and negative regions are defined as:

where 0 ≤ α, γ ≤ 1 are two thresholds, a dot represents a non-relevant threshold, and the probabilities are computed as:

The qualitative structured regions can be viewed as a special case of the probabilistic structured regions with α = γ = 1.

(α,·) (X) and SNEG pr (·,γ) (X)

As shown in Fig. 2(a) , (p, m(p)) is included in SPOS pr (α,·) (X) if the intersection between m(p) and X occupies a large portion of m(p). Accordingly, p is used to classify positive instances of X with an error rate:

which is called the rate of incorrect acceptance error (IAE) [5] . Similarly, a concept (q, m(q)) ∈ SNEG pr (·,γ) (X) is associated with the following rate of incorrect rejection error (IRE):

At the expenses of IAE and IRE, we are able to approximate a larger part of X compared with the qualitative regions.

Most existing research on quantitative rough sets is based on the probabilistic (unstructured) approximations, such as decision-theoretic rough sets [29] , gametheoretic rough sets [10] , information-theoretic rough sets [6] , naive Bayesian rough sets [30] , confirmation-theoretic rough sets [9] , and Bayesian rough sets [23] . Variable precision rough sets [33] can be viewed as both probabilistic and non-probabilistic in the sense that approximations are defined in terms of precisions which are estimated through probabilities.

In contrast, there is limited research [8, 13, 28] on non-probabilistic rough set models, which mainly uses subsethood measures and similarity measures instead of probabilities. Subsethood measure is a quantitative generalization of the qualitative set-inclusion. Given a universe OB, a normalized subsethood measure is defined as a mapping sh : 2 OB × 2 OB → [0, 1] where 2 OB is the power set of OB. For two sets A, B ⊆ OB, sh(A B) represents the degree to which A is a subset of B. Yao and Deng [28] formulate quantitative unstructured approximations through subsethood measures, which unifies both probabilistic and non-probabilistic models. Following their formulation, we present the following quantitative structured regions based on subsethood measures.

Suppose sh : 2 OB × 2 OB → [0, 1] is a normalized subsethood measure. For a given set of objects X ⊆ OB, its quantitative structured positive and negative regions can be defined as:

By using a subsethood measure sh(A B) = |A∩B| |A| , we have:

and similarly, sh(m(p) X c ) = P r(X c |m(p)). Thus, the two probabilistic regions in Definition 7 are special cases of the above two regions defined through subsethood measures. One may also consider many other meaningful cardinalitybased subsethood measures [1] to formulate non-probabilistic approximations. For example, by using a measure sh R c 5 [11] :

we can formulate a pair of non-probabilistic structured regions of X ⊆ OB as:

One may also consider both probabilities and subsethood measures in defining quantitative regions in an incomplete table. In this subsection, we present two ways to define probabilistic structured regions. An intuitive way is to simply replace the set-inclusion in Definition 6 with probabilities, which leads to the following definition.

Definition 9. Given a set of objects X ⊆ OB in an incomplete table T , one can construct the following probabilistic lower and upper structured regions:

where the probabilities are computed as:

The condition (p, m(p)) ∈ CDEFI( T ) is omitted in Definition 9 and the following definitions where this doesn't cause misunderstanding. Since m(p) is a collection of p's meaning sets in all completions, the above regions can be equivalently expressed through the family COMP( T ).

(1) SPOS pr * (α,·) (X) = {(p, m(p)) | ∀ T ∈ COMP( T ), P r(X|m(p|T )) ≥ α}, SNEG pr * (·,γ) (X) = {(p, m(p)) | ∀ T ∈ COMP( T ), P r(X c |m(p|T )) ≥ γ}; (2) SPOS * pr (α,·) (X) = {(p, m(p)) | ∃ T ∈ COMP( T ), P r(X|m(p|T )) ≥ α}, SNEG * pr (·,γ) (X) = {(p, m(p)) | ∃ T ∈ COMP( T ), P r(X c |m(p|T )) ≥ γ}. (23) The possible-world semantics also connects the above probabilistic regions in an incomplete table to those in a complete table (i.e., Definition 7). The condition P r(X|m(p|T )) ≥ α implies that the concept (p, m(p|T )) is included in SPOS pr (α,·) (X|T ). Accordingly, we get the following theorem. Theorem 1. Given a set of objects X ⊆ OB, we have:

(1) (p, m(p)) ∈ SPOS pr * (α,·) (X) ⇔ ∀ T ∈ COMP( T ), (p, m(p|T )) ∈ SPOS pr (α,·) (X|T ), (p, m(p)) ∈ SNEG pr * (·,γ) (X) ⇔ ∀ T ∈ COMP( T ), (p, m(p|T )) ∈ SNEG pr (·,γ) (X|T ); (2) (p, m(p)) ∈ SPOS * pr (α,·) (X) ⇔ ∃ T ∈ COMP( T ), (p, m(p|T )) ∈ SPOS pr (α,·) (X|T ), (p, m(p)) ∈ SNEG * pr (·,γ) (X) ⇔ ∃ T ∈ COMP( T ), (p, m(p|T )) ∈ SNEG pr (·,γ) (X|T ).

Definition 9 and Proposition 1 provide two mathematically equivalent but semantically different formulations. Definition 9 is a straightforward generalization of the qualitative regions. Proposition 1 provides an equivalent version through the family COMP( T ), which offers a clearer semantics. This clear semantics enables us to explore the relationships stated in Theorem 1.

Instead of considering P r(X|S) for every set S ∈ m(p), we may generalize P r(X|S) into a probability P r(X| m(p)) regarding a set X and an interval set m(p) which has not been well studied. Different interpretations of P r(X| m(p)) may lead to different formulas. In our work, we interpret P r(X| m(p)) as the probability of a set in m(p) being included in X. Accordingly, we define the following probabilistic structured regions.

Definition 10. Given a set of objects X ⊆ OB, one may define the following pair of probabilistic structured positive and negative regions:

where the probabilities are computed as:

Since each set in m(p) represents a possibility of p's actual meaning set, a high probability P r(X| m(p)) means that, in a large portion of all possible worlds COMP( T ), the meaning set of p is included in X, or equivalently, p appears in the qualitative structured positive region of X. Thus, it is with high probability that p appears in the actual qualitative structured positive region of X. Proposition 2. Given a set of objects X ⊆ OB, we have: (27) where T 0 ∈ COMP( T ) is the actual table.

The two probabilities P r(X| m(p)) and P r(X c | m(p)) can be efficiently computed through the two bounds of m(p). For P r(X| m(p)), if m * (p) ⊆ X, then no set in m(p) is included X, that is, P r(X| m(p)) = 0. Otherwise, we have:

The probability P r(X c | m(p)) can be similarly computed and the following computational formulation of the structured regions can be accordingly obtained.

Given a set of objects X ⊆ OB, one may construct a pair of probabilistic structured regions of X as: (29) where μ and μ c are two numbers defined as:

While Definition 10 provides a conceptual understanding of the structured regions which requires an exhaustive scan of m(p) to compute the probabilities, Theorem 2 gives an equivalent computational formulation where the probabilities can be efficiently computed through the two bounds of m(p). Example 1. We illustrate the above probabilistic structured regions in Definitions 9 and 10 with an incomplete table given by Table 1 [12] . The family CDEFI( T ) is given by Table 2 . 

With the same set X and the same thresholds, the pair of regions defined in Definition 10 are: 

We present a more general formulation of quantitative structured regions by using a subsethood measure instead of the probabilities in Definition 9.

Definition 11. Suppose sh : 2 OB × 2 OB → [0, 1] is a normalized subsethood measure. Given a set X ⊆ OB, one can define the following structured regions:

By using a subsethood measure sh(A B) = |A∩B| |A| , the probabilistic regions in Definition 9 become special cases of the above regions. Non-probabilistic struc-tured regions can be constructed by applying non-probabilistic subsethood measures such as sh R c 5 given in Eq. (19) . Since each set S ∈ m(p) is a meaning set of p in a completion, one can equivalently express the above regions through the family COMP( T ), for example:

Similar as in Theorem 1, one may also establish relationships between the above subsethood-based regions and those in the completions, for example:

Alternatively, one may generalize subsethood measures for two sets into those for an interval set and a set to construct quantitative regions. Such subsethood measures evaluate the degree to which an interval set is included in a set. 

One may also define Sh through the qualitative set-inclusion such as using the proportion of subsets of B in A:

With the latter, the probabilistic regions in Definition 10 become special cases of the above regions in Definition 12. One may also construct non-probabilistic quantitative regions through Definition 12 by applying a subsethood measure Sh that cannot be explained through probabilities.

To combine the advantages of both quantitative and structured approximations, this paper investigates quantitative formulations of structured approximations in both complete and incomplete tables. We consider both probabilistic formulations which are widely studied in the literature regarding unstructured approximations and non-probabilistic formulations which have not received due attention.

Our work brings up several interesting topics to work on. A first topic is the interpretation and determination of thresholds in various quantitative regions. While there are lots of existing related studies with respect to the probabilistic unstructured approximations in a complete table, the thresholds in subsethoodbased quantitative regions and those regions in incomplete tables need further investigation. Solutions to this topic will help construct efficient computational formulations of approximations, which is a second topic for future work. A third topic is to investigate the relationships between this work and other concept analysis approaches such as lattice theory, formal concept analysis, and pattern structures. 

On rational cardinality-based inclusion measures

A calculus of rough sets of the first order

Port Royal Logic. Stanford Encyclopedia of Philosophy

A semantically sound approach to Pawlak rough sets and covering-based rough sets

Three-way classification models

A multifaceted analysis of probabilistic three-way decisions

Formal Concept Analysis: Foundations and Applications

Rough approximation based on weak q-RIFs

Parameterized rough set model using rough membership and Bayesian confirmation measures

Game-theoretic rough sets

On the properties of subsethood measures

Structured approximations as a basis for three-way decisions with rough sets. Knowl.-Based Syst

Optimal approximations with rough sets and similarities in measure spaces

On semantics issues connected with incomplete information table

On modeling similarity and three-way decision under incomplete information in rough set theory. Knowl.-Based Syst

Three-way decision with incomplete information based on similarity and satisfiability

Notes on relation between symbolic classifiers

Rough Sets: Theoretical Aspects of Reasoning about Data

Rough sets

The connections between three-way and classical concept lattices. Knowl.-Based Syst

An analysis of three types of partially-known formal concepts

Generalized multigranulation double-quantitative decision-theoretic rough set of multi-source information system

Bayesian rough set model

Three-way granular computing, rough sets, and formal concept analysis

The two sides of the theory of rough sets. Knowl.-Based Syst

Three-way decision: an interpretation of rules in rough set theory

Probabilistic rough set approximations

Quantitative rough sets based on subsethood measures

A decision-theoretic rough set model

Naive Bayesian rough sets

Incremental concept-cognitive learning based on attribute topology

Three-way dual concept analysis

Variable precision rough set model

The author thanks reviewers for their valuable comments and suggestions.