key: cord-0044737-0bzfu3b0 authors: Campagner, Andrea; Ciucci, Davide; Hüllermeier, Eyke title: Feature Reduction in Superset Learning Using Rough Sets and Evidence Theory date: 2020-05-18 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50146-4_35 sha: 114cea4ec28cd3b2cba277ddd85579c117e09939 doc_id: 44737 cord_uid: 0bzfu3b0 Supervised learning is an important branch of machine learning (ML), which requires a complete annotation (labeling) of the involved training data. This assumption, which may constitute a severe bottleneck in the practical use of ML, is relaxed in weakly supervised learning. In this ML paradigm, training instances are not necessarily precisely labeled. Instead, annotations are allowed to be imprecise or partial. In the setting of superset learning, instances are assumed to be labeled with a set of possible annotations, which is assumed to contain the correct one. In this article, we study the application of rough set theory in the setting of superset learning. In particular, we consider the problem of feature reduction as a mean for data disambiguation, i.e., for the purpose of figuring out the most plausible precise instantiation of the imprecise training data. To this end, we define appropriate generalizations of decision tables and reducts, using information-theoretic techniques based on evidence theory. Moreover, we analyze the complexity of the associated computational problems. In recent years, the increased availability of data has fostered the interest in machine learning (ML) and knowledge discovery, in particular in supervised learning methodologies. These require each training instance to be annotated with a target value (a discrete label in classification, or a real number in regression). The annotation task is a fundamental component of the ML pipeline, and often a bottleneck in terms of cost. Indeed, the high costs caused by the standard annotation process, which may require the involvement of domain experts, have triggered the development of alternative annotation protocols, such as those based on crowdsourcing [4] or (semi-)automated annotation [12] . A different approach, which has attracted increasing attention in the recent years, is the combination of supervised and unsupervised learning techniques, sometimes referred to as weakly supervised learning [30] . In this setting, training instances are not necessarily labeled precisely. Instead, annotations are allowed to be imprecise or partial. A specific variant of weakly supervised learning is the setting of superset learning [9, 16, 18] , where an instance x is annotated with a set S of (precise) candidate labels that are deemed possible. In other words, the label of x cannot be determined precisely, but is known to be an element of S. For example, an image could be tagged with {horse, pony, zebra}, suggesting that the animal shown on the picture is one of these three, though it is not exactly known which of them. Superset learning has been widely investigated under the classification perspective [10, 15] , that is, with the goal of training a predictive model that is able to correctly classify new instances, despite the weak training information. Nevertheless, the task of feature selection [6] , which is of critical importance for machine learning in general, has not received much attention so far. In this article, we study the application of rough set theory in the setting of superset learning. In particular, we consider the problem of feature reduction as a mean for data disambiguation, i.e., for the purpose of figuring out the most plausible precise instantiation of the imprecise training data. Broadly speaking, the idea is as follows: An instantiation that can be explained with a simple model, i.e., a model that uses only a small subset of features, is more plausible than an instantiation that requires a complex model. To this end, we will define appropriate generalizations of decision tables and reducts, using information-theoretic techniques based on evidence theory. Moreover, we analyze the complexity of the associated computational problems. In this section, we recall basic notions of rough set theory (RST) and evidence theory, which will be used in the main part of the article. Rough set theory has been proposed by Pawlak [19] as a framework for representing and managing uncertain data, and has since been widely applied for various problems in the ML domain (see [2] for a recent overview and survey). We briefly recall the main notions of RST, especially regarding its applications to feature reduction. A decision table (DT) is a triple DT = U, Att, t such that U is a universe of objects and Att is a set of attributes employed to represent objects in U . Formally, each attribute a ∈ Att is a function a : U → V a , where V a is the domain of values of a. Moreover, t / ∈ Att is a distinguished decision attribute, which represents the target decision (also labeling or annotation) associated with each object in the universe. We say that DT is inconsistent if the following holds: ∃x 1 , x 2 ∈ U, ∀a ∈ Att, a(x 1 ) = a(x 2 ) and t(x 1 ) = t(x 2 ). Given B ⊆ Att we can define the indiscernibility partition with respect to B We say that B ⊆ Att is a decision reduct for DT if π B ≤ π t (where the order ≤ is the refinement order for partitions, that is, π t is a coarsening of π B ) and there is no C B such that π C ≤ π t . Then, evidently, a reduct of a decision table DT represents a set of non-redundant and necessary features to represent the information in DT . We say that a reduct R is minimal if it is among the smallest (with respect to cardinality) reducts. Given B ⊆ Att and a set S ⊆ U , a rough approximation of S (with respect to B) is defined as the pair Finally, given B ⊆ Att, the generalized decision with respect to B for an object x ∈ U is defined as We notice that in the RST literature, there exist several definitions of reduct. We refer the reader to [25] for an overview of such a list and a study of their dependencies. We further notice that, given a decision table, the problem of finding the minimal reduct is in general Σ P 2 -complete (by reduction to the Shortest Implicant problem [28] ), while the problem of finding a reduct is in general NPcomplete [23] . We recall that Σ P 2 is the complexity class defined by problems that can be verified in polynomial time given access to an oracle for an NP-complete problem [1] . Evidence theory (ET), also known as Dempster-Shafer theory or belief function theory, has originally been introduced by Dempster in [5] and subsequently formalized by Shafer in [21] as a generalization of probability theory (although this interpretation has been disputed [20] ). The starting point is a frame of discernment X, which represents all possible states of a system under study, together with a basic belief assignment (bba) m : 2 X → [0, 1], such that m(∅) = 0 and A∈2 X m(A) = 1. From this bba, a pair of functions, called respectively belief and plausibility, can be defined as follows: As can be seen from these definitions, there is a clear correspondence between belief functions (resp., plausibility functions) and lower approximations (resp., upper approximations) in RST; we refer the reader to [29] for further connections between the two theories. Starting from a bba, a probability distribution, called pignistic probability, can be obtained [26] : Finally, we recall that appropriate generalizations of information-theoretic concepts [22] , specifically the concept of entropy (which was also proposed to generalize the definition of reducts in RST [24] ), have been defined for evidence theory. Most relevantly, we recall the definition of aggregate uncertainty [7] AU (m) = max where is the Shannon entropy of p and P(m) the set of probability distributions p such that Bel m ≤ p ≤ P l m ; and the definition of normalized pignistic entropy (see [13] for the un-normalized definition) wherep m is the probability distribution that is uniform on the support of P m Bet (x), i.e., on the set of elements {x | P m Bet (x) > 0}. In this section, we extend some key concepts of rough set theory to the setting of superset learning. In superset learning, each object x ∈ U is not associated with a single annotation t(x) ∈ V t , but with a set S of candidate annotations, one of which is assumed to be the true annotation associated with x. In order to model this idea in terms of RST, we generalize the definition of a decision table. -U is a universe of objects of interest; -Att is a set of attributes (or features); -t is the decision attribute (whose value, in general, is not known); The intuitive meaning of the set-valued information d is that, if |d(x)| > 1 for some x ∈ U , then the real decision associated with x (i.e. t(x)) is not known precisely, but is known to be in d(x). Notice that Definition 1 is a proper generalization of decision tables: if |d(x)| = 1 for all x ∈ U , then we have a standard decision table. Remark 1. In Definition 1, a set-valued decision attribute is modelled as a function d : U → P(V t ). While this mapping is formally well-defined for a concrete decision table, let us mention that, strictly speaking, there is no functional dependency between x and d(x). In fact, d(x) is not considered as a property of x, but rather represents information about a property of x, namely the underlying decision attribute t(x). As such, it reflects the epistemic state of the decision maker. Based on the notion of SDT, we can generalize the notion of inconsistency. We call such a pair x 1 , x 2 inconsistent, otherwise it is consistent. Thus, inconsistency implies the existence of (at least) two indiscernible objects with non-overlapping superset decisions. We say that an instantiation I is consistent with a SDT S (short, is consistent) if the following holds for all x 1 , x 2 : if x 1 , x 2 are consistent in S, then they are also consistent in I. Learning from superset data is closely connected to the idea of data disambiguation in the sense of figuring out the most plausible instantiation of the set-valued training data [8, 11] . But what makes one instantiation more plausible than another one? One approach originally proposed in [9] refers to the principle of simplicity in the spirit of Occam's razor (which can be given a theoretical justification in terms of Kolmogorov complexity [14] ): An instantiation that can be explained by a simple model is more plausible than an instantiation that requires a complex model. In the context of RST-based data analysis, a natural measure of model complexity is the size of the reduct. This leads us to the following definition. An algorithmic solution to the problem of finding the MDL reduct for an SDT can be given as a brute force algorithm, which computes the reducts of all the possible instantiations, see Algorithm 1. It is easy to see that the worst case runtime complexity of this algorithm is exponential in the size of the input. Unfortunately, it is unlikely that an asymptotically more efficient algorithm exists. Indeed, if we consider the problem of finding any MDL reduct, then the number of instantiations of S is, in the general case, exponential in the number of objects, and for each such instantiation one should find the shortest reduct for the corresponding decision table, which is known to be in Σ P 2 . Interestingly, we can prove that the decisional problem MDL-reduct related to finding MDL-Reducts is also in Σ P 2 . That is, finding an MDL-Reduct is no more complex than finding a minimal reduct in standard decision tables. We need to show that there is an algorithm for verifying instances of MDL-Reduct whose runtime is polynomial given access to an oracle for an NP -complete problem. Indeed, a certificate can be given by an instantiation I (whose size is clearly polynomial in the size of the input SDT) together with a reduct R for I, which is an MDL-reduct. Verifying whether R is a minimal reduct for I can then be done in polynomial time with an oracle for NP , hence the result. Further, as finding the minimal reduct for classical decision tables is Σ P 2 -complete (by reduction to the Shortest Implicant problem), MDL-Reduct is also complete. While heuristics could be applied to speed up the computation of reducts [27] (specifically, to reduce the complexity of the find-shortest-reducts step in Algorithm 1) the approach described in Algorithm 1 still requires enumerating all the possible instantiations. Thus, in the following section we propose two alternative definitions of reduct in order to reduce the computational costs. In this section, we present the main results concerning the application of rough set and evidence theory towards feature reduction in the superset learning setting. We begin with an alternative definition of reduct, based on the notion of entropy [24] , which simplifies the complexity of finding a reduct in SDT. Given a decision d, we can associate with it a pair of belief and plausibility functions. Let v ∈ V t and [x] B for B ⊆ Att an equivalence class, then: For each W ⊆ V t , the corresponding basic belief assignment is defined as Given this setting, we now consider two different entropies. The first one is the pignistic entropy H Bet (m) as defined in (5). As regards the second definition, we will not directly employ the AU measure (see Eq. (4)). This measure, in fact, corresponds to a quantification of the degree of conflict in the bba m, which is not appropriate in our context, as it would imply finding an instantiation which is maximally inconsistent. We thus define a modification of the AU measure that we call Optimistic Aggregate Uncertainty (OAU). This measure, which has already been studied in the context of superset decision tree learning [9] and soft clustering [3] , is defined as follows: Definition 5. We say that B ⊆ Att is -an OAU super-reduct (resp., Att) ); -an OAU reduct (resp., H Bet reduct) if no proper subset of B is also a superreduct. -an OAU -approximate super-reduct (resp., ; -an OAU -approximate reduct (resp., H Bet -approximate reduct) if no proper subset of B is also an -approximate super-reduct. Let [x] B be one of the granules with respect to an OAU-reduct. Then, the OAU instantiation with respect to [x] B is given by that is, the most probable among the classes under the probability distribution which corresponds to the minimum value of entropy. Similarly, the H Bet instantiation with respect to [x] B is given by The following example shows, for a simple SDT, the OAU reducts, MDL reducts, and H Bet reducts and their relationships. Table 1 . Notice that {z} is also an OAU reduct. The OAU instantiation given by {x, v} In Example 1, it is shown that the MDL reduct is one of the OAU reducts. Indeed, we can prove that this holds in general. Then R is also an OAU reduct. Proof. As the instantiation corresponding to R is consistent, OAU (d | R) = 0. Thus R is an OAU reduct. Concerning the computational complexity of finding the minimal OAU or one OAU, we have the following results. Proposition 1. Finding the minimal OAU reduct for a consistent SDT is Σ P 2 -complete. Proof. As any MDL reduct of a consistent SDT is also an OAU reduct and MDL reducts are by definition minimal, the complexity of finding a minimal OAU reduct is equivalent to that of finding MDL reducts, hence is Σ P 2 -complete. On the other hand, as both OAU [3, 9] and H Bet can be computed in polynomial time, the following result holds for finding OAU (resp. H Bet ) reducts. On the other hand, as shown in Example 1, the relationship between MDL reducts (or OAU reducts) and H Bet reducts is more complex as, in general, an OAU reduct is not necessarily a H Bet reduct. In particular, one could be interested in whether an H Bet exists and whether there exists an H Bet reduct which is able to disambiguate objects that are not disambiguated when taking in consideration the full set of attributes Att. The following two results provide a characterization in the binary (i.e., V t = {0, 1}), consistent case. (and, symmetrically when changing n 1 , n 2 ). under the constraints n 1 , n 2 , m 1 , m 2 ≥ 0, as the satisfaction of this inequality implies that the probability is more peaked on a single alternative. The integer solutions for this inequality provide the statement of the Theorem. Further, one can see that the strict inequality is not achievable. Notably, these two results also provide an answer to the second question, that is, whether an H Bet reduct can disambiguate instances that are not disambiguated when considering the whole attribute set Att. Indeed, Theorem 4 provides sufficient conditions for this property and shows that, in the binary case, disambiguation is possible only when at least one of the equivalence classes (w.r.t. Att) that are merged w.r.t. the reduct is already disambiguated. On the contrary, in the general n-ary case, disambiguation could happen also in more general situations. This is shown by the following example. On the other hand, dec H Bet(a) (x i ) = 1 for all x i ∈ U . Notice that, in this case, {a} would also be an OAU reduct (and hence a MDL reduct, as it is minimal). A characterization of H Bet reducts in the n-ary case is left as future work. Finally, we notice that, while the complexity of finding OAU (resp. H Bet ) reducts is still NP -complete, even in the approximate case, these definitions are more amenable to optimization through heuristics, as they employ a quantitative measure of quality for each attribute. Indeed, a simple greedy procedure can be implemented, as shown in Algorithm 2, which obviously has polynomial time complexity. In this article we investigated strategies for the simultaneous solution of the feature reduction and disambiguation problems in the superset learning setting through the application of rough set theory and evidence theory. We first defined a generalization of decision tables to this setting and then studied a purely combinatorial definition of reducts inspired by the Minimum Description Length principle, which we called MDL reducts. After studying the computational complexity of finding this type of reducts, which was shown to be NP -hard, harnessing the natural relationship between superset learning and evidence theory, we proposed two alternative definitions of reducts, based on the notion of entropy. We then provided a characterization for both these notions in terms of their relationship with MDL reducts, their existence conditions and their disambiguation power. Finally, after having illustrated the proposed notions by means of examples, we suggested a simple heuristic algorithm for computing approximate entropy reducts under the two proposed definitions. While this paper provides a first investigation towards the application of RST for feature reduction in the superset learning setting, it leaves several interesting open problems to be investigated in future work: -In Theorem 2, we proved that (in the consistent case) RED MDL ⊂ RED OAU , that is, every MDL reduct is also an OAU reduct. In particular, the MDL reducts are the minimal OAU reducts. As RED MDL ⊆ RED super , the relationship between the OAU reducts and the superset reducts should be investigated in more depth. Specifically we conjecture the following: While the inclusion RED super ⊆ RED OAU is easy to prove in the consistent case, the general case should also be considered. -In Theorem 4, we provided a characterization of H Bet reducts in the binary consistent case, however, the behavior of this type of reducts should also be investigated in the more general setting, specifically with respect to the relationship between RED OAU and RED HBet . -Given the practical importance of the superset learning setting, an implementation of the presented ideas and algorithms should be developed, in order to provide a computational framework for the application of the rough set methodology also to these tasks, in particular with respect to the implementation of algorithms (both exact or heuristic) for finding MDL or entropy reducts. In closing, we would like to highlight an alternative motivation for the superset extension of decision tables in general and the search for reducts of such tables in particular. In this paper, the superset extension was motivated by the assumption of imprecise labeling: The value of the decision attribute is not known precisely but only characterized in terms of a set of possible candidates. Finding a reduct is then supposed to help disambiguate the data, i.e., figuring out the most plausible among the candidates. Instead of this "don't know" interpretation, a superset S can also be given a "don't care" interpretation: In a certain context characterized by x, all decisions in S are sufficiently good, or "satisficing" in the sense of March and Simon [17] . A reduct can then be considered as a maximally simple (least cognitively demanding) yet satisficing decision rule. Thus, in spite of very different interpretations, the theoretical problems that arise are essentially the same as those studied in this paper. Nevertheless, elaborating on the idea of reduction as a means for specifically finding satisficing decision rules from a more practical point of view is another interesting direction for future work. Computational Complexity: A Modern Approach Rough sets in machine learning: a review Orthopartitions and soft clustering: soft mutual information measures for clustering validation. Knowl.-Based Syst Revolt: collaborative crowdsourcing for labeling machine learning datasets Upper and lower probabilities induced by a multivalued mapping An introduction to variable and feature selection Measuring total uncertainty in Dempster-Shafer theory: a novel approach Learning from imprecise and fuzzy observations: data disambiguation through generalized loss minimization Learning from ambiguously labeled examples Superset learning based on generalized loss minimization Learning from imprecise data: adjustments of optimistic and pessimistic variants Interactive machine learning system for automated annotation of information in text Measuring ambiguity in the evidence theory An Introduction to Kolmogorov Complexity and Its Applications Learnability of the superset label learning problem A conditional multinomial mixture model for superset label learning Classification with partial labels Rough sets Reasoning with belief functions: an analysis of compatibility A Mathematical Theory of Evidence A mathematical theory of communication The discernibility matrices and functions in information systems Approximate entropy reducts Dynamic and discernibility characteristics of different attribute reduction criteria The transferable belief model Dimensionality reduction based on rough set theory: a review On the complexity and inapproximability of shortest implicant problems Interpretations of belief functions in the theory of rough sets A brief introduction to weakly supervised learning