key: cord-0044806-2kimhmg0 authors: Mencar, Corrado title: Possibilistic Bounds for Granular Counting date: 2020-05-16 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50153-2_3 sha: 5fba3de07dd0f25827f63f42b7caf73ec68a53ba doc_id: 44806 cord_uid: 2kimhmg0 Uncertain data are observations that cannot be uniquely mapped to a referent. In the case of uncertainty due to incompleteness, possibility theory can be used as an appropriate model for processing such data. In particular, granular counting is a way to count data in presence of uncertainty represented by possibility distributions. Two algorithms were proposed in literature to compute granular counting: exact granular counting, with quadratic time complexity, and approximate granular counting, with linear time complexity. This paper extends approximate granular counting by computing bounds for exact granular count. In this way, the efficiency of approximate granular count is combined with certified bounds whose width can be adjusted in accordance to user needs. Data uncertainty may arise as a consequence of several conditions and require proper management [1] . The simplest approach is to ignore uncertainty by estimating a precise value for each observation, but this simplistic approach, though of very simple application, can lead to a distortion in the subsequent processing stages that is difficult to detect. A more comprehensive approach should take into account data uncertainty and propagate its footprint throughout the entire data processing flow. In this way, the results of data processing reveal their uncertainty, which can be evaluated to assess their ultimate usefulness. Several theories can be applied to represent and process uncertainty, such as Probability Theory [2] , which is however a particular case falling in the Granular Computing paradigm. Granular Computing also includes classical Set Theory [3] , Rough Sets Theory [4] , Evidence Theory [5] and Possibility Theory [6] . The choice of a particular theory depends on the nature of uncertainty; in particular, possibility theory deals with uncertainty due to incomplete information, e.g. when the value of an observation cannot be precisely determined: we will use the term uncertain data to denote data characterized by this specific type of uncertainty, therefore we adopt the possibilistic framework in this paper. A common process on data is counting, which searches for the number of data samples with a specific value. Data counting is often a preliminary step for different types of analysis, such as descriptive statistics, comparative analysis, etc. It is a fairly simple operation when data are accurate, but it becomes nontrivial when data are uncertain. In fact, the uncertainty in the data should propagate in the count, so that the results are granular rather than precise. Recently, a definition of granular count through Possibility Theory was proposed [7] . It was shown that the resulting counts are fuzzy intervals in the domain of natural numbers. Based on this result, two algorithms for granular counting were defined: an exact granular counting algorithm with quadratic-time complexity and an approximate counting algorithm with linear-time complexity. Approximate granular counting is appealing in applications dealing with large amounts of data due to its low complexity, but a compromise must be accepted in terms of accuracy of the resulting fuzzy interval. In particular, it is not immediate to know how far is the result of the approximate count from the fuzzy interval resulting from the exact granular count. In this paper an algorithm is proposed for bounded granular counting, which computes an interval-valued fuzzy set representing the boundaries in which the exact granular count is located. In this way, the efficiency of approximate granular count is combined with certified bounds whose width can be adjusted in accordance to user needs. The concept of granular count and related algorithms are briefly described in Sect. 2, while the proposal of bounded granular count is introduced in Sect. 3. Section 4 reports some numerical experiments to assess the efficiency of the proposed algorithm, as well as an outline of an application in Bioinformatics. A brief summary of Granular Counting is reported in this Section. Further details can be found in the original papers [7, 8] . We assume that data are manifested through observations, which refer to some objects or referents. The relation between observations and referentswhich is called reference-may be uncertain in the sense that an unequivocal reference of the observation to one of the referents is not possible. We model such uncertainty with Possibility Theory [6] as we assume that uncertainty is due to the imprecision of the observation, i.e. the observation is not complete enough to make reference unequivocal. Given a set R of referents and an observation o ∈ O, a possibility distribution is a mapping such that ∃r ∈ R : π o (r) = 1. The value π o (r) = 0 means that it is impossible that the referent r is referred by the observation, while π o (r) = 1 means that the referent r is absolutely possible (though not certain). Intermediate values of π o (r) stand for gradual values of possibility, which quantify the completeness of information resulting from an observation. (More specifically, the lower the possibility degree, the more information we have to exclude a referent.) The possibility distributions of all observations can be arranged in a possibilistic assignment table, as exemplified in Table 1 . By using the operators of Possibility Theory, as well as the assumption that observations are non-interactive (i.e. they do not influence each other), the possibility degree, that a subset O x ⊆ O of x ∈ N observations is exactly 1 the set of observations referring to a reference r i ∈ R, is defined as: with the convention that min ∅ = 1. Informally speaking, Eq. (1) defines the possibility degree that O x is the subset of all and only the observations of r i by computing the least possibility degree of two simultaneous events: (i) all observations of O x refer to r i , and (ii) all the other observations refer to a different referent. In order to compute the possibility degree that the number of observations referring to a referent r i is N i , we are not interested in a specific set O x , but in any set of x elements. We can therefore define the possibility value that the number of observations for a referent r i is x as: for x ≤ m and π Ni (x) = 0 for x > m. Equation (2) provides a granular definition of count. Counting is imprecise because observations are uncertain. It is possible to prove that a granular count as in Eq. (2) is a fuzzy interval in the domain of natural numbers. A fuzzy interval is a convex and normal fuzzy set on a numerical domain (in our case, it is N). Convexity of a fuzzy set can be established by proving that all α-cuts are intervals, while normality Table 1 of the granular count is guaranteed because of the normality of the possibility distributions π o for all o ∈ O. Figure 1 depicts an example of granular count. The direct application of Eq. (2) leads to an intractable counting procedure as all possible subsets of O must be considered. On the other hand, a polynomial-time algorithm can be devised by taking profit of the representation of a granular count as a fuzzy interval. In particular, the granular counting algorithm builds the fuzzy interval by considering the α-cut representation of fuzzy sets. On such basis, two variants of granular counting algorithms can be devised: -Exact granular counting uses all the values of α that correspond to some possibility degree in the possibilistic assignment table; -Approximate granular counting uses the values of α taken from a finite set of evenly spaced numbers over [0, 1]. The number of such values depend on a user-defined parameter n α . The approximate granular counting is more efficient than the exact version because it does not require to scan the possibilistic assignment table, though at the price of a new required parameter. Exact granular counting (Algorithm 1) and approximate granular counting (Algorithm 2) share the same core algorithm (Algorithm 3) and only differ by how the set of α-values are computed. In essence, the core algorithm computes the granular count in an incremental way, by reckoning the α-cuts of the fuzzy interval for each α value provided in input. In brief, the core algorithm works as follows. Given the possibilistic assignment table R, the index i of the referent and the set A of α-cuts, the array r represents the possibility degrees that an observation refers to r i , i.e. r j = π oj (r i ) (line 1), whiler represents the possibility degrees that an observation refers to any other referent different from r i (line 2). N is the array representing the granular count (line 3). The main cycle (lines 4-17) loops over each α ∈ A and computes the bounds x min and x max of the corresponding α-cut (line 5). These bounds are calculated by looping over all observations (lines 6-13), so that x max is incremented if the possibility degree that the current observation refers to r i is greater than or equal to α (lines 7-8), while x min further requires that the possibility degree that the observation refers to any other referent is less than α (lines 9-10). When both x min and x max are computed, the degrees of membership of the granular count are updated accordingly (lines 14-16). For a fixed referent, the time-complexity of exact granular count is O nm 2 (being n the number of referents and m the number of observations), while the time-complexity of approximate granular count drops to O (m (n + n α )). In consideration that, in typical scenarios, the number of observations is very large (i.e., m n), especially in comparison with the number of referents, it is deduced that approximate granular counting is the preferred choice in the case of very large amounts of uncertain data. The time-complexity of approximate granular count linearly depends on the number of α values which, in turn, depend on the value of the parameter n α . On one hand, low values of n α lead to fast computation of granular counts; on the other hand, low values of n α may lead to a rough estimate of the possibility degrees of the exact granular count. Fig. 2 the Jaccard similarity 2 measure between approximate granular count and exact granular count is reported for n α between 2 and 100: even though similarity values close to 1 are reached for n α 20, for smaller values a significant dissimilarity can be observed. In order to assess whether the discrepancy between approximate and exact granular counts is acceptable for a problem at hand, it is important to identify some bounds for the exact granular count when only an approximate count is available. In order to identify such bounds, a closer look at Algorithm 3 is necessary. The algorithm computes the granular count for the i-th referent given a possibilistic assignment table R and a set A of α-values. The main cycle within the algorithm computes the α-cut of the granular count, which is represented by the array Table 1 N and corresponds to the possibility distribution π N . For a given value of α, the variable x max counts the number of observations that refer to r i with a possibility degree ≥ α; on the other hand, the variable x min counts the number of observations that refer to r i with a possibility degree ≥ α and refer to any other referent with possibility degree < α. As a consequence, x min ≤ x max . Since in our analysis we will consider different values of α, we shall denote the two variables as x while the value x (α) min is the cardinality of the set with the obvious relation that O max . On this basis, it is possible to prove the following lemmas: Proof. By Definition (1), we can write where P = min o∈Ox π o (r i ) and Q = min o / ∈Ox max r =ri π o (r). We focus on Q. Let O x ⊂ O be a subset of x observations. Since x < x min , by definition max r =ri π o (r) < α, therefore Q < α because o / ∈ O x . As a consequence, π Ox (r i ) < α. This is true for all subsets of cardinality x < x (α) min , therefore: is not a number of observations greater than m; in such a case, Similarly to the proof of the previous lemma, we split Definition (1) as in Eq. (5) but now we focus on P . Since x > x max , therefore π o (r i ) < α. As a consequence, P < α, thus π Ox (r i ) < α. This is true for all subsets of cardinality x > x (α) max , thus proving the thesis. (3) and (4). x (α) min ≤ x ≤ x (α) max , π N (x) ≥ α The previous lemmas show that, for a given value of α, the exact granular count must satisfy the following relations: that is: In Fig. 3 the 0.3-and 0.7-cuts are used to depict the regions that bound the values of π N . Notice that such regions have been computed without knowing the actual values of the exact granular count. Thanks to the properties of α-cuts, it is possible to identify tight bounds for an exact granular count by using the results of an approximate granular count. In fact, given two values α < α , the α -cut is included in the α -cut, therefore Fig. 3 . α values bound the exact granular count πN and it is easy to verify the following properties: The previous relations suggest a strategy for computing the bounds of an exact granular count: supposing that, for some x, it is known that α 1 ≤ π N (x) ≤ α 2 and a new value α is considered: if x On the basis of such strategy, it is possible to define a bounded granular counting algorithm to compute bounds of the exact granular count when a set A of α-cuts is given, which is reported in Algorithm 4. In this algorithm the bounds are represented by the arrays N l and N u (lines 3-4) and they are updated for each α ∈ A so as to satisfy the above-mentioned relations (lines 15-22). The resulting bounded granular count is an Interval-Valued Fuzzy Set (IVFS) [9] which assigns, to each x ∈ N, an interval [π NL (x) , π NU (x)] representing the possibility distribution of π N (x), which may be unknown. In Fig. 4 it is shown an example of bounded granular count by generating the set of α-values as in approximate granular count with n α = 5. It is possible to observe that the core of the granular count (i.e. the set of x values such that π N (x) = 1) is precisely identified because the approximate granular counting algorithm includes α = 1 in the set of α-values to be considered. Also, since a value of α very close to 0 is also included (namely, 10 −12 ), the impossible counts (i.e. the values of x such It also possible to set n α in order to achieve a desired precision. By looking at Algorithm 2, it is possible to observe that the values of α are equally spaced at distance where the value Δα coincides with the maximum length of the intervals computed by the bounded granular counting algorithm. By reversing the problem, it is possible to set the value of n α so that the maximum length is less than a desired threshold β. Since ε 0, it suffices to set The evaluation of efficiency has been performed on synthetically generated data. In particular, a number of random possibilistic assignment tables have been generated by varying the number of observations on a geometrical progression with common ratio 10 but keeping the number of referents fixed to three. 3 For each possibilistic assignment table, both exact and bounded granular counting algorithms (with n α = 10) have been applied on the first referent, and the time required to complete operations has been recorded. 4 Each experiment has been repeated 7 times and average time has been recorded. For each repetition, the experiment has been looped for 10 times and the best timing has been retained. The average execution time is reported in Table 2 and depicted in Fig. 5 . A linear regression in the log-log scale confirms the quadratic trend of the time required for exact granular counting and the linear trend for bounded granular counting algorithm. Noticeably, the change of complexity is most exclusively due to the way the set of α-values have been generated: the selection of all values occurring in the possibilistic assignment table-which is required for exact granular counting-determines a significant reduction of the overall efficiency. On the other hand, bounded granular counting takes profit of the advantages of approximate granular counting for light-weight computations but, at the same time, it offers certified bounds on the possibility degrees of the exact granular count. In Bioinformatics, RNA-Seq is a protocol that allows to examine the gene expression in a cell by sampling fragments of RNA called "reads". When RNA-Seq output is mapped against a reference database of known genes, a high percentage of reads-called multireads-map to more than one gene [10] . Multireads are a source of uncertainty in the quantification of gene expression, which should be managed in order to provide significant results. To this end, the mapping procedure provides a quality index that is a biologically plausible estimate of the possibility that a read can be associated to a gene [11] . However, a high quality index does not mean certainty in association: two or more genes can be candidate for mapping a read because they can be mapped with similar high quality. Granular counting finds a natural application in the specific problem of counting the number of reads that are possibly associated to a gene. (Reads are considered as observations, while genes are referents.) However, the amount of data involved in such process may be overwhelming. For example, the public dataset SRP014005 downloaded from NCBI-SRA archive 5 , contains a case-control study of the Asthma disease with 55,579 reads mapped on 14,802 genes (16% are multireads). Nonetheless, accurate granular counting can be achieved by the use of the proposed algorithm. As an example, in Fig. 6 the bounded granular count has been computed for gene OTTHUMG00000189570-HELLPAR with n α = 10. It is noteworthy observing how imprecise is the count of this gene, which is due to a large number of multireads (with different quality levels). The proposed bounded granular counting algorithm is an extended version of approximate granular counting where efficient computation is combined with the ability of bounding the exact granular count within intervals whose granularity can be decided by the user. In most cases, it is more than enough that the exact possibility degrees of exact granular count are assured to be within a small range from some approximate values. When such type of imprecision is tolerated, a significant speed-up in calculations can be achieved, thus opening the door of granular counting to big-data problems. How data workers cope with uncertainty: a task characterisation study A survey of uncertain data algorithms and applications Learning from ambiguously labeled examples Rough Sets and Data Mining Reasoning with unlabeled samples and belief functions Computational Complexity Granular counting of uncertain data GrCount: counting method for uncertain data Interval-valued fuzzy sets, possibility theory and imprecise probability Managing NGS differential expression uncertainty with fuzzy sets A fuzzy method for RNA-Seq differential expression analysis in presence of multireads Acknowledgments. This work was partially funded by the INdAM -GNCS Project 2019 "Metodi per il trattamento di incertezza ed imprecisione nella rappresentazione e revisione di conoscenza".