key: cord-0972268-kt9sbn40 authors: Lee, Guanling; Peng, Sheng-Lung; Lin, Yuh-Tzu title: Proportional fault-tolerant data mining with applications to bioinformatics date: 2009-02-19 journal: Inf Syst Front DOI: 10.1007/s10796-009-9158-z sha: 50421f4a4d4bc6bf6e923f2bfc806bde6529f34f doc_id: 972268 cord_uid: kt9sbn40 The mining of frequent patterns in databases has been studied for several years, but few reports have discussed for fault-tolerant (FT) pattern mining. FT data mining is more suitable for extracting interesting information from real-world data that may be polluted by noise. In particular, the increasing amount of today’s biological databases requires such a data mining technique to mine important data, e.g., motifs. In this paper, we propose the concept of proportional FT mining of frequent patterns. The number of tolerable faults in a proportional FT pattern is proportional to the length of the pattern. Two algorithms are designed for solving this problem. The first algorithm, named FT-BottomUp, applies an FT-Apriori heuristic and finds all FT patterns with any number of faults. The second algorithm, FT-LevelWise, divides all FT patterns into several groups according to the number of tolerable faults, and mines the content patterns of each group in turn. By applying our algorithm on real data, two reported epitopes of spike proteins of SARS-CoV can be found in our resulting itemset and the proportional FT data mining is better than the fixed FT data mining for this application. Valuable information is one of the most powerful weapons in the current knowledge age, and hence knowledge discovery has become a popular field of research. For example, in bioinformatics, biological data grow exponentially in size and complexity. It is not an easy work to extract useful information from it. Thus an important goal of data mining in bioinformatics is to extract valuable information from a large amount incomprehensible, biological data. Traditional algorithmical techniques use pattern matching algorithms to find valuable information for biological sequences. For example, linear scan (Knuth et al. 1977) and suffix tree (Ukkonen 1995) are two wellknown approaches. For more references and applications on patter matching, we refer to (Gusfield 1997) . In contrast, data mining in bioinformatics deals with different techniques and algorithms to gain knowledge from data of biological sequences, structures, and microarrays (Chen 2005) . Association-rule mining, which was first investigated in (Agrawal et al. 1993) , explores the relationships among data items. It is a very popular technique in data mining. For example, in the analysis of association rules of a metabolic pathway, the rule "Gene 1 → Gene 2 , support=10%, confidence=90%" means that the support of 10% indicates that Gene 1 and Gene 2 are expressed together in 10% of all datasets, and the confidence of 90% indicates that 90% of the metabolic pathway which activated Gene 1 also correlated with Gene 2 . Recently, Kotlyar and Jurisica used this technique to predict protein-protein interactions (Kotlyar and Jurisica 2006) . The approaches used to find association rules can be roughly classified into two categories. The first category is the Apriori-based algorithm (Agrawal and Srikant 1994) . This influential algorithm generates candidate patterns according to the non-monotonicity heuristic. The two main problems of Apriori are that (1) it needs to scan the database several times and (2) it generates too many candidate patterns. Various techniques have been used to overcome these problems. In Park et al. (1995) , a hash table is used to store candidate patterns to increase the scanning efficiency; Agrawal and Srikant (1994) , Han and Fu (1995) , and Park et al. (1995) attempt to reduce the number of transactions scanned in future iterations; Savasere et al. (1995) and Zaki (2000) adopt the techniques of partitioning and sampling; and Brin et al. (1997) proposes a dynamic pattern-counting method in which candidate patterns are added at different points during a scan. And the negative association rules are considered in Antonie and Zaïane (2004) ; Thiruvady and Webb (2004) ; Zhang and Zhang (2004) . Moreover, the problem of mining temporal indirect association patterns is considered in Chen et al. (2006) . The second category is the tree-based algorithm, which was proposed as the FP-tree (frequent-pattern tree) in Han et al. (2000) . This algorithm scans the database to find all frequent items, and compresses the database by representing the frequent items in an FP-tree. Finally, all frequent patterns can be obtained by searching the tree. When the database is large, it is sometimes unrealistic to construct an FP-tree that resides in main memory. This leads to the proposed extension of the pattern-growth concept, namely, H-mine. H-mine designs a dynamic structure to adjust links dynamically, instead of requiring an FP-tree to be maintained or a physical database to be created. The motivation of this method is to preserve space, and initially involves loading transactions into memory. However, H-mine has to maintain a head table at each level of the tree, and modify the links to build a queue of the collection of transactions containing the same prefix before the pattern support is counted. Expect for the two categories of approaches, some works study on other ways to extract association rules. Matrix Apriori (Pavon et al. 2006) utilizes simple structures such as matrices and vectors in the process of generating frequent patterns, and it also minimizes the number of candidate sets. In (Lee et al. 2006) , the idea of compressions rules is proposed and a data mining structure is used to extract association rules from a database. Redundant data will then be replaced by means of compression rules. (Chu et al. 2005 ) uses a simple method to transform the transactions read from the database into their corresponding patterns and then accumulates the occurring times of these patterns. (Chen et al. 2002) refines sampling functions to a two-phase sampling based algorithm that attempts to reduce the errors caused by sampling functions. (Chen and Ho 2005) proposed a sampling-based method that contains three phases. The first phase draws a small sample of data to estimate the set of frequent patterns, denoted as FS. The second phase computes the actual supports of the patterns in FS as well as identifies a subset of patterns in FS that need to be further examined in the next phase. Finally, the third phase explores this set and finds all missing frequent patterns. Traditional association-rule mining extracts patterns that match exactly. However, real-world databases contain noise that can make important information ambiguous, resulting in it not appearing in the mining result. Moreover, sometimes a decision maker will not be helped by the limited knowledge mined from a small database. Therefore, we need a method that copes with such variations in an association pattern (within predefined limits), which is called a fault-tolerant (FT) pattern. In contrast to traditional frequent-pattern mining, the mining of FT patterns must tolerate a certain degree of inexactitude. For example, coughing, fever, a runny nose, a headache, and a sore throat are all signs of catching a cold. However, these symptoms are seldom present at the same time, and hence a doctor will not diagnose the disease exactly following the rule R1: {coughing, fever, runny nose, headache, sore throat} à{catch a cold}. Instead, a better rule corresponding to the real-world situation would be R2: Patients who have at least two of the following symptoms {coughing, fever, runny nose, headache, sore throat} are catching a cold. R2 requires matching just part of the data, which illustrates the sense of allowing for fault tolerance in data mining. Yang et al. (2001) was the first to propose discovering FT frequent patterns in many dimensions. Their primary motivation was to find frequent groups of transactions (user groups, web sessions, and so on.) instead of focusing on just the items themselves, allowing for the discovery of groups of similar transactions that share most items. Unfortunately, the approach proposed in Yang et al. (2001) may generate sparse patterns, which may contain subpatterns that do not appear frequently. Another milestone of FT pattern mining is the work described in Pei et al. (2001) , in which extending Apriori and developing FT-Apriori for FT frequent-pattern mining allows a complete set of FT patterns to be mined out. However, the disadvantages of Apriori-based algorithms, including a huge number of candidate patterns and high database scanning frequency, also occurred in Pei et al. (2001) . In response, Wang and Lee (2002) suggested the algorithm FTP-mine that finds FT patterns using the concept of pattern growth. The main defect of Pei et al. (2001) and Wang and Lee (2002) is their definition of the number of tolerable faults in a pattern as a fixed number. Defining the number of tolerable faults in the patterns as a fixed number of items is not objective. The matters of "tolerant 1 item" in a pattern with length 4 and that in a pattern with length 10 give people entirely different sense. For example, the function of a protein is determined by its structure but not sequence. It is possible that two proteins of similar function have different sequence lengths, e.g., the family of heat shock proteins. In this case, it is hard to mine them together using FT pattern mining with fixed number of tolerable faults. In this paper, we introduce the problem of mining proportional FT patterns; in these patterns, the number of tolerable faults in the pattern is proportional to the pattern length. Two approaches are proposed to solve this problem. The remainder of this paper is organized as follows: background knowledge and problem definitions are presented in Section 2, the principles underlying our approaches are presented in Section 3, Section 4 describes the experimental results, and the conclusions and future work are discussed in Section 5. Let pattern X={i 1 ,…,i n } be a set of items, where the length of X is the cardinality of X, denoted as |X|. Moreover, X is called an |X|-pattern since it contains |X| items. A transaction T = (tid, X) is a 2-tuple record, where tid is the transaction-id and X is a pattern. Transaction T=(tid, X) is said to contain pattern Y iff Y⊆X. A transaction database TDB is a set of transactions. The number of transactions in TDB containing pattern X is called the support of X, denoted as sup(X). Given a transaction database TDB and a user-defined support threshold min_sup>0, pattern X is a frequent pattern iff sup(X)≥min_sup. A frequent pattern with length k is denoted as a frequent-k pattern. In the process of frequent-pattern mining, possible patterns are generated as candidate patterns, and these are later tested to determine whether they are frequent. The problem of frequent-pattern mining is to find the complete set of frequent patterns in a given transaction database with respect to a given support threshold. Extending the problem of mining frequent patterns, the FT frequent-pattern-mining problem relaxes the definition of containing to FT-containing. In addition to mining exact patterns that occur with high frequencies, we find those frequent patterns that contain some errors. In Pei et al. (2001) and Wang and Lee (2002) , FT-containing is defined as mismatches in a fixed number of items in a pattern. However, as mentioned above, it is disadvantageous for the same number of faults to be tolerated in patterns with different lengths. Therefore, the problem of proportional FT frequent-pattern mining is proposed. Let P be a pattern. A transaction T=(tid, X) is said to FTcontain pattern P with respect to a given FT parameter δ (0<δ≤1) iff there exists P′⊆P such that P′⊆ X and P 0 j j P j j ! d. The number of transactions in a database FT-containing pattern P is called the FT-support of P, denoted as sup FT (P). Let B(P) be the set of transactions FT-containing pattern P. Given a frequent item-support threshold min_sup item and an FT-support threshold min_sup FT , a pattern P is called an FT frequent pattern iff 1. sup FT (P)≥min_sup FT ; and 2. for each item p∈P, sup item B(P) (p)≥min_sup item , where sup item B(P) (p) is the number of transactions in B(P) containing item p. Definition 2.1 mostly extends (Pei et al. 2001) , except the FT parameter and the definition of FT-containing. The itemsupport threshold avoids the problem of sparse patterns appearing which can occur in (Yang et al. 2001) , by constraining the frequency of occurrence of each item in a pattern. Moreover, our definition of FT-containing releases the constraint that the number of fault items tolerable in a pattern is fixed-instead, the number of tolerable-fault items increases depending on the length of the pattern. Figure 1 shows the relation between the length of pattern X and #fault(|X|), where #fault(|X|) denotes the number of fault items tolerable in pattern X. According to Definition 2.1, we have the following equation c . In the horizontal parts of the stair shown in Fig. 1 , our problem can be simplified to previous works, i.e., FT-Apriori (Pei et al. 2001) can be extended as the following lemma to solve part of our problem. then none of its superset that tolerates the same number of faults will be an FT pattern However, we have a challenge where the gaps occur in Fig. 1 , since these areas do not comply with the nonmonotonicity property. That is, if a pattern is not a frequent FT pattern, its superset can still be a frequent FT pattern. Therefore, solutions from previous studies cannot solve our problem. For example, please refer to However, observation of the properties of patterns separated by the gap produces the following lemma: Lemma 2.2 Let set(X subpattern ) denote the set of X's subpatterns whose length is one less than the length of X. Moreover, let #fault(|X|)-1= #fault(|X|-1) (i.e., X and the considered subset are separated by the gap). If X is not a frequent FT pattern, then we have the following two conditions: case 1. if sup FT (X)min_sup FT . Because sup FT (X subpattern ) is included in sup FT (X), sup FT (X) >min_sup FT , which disagrees with the supposition. Therefore, no patterns in set(X subpattern ) can be frequent patterns if sup FT (X)min_sup item . Since the item support of X subpattern counts from S', that of X is from S, and S′⊂S, we obtain sup item B(X) (x j )>min_sup item , which disagrees with the supposition. Therefore, none of patterns in set(X subpattern ) that contains item x j can be an FT pattern if sup item B(X) (x j )