Constrained Dynamic Rule Induction Learning Fadi Thabtaha, Issa Qabajehb, Francisco Chiclanac a. Applied Business and Computing, NMIT, Auckland, New Zealand b. School of Computer Sciences and Informatics, De Montfort University, Leicester, UK c Centre for Computational Intelligence, Faculty of Technology, De Montfort University, Leicester, UK Abstract One of the known classification approaches in data mining is rule induction (RI). RI algorithms such as PRISM usually produce If-Then classifiers, which have a comparable predictive performance to other traditional classification approaches such as decision trees and associative classification. Hence, these classifiers are favourable for carrying out decisions by users and hence they can be utilised as decision making tools. Nevertheless, RI methods, including PRISM and its successors, suffer from a number of drawbacks primarily the large number of rules derived. This can be a burden especially when the input data is largely dimensional. Therefore, pruning unnecessary rules becomes essential for the success of this type of classifiers. This article proposes a new RI algorithm that reduces the search space for candidate rules by early pruning any irrelevant items during the process of building the classifier. Whenever a rule is generated, our algorithm updates the candidate items frequency to reflect the discarded data examples associated with the rules derived. This makes items frequency dynamic rather static and ensures that irrelevant rules are deleted in preliminary stages when they don’t hold enough data representation. The major benefit will be a concise set of decision making rules that are easy to understand and controlled by the decision maker. The proposed algorithm has been implemented in WEKA (Waikato Environment for Knowledge Analysis) environment and hence it can now be utilised by different types of users such as managers, researchers, students and others. Experimental results using real data from the security domain as well as sixteen classification datasets from University of California Irvine (UCI) repository reveal that the proposed algorithm is competitive in regards to classification accuracy when compared to known RI algorithms. Moreover, the classifiers produced by our algorithm are smaller in size which increase their possible use in practical applications. Keywords: Classification, Data Mfining, Prediction, PRISM, Rule Induction, Online Security 1. Introduction Data mining, which is based on computing and mathematical sciences, is a common intelligent tool currently used by managers to perform key business decisions. Traditionally, data analysts used to spend a long time gathering data from multiple sources and little time was spent on analysis due to the limited computing resources. Though since the rapid development of computer networks and the hardware industry, analysts nowadays are spending more time on examining data, seeking useful concealed information. In fact, after the recent development of cloud computing, data collection, noise removal, data size and data location are no longer obstacles facing analysts. Data analysis or data mining is concerned about finding patterns from datasets that are useful for users, particularly managers, to perform planning (Thabtah and Hamoud, 2014). One of the known data mining tasks that involve forecasting class labels in previously unseen data based on classifiers learnt from training dataset is classification. Normally, classification is performed in two steps: Constructing a model often named the classifier from a training dataset, and then utilising the classifier to guess the value of the class of test data accurately. This type of learning is called supervised learning since while building the classifier the learning is guided toward the class label. Common applications for classification are medical diagnoses (Rameshkumar et al., 2013), phishing detection (Abdelhamid et al., 2014), etc. There have been many different classification approaches including decision trees (Witten and Frank, 2005), Neural Network (NN) (Mohammad et al., 2013), Support Vector Machine (SVM) (Cortes and Vapnik, 1995), Associative Classification (AC) (Thabtah, et al., 2004), rule induction (RI) (Holt, 1993) and others. The latter two approaches, i.e. AC and RI, extract classifiers which contain “If-Then” rules so this explains their wide spread applicability. However, there are differences between AC and RI especially in the way rules are induced as well as pruned. This article falls under the umbrella of RI research. PRISM is one of the RI techniques which was developed in (Cendrowska, 1987) and slightly enhanced by others, i.e. (Stahl and Bramer, 2008) (Elgibreen and Aksoy, 2013) (Stahl and Bramer, 2014). This algorithm employs separate-and-conquer strategy in knowledge discovery in which PRISM generates rules according to the class labels in the training dataset. Normally for a class, PRISM starts with an empty rule and keeps appending items to the rule’s body until this rule reaches zero error (Definition 8- Section 2.2). When this occurs, the rule gets induced and the training data samples connected with the rule are discarded. The algorithm continues building other rules in the same way until no more data connected with the current class can be found. At this point, the same steps are repeated for the next-in-line class until the training dataset becomes empty. One of the obvious problems associated with PRISM is the massive numbers of rules induced, which normally results in large size classifiers. This problem is attributed to the way PRISM induces the rules where it keeps adding items to the rule’s body until the rule becomes 100% accurate despite the low data coverage. In other words, PRISM does not mind inducing many specific rules, each covering a single data sample, rather producing a rule, say, with 90% accuracy covering 10 data samples. This excessive learning limits the applicability of PRISM as a decision making tool for managers in application domains and definitely overfits the training dataset. This is since managers normally prefer a summarised set of rules that they are able to control and comprehend rather a larger high maintenance set of rules. In fact, there should be a trade-off between the number of rules offered and the predictive accuracy performance of these rules. This paper investigates shortcomings associated with PRISM algorithm. Specifically, we look into three main issues: 1) The search space reduction: When constructing a rule for a particular class, PRISM has to evaluate the accuracy of all available items linked with that class in order to select the best one that can be added to the rule’s body. This necessitates large computations when the training data has many attribute values and can be a burden especially when several unnecessary computations are made for items that have low data representation (weak items). A frequency threshold that we call (freq) can be employed as a pre-pruning of items with low frequency. It prevents these items from being part of rules, and therefore the search space gets minimised. 2) PRISM only generates a rule when its error is zero, which may cause overfitting the training dataset. We want to derive good quality rules, not necessarily with 100% accuracy, to reduce overfitting and increase data coverage. We utilise rule’s strength parameter (Rule_Strength) to separate between acceptable and non-acceptable rules in our classifier. 3) When removing training data linked with a rule, we ensure that other items which have appeared in the removed data are updated. In particular, we amend the frequency of the impacted items. This indeed maintains the true weight of the items rather the computed frequency from the initial input dataset. In response to the above raised issues, we are developing in this article, a new dynamic learning method based on RI that we name enhanced Dynamic Rule Induction (eDRI). Our algorithm discovers the rule one by one per class and primarily uses a freq threshold to limit the search space for rules by discarding any items with insufficient data representation. For each rule, eDRI updates items frequency that appeared within the deleted training instances of the generated rule. This indeed gives a more realistic classifier with lower numbers of rules leading to a natural pruning of items during the rule discovery phase. Lastly, the proposed algorithm limits the use of the default class rule by generating rules with accuracy <100%. Often these rules are ignored by PRISM algorithm since they don’t hold zero error. These rules are only used during class prediction phase instead of the default class rule and when no 100% accuracy rule is able to classify a test data. This paper is structured as follows: Section 2 illustrates the classification problem and its main related definitions. Section 3 critically analyses PRISM and its successors, and Section 4 discusses the proposed algorithm and its related phases besides a comprehensive example that reveals eDRI’s insight. Section 5 is devoted to the data and the experimental results analysis, and finally, conclusions are provided in Section 6. Input: Training dataset T Output: A classifier that consists of If-Then rules Step 1: For each subset Ti in T that belong to wi Do Step 1.1 Make an empty rule, rj: If Empty then wi Step 1.2: Calculate Attx in Ti p(wi = i| Attx); Step 1.3: Append the Attx with the largest p(wi = i| Attx) to the body of rj Step 2: Repeat steps 1.1-1.3 until rj has 100% accuracy or no longer can be improved Step 2.1: Generate rj and insert it into the classifier Step 3: Discard all data examples from Ti that match rj’s body Step 4: Continue producing rules until Ti is empty or no rule with accepted error can be found Step 5: Repeat steps 1-4 until no more data subsets of can be wi found Step 6: IF(Ti does not contain any data examples of class wi ) Generate the classifier. Fig. 1 PRISM Pseudocode 2. The Classification Problem and Definitions Given a training dataset T, which has x distinct columns (attributes) Att1, Att2, … ,Attn, one of which is the class, i.e. cl. The cardinality of T is |T|. An attribute may be nominal which means it takes a value from a predefined set of values or continuous. Nominal attributes values are mapped to a set of positive integers whereas continuous attributes are preprocessed by discretising their values using any discretisation method. The aim is to make a classifier from T, e.g. clAttClassfiier → : , which forecasts the class of previously unseen dataset. Our classification method employs a user threshold called freq. This threshold serves as a fine line to distinguish strong items ruleitems from weak ones based on their computed occurrences in T. Any ruleitem that its frequency passes the freq is called as a strong ruleitem, otherwise it is called weak ruleitem. Below are the main terms used and their definitions. Definition 1: An item is an attribute plus its values name denoted (Ai, ai). Definition 2: A training example in T is a row consisting of attribute values (Aj1, aj1), …, (Ajv, ajv), plus a class denoted by cj. Definition 3: A ruleitem r has the format, where body is a set of disjoint items and cis a class value. Definition 4: The frequency threshold (freq) is a predefined threshold given by the end user. Definition 5: The body frequency (body_Freq) of a ruleitem r in T is the number of data examples in T that match r’s body. Definition 6: The frequency of a ruleitem r in T (ruleitem_freq) is the number of data examples in T that match r. Definition 7: A ruleitem r passes the freq threshold if, r’s |body_Freq|/ |D| ≥ freq. Such a ruleitem is said to be a strong ruleitem. Definition 8: A rule r expected accuracy is defined as |ruleitem_freq|/ |body_Freq|. Definition 9: A rule in our classifier is represented as: clbody → , where body is a set of disjoint attribute values and the consequent is a class value. The format of the rule is: 121 ... claaa n →∧∧∧ 3. Literature Review PRISM is a key algorithm for building classification models that contain simple yet effective easy to understand rules. This algorithm was developed in 1987 based on the concept of separate and conquer where data examples are separated using the available class labels. For each class (wi) data samples, an empty rule (If nothing then w1) is formed. The algorithm computes the frequency of each attribute value linked with that class and appends the attribute value with the largest frequency to the current rule’s body. PRISM terminates the process of a building a rule when that rule has an accuracy 100% according to definition (8). When the rule is generated, the algorithm continues producing the remaining possible rules from w1’ s data subset until the data subset becomes empty or no more attribute value can be found with an acceptable accuracy. At that point, PRISM moves to the second class data subset and repeats the same steps. The algorithm terminates when the training dataset is evaluated and when this happens the rules formed make the classifier. Figure 1 below depicts PRISM algorithm major steps. Below are the main pros and cons associated with PRISM algorithm based on the comprehensive review that we have done, PRISM Pros 1) The simplicity of rules generation in which only the rule’s accuracy parameter is computed to decide on the rule significance. 2) The classifier contains easy to understand rules which can empower decision makers particularly in domain applications that necessitate interpretations. 3) Easy implementation especially with the available computing resources. Actually, PRISM can be easily implemented in different versions: local, online, distributed and in environments that require parallelism. 4) The predictive power of its classifier can be seen as acceptable when contrasted to other classic data mining approaches such as search methods, decision trees, neural networks, associative classification and many others. PRISM Cons 1) In the original PRISM algorithm, there is no clear search space reduction mechanism for candidate items and therefore for large dimensional datasets, the expected numbers of candidate items can be huge. This may limits its use for certain applications. 2) Noisy datasets that contain incomplete attributes and missing values. This can been as a major challenge for PRISM since no clear mechanisms for handling noise is presented. Currently, adopted approaches from information theory are used to handle noise (Bramer, 2000) (Stahl and Bramer, 2014). 3) Handling numeric attributes. No build-in strategy is available in PRISM to discretise continuous attributes. 4) No clear rule pruning methodology is present in the original PRISM. This may lead to the discovery of large numbers of rules and lead to combinatorial explosion (Abdelhamid and Thabtah, 2014). There is a high demand on pruning methods to cut down the number of rules without hindering the overall performance of the classifiers. 5) Conflicting rule: There is no clear mechanism on how to resolve conflicting rules in PRISM. Currently the choice is random and favours class labels with the largest frequency linked with the item rather than keep multiple class labels per item. 6) Breaking tie among items frequency while building a rule: When two or more items have the same frequency, PRISM looks at their denominator in the expected accuracy formula. Yet, sometimes these items have similar accuracy and denominators which makes the choice random. Arbitrary selection without scientific justification can be seen as a biased decision and may not be useful for overall algorithm’s performance. One of the major challenges faced by PRISM algorithm is noisy training datasets. These are datasets that contain missing values, data redundancy, and incomplete data examples. A modified version has been developed called N- RISM to handle noisy datasets besides focusing on maximising the prediction accuracy (Bramer, 2000). Experimental tests using eight datasets from the UCI (Lichman, 2013) have showed consistent performance of N-PRISM when compared with classic PRISM particularly on classification accuracy obtained from noisy and noise-free datasets. It should be noted that the differences between the original PRISM and N-PRISM are very minimal. Modified PRISM methods were developed based on a previously design pre-pruning method called J- Pruning by (Bramer, 2002) (Stahl and Bramer, 2012). The purpose was to reduce overfitting the training dataset. J-Pruning is based on the information theory test performed typically in decision tree by measuring the significance of removing an attribute from the rule’s body. The algorithm of (Stahl and Bramer, 2012) was proposed to address rule pruning issue during the process of building rules for each class label. Experimental results using sixteen UCI datasets showed decrement on the number of items per rule. In 2015, (Othman and Bryant, 2015) have investigated instance reduction methods as a paradigm for rule pruning in RI. They claimed that minimising the training dataset to the subset needed to learn the rules is one way to shrink the search space. The authors have applied three common instance reduction methods namely DROPS, ENN and ALLKNN on limited number of datasets to evaluate their impact on the resulting classifiers. The limited results obtained can be seen as a promising direction for using instance reduction methods as pre-pruning phase in RI. The database coverage pruning (Liu et al., 1998) and its successors (Abdelhamid, et al., 2014) (Thabtah, et al., 2011) were one of the major breakthroughs in associative classification that can be employed successfully in RI as late or post pruning methods. These pruning methods necessitate a positive data coverage per rule with a certain accuracy so the rule can be part of the classifier. A new version of the database coverage pruning was developed by (Ayyat, et al., 2014) as a rule prediction measure. The authors have proposed a method that considers the rule rank as a measure of goodness besides the number of similar rules sharing items in their body. These two parameters play a critical role in differentiating among available rules especially in predicting the test data class labels. Recently, (Elgibreen and Aksoy, 2013) investigated a common problem in PRISM and RULES (Pham and Soroka, 2007) family of algorithms which is the tradeoff between training time and classification accuracy during constructing the rule based classifiers. Moreover, the same article also highlighted performance deterioration of RI methods when applied to incomplete datasets. The result is a new covering algorithm that utilises “Transfer Knowledge” approach to fill in missing attribute values specifically the target attribute in the training dataset before mining kicks in. The “Transfer Knowledge” is basically building up a knowledge base based on learning curves via agents from different environments and from previous learning experience and then using this knowledge base to fill in incomplete data examples (Ramon et al., 2007). Experimental results against eight datasets revealed that the improved covering algorithm consistently produced competitive classifiers when compared with other RI algorithms such as RULES-IS and PRISM. One of the major obstacles facing the data mining algorithm is the massive amount of data that are stored and scattered in different geographical location. Decision makers are striving to have an “on the fly” mining approach that processes very huge database simultaneously hoping to improve planning decisions. Most of the research works on parallelisation in classification are focused on decision trees. However, RI may present simple classifiers to decision makers. The big data problem have been investigated by amending PRISM algorithm to handle parallelisation (Stahl and Bramer, 2012). The authors developed a strategy for parallel RI called Parallel Modular Classification Rule Induction (PMCRI). This strategy is a continuation of an early work by the same authors in 2008 which resulted in parallel PRISM (P-PRISM) (Stahl and Bramer, 2008). P-PRISM algorithm was disseminated to overcome PRISM’s excessive computational process of testing the entire population of data attribute inside the training dataset. The parallelism involves sorting the attribute values based on their frequency in the input dataset and the class attribute. This means the algorithm will need only this information for processing the rules and hence holding these data rather than the entire input data reduces computing resources such as processing time and memory. The attribute values and their frequency are then distributed to clusters and rules are generated from each cluster. All rules are finally merged to form the classifier. Experiments using a replication of the Diabetes and Yeast datasets from UCI repository have been conducted. The results show that P-PRISM as well as the parallel RI strategy scales well. However, a better approach to evaluate the parallelisation is to utilise unstructured real data such as Bioinformatics or text mining where dimensionality is huge and number of instances vary rather structured datasets. (Stahl and Bramer, 2014) developed a PRISM based method for ensemble learning in classification. Usually ensemble learning is an approach in classification used to improve the classifier’s predictive accuracy by deriving multiple classifiers using any learning approach such as Neural Network, decision trees, etc, and then merging them to form a global classifier. Since PRISM often suffers from overfitting problem to decrease this risk, the authors have built multiple classifiers based on PRISM using the ensemble learning approach. The results derived from fifteen UCI datasets revealed that the ensemble learning model based on PRISM was able to generate results comparable with classic PRISM algorithm. Moreover, a parallel version of the new model has also been implemented by the authors and tested with respect to training time. Results on run time showed that the parallel version scales well during the process of constructing the classifier. PRISM successors have focused mainly on improving the algorithm scalability or reducing overfitting. We have seen approaches such as P-PRISM that showed promising research direction toward parallel data mining using RI. Other approaches such as Ensemble Learning based PRISM was able to minimise overfitting the training data by generating ensemble classifiers that later on are integrated together to make a final classifier. Lastly, early pruning has been introduced to further minimise the search space by the introduction of J-pruning. There are needs to further cut down the irrelevant candidate items during forming the rules in PRISM. This indeed will have a positive impact on the algorithm’s efficiency in mining the rules and building the classifier. We think that post pruning is essential in PRISM and should increase its chance of being used as a data mining solution and decision making tool in practical applications. Yet since the classifiers produced by this family of algorithms is large in size this limits its usage. One promising solution to shrink the size of the classifier is by eliminating rules overlapping in the training example and having rules to cover larger data samples. This solution can be accomplished by having live items frequency during the mining phase since PRISM relies primarily on item’s frequency to build rules. In particular, PRISM algorithm often employs the rule’s accuracy as a measure to generate the rule especially when the rule’s accuracy reaches 100%. This normally results in low data coverage rules. We want each candidate item that can be part of a rule body to be associated with its true frequency in the training dataset that usually changes when a rule is generated. Recall that when a rule is outputted by PRISM, all training data linked with it are discarded and this may affect items appearing in those discarded rows. When each item has its true data representation while building the classifier this indeed decreases the search space and all insufficient candidate items will be deleted rather stored as in PRISM. The overall benefit will be that of having fewer number of rules classifiers. These classifiers can be seen as a real decision support system since they hold concise knowledge that are maintainable and usable by decision makers. 4. Enhanced Dynamic Rule Induction Algorithm The proposed algorithm (Figure 2) has two primary phases: rule production and class assignment of test data. In phase (1), eDRI produces rules from the training dataset that have accuracy>=Rule_Strength. This parameter is similar to the confidence threshold used in associative classification (Thabtah, et al., 2004) that aims to generate near perfect rules besides rules with 100% accuracy as classic PRISM. The proposed learning method ensures the production of rules that have zero error as well as rules that survive the Rule_Strength parameter. The algorithm terminates building up the classifier when no more rules have an acceptable accuracy or the training dataset becomes empty. When this occurs, all rules get merged together to form the classifier. There is another parameter utilised by eDRI algorithm in phase one to minimise the search space of items. This parameter is called freq and it is similar to the minimal support threshold used in the association rule. The freq parameter is primarily utilised to differentiate between items that are frequent (have high number of occurrences in the training dataset) from those which are not frequent. This surely eliminates items which have low frequency, i.e. = Rule_strength Step 2.1: Generate rj Step 3: Discard all data examples from T that contained rj’s body Step 3.1: Update the frequency of all impacted candidate items to reflect step 3 Step 3.2: Continue producing rules from Ti until all remaining unclassified items have frequency