Inducing safer oblique trees without costs Vadera, S http://dx.doi.org/10.1111/j.1468­0394.2005.00311.x Title Inducing safer oblique trees without costs Authors Vadera, S Type Article URL This version is available at: http://usir.salford.ac.uk/944/ Published Date 2005 USIR is a digital collection of the research output of the University of Salford. Where copyright  permits, full text material held in the repository is made freely available online and can be read,  downloaded and copied for non­commercial private study or research purposes. Please check the  manuscript for any further copyright restrictions. For more information, including our policy and submission procedure, please contact the Repository Team at: usir@salford.ac.uk. mailto:usir@salford.ac.uk Inducing Safer Oblique Trees without Costs Sunil Vadera S.Vadera@salford.ac.uk School of Computing, Science and Engineering, University of Salford, Salford M5 4WT, UK Abstract Decision tree induction has been has been widely studied and applied. In safety appli- cations, such as determining whether a chemical process is safe, or whether a person has a medical condition, the cost of misclassification in one of the classes is significantly higher than the other class. Several authors have tackled this problem by developing cost-sensitive decision tree learning algorithms or suggested ways of changing the distribution of training examples to bias the decision tree learning process so as to take account of costs. A pre-requisite for applying such algorithms is the availability of costs of misclassifica- tion. Although this may be possible for some applications, in the area of safety, obtaining reasonable estimates of costs of misclassification is not easy. This paper presents a new algorithm for applications where the cost of misclassifications can not be quantified, though the cost of misclassification in one class is known to be significantly higher than another class. The algorithm utilises linear discriminant analysis to identify oblique relationships between continuous attributes and then carries out an appropriate modification to ensure that the resulting tree errs on the side of safety. The algorithm is evaluated with respect to ICET, one of the best known cost-sensitive algorithms, OC1 a well known oblique decision tree algorithm and an algorithm that utilises robust linear programming. 1 Introduction Research on decision tree learning is one of the major successes of AI, with many commercial products, such as case based reasoning systems (Kolodner, 1993) and data mining tools utilising decision tree learning algorithms (Berry & Linoff, 2004). Early research in this area developed algorithms that aimed to maximize overall accuracy. More recent research is being influenced by recognising that human decision making is not focused solely on accuracy, but also takes account of the potential implications of a decision. For example, a chemical engineer considers the risks of explosion when considering the safety of a process plant, a bank manager carefully considers the implications of a customer defaulting on a loan and a medical consultant does not ignore the potential consequences of misdiagnosing a patient. This motivation has led to research in the area of decision tree learning progressing from considering accuracy alone, to the development of algorithms that minimize costs of misclassification (e.g., Breiman, Friedman, Olsen, & Stone, 1984; Turney, 1995). Research on this problem can be categorised into two broad approaches: algorithms that aim to reduce a cost-sensitive problem to an equivalent accuracy maximization problem, and those that extend accuracy based algorithms to handle costs of misclassification. 1 1There are, of course some approaches that don’t use decision trees at all, but these are outside the scope of this paper. 1 The first category of systems includes the proposal by Breiman et al. (1984), who sug- gested changing the distribution of the training examples to be equivalent to the ratio of costs of misclassification. More recently, Zadrozny, Langford, and Abe (2003) formalize this idea, called stratification, and show how the distribution of examples can be changed to implicitly reflect the costs and thereby enable the application of accuracy based classi- fiers. However, Elkan (2001) shows that an impurity measure, with a similar basis to those adopted in many of the decision tree learning algorithms (such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993)) is insensitive to stratification, therefore casting doubt on the effectiveness of using such systems after changing the distribution of examples. The second category of systems is based on the observation that decision tree learning algorithms that aim to maximize accuracy construct trees in a greedy, top-down manner and utilize an information-theoretic measure to select decision nodes. Hence, a natural approach to cost-sensitive decision tree construction is to alter the information-theoretic selection measure so that it includes costs. A number of systems take this approach, including the EG2 (Núñez, 1991), CS-ID3 (Tan, 1993) and ICET (Turney, 1995). The studies by Pazzani, Merz, Murphy, Ali, Hurne, and Brunk (1994) and Turney (1995) include empirical evaluations of these types of algorithms. Pazzani et al. (1994) conclude that these algorithms are no more effective than algorithms that ignore costs. Turney (1995) develops ICET, an algorithm based on using Genetic Algorithms (GAs) to evolve cost-sensitive trees that aim to take account of both costs of misclassification as well as costs of tests. Turney evaluates ICET with respect to several algorithms and shows that ICET performs better than the other algorithms, though he also notes the difficulty ICET has for domains where costs of misclassification are not symmetric: “it is easier to avoid false positive diagnosis (a patient is diagnosed as being sick, but is actually healthy) than it is to avoid false negative diagnosis (a patient is diagnosed as healthy, but is actually sick). This is unfortunate, since false negative diagnosis usually carry a heavier penalty, in real life.” 2 Thus, the problem of developing an effective cost-sensitive decision tree algorithm when given costs of misclassification remains a difficult and open problem. Even if there is an effective algorithm for minimising costs of misclassifications and costs are available, there is a further potential problem which we now illustrate. Consider the application of a de- cision tree induction algorithm on the Diabetes data set, which has been widely used for benchmarking. The data set has 768 examples of which 65% are in the safe class (i.e., no diabetes) and 35% are in the unsafe class (i.e., has diabetes). Suppose we use 50% of this data for training and 50% for testing, then decision tree algorithms would typically produce the results given in table 1.3 If, say, the cost of misclassifying unsafe cases was four times that of misclassifying safe cases, then the expected cost and accuracy are: Expected Cost = 50 ∗ 1 + 59 ∗ 4 384 = 0.75 2Turney refers to a simple cost matrix as one where the costs of misclassifications are equal, and a complex cost matrix as one where the costs are different. 3These particular results were obtained with 10 random trials using an axis-parallel algorithm similar to C4.5 and the literature contains similar results. 2 Class Algorithm Algorithm returns Safe returns Unsafe Actually Safe 200 50 Actually Unsafe 59 75 Table 1: Typical results produced for the diabetes data set Accuracy = 275 384 = 71.6% However, a trivial algorithm that always returns the unsafe class would result in: Expected Cost = 250 ∗ 1 + 0 ∗ 4 384 = 0.65 Accuracy = 125 384 = 32.6% Thus, as this example illustrates, if a learning algorithm aims to minimise the cost, there is a danger that it will achieve this by sacrificing overall accuracy significantly. Of course, if there was an algorithm that produced high accuracies for both classes, then there is no need to sacrifice accuracy in one of the classes and indeed, no need for cost-sensitive algorithms, since accuracy based algorithms would suffice. Given this motivation, this paper presents an alternative algorithm that aims to improve the accuracy of the unsafe class whilst aiming to retain overall accuracy and which does not require the costs of misclassifications. The paper is organised in two main parts: section 2 develops the algorithm and section 3 presents an evaluation of the algorithm with respect to three other decision tree algorithms on six data sets. 2 Development of a Safer Induction Algorithm The central idea of the algorithm is to err on the side of caution and aim to avoid concluding that a process or situation is safe without supporting examples. This aim is not easy to implement with current algorithms as we now illustrate. When faced with continuous attributes, most decision tree learning algorithms discretise the attributes resulting in what are known as axis-parallel splits (such as in C4.5 (Quinlan, 1993) and Remind (Althoff, Auriol, Barletta, & Manago, 1995)). 4 Figure 1 illustrates what happens when axis-parallel splits are used in an example where there are two classes denoted by ‘x’ and ‘o’. Line 1 represents the first division, line 2 the second division, and so on. Consequently, the decision tree for this example would have 5 non-leaf nodes (one for each dividing line) and 6 leaf nodes for the conclusions (one for each region). Thus, to achieve our aim of not drawing conclusions about a region being safe, the only freedom we have is to move the lines along each of the axes. 4The exceptions, CART and OC1 are considered later. 3 x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o ox x X2 X1 1 2 3 o 4 5 Figure 1: An ID3 division x x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o o o o o o o o o o o o o o ox x X2 X1 1 o Figure 2: A linear or oblique division However, a greater degree of freedom would be possible if we could use oblique divi- sions, which are linear combinations of the attributes (i.e., hyperplanes), to identify a more appropriate division of the space. For figure 1, we could adopt the oblique division shown in figure 2. Such a division would result in a decision tree that has only one non-leaf node and the conclusions would be based on more examples. Of course, this example is for illustrative purposes and is simple enough to allow a single linear relationship to divide the classes. In general, the problem will be much more complex: a clean separation may not be possible, more attributes are usually involved and there may be some discrete, as well as continuous attributes present. The following describes how the greater degree of freedom available with linear relationships is used to obtain safer divisions and incorporated as part of a top-down tree induction algorithm. 4 Following the intuition suggested by the above example, the aim is to identify a linear combination of the form: m∑ j=1 aj xj ≤ c (1) where m is the number of continuous attributes, aj are the coefficients, xj are the values of the continuous attributes and c is the threshold. This linear combination should be such that it “best” separates the two classes. By “best” we mean the division that minimizes the probability of incorrectly classifying an example. With this interpretation of “best”, we can use a well known statistical technique, called the linear discriminant method, for identifying the initial form of such relationships. As described below, the identified relationship is then modified so that the resulting decision tree errs on the safe side. The linear discriminant method enables us to find relationships like (1) by using the following equation (Morrison, 1976): xtΣ−1(µ1 − µ2) − 1 2 (µ1 + µ2) ≤ ln ( P (C2) P (C1) ) (2) where x is a vector representing the new example to be classified, µ1, µ2 are the mean vectors for the two classes, Σ is the pooled co-variance matrix, and P (Ci) is the probability of an example being in class Ci. Theoretically, it can be shown that this equation minimizes the misclassification rate when x has a multivariate normal distribution and when the co-variance matrices for each of the two groups are equal (this is sometimes known as the homogeneity condition). Although in practice, this equation is sometimes used without testing these optimality conditions, there remains a danger that it could give poor results if the data departs far from these conditions. Hence later, when we incorporate such linear discriminants into a decision tree learning algorithm, we will guard against this possibility. In the context of safety, we need to ensure that the linear discriminant obtained by the above equation is on the safe side. In particular, there may be situations where the classes overlap and it is not possible to separate them using a hyperplane. Since the linear discrimination method minimizes the probability of an error, it may misclassify some of the examples of both classes. Figure 3 illustrates this problem for a two-attribute and two-class problem. If we adopt such a linear split, each of the regions would be further sub-divided by the ID3 based algorithms and therefore result in a smaller version of the kind of problem we illustrated in figure 1. To reduce this, we can adjust the linear discriminant so that only one of the regions requires further sub-divisions. In making such an adjustment, we have a choice: 1. Adjust the discriminant so that one of the regions has just unsafe examples. 2. Adjust the discriminant so that one of the regions has just safe examples. If we choose the second alternative, there may be a few safe examples in a region that has numerous unsafe examples. Hence, at some point, the learning algorithm will further 5 x x x x x x x x x x x x x x x x x x x o o o o o o o o o o o o o o x x X2 x X1 A ox o Unsafe x Safe o x xx o o x x Figure 3: An illustration of overlapping classes sub-divide the unsafe region and isolate part of that region as safe. That is, a region would be inferred as safe based on just a small number of examples which is contrary to our aims. As an illustration of this effect, figure 4(a) shows the regions identified when we apply this approach on figure 3. Notice that the region on the right of line A and below line B is considered as safe based on just two examples. In contrast, if we select the first o X2 x xx o o X1 A B x x Safe Region UnsafeSafe Region Region (a) An unsafe linear adjustment. o X2 x xx o o X1 B A x x C Safe Unsafe UnsafeSafe Region Region Region Region (b) A safer linear adjustment. Figure 4: An illustration of adjustments alternative, some regions may be inferred as unsafe based on a small number of examples. As an illustration, figure 4(b) shows the effect of applying this approach on figure 3. Hence, since we wish to err on the side of safety, we prefer to adjust the discriminant towards the first alternative. To achieve this, the linear discriminant is adjusted by changing a coefficient (i.e., one of ai, or c in equation (1)) in a way that captures the furthest misclassified safe example and results in the best separation (as measured by the information gain). That is, equation (1) is used to compute a new value for each coefficient so that 6 it excludes the furthest misclassified safe example and by keeping the other coefficients constant. The new coefficient that results in the best division is then adopted. Once we have this method for identifying suitable linear divisions and adapting them to err on the safe side, we need to incorporate it within the usual inductive learning algorithm used by the ID3 family. This is done in a manner that remains faithful to the principle be- hind ID3 based algorithms. That is, from the available divisions: axis-parallel splits, linear division and the other attributes, we select the one that maximizes the information gained. Adopting this view gives added protection against situations where the data does not satisfy the theoretical conditions that guarantee a minimum probability of misclassification (i.e., the homogeneity condition described above), since the algorithm can default to the usual ID3 process. Figure 5 summarizes the modification. IF all examples belong to the same class THEN stop ELSE BEGIN Find a linear discriminant on the safe side. Find an axis-parallel split. Evaluate all discrete attributes. Select the test with the best measure from the above three. Partition the examples with respect to the selected test. Repeat the process for each partition. END Figure 5: Top-level of LDT The process of constructing axis-parallel splits is similar to that found in algorithms such as C4.5 except that the threshold is chosen to be on the safe side. More precisely, instead of selecting the mid-point between two consecutive examples of an interval, selecting the value of an example that lies on the safe boundary reduces the risk of misclassifying an unsafe example as safe. Decision tree induction algorithms of the kind proposed above are known to result in trees that overfit the data and are unnecessarily complex. One way of reducing this problem, which is widely used, is to adopt post-pruning methods (Quinlan, 1987; Mingers, 1989). The central operation in most post-pruning methods is to replace a subtree by a leaf node which takes the value of the majority class of the examples in the subtree. Since in practice, the unsafe class usually has fewer cases than the safe class, post-pruning methods that do not take account of safety may result in more accurate trees but without improving their safety. Hence, we now consider how to adapt an existing post-pruning method so that it takes account of safety and can be coupled with the tree induction method proposed in this paper. The literature contains numerous methods for post-pruning including: minimum cost- complexity pruning (Breiman et al., 1984), critical value pruning (Mingers, 1989), error- based pruning (Quinlan, 1993) and cost sensitive pruning (Knoll, Nakhaeizadeh, & Tausend, 1994). The reader is referred to the survey paper by Breslow and Aha (1997), a very 7 good comparison by Frank and Witten (1998) and an analysis of the reduced-error pruning method by Elomaa and Kaariaine (2001) for further information on pruning methods. Knoll et al.’s work is particularly relevant since it suggests modifying the accuracy based replacement criteria by one that takes account of the cost of misclassification. Thus, we have a ready made method that can take account of safety: simply allocate a large cost of misclassification to the unsafe class relative to the safe class. Unfortunately, it is not that easy, since assigning a relatively large cost to one class simply has the effect of pruning trees so that the accuracy in the other class is sacrificed to an unacceptable level. This behaviour is confirmed in Knoll et al.’s empirical evaluation, where in a credit allocation problem they adopt a misclassification cost of 1 for the “no risk” class and a misclassification cost of 13 for the “risk” class. In analyzing the results obtained for this problem, they conclude (Knoll et al., 1994, p. 385): “Concerning the credit data, the cost matrix is strongly asymetric. As a con- sequence, the pruning methods tried to classify all the test examples as “risk”, i.e. granting no credit at all. Obviously such a classifier is useless in practice.” In a related study, Bradford, Kunz, Kohavi, Brunk, and Brodley (1998) carry out an empirical evaluation of the effect of cost-based versions of pruning algorithms, similar to those proposed in (Knoll et al., 1994), on trees produces by systems such as C4.5. Surpris- ingly, they conclude that such pruning methods appear not to be effective in reducing costs, even relative to results from unpruned trees. Thus, instead of using Knoll’s approach, one of the other techniques, known as the minimum cost-complexity pruning method (Breiman et al., 1984), is adapted for our purposes. This technique works in two stages. First, it generates a series of pruned trees. This is done by considering the effect of replacing each subtree by a leaf and working out the reduction in error per leaf (α): α = R(t) − R(Tt) Nt − 1 (3) where R(t) is the expected error rate if the subtree is pruned, R(Tt) is the expected error rate if the subtree is not pruned, and Nt is the number of leaf nodes in the subtree. The subtree with the least value per leaf (smallest α) is pruned by replacing the subtree by its majority class, and the process repeated recursively to generate a series of increasingly pruned trees. Second, once a series of increasingly pruned trees have been generated, it uses an independent testing set to select a tree that is the most accurate. 5 In its pure form, the minimum cost-complexity pruning (MCCP) method does not take account of costs of misclassification or safety. To take account of safety, the MCCP method is adapted as follows. Given that the primary goal is to minimize the number of misclassified unsafe examples, the weakest subtrees are those that have least effect on the misclassification of unsafe examples irrespective of the number of leaves in the subtree. This leads to an adaptation of equation (3) so that R(t) is the expected error rate on the unsafe class if the subtree is pruned, R(Tt) is the expected error rate on the unsafe class if the subtree is not pruned, and the divisor Nt − 1 is omitted. In addition, the second part of MCCP is modified to select the safest pruned tree instead of the most accurate. Section 3 will include 5More precisely, it selects the smallest tree within one standard error of the most accurate tree. 8 the results of coupling this simple adaptation of the MCCP method with the developed algorithm. This concludes the development of the algorithm and an associated pruning method. The next section presents an evaluation with respect to several algorithms. 3 Empirical Evaluation with Respect to Existing Systems As suggested by the introduction, decision tree learning is a substantial research area, with many algorithms. The choice of which algorithms we should compare with is guided by the following potential counter claims: • Existing cost-sensitive algorithms can handle this problem by simply allocating a larger nominal cost to the more expensive class. • Any potential improvement may be due to the use of oblique divisions only and algorithms that utilise such splits may be just as effective. Thus, the empirical evaluation needs to be with respect to a cost-sensitive algorithm and an algorithm that induces oblique decision trees. But which ones and why? There are several systems that aim to take account of the costs (e.g., those proposed by (Pazzani et al., 1994; Draper, Brodley, & Utgoff, 1994; Provost & Buchanan, 1995; Turney, 1995)). 6 Despite some differences in their approaches, most of these systems extend the attribute selection measure to include cost and aim to optimize a function that measures the cost of a decision tree. Amongst these systems, the ICET system is possibly the best known system and one which Turney has demonstrated to be superior to several systems (Turney, 1995, p384). In another empirical study, ICET performs better than LMDT (Vadera & Ventura, 2001). Algorithms that utilise oblique splits include CART (Breiman et al., 1984), OC1 (Murthy, Kasif, & Salzberg, 1994) and RLP (Bennett, 1999). Murthy developed OC1 as an improve- ment of CART and shows OC1 to be empirically superior. Hence, given these considerations, we include an evaluation with respect to ICET, OC1, RLP and an axis parallel (AP) algorithm. The following subsection summarises the main characteristics of the selected algorithms and subsection 3.2 presents the empirical evalua- tion. 3.1 Summary of ICET, OC1 and RLP Summary of ICET The ICET system takes an evolutionary approach to inducing cost effective decision trees (Tur- ney, 1995). It utilizes a genetic algorithm, GENESIS (Grefenstette, 1986), and an extension of C4.5 in which a cost function is used instead of the information gain measure. The cost function used is borrowed from the EG2 system (Núñez, 1991) and takes the following form 6The reader can browse the following for a more complete bibliography: http:// mem- bers.rogers.com/peter.turney/ml cost.html 9 for the ith attribute: ICFi = 2∆i − 1 (Ci + 1)ω Where ∆i is the information gain, Ci the cost of carrying out the ith test and ω is a bias parameter used to control the amount of weight that should be given to the costs. The central idea of ICET is summarized in figure 6. The genetic algorithm GENESIS GENESIS C4.5 Biases Ci, ω, CF Trees Fitness Figure 6: The ICET system begins with a population of 50 individuals where each individual consists of the parameters Ci, ω, CF whose values are randomly selected initially. The parameters Ci, ω are utilized in the above equation and the parameter CF is used by C4.5 to decide the amount of pruning. Notice that in ICET, the parameters Ci are biases and not true costs as defined in EG2. Given the individuals, C4.5 is run on each one of them to generate a corresponding decision tree. Each decision tree is obtained by applying C4.5 on a randomly selected sub-training set, which is about half the available training data. The fitness of each tree is obtained by using the remaining half of the training data as a sub-testing set. The fitness function used by ICET is the cost of the tests required plus the cost of any misclassifications averaged over the number of cases in the sub-testing set. The individuals in the current generation are used to generate a new population using the mutation and cross over operators. This cycle is repeated 20 times and the fittest tree is returned as the output. Summary of OC1 The OC1 algorithm is motivated by the process adopted by CART. The following first summarises how CART finds the linear relationships and then how OC1 improves upon CART. The method used by CART works in two stages. First it uses a heuristic to find a linear combination of the continuous attributes that best divides the space. Then it eliminates those continuous attributes that have little influence (a process called backward elimination in CART). This latter process of eliminating variables is carried out by simply comparing the benefit of retaining a variable over deleting it and is aimed at reducing the complexity of the equation rather than improving the accuracy of a decision tree. The first step, that of finding the linear combination, is central to CART and is now described in more detail. CART aims to find a hyperplane by searching for the coefficients and a threshold (for equation 1) that divides the space of examples so that its evaluation function (GINI 10 index in CART) is minimized. Initially, the linear form takes an arbitrary set of coefficients and an arbitrary threshold. It then cycles through the continuous attributes one by one, updating the coefficients and the threshold to find an improved linear form with respect to the evaluation function and the training examples. This is repeated until the change in measures between the combination obtained in the previous cycle and the new division is less than a pre-specified amount. An individual coefficient, ak is updated by viewing equation 1 as a function of ak so that an example is above the current hyperplane if (for positive xk): ak > c − ∑j 6=k aj xj xk (4) A value for ak can then be found by evaluating the right hand side of this equation for each training example, sorting them and selecting the mid-point between two successive values that results in the best division. Murthy et al. (1994) argue that CART’s approach to identifying a hyperplane is focused on finding hyperplanes that are locally optimal and can therefore result in the identification of “poor” hyperplanes. Hence, although OC1’s approach to finding a linear combination is similar to CART ’s cycling process, it provides the following two key refinements that enable it to escape local minima: • When a locally optimal linear form is obtained, it is perturbed in a random direction. The amount by which it is perturbed in the random direction is then optimized with respect to the evaluation function. If this new hyperplane is better than the previous one, then the cycling process is repeated with this new hyperplane. The number of such random jumps it performs is a user specified parameter. • The linear combination to which the iterative process will converge to, depends mainly on the initial starting hyperplane. Hence, OC1 also takes different random starting hyperplanes to see if they converge to better hyperplanes. The number of restarts it makes is also a user specified parameter. Summary of RLP A number of authors have proposed the use of linear programming to obtain linear discrim- inants (see (Bennett, 1999) for a good overview). The central idea is to formulate a linear programming problem where the constraints are based on discriminating the classes and an objective function that minimises the distances between the misclassified points and the discriminant. Given matrices A and B that consist of m and k examples from their respective classes, and a vector of ones e, Bennett and Mangasarian (1992) develop the following linear pro- gramming formulation, called Robust Linear Programming (RLP) to identify a discriminant wx = γ. 11 Data Set No of examples %Safe %Unsafe No of attributes Pima Indians Diabetes 768 65 35 9 Intensive Therapy 400 77 23 16 Breast Cancer 683 65 35 11 Housing 506 49 51 14 Cleveland Heart 297 54 46 14 Hepatitis 155 79 21 20 Table 2: Characteristics of the data sets Objective: min w,γ,y,z 1 m ey + 1 k ez Subject to: Aw − eγ + y ≥ e −Bw + eγ + z ≥ e y, z ≥ 0 This minimisation problem can be solved using standard linear programming methods and incorporated as part of the top down procedure outlined in figure 5 resulting the RLP algorithm. 3.2 Empirical Evaluation This section presents the results of an empirical evaluation of the tree induction algorithm developed in this paper relative to the algorithms outlined above. The section also includes the results obtained when axis parallel splits are utilised, since such split are used as a comparative benchmark by a number of authors. The OC1 system is available and RLP was relatively easy to implement with the aid of the Linear programming package, loqo (Vanderbe, 2002). Implementation of ICET is more involved and some of the experiments in Turney(1995) were repeated to ensure that the implementation was faithful (see appendix B). Since post-pruning potentially adds its own effects on the results, experiments were carried out both before and after post-pruning was employed. The experiments utilized six data sets whose basic characteristics are summarized in table 2. The diabetes, breast cancer, housing, heart and hepatitis data sets are from the UCI repository of machine learning databases where they are fully described (Blake & Merz, 1998). For the housing data set, we followed Murthy et al’s (1994) discretisation of the class so that values below $21000 are of class 1 and otherwise of class 2. Arbitrarily, values below $21000 were considered safe. The intensive therapy data set was obtained from Manchester Royal Infirmary in the UK.7 It consists of a database where each record describes a patient’s 7This data set is available from the author. 12 state by measures such as age, temperature, mean arterial pressure, heart rate, respiratory rate, and has a class variable with values “critical” or “not critical”. Bearing in mind our earlier comments about the optimality conditions for linear dis- criminant analysis (p. 5), it is worth noting that none of the selected data sets satisfy the homogeneity condition for the covariance matrices. This was tested by Box’s M test using SPSS (Norusis, 1994, p73). The diabetes data, which was the closest to satisfying the optimality requirements, resulted in Box’s M =229, F = 6.2 with (36,1049414) DF, and zero significance. That is, these data sets are not, somehow biased towards the use of linear discriminant analysis, and therefore the algorithm developed in this paper. Results before pruning For each data set, we randomly partitioned the data into a training and a testing set where the training set was used to learn the decision tree and the testing set was used to evaluate its accuracy. To ensure robustness of the results, the experiments involved three different proportions of training/testing sets: 30/70, 50/50, and 70/30. For each of these training/testing splits, 50 experiments with randomly selected examples were carried out. Figure 7 shows the results obtained for the average accuracy on the unsafe class using OC1, axis-parallel (AP), Robust Linear Programming (RLP) and the algorithm presented in this paper (LDT). It presents the results for each of the data sets and across each of the training/testing splits. 8 Table 3 presents the average results for the 50/50 training/testing experiments (the complete results are given in appendix A, and the conclusions drawn apply to all the results). The table gives the averages for the overall accuracy, the accuracy of the safe class, the accuracy of the unsafe class, the depth of trees, and the time taken on a Sun SPARCStation2. The OC1 results were obtained by using the system supplied by Murthy et al. (1994) with the default parameters: number of restarts set to 20, and the random directional jumps set to 5. The results for the axis-parallel approach were obtained by using the axis-parallel option of OC1.9 In all cases, the Information Gain measure, which is probably the most widely used, was employed. These results show that, in general, when pruning is omitted, the developed algorithm produces safer trees than both OC1, AP and RLP without a significant loss in the overall accuracy. The only exception to this is on the 70/30 Hepatitis trial where the AP algorithm does produce better results than all the other algorithms. The degree of improvement in safety varies from a significant improvement for the intensive therapy data set to only a very small improvement in the breast cancer data set. The fact that there is only a small improvement in the breast cancer data set is probably due to the “ceiling effect”. That is, since the accuracy of the algorithms is already very high and close to the upper bound, the level of improvement obtained by any algorithm will be limited (Cohen, 1995). The behaviour of ICET is also of interest, since although Turney (1995) notes the relative difficulty of the problem of inducing safer trees, ICET is shown to perform better 8The use of ROC curves was also considered but is not appropriate given the lack of variation in the results and the more direct presentation is preferred. Readers interested in ROC can refer to (Provost, Fawcett, & Kohavi, 1998; Martin, Doddington, Kamm, Ordowski, & Przybocki, 1997) for some of the non-trivial issues with their use for such problems. 9This produces similar results to C4.5 in the empirical trials presented in (Murthy et al., 1994). 13 Diabete s 50.0 52.0 54.0 56.0 58.0 60.0 62.0 30/70 50/50 70/30 Training/Tes ting se t % A c c u ra c y Intensive Therapy 0.0 10.0 20.0 30.0 40.0 50.0 30/70 50/50 70/30 Training/Tes ting se t % A c c u ra c y Breast Cance r 87 88 89 90 91 92 93 30/70 50/50 70/30 Training/Tes ting se t % A c c u ra c y Hous ing 78 79 80 81 82 83 84 85 86 87 30/70 50/50 70/30 Training/Tes ting se t % A c c u ra c y Hepatitis 44.0 46.0 48.0 50.0 52.0 54.0 56.0 58.0 60.0 62.0 30/70 50/50 70/30 Training/Tes ting se t % A c c u ra c y LDT OC1 AP RLP Heart 66.0 68.0 70.0 72.0 74.0 76.0 30/70 50/50 70/30 Training/Testing set % A c c u ra c y Figure 7: Summary of accuracy for the unsafe class without pruning 14 Data set System Accuracy Tree Time Overall Safe Unsafe Depth (secs) Diabetes LDT 69.4±1.5 73.8±0.9 61.3±1.4 17.1±0.7 3.3±0.09 OC1 68.5±1.2 75.5±1.0 55.4±1.6 11.2±0.3 192.3±5.10 AP 69.7±1.1 76.4±0.9 57.2±1.4 14.2±0.4 1.8±0.03 RLP 67.4±0.7 72.5±1.1 57.9±1.5 46.0±3.8 27.3±0.31 Breast Cancer LDT 95.1±1.0 96.4±0.3 92.6±1.1 7.1±0.4 1.1±0.03 OC1 94.3±0.5 96.4±0.3 90.4±1.2 6.3±0.3 91.5±7.27 AP 94.5±0.8 96.4±0.4 90.8±1.0 7.7±0.4 0.7±0.02 RLP 94.8±0.4 96.4±0.3 91.9±1.2 16.5±5.1 13.8± 0.52 Heart LDT 74.1 ± 1.1 73.9 ± 1.9 74.7 ± 1.6 9.8± 0.5 1.3± 0.05 OC1 74.0 ± 1.0 74.4 ± 1.7 73.9 ± 1.7 6.8± 0.3 44.8± 1.26 AP 72.9 ± 0.9 74.4 ± 1.7 71.3 ± 1.6 7.8± 0.3 0.7± 0.02 RLP 76.1 ± 1.1 77.8 ± 2.2 74.6 ± 1.7 26.6± 6.1 7.4± 0.25 Hepatitis LDT 80.5 ± 1.1 87.0 ± 1.5 56.0 ± 3.6 5.1 ± 0.4 0.7 ± 0.1 OC1 80.7 ± 1.5 87.4 ± 1.6 54.6 ± 4.4 4.7 ± 0.4 22.1 ± 1.7 AP 80.0 ± 1.5 87.1 ± 1.7 52.7 ± 3.9 5.6 ± 0.4 0.4 ± 0.0 RLP 79.4 ± 1.3 86.1 ± 1.4 53.3 ± 4.2 4.8 ± 3.3 3.4 ± 0.1 Housing LDT 83.3±1.2 81.3±1.3 85.2±1.1 10.2±0.5 2.3±0.07 OC1 82.4±0.8 82.3±1.2 82.6±1.2 7.7±0.3 104.6±4.41 AP 81.7±0.8 81.3±1.2 82.1±1.1 9.2±0.4 1.4±0.03 RLP 83.1± 0.5 81.2 ± 1.2 85.0 ±1.1 25.3 ±5.9 18.3 ±0.40 Intensive Therapy LDT 70.4±1.0 78.2±1.3 44.2±2.4 11.2±0.5 3.0±0.09 OC1 72.5±0.6 83.0±1.0 37.8±2.8 8.2±0.4 99.1±2.87 AP 72.2±0.7 81.9±1.0 39.7±2.3 9.7±0.4 1.4±0.04 RLP 70.5± 1.0 80.6 ±1.4 37.2± 2.4 25.0±5.9 17.3±0.46 Table 3: Results for 50/50 experiments before pruning together with 95% confidence interval 15 than other cost-sensitive algorithms in such situations (Turney, 1995, figure 5, p389). Thus it is worth investigating ICET’s behaviour as the cost of misclassification in the unsafe class is increased relative to the safe class. Figure 8 shows the results obtained by ICET prior to pruning when the 50/50 experiments are carried out and as the ratio of misclassification costs is increased. As the results show, there is little variation and the trees produced are not as safe as those produced by LDT. 10 It would be reasonable to suggest that, given the evolutionary nature of ICET, removing the pruning component restricts the search space unfairly and is bound to result in the lack of variation displayed in this figure. Hence, discussion of this behaviour is postponed until the results after pruning are presented. Diabetes 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Breast Cancer 88% 90% 92% 94% 96% 98% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Hepatitis 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Intensive Therapy 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Housing 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Heart 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Overall Safe Unsafe Figure 8: ICET behaviour before pruning, as the misclassification ratio unsafe/ safe is increased 10Note that the large ratio of costs is chosen to give it every chance to show variation. 16 Bre as t Cance r 86 88 90 92 94 96 98 30/ 70 50/ 50 70/ 30 Training/Te sting s e t % A cc u ra cy Diabe te s 0 10 20 30 40 50 60 70 80 30/ 70 50/ 50 70/ 30 Training/Tes ting s et % A cc u ra cy Hous ing 70 75 80 85 90 95 30/ 70 50/ 50 70/ 30 Training/Tes ting s e t % A cc u ra cy Inte ns ive The rapy 0 10 20 30 40 50 60 30/ 70 50/ 50 70/ 30 Training/Tes ting s et % A cc u ra cy He art 60 65 70 75 80 85 30/ 70 50/ 50 70/ 30 Training/Te sting s e t % A cc u ra cy He patitis 0 10 20 30 40 50 60 70 30/ 70 50/ 50 70/ 30 Training/Tes ting s e t % A cc u ra cy LDT OC1 AP RLP Figure 9: Summary of accuracy for the unsafe class after pruning Results after pruning As mentioned earlier, post-pruning methods can result in improved accuracy, so the above experiments were repeated with pruning. The pruning methods employed were the default methods associated with the systems: OC1 and AP utilize the MCCP method, and LDT is coupled with the safer variation of MCCP described in section 2. Figure 9 presents the results for the unsafe class, and table 4 gives the results for the 50/50 experiments (the complete results are given in appendix A). The results obtained show that LDT produces significantly safer trees than the other methods on almost all the datasets. The only dataset where it does not improve the safety of the trees relative to the other methods is the breast cancer data, where both LDT and OC1 obtain an accuracy of 94.3% for the unsafe class. When comparing overall accuracy, the only significant difference occurs on the intensive therapy dataset, where the performance of both OC1 and AP is better than LDT. But since the accuracy on the unsafe class is 17 Data set System Accuracy Tree Time Overall Safe Unsafe Depth (secs) Diabetes LDT 69.9±0.8 67.6±1.7 74.6±2.1 15.3±0.8 3.1±0.08 OC1 71.4±0.8 80.3±1.8 55.0±3.6 5.1±1.0 167.8±4.25 AP 73.8±0.6 84.2±1.5 54.3±2.4 5.0±1.1 1.7±0.03 RLP 75.1 ± 0.7 81.4 ± 1.3 63.3 ± 2.1 3.2 ± 0.6 23.0 ± 0.30 Breast Cancer LDT 95.7±0.3 96.5±0.3 94.3±1.0 5.5±0.6 1.0±0.03 OC1 94.6±0.5 94.7±0.7 94.3±1.2 1.9±0.3 82.2±5.04 AP 94.1±0.5 94.9±0.7 92.6±1.2 3.1±0.5 0.7± 0.02 RLP 96.5 ± 0.2 96.9 ± 0.3 95.6 ± 0.6 1.2± 0.2 12.0 ± 0.50 Heart LDT 75.0 ± 1.1 72.8 ± 2.1 78.0 ± 1.5 9.8 ± 0.6 1.2 ± 0.06 OC1 75.8 ± 1.2 77.9 ± 2.5 73.6 ± 2.5 2.4 ± 0.5 38.0 ± 1.04 AP 74.4 ± 1.1 77.8 ± 2.1 70.7 ± 2.5 3.6 ± 0.7 0.6 ± 0.01 RLP 79.3 ± 0.9 82.1 ± 1.5 76.2 ± 1.9 1.8 ± 0.4 6.4 ± 0.18 Hepatitis LDT 79.4 ± 1.4 85.6 ± 1.8 56.2 ± 4.2 5.0 ± 0.4 0.6 ± 0.04 OC1 79.3 ± 1.5 85.4 ± 2.1 56.2 ± 4.5 1.6 ± 0.3 18.5 ± 1.48 AP 80.2 ± 1.3 89.4 ± 2.1 45.6 ± 6.6 1.8 ± 0.3 0.4 ± 0.02 RLP 78.5 ± 1.4 84.9 ± 1.4 54.7 ± 4.4 1.0 ± 0.0 2.9 ± 0.10 Housing LDT 82.7±0.6 76.2±1.8 88.8±1.2 8.5±0.5 2.1±0.07 OC1 81.0±1.0 82.0±2.9 80.3±2.0 3.3±0.6 84.7±2.24 AP 81.5±0.6 84.1±1.9 79.3±2.0 3.7±0.8 1.3±0.03 RLP 85.2 ± 0.6 86.0 ± 1.4 84.5 ± 1.1 2.4 ± 0.6 15.8 ± 0.42 Intensive Therapy LDT 70.0±1.0 76.8±1.3 47.8±2. 9 9.9±0.6 2.6±0.08 OC1 74.8±0.9 89.8±2.4 25.1±5.3 2.5±0.6 82.0±3.51 AP 75.9±0.9 92.8±1.8 19.2±4.7 3.8±0.9 1.3±0.04 RLP 73.8 ± 1 88.5 ± 2.7 25.7 ± 5.9 2.1 ± 0.4 14.9 ± 0.4 Table 4: Results for 50/50 experiments together with 95% confidence interval after pruning 18 significantly poorer than LDT’s performance, such trees are useless in practice. Indeed, the overall accuracy of 74.8% obtained by OC1 masks the information that in more than a quarter of the trials, OC1 produces simple, one node trees, labelled with the safe class, that are 100% accurate on the safe class but 0% accurate on the unsafe class. A comparison of these results with those obtained without pruning (given in table 3) reveals the effects of pruning. As expected, pruning does improve accuracy in most cases. More interestingly, the use of pruning with OC1 and AP results in improving the accuracy of the unsafe class for just the breast cancer dataset. For the other 3 data sets, the use of pruning with OC1 and AP actually results in a decrease in the accuracy of the unsafe class. The adoption of the safer MCCP method does improve the accuracy of the unsafe class, but, this is at the expense of a reduction in the accuracy of the safe class. Although this is reasonable in the context of safety, these results show that pruning methods do not necessarily result in balanced improvements to the accuracy of all the classes. Earlier, discussion of ICET’s results prior to pruning (figure 8) was postponed in case the lack of variation in accuracy was due to the omission of pruning. As the results in figure 10 show, even after pruning is employed, the accuracy does not vary much on either class as the ratio of misclassification costs is increased. This behaviour is not too surprising when we consider how ICET works. At the bottom level (figure 6), it uses C4.5 to generate trees based on the post-pruning parameter CF, and a selection function ICF that has biases Ci and ω. The pruning algorithm adopted is the error based pruning method that does not directly take account of costs but may be influenced indirectly by the parameter CF. Thus the extent to which the trees generated by C4.5 are influenced by costs depends on how well the genetic algorithm learns biases that relate to the costs. The results show that in its current form, the genetic algorithm is unable to learn the biases in a way that is sufficient for generating trees that are biased towards a particular class. Prior to these experiments, this author (as well as Turney (1995), p 390) thought that ICET’s inability to improve the accuracy of the unsafe class was due to the fewer number of examples available relative to the safe class. So it is worth noting that, as the results of the housing data (where each class has approximately 50% of the data) show, ICET’s lack of sensitivity to unbalanced misclassification costs is independent of the proportion of examples available for training from each class. 4 Conclusion and Future Work This paper has presented a decision tree induction algorithm aimed at domains where safety is paramount, costs of misclassification are not easy to estimate accurately and where it is known that the cost of misclassification for one of the classes is significantly higher than another. The algorithm errs on the side of caution by aiming to avoid concluding that a process is safe without supporting examples. To improve the chances of achieving this goal the algorithm utilises oblique divisions in preference to the less flexible axis-parallel splits. The algorithm, called LDT, utilises the linear discriminant method to identify the initial division which is then adapted to favour the development of a tree that errs on the safe side. An alternative way of tackling the problem could be to utilise an existing cost-sensitive algorithm and use a high cost for the more costly class. Given the use of oblique splits, 19 Diabetes 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Intensive Therapy 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Breast Cancer 88% 90% 92% 94% 96% 98% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Housing 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Hepatitis 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Heart 0% 20% 40% 60% 80% 100% 0 .1 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 Ratio A c c u ra c y Overall Safe Unsafe Figure 10: ICET’s behaviour after pruning, as the misclassification ratio unsafe/safe is increased it could also be argued that existing oblique decision tree algorithms may be adequate. Hence, the algorithm is evaluated relative to OC1, the axis parallel approach (AP), Robust Linear Programming (RLP) and an implementation of ICET. Since pruning can have its own effects on accuracy, experiments were carried out both before and after pruning. The default pruning methods were adopted for OC1, AP, and ICET, while a simple variation of the MCCP method that took account of safety was used with LDT. Both before and after pruning, LDT produces trees that are significantly safer than OC1, AP, RLP and ICET. The only exception occurs on the breast cancer data set after pruning, where OC1 performs as well as LDT, and where, with accuracies of about 95%, there is not much scope for improvement. Since ICET is an approach that takes account of costs of misclassification, a reasonable proposal for inducing safer trees is to increase the cost of misclassification in the unsafe class relative to the safe class. Hence experiments, in which the ratio of costs of misclassifications was varied were carried out on an implementation of ICET. The results revealed that ICET is not particularly sensitive to imbalanced costs of misclassification and 20 that the trees do not become any safer as the ratio is increased. Further, the experiments show that this lack of sensitivity is not simply due to an imbalance in the proportion of examples available in the classes as previously thought. In general, the improvements in safety obtained by LDT are without sacrificing the overall accuracy. For the diabetes, breast cancer, and housing datasets, the overall accuracy obtained by LDT is similar to that obtained by OC1, both before and after pruning. For the intensive therapy data set, the overall accuracy obtained by OC1 and AP is significantly higher than that obtained by LDT. However, this is achieved by taking advantage of the low ratio of unsafe cases in the data set to the extent that over a quarter of the trees degenerate into always returning the safe class. Such degenerate trees are, of course, not much use in any application. Comparing the results obtained before and after pruning reveals that post-pruning meth- ods can have an unbalanced effect on the accuracy of the classes. The use of the default pruning strategies with OC1, AP, and ICET resulted in no significant improvement to the safety of the trees, while the use of the adapted pruning method improved the safety of the trees, but with a reduction in the accuracy of the safe class. There are several areas for further work. In a related study, non-linear divisions have been used to develop a cost-sensitive algorithm (Vadera, 2005) but which relies on the availability of costs. Adapting non-linear divisions for safety applications is likely to be more difficult but represents a natural extension of this work. Several authors have also experimented with the use of boosting for cost-sensitive problems (Domingos, 1999; Fan, Stolfo, Zhang, & Chan, 1999; Ting, 2000; Merler, Furlanello, Larcher, & Sboner, 2003). Attempts to use boosting coupled with the algorithm presented in this paper also represents a further potential enhancement. To conclude, this paper has developed a tree induction algorithm that results in safer trees whilst maintaining overall accuracy and which is more appropriate for safety applica- tions. 5 Acknowledgements The work presented in this paper is a significant development of work that was initiated with the help of a Research Assistant, Said Nechab in 1994-95. In particular, the motivation and examples in section 2 appear in a previous joint paper written and presented by the author at Knowldege Based Systems Conference in 1996. The author is grateful to David Ventura for implementing ICET as part of his MSc project, and for the results included in appendix B. The author is also grateful to Sreerama Murthy for making his OC1 system available and the anonymous referees for their comments. References Althoff, K., Auriol, E., Barletta, R., & Manago, M. (1995). A review of industrial case-based reasoning tools. Oxford: AI Intelligence. 21 Bennett, K. P. (1999). On mathematical programming methods and support vector ma- chines. In Schoelkopf, A., Burges, C., & Smola, A. (Eds.), Advances in Kernel Methods – Support Vector Machines, chap. 19, pp. 307–326. MIT Press, Cambridge, MA. Berry, M., & Linoff, G. (2004). Data Mining Techniques (2nd edition). John Wiley. Blake, C., & Merz, C. (1998). UCI Repository of Machine Learning Databases. [http: //www.ics.uci.edu/ mlearn/ MLRepository.html], Irvine, CA: University of California, Department of Information and Computer Science, USA. Bradford, J., Kunz, C., Kohavi, R., Brunk, C., & Brodley, C. (1998). Pruning decision trees with misclassification costs. In Proceedings of the Tenth European Conference on Machine Learning, pp. 131–136. Breiman, L., Friedman, J. H., Olsen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Belmont: Wadsworth. Breslow, L., & Aha, D. (1997). Simplifying decision trees: A survey. Knowledge Engineering Review, 12, 1–40. Cohen, R. (1995). Empirical Methods for Artificial Intelligence. Massachusetts : MIT Press. Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 155–164. Draper, B., Brodley, C. E., & Utgoff, P. E. (1994). Goal-directed classification using lin- ear machine decision trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16 (9), 888–893. Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp. 973–978. Elomaa, T., & Kaariaine, M. (2001). An analysis of reduced error pruning. Journal of Artificial Intelligence Research, 15, 163–187. Fan, W., Stolfo, S., Zhang, J., & Chan, P. (1999). AdaCost: Misclassification cost-sensitive boosting. In Proceedings of the 16th International Conference on Machine Learning, pp. 97–105. Morgan Kaufmann, San Francisco, CA. Frank, E., & Witten, I. (1998). Reduced-error pruning with significance tests, [http://citeseer.ist.psu.edu/frank98reducederror.html].. Grefenstette, J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 16, 122–128. Knoll, U., Nakhaeizadeh, G., & Tausend, B. (1994). Cost-sensitive pruning of decision trees. In Proceedings of the Eight European Conference on Machine Learning, Vol. 2, pp. 383–386, Berlin, Germany. Springer-Verlag. Kolodner, J. (1993). Case-based Reasoning. Palo Also, CA:Morgan Kaufman. Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of Eurospeech ’97, pp. 1895–1898, Rhodes, Greece. 22 Merler, S., Furlanello, C., Larcher, B., & Sboner, A. (2003). Automatic model selection in cost-sensitive boosting. Information Fusion, 4, 3–10. Mingers, J. (1989). An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4, 227–243. Morrison, D. (1976). Multivariate Statistical Methods (Second edition). New York : McGraw-Hill. Murthy, S., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32. Norusis, M. (1994). SPSS Advanced Statistics 6.1. SPSS Inc., Chicago, Illinois 60611, USA. Núñez, M. (1991). The use of background knowledge in decision tree induction. Machine Learning, 6, 231–250. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hurne, T., & Brunk, C. (1994). Reducing misclassification costs: Knowledge-intensive approaches to learning from noisy data. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 217–225. Provost, F. J., & Buchanan, B. G. (1995). Inductive policy: The pragmatics of bias selection. Machine Learning, 20, 35–61. Provost, F. J., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 445–553. Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-Macine Studies, 27, 221–234. Quinlan, J. R. (1993). C4.5 : Programs for Machine Learning. California:Morgan Kauffman. Tan, M. (1993). Cost-sensitive learning of classification knowledge and its applications in robotics. Machine Learning, 13, 7–33. Ting, K. (2000). A comparative study of cost-sensitive boosting algorithms. In Proceedings of the 17th International Conference on Machine Learning, pp. 983–990. Turney, P. (1995). Cost sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, 2, 369– 409. Vadera, S. (2005). Inducing Cost-Sensitive Nonlinear Decision Trees, Technical Report. School of Computing, Science and Engineering, University of Salford, Salford, M5 4WT, UK, [http://citeseer.ist.psu.edu]. Vadera, S., & Ventura, D. (2001). A comparison of cost-sensitive decision tree learning algo- rithms. In Proceedings of the Second European Conference on Intelligent Management Systems in Operations, pp. 79–86. Vanderbe, R. (2002). Loqo: Optimisation and Applications Web Site. Princeton University, NJ 08544, [http://orfe.princeton.edu/ loqo/]. Zadrozny, B., Langford, J., & Abe, N. (2003). A simple method for cost-sensitive learning. IBM Technical Report RC22666, [http://citeseer.ist.psu.edu/zadrozny03simple.html]. 23 A Results of Empirical Evaluations of LDT, AP, RLP and OC1 This appendix presents the complete set of results for LDT, AP,RLP and OC1 both before and after pruning. The experimental method used is described in the body of the paper. Diabetes (unpruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 68.1±1.6 73.1±1.2 58.9±2.0 13.3±0.5 1.7±0.05 OC1 67.7±0.8 75.2±1.2 54.0±1.6 9.4±0.4 92.0±2.46 AP 68.5±1.0 76.0±1.1 54.8±1.9 12.2±0.5 1.0±0.02 RLP 66.9 ± 0.7 71.8 ± 1.1 58.0 ± 1.7 38.4 ± 5.4 13.60 ± 0.20 50/50 LDT 69.4±1.5 73.8±0.9 61.3±1.4 17.1±0.7 3.3±0.09 OC1 68.5±1.2 75.5±1.0 55.4±1.6 11.2±0.3 192.3±5.10 AP 69.7±1.1 76.4±0.9 57.2±1.4 14.2±0.4 1.8±0.03 RLP 67.4 ± 0.7 72.5 ± 1.1 57.9 ± 1.5 46.0 ± 3.8 27.3 ± 0.31 70/30 LDT 68.7±2.2 73.8±1.0 59.5±1.8 19.1±0.6 5.0±0.08 OC1 68.4±1.3 75.8±1.1 55.1±1.7 12.7±0.4 313.0±9.60 AP 69.6±1.4 77.5±1.0 55.3±1.6 15.4±0.5 2.7±0.05 RLP 68.0 ± 0.8 74.0 ± 1.1 57.1 ± 1.5 46.9 ± 3.3 42.8 ± 0.30 Breast Cancer (unpruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 94.9±0.8 96.4±0.3 92.0±0.9 5.5±0.3 0.6±0.03 OC1 93.6±0.5 95.8±0.5 89.5±1.4 4.4±0.3 43.0±2.09 AP 93.9±0.7 95.6±0.4 90.8±0.8 6.2±0.5 0.3±0.01 RLP 93.9 ± 0.4 96.2±0.37 89.64 ± 1.41 4.6 ± 2.7 6.3 ± 0.40 50/50 LDT 95.1±1.0 96.4±0.3 92.6±1.1 7.1±0.4 1.1±0.03 OC1 94.3±0.5 96.4±0.3 90.4±1.2 6.3±0.3 91.5±7.27 AP 94.5±0.8 96.4±0.4 90.8±1.0 7.7±0.4 0.7±0.02 RLP 94.8± 0.4 96.4 ± 0.3 91.9± 1.2 16.5±5.1 13.8±0.52 70/30 LDT 95.1±1.0 96.7±0.4 92.2±1.2 8.3±0.4 1.8±0.05 OC1 94.5±0.7 96.3±0.4 91.0±1.0 6.8±0.3 145.2±6.77 AP 94.6±0.8 96.2±0.5 91.7±1.0 8.7±0.4 1.2±0.04 RLP 93.5± 0.5 95.8± 0.4 89.2 ±1.5 14.6 ± 4.5 23.4 ± 0.60 24 Heart(unpruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 73.8 ± 1.0 74.0 ± 1.8 73.9 ± 2.3 7.5± 0.5 0.6± 0.03 OC1 73.1 ± 0.9 73.8 ± 2.0 72.5 ± 1.9 5.6± 0.4 19.9± 0.76 AP 72.9 ± 1.0 75.9 ± 1.6 69.6 ± 2.5 6.7± 0.4 0.4± 0.01 RLP 72.8 ± 1.0 74.9 ± 1.6 70.7 ± 2.2 9.0± 4.0 3.8± 0.15 50/50 LDT 74.1 ± 1.1 73.9 ± 1.9 74.7 ± 1.6 9.8± 0.5 1.3± 0.05 OC1 74.0 ± 1.0 74.4 ± 1.7 73.9 ± 1.7 6.8± 0.3 44.8± 1.26 AP 72.9 ± 0.9 74.4 ± 1.7 71.3 ± 1.6 7.8± 0.3 0.7± 0.02 RLP 76.1 ± 1.1 77.8 ± 2.2 74.6 ± 1.7 26.6± 6.1 7.4± 0.25 70/30 LDT 73.6 ± 1.2 72.0 ± 1.9 75.2 ± 2.0 11.2± 0.5 2.1± 0.07 OC1 74.8 ± 1.3 75.6 ± 2.1 74.2 ± 1.9 7.6± 0.3 75.3± 2.18 AP 73.4 ± 1.3 75.1 ± 2.0 71.6 ± 1.7 9.2± 0.5 1.1 ±0.02 RLP 74.1 ± 1.2 74.4 ± 2.3 74.1 ± 2.3 33.3± 5.8 12.2± 0.31 Hepatitis(unpruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 79.6 ± 1.3 86.0 ± 1.8 55.9 ± 3.7 3.0 ± 0.4 0.3 ± 0.0 OC1 80.0 ± 1.2 87.3 ± 1.8 53.3 ± 4.4 3.2 ± 0.3 8.6 ± 0.6 AP 79.7 ± 1.1 87.4 ± 1.5 50.5 ± 3.9 4.1 ± 0.3 0.2 ± 0.0 RLP 77.1 ± 1.6 83.4 ± 1.8 54.3 ± 4.2 1.1 ± 0.1 1.6 ± 0.0 50/50 LDT 80.5 ± 1.1 87.0 ± 1.5 56.0 ± 3.6 5.1 ± 0.4 0.7 ± 0.1 OC1 80.7 ± 1.5 87.4 ± 1.6 54.6 ± 4.4 4.7 ± 0.4 22.1 ± 1.7 AP 80.0 ± 1.5 87.1 ± 1.7 52.7 ± 3.9 5.6 ± 0.4 0.4 ± 0.0 RLP 79.4 ± 1.3 86.1 ± 1.4 53.3 ± 4.2 4.8 ± 3.3 3.4 ± 0.1 70/30 LDT 80.9 ± 1.7 86.9 ± 1.9 58.4 ± 4.5 7.1 ± 0.5 1.4 ± 0.1 OC1 82.9 ± 1.5 89.5 ± 1.7 57.4 ± 4.9 5.5 ± 0.3 38.3 ± 2.0 AP 81.9 ± 1.4 88.0 ± 1.8 59.6 ± 4.2 6.4 ± 0.4 0.6 ± 0.0 RLP 80.1 ± 1.4 87.4 ± 1.6 52.5 ± 4.5 13.2 ± 5.0 5.9 ± 0.4 25 Housing (unpruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 82.8±1.0 81.3±1.5 84.3±1.0 7.9±0.5 1.1±0.04 OC1 80.9±0.5 80.8±1.4 81.1±1.1 6.6±0.3 45.5±1.31 AP 80.9±0.7 80.2±1.3 81.6±1.1 8.1±0.4 0.7±0.02 RLP 82.3 ± 0.6 82.3 ± 1.3 82.4 ± 1.1 18.1± 5.8 9.2 ± 0.18 50/50 LDT 83.3±1.2 81.3±1.3 85.2±1.1 10.2±0.5 2.3±0.07 OC1 82.4±0.8 82.3±1.2 82.6±1.2 7.7±0.3 104.6±4.41 AP 81.7±0.8 81.3±1.2 82.1±1.1 9.2±0.4 1.4±0.03 RLP 83.1 ± 0.5 81.2 ± 1.2 85.0 ± 1.1 25.3 ± 5.9 18.3 ± 0.40 70/30 LDT 83.4±1.3 81.1±1.4 85.6±1.0 11.4±0.5 3.6±0.09 OC1 82.3±0.8 82.4±1.3 82.3±1.1 9.0±0.3 164.5±4.10 AP 82.6±0.8 82.8±1.5 82.6±1.2 11.0±0.4 2.3±0.04 RLP 83.2 ± 0.7 80.5 ± 1.6 85.7 ± 1.1 34.2 ± 5.8 30.1 ± 0.56 Intensive Therapy (unpruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 70.0±0.9 77.7±1.3 44.3±2.4 8.9±0.5 1.5±0.07 OC1 72.1±0.5 82.1±1.3 38.7±2.2 6.6±0.3 42.2±2.50 AP 71.5±0.7 81.5±1.4 38.1±2.0 7.9±0.4 0.7±0.03 RLP 69.5 ± 0.8 78.6 ± 1.4 39.1 ± 2.6 20.2± 5.9 8.3 ± 0.27 50/50 LDT 70.4±1.0 78.2±1.3 44.2±2.4 11.2±0.5 3.0±0.09 OC1 72.5±0.6 83.0±1.0 37.8±2.8 8.2±0.4 99.1±2.87 AP 72.2±0.7 81.9±1.0 39.7±2.3 9.7±0.4 1.4±0.04 RLP 70.5 ± 1.0 80.6 ± 1.4 37.2 ± 2.4 25.0 ± 5.9 17.3 ± 0.46 70/30 LDT 71.4±0.9 79.1±1.2 45.3±3.2 12.4±0.5 4.4±0.11 OC1 71.3±0.8 81.0±1.1 38.5±3.1 9.3±0.4 169.5±4.35 AP 72.0±0.8 81.5±1.2 39.6±3.3 11.0±0.5 2.2±0.05 RLP 70.1 ± 1.2 79.0 ± 1.5 39.7 ± 3.4 40.2 ± 5.1 29.4± 0.49 26 Diabetes (pruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 68.8±0.8 67.9±2.1 70.8±2.8 11.8±0.8 1.6±0.04 OC1 70.6±0.7 79.5±2.7 54.2±5.1 3.4±0.7 79.0±2.16 AP 71.7±0.7 82.5±2.4 52.0±3.5 4.9±1.1 0.9±0.02 RLP 74±0.54 80.82± 1.76 61.52 ± 3.12 4.78 ± 1.73 10.2 ± 0.47 50/50 LDT 69.9±0.8 67.6±1.7 74.6±2.1 15.3±0.8 3.1±0.08 OC1 71.4±0.8 80.3±1.8 55.0±3.6 5.1±1.0 167.8±4.25 AP 73.8±0.6 84.2±1.5 54.3±2.4 5.0±1.1 1.7±0.03 RLP 75.1 ± 0.7 81.4 ± 1.3 63.3 ± 2.1 3.2 ± 0.6 23.0 ± 0.30 70/30 LDT 69.9±0.8 67.6±1.7 74.6±2.1 15.3±0.8 3.1±0.08 OC1 71.7±0.9 79.9±2.1 56.6±3.3 5.2±0.9 276.5±6.98 AP 73.6±0.8 85.1±1.6 52.8±2.5 5.9±1.3 2.6±0.05 RLP 74.5 ± 0.9 81.9 ± 1.5 60.9 ± 2.8 3.1 ± 0.6 37.5 ± 0.37 Breast Cancer (pruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 95.8±0.5 96.0±0.5 95.3±1.2 5.5±0.9 1.6±0.05 OC1 94.0±0.4 93.8±0.8 94.3±0.8 1.3±0.2 36.1±1.95 AP 92.7±0.5 93.7±0.9 90.8±1.8 2.2±0.4 0.3±0.02 RLP 95.5 ± 0.4 96.3 ± 0.4 93.8 ± 1.0 5.1 ± 0.3 5.1 ± 0.30 50/50 LDT 95.7±0.3 96.5±0.3 94.3±1.0 5.5±0.6 1.0±0.03 OC1 94.6±0.5 94.7±0.7 94.3±1.2 1.9±0.3 82.2±5.04 AP 94.1±0.5 94.9±0.7 92.6±1.2 3.1±0.5 0.7± 0.02 RLP 96.5 ± 0.2 96.9± 0.3 95.6± 0.6 1.2± 0.2 12.0± 0.50 70/30 LDT 95.8±0.5 96.0±0.5 95.3±1.2 5.5±0.9 1.6± 0.05 OC1 95.4±0.4 95.6±0.6 95.2±1.0 2.2±0.5 123.4±5.88 AP 94.4±0.5 95.4±0.7 92.8±1.2 3.9±0.6 1.1±0.04 RLP 96.4 ± 0.4 96.6 ± 0.5 96.6 ± 0.5 2.5 ± 2.0 9.8±0.60 27 Heart(pruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 74.0 ± 1.0 72.7 ± 2.5 75.8 ± 2.1 7.2 ± 0.7 0.6 ± 0.05 OC1 74.3 ± 1.3 75.9 ± 3.2 72.6 ± 3.0 1.4 ± 0.2 18.0 ± 0.66 AP 73.9 ± 0.9 78.1 ± 2.0 69.1 ± 2.6 2.3 ± 0.5 0.3 ± 0.01 RLP 78.3 ± 1.0 80.1 ± 1.6 76.1 ± 1.9 1.1 ± 0.1 3.4 ± 0.13 50/50 LDT 75.0 ± 1.1 72.8 ± 2.1 78.0 ± 1.5 9.8 ± 0.6 1.2 ± 0.06 OC1 75.8 ± 1.2 77.9 ± 2.5 73.6 ± 2.5 2.4 ± 0.5 38.0 ± 1.04 AP 74.4 ± 1.1 77.8 ± 2.1 70.7 ± 2.5 3.6 ± 0.7 0.6 ± 0.01 RLP 79.3 ± 0.9 82.1 ± 1.5 76.2 ± 1.9 1.8 ± 0.4 6.4 ± 0.18 70/30 LDT 75.4 ± 1.4 70.1 ± 2.2 81.7 ± 1.8 11.0 ± 0.6 1.9 ± 0.07 OC1 75.4 ± 1.3 78.0 ± 2.6 72.6 ± 2.8 2.8 ± 0.5 62.7 ± 1.59 AP 75.7 ± 1.4 79.1 ± 2.2 71.8 ± 2.4 3.7 ± 0.7 1.0 ± 0.02 RLP 79.9 ± 1.2 82.0 ± 1.9 77.7 ± 1.8 2.2 ± 0.6 10.4 ± 0.29 Hepatitis(pruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 78.8 ± 1.4 85.4 ± 2.0 54.5 ± 4.0 2.3 ± 0.4 0.2 ± 0.01 OC1 78.5 ± 1.3 85.9 ± 2.2 50.8 ± 5.4 1.2 ± 0.2 7.2 ± 0.53 AP 78.2 ± 1.3 87.5 ± 2.5 44.3 ± 6.5 1.3 ± 0.2 0.2 ± 0.01 RLP 76.6 ± 1.4 83.3 ± 1.7 52.4 ± 4.3 1.0 ± 0.0 1.7 ± 0.03 50/50 LDT 79.4 ± 1.4 85.6 ± 1.8 56.2 ± 4.2 5.0 ± 0.4 0.6 ± 0.04 OC1 79.3 ± 1.5 85.4 ± 2.1 56.2 ± 4.5 1.6 ± 0.3 18.5 ± 1.48 AP 80.2 ± 1.3 89.4 ± 2.1 45.6 ± 6.6 1.8 ± 0.3 0.4 ± 0.02 RLP 78.5 ± 1.4 84.9 ± 1.4 54.7 ± 4.4 1.0 ± 0.0 2.9 ± 0.10 70/30 LDT 81.7 ± 1.3 87.1 ± 1.8 61.9 ± 4.2 6.5 ± 0.5 1.2 ± 0.08 OC1 82.4 ± 1.5 88.8 ± 1.9 58.2 ± 5.1 1.6 ± 0.3 31.4 ± 2.13 AP 82.3 ± 1.7 91.9 ± 1.7 47.1 ± 6.7 2.7 ± 0.5 0.6 ± 0.03 RLP 80.3 ± 1.5 86.0 ± 1.8 59.7 ± 4.6 2.3 ± 2.0 5.3 ± 0.30 28 Housing (pruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 81.9±0.6 76.9±1.5 86.6±1.1 7.2±0.7 1.1±0.04 OC1 80.9±0.6 82.6±1.8 79.2±1.9 2.8±0.6 38.2±1.55 AP 81.0±0.6 83.4±2.5 78.8±2.3 2.9±0.7 0.6±0.02 RLP 84.1± 0.7 84.4 ± 5.8 83.8 ± 4.5 1.5 ± 1.0 8.3 ± 0.14 50/50 LDT 82.7±0.6 76.2±1.8 88.8±1.2 8.5±0.5 2.1±0.07 OC1 81.0±1.0 82.0±2.9 80.3±2.0 3.3±0.6 84.7±2.24 AP 81.5±0.6 84.1±1.9 79.3±2.0 3.7±0.8 1.3±0.03 RLP 85.2 ± 0.6 86.0±1.4 84.5 ± 1.1 2.4±0.6 15.8 ± 0.42 70/30 LDT 82.1±1.9 71.9±4.5 91.5±1.4 8.9±0.7 3.2±0.11 OC1 82.3±1.1 82.4±2.3 82.4±1.6 4.3±0.7 146.6±5.64 AP 81.6±0.8 84.4±1.7 79.3±1.7 5.4±1.1 2.1±0.05 RLP 85.1 ± 0.8 85.3 ± 1.5 85.0 ± 1.3 3.2 ± 0.7 26.2 ± 0.60 Intensive Therapy (pruned) Train/Test System Accuracy Safe Unsafe Depth Time (s) 30/70 LDT 70.2±0.9 77.7±1.6 45.0±2.9 8.6±0.5 1.4±0.08 OC1 74.7±0.8 88.8±2.4 27.6±5.5 2.2±0.4 36.5±2.08 AP 75.3±0.7 93.9±2.0 13.3±4.2 1.7±0.4 0.7±0.03 RLP 72.4 ± 1.1 84.7 ± 3.2 31.9 ± 6.4 1.7 ± 0.3 7.1 ± 0.26 50/50 LDT 70.0±1.0 76.8±1.3 47.8±2.9 9.9±0.6 2.6±0.08 OC1 74.8±0.9 89.8±2.4 25.1±5.3 2.5±0.6 82.0±3.51 AP 75.9±0.9 92.8±1.8 19.2±4.7 3.8±0.9 1.3±0.04 RLP 73.8 ± 1 88.5 ± 2.7 25.7 ± 5.9 2.1 ± 0.4 14.9 ± 0.4 70/30 LDT 69.4±1.2 75.3±1.7 49.6±3.2 11.3±0.5 3.8±0.09 OC1 75.1±1.1 91.0±2.3 21.4±5.2 3.4±0.8 142.5±4.30 AP 76.4±1.0 93.4±1.9 18.7±5.2 3.7±0.9 2.0±0.05 RLP 74.1 ± 1.4 91.3 ± 2.7 15.8 ± 5.0 2.2 ± 0.4 25.2 ± 0.53 29 B Repeated Experiments using Re-implementation of ICET Figure 11 shows the results obtained by both the implementation of ICET utilized in this paper and the published results in (Turney, 1995, figure 3). The results give confidence that the implementation is a fair and accurate representation of ICET. The measure ’Average % Standard Cost’ labelling the Y axis is a normalized average cost of classification. This measure and details of the experimental method are as presented in Turney (1995). ICET results from Turney (1995). Results from implementation used in the paper. Diabetes 0 20 40 60 80 100 10 100 1000 10000 Misclassification Costs A v. % S ta n d ar d C o st Bupa Liver Disease 0 20 40 60 80 100 10 100 1000 10000 Misclassification Costs A v. % S ta n d ar d C o st Thyroid Disease 0 20 40 60 80 100 10 100 1000 10000 Misclassification Costs A v. % S ta n d ar d C o st Figure 11: Published results and those obtained by the implementation of ICET used in this paper 30