doi:10.1016/j.eswa.2007.08.049 Available online at www.sciencedirect.com www.elsevier.com/locate/eswa Expert Systems with Applications 35 (2008) 1177–1185 Expert Systems with Applications Incremental clustering of mixed data based on distance hierarchy Chung-Chian Hsu a, Yan-Ping Huang a,b,* a Department of Information Management, National Yunlin University of Science and Technology, Taiwan b Department of Information Management, Chin Min Institute of Technology, Taiwan Abstract Clustering is an important function in data mining. Its typical application includes the analysis of consumer’s materials. Adaptive resonance theory network (ART) is very popular in the unsupervised neural network. Type I adaptive resonance theory network (ART1) deals with the binary numerical data, whereas type II adaptive resonance theory network (ART2) deals with the general numer- ical data. Several information systems collect the mixing type attitudes, which included numeric attributes and categorical attributes. However, ART1 and ART2 do not deal with mixed data. If the categorical data attributes are transferred to the binary data format, the binary data do not reflect the similar degree. It influences the clustering quality. Therefore, this paper proposes a modified adaptive resonance theory network (M-ART) and the conceptual hierarchy tree to solve similar degrees of mixed data. This paper utilizes artificial simulation materials and collects a piece of actual data about the family income to do experiments. The results show that the M-ART algorithm can process the mixed data and has a great effect on clustering. � 2007 Elsevier Ltd. All rights reserved. Keywords: Adaptive resonance theory network; Conceptual hierarchy; Clustering algorithm; Unsupervised neural network; Data mining 1. Introduction Clustering is the unsupervised classification of patterns into groups. It is an important data analyzing technique, which organizes a collection of patterns into clusters based on similarity (Hsu, 2006; Hsu & Wang, 2005; Jain & Dubes, 1988). Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine- learning situations. This includes data mining, document retrieval, image segmentation, and pattern classification. Clustering methods have been successfully applied in many fields including pattern recognition (Anderberg, 1973), biology, psychiatry, psychology, archaeology, geology, geography, marketing, image processing (Jain & Dubes, 1988) and information retrieval (Rasmussen, 1992; Salton & Buckley, 1991). Intuitively, patterns with a valid cluster 0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.08.049 * Corresponding author. Address: Department of Information Man- agement, National Yunlin University of Science and Technology, Taiwan. Tel.: +886 37627153; fax: +886 37605684. E-mail addresses: hsucc@yuntech.edu.tw (C.-C. Hsu), sunny@ms. chinmin.edu.tw (Y.-P. Huang). are more similar to each other than they are to a pattern belonging to a different cluster. Data clustering has been considered as a primary data mining method for knowledge discovery. There have been many clustering algorithms in the literature. In general, major clustering methods can be classified into the hierar- chical or the partition category. A hierarchical method cre- ates a hierarchical decomposition of the given set of data patterns. A partition approach produces k partitions of the patterns, where each partition represents a cluster. Fur- ther classification in each of the categories is possible (Jain & Dubes, 1988). In addition, Jian (1999) discussed some cross-cutting issues that might affect all of the different approaches regardless of their placement in the categories (Jain, Murty, & Flynn, 1999). Being non-incremental or incremental is one of the issues (Hsu, 2006; Hsu & Wang, 2005). Non-incremental clustering methods process all the data patterns at a time. These algorithms usually require the entire datasets being loaded into memory and therefore have high requirement in memory space. The major advantage with the incremental clustering algorithms is that it is not necessary to store the entire mailto:hsucc@yuntech.edu.tw mailto:sunny@ms. chinmin.edu.tw mailto:sunny@ms. chinmin.edu.tw 1178 C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 pattern matrix in the memory. So, the space requirements of incremental algorithms are very small. Incremental clus- tering considers input patterns one at a time and assigns them to the existing clusters (Jain & Dubes, 1988). Here, a new input pattern is assigned to a cluster without affect- ing the existing clusters significantly. Moreover, a major advantage of the incremental clustering algorithms is their limited space requirement since the entire dataset is not necessary to store in the memory. Therefore, these algo- rithms are well suited for a dynamic environment and for very large datasets. They have already been applied along these directions (Can, 1993; Ester, Kriegel, Sander, Wim- mer, & Xu, 1998; Somlo & Adele, 2001). Most of clustering algorithms consider either categorical data or numeric data. However, many mixed datasets including categorical and numeric values existed nowadays. A common practice to clustering mixed dataset is to trans- form categorical values into numeric values and then pro- ceed to use a numeric clustering algorithm. Another approach is to compare the categorical values directly, in which two distinct values result in distance 1 while identical values result in distance 0. Nevertheless, these two methods do not take into account the similarity information embed- ded between categorical values. Consequently, the cluster- ing results do not faithfully reveal the similarity structure of the dataset (Hsu, 2006; Hsu & Wang, 2005). This article is based on distance hierarchy (Hsu, 2006; Hsu & Wang, 2005) to propose a new incremental cluster- ing algorithm for mixed datasets, in which the similarity information embedded between categorical attribute is considered during clustering. In our setting, each attribute of the data is associated with a distance hierarchy, which is an extension of the concept hierarchy (Somlo & Adele, 2001) with link weights representing the distance between concepts. The distance between two mixed data patterns is then calculated according to distance hierarchies. It is worth mentioning that the representation scheme of distance hierarchy can generalize some conventional dis- tance computation schemes including the simple matching and the binary encoding for categorical values, and the subtraction method for continuous values and ordinal values. The rest of this article is organized as follows. Section 2 reviews clustering algorithms and discusses the shortcom- ings of the conventional approaches to clustering mixed data. Section 3 presents distance hierarchy for categorical data and proposes the incremental clustering algorithm based on distance hierarchies. In Section 4, experimental results on synthetic and real datasets are presented. Con- clusions are given in Section 5. 2. Literature review Adaptive resonance theory neural networks model real- time prediction, search, learning, and recognition. ART networks function as models of human cognitive informa- tion processing (Carpenter, 1997; Carpenter & Grossberg, 1993; Grossberg, 1980, 1999, 2003). A central feature of all ART systems is a pattern matching process that com- pares an external input with the internal memory of an active code. ART1 deals with the binary numerical data and ART2 deals with the general numerical data (Gross- berg, 1999). However, these two methods do not deal with mixed data attributes. About clustering mixed data attributes, there are two approaches for mixed data. One is resorted to a pre-process, which transferred the data to the same type, either all numeric or all categorical. For transferring continuous data to categorical data, some metric function is employed. The function is based on simple matching in which two distinct values result in distance 1, with identical values of distance 0 (Guha, Rastogi, & Shim, 1999). The other is to use a metric function, which can handle mixed data (Wilson & Martinez, 1997). Overlap metric is for nominal attributes and normal- ized Euclidean distance is for continuous attributes. Among problems with simple matching and binary encoding, a common approach for handling categorical data is simple matching, in which comparing two identical categorical values result in distance 0, while two distinct values result in distance 1 (Ester et al., 1998; Wilson & Martinez, 1997). In this case, the distance between patterns of Gary and John in the previous example becomes d(Gary, John) = 1, which is the same as d(John, Tom) = d(- Gary, Tom) = 1. Obviously, the simple matching approach disregards the similarity information embedded in categor- ical values. Another typical approach to handle categorical attri- butes is to employ binary encoding that transforms each categorical attribute to a set of binary attributes and a cat- egorical value is then encoded to a set of binary values. As a result, the new relation contains all numeric data, and the clustering is therefore conducted on the new dataset. For example, as the domain of the categorical attribute: Favor- ite_Drink. The set of it is {Coke, Pepsi, Mocca}. Favor- ite_Drink is transformed to three binary attributes: Coke, Pepsi and Mocca in the new relation. The value Coke of Favorite_Drink in a pattern is transformed to a set of three binary values in the new relation, i.e. {Coke = 1, Pepsi = 0, Mocca = 0}. The Manhattan distance of patterns Gary and John is dM(Gary, John) = 2, which is the same as dM(Gary, Tom) and dM(John, Tom), according to the new relation. Traditional clustering algorithm transfers Favor- ite_Drink categorical attributes into a binary numerical attribute type as shown in Fig. 1. After transformation, each original categorical attribute handled by the binary encoding approach contributes as twice as that by the simple matching approach, as shown in the above example of distance (Gary, John). Conse- quently, when the binary encoding approach is adopted by a clustering algorithm, categorical attributes have larger influence on clustering data than those adopting the simple matching approach. The ART network is a popular incremental clustering algorithm (Jain & Dubes, 1988). It has several variants ID Favorite Drink Amt. Gary Coke 70 John Pesi Tom Coffee ID Coke Pepsi Coffee Amt. Gary 0 0 70 John 0 1 0 70 Tom 0 0 1 70 70 70 1 Fig. 1. Traditional clustering algorithm transfers Favorite_Drink cate- gorical attributes into binary numerical attribute type. Any Refrig. Pepsi TV Y Z X Coke Drink Appliance 1 1 1 1 1 1 Any Refrig. Pepsi TV Coke .5 .5 .5 .5 MAX MIN w Fig. 2. (a) A distance hierarchy with weight 1, (b) two-level distance hierarchy for simple matching approach, and (c) degenerated distance hierarchy with w = (max � min) for a numeric attribute. C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 1179 (Carpenter & Grossberg, 1987; Carpenter, Grossberg, & Rosen, 1991), in which ART1 handles only the binary data and ART2 can handle only the arbitrary continuous data. K-prototype (Huang, 1998) is a recent clustering algorithm for mixed data. It transfers categorical data attributes to the binary data format, however, the binary data do not reflect the similar degree. It influences the clustering qual- ity. Therefore, this paper proposes a modified adaptive res- onance theory network algorithm and the conceptual hierarchy tree to solve the similar degree of mixed data. 3. Clustering hybrid data based on distance hierarchy This paper proposes the distance hierarchy tree structure to overcome the expression for similar degree. This dis- tance hierarchy tree algorithm combines the adaptive reso- nance theory network algorithm and it can be effective with mixed data in data clustering. This section presents dis- tance hierarchy for categorical data and it proposes the incremental clustering algorithm based on distance hierarchies. 3.1. Distance hierarchy The distance hierarchy tree is a concept hierarchy struc- ture. It is also a better mechanism to facilitate the represen- tation and computation of the distance between categorical values. A concept hierarchy consists of two parts: a node set and a link set (Dash, Choi, Scheuermann, & Liu, 2002; Hsu, 2006; Hsu & Wang, 2005; Maulik & Bandyo- padhyay, 2002). According to binary encoding approach, it does not reflect the similar degree. However, it influences the clustering quality. Maintenance was difficult when the domain of a categorical attribute changes, because the transformed relation schema also needs to be changed. The transformed binary attributes cannot preserve the semantics of the original attribute. Because of the draw- backs resulting from the binary-encoding approach, this paper uses distance hierarchy to solve the similar degree of mixed data. A concept hierarchy extends with distance weights as shown in Fig. 2. This paper extends the distance hierarchy structure with link weights. Each link has a weight representing a dis- tance. Link weights are assigned by domain experts. There are several assignment alternatives. The simplest way is to assign all links as a uniform constant weight. Another alternative is to assign heavier weights to the links closer to the root and lighter weights to the links away from the root. For simplicity, unless stated explicitly, each link weight is set to 1 in this article. The distance of two con- cepts at the leaf nodes is the total weight between those two nodes. A point X in a distance hierarchy consists of two parts, an anchor and a positive real-value offset, denoted as X(N, d), that is, anchor(X) = N and offset(X) = d. The anchor is a leaf node and the offset represents the distance from the root of the hierarchy to the point. A point X is an ancestor of Y if X is in the path from Y to the root of the hierarchy. If neither one of the two points is an ancestor of the other point, then the least common ancestor, denoted as LCA(X, Y), is the deepest node that is an ancestor of X as well as Y. Let X(NX, dX) and Y(NY, dY) be two points, the distance between X and Y can be defined as jX � Y j ¼ d X þ d Y � 2d LCPðX ;YÞ ð1Þ where LCP(X, Y) is the least common point of X and Y in the distance hierarchy. dLCP(X,Y) is the distance between the least common point and the root. The distance between two equivalent points is 0, i.e. |X � Y| = 0. For example, W = (Coke, 1.4), X = (Coke, 0.3), Y = (Pepsi, 0.3) and Z = (Pepsi, 1.4). Both X and Y are ancestors of W and Z. X is equivalent to Y, and their distance is 0. Both X and Y are the least common points of these two points. They are the least common points of Y and W, as well as those of Y and Z. The distance between nodes Any and Drink is 1. The distance between Y and Z is |1.4 + 0.3 � 2 * 0.3| = 1.1. The least common point, LCP(W, Z), of W and Z is the node Drink, which is also the least common ancestor of these two points. Therefore, dLCP(W,Z) = 1. Furthermore, the distance between W and Z is |1.4 + 1.4 � 2 * 1| = 0.8. A special distance hierarchy calls numeric distance hier- archy for a numeric attribute, say xi, is a degenerate one, which consists of only two nodes, a root MIN and a leaf MAX (e.g. Fig. 2c), and has the link weight w being the domain range of xi, i.e. w = (maxi � mini). A point p in such a distance hierarchy has the value (MAX, dp) where the anchor is always the MAX and the offset dp is the dis- tance from the point to the root MIN. About measuring distance, the distance between two data points can be measured as follows: Let x = [x1 x2 . . . xn] and y = [y1 y2 . . . yn]. The distance between a Fig. 3. Modified adaptive neural network architecture. 1180 C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 training pattern x and an M-ART neuron y is measured as the square root of the sum of the square differences between each-paired components of x and y. Specifically, x andy represent a training data and a map neuron, respec- tively, with n-dimension, and C is a set of n distance hier- archies, then the distance between x and y can be expressed as dðx; yÞ¼ kx � yk¼ X i¼1;n wiðxi � yiÞ 2 !1=2 ¼ X i¼1;n wiðhðxiÞ� hðyiÞÞ 2 !1=2 ð2Þ where h(xi) and h(yi) are the mapping of xi and yi to their associated distance hierarchy hi and wi, the attribute weight, is a user specified parameter allowing the domain expert to give different weights. For a numeric attribute I, h(xi) � h(yi) is equal to xi � yi, since h(xi) � h(yi) = (MIN, dh(xi)) � (MIN, dh(yi)) = (MIN, xi � mini) � (MIN, yi � mini) = (xi � yi). The attribute weight wi can be used to remedy the mixed depth effect of distance hierarchies, especially when all the attribute domains are normalized to a small range, such as [0 1]. For example, the difference between any two dis- tinct values of an attribute with a two-level distance hierar- chy, like Fig. 2b, is always 1 while the difference between two distinct values of a three-level distance hierarchy can be the 0.5 (minimum) or 1 (maximum). Therefore, in a mul- tidimensional data set with attributes associated with vari- ous depths of distance hierarchies, attributes with shallow distance hierarchies tend to dominate the distance compu- tation. Decreasing the weight of attributes with a shallow hierarchy or in contrast, increasing the weight of an attri- bute with a deep hierarchy can alleviate the mixed depth effect. 3.2. Incremental clustering of hybrid data The unsupervised clustering algorithm has two layers, the input layer and output layer. There is one distance tree. As shown in Fig. 3, each layer is explained as follows. The input layer is used for receiving the data or the vec- tor. In the input layer, each neuron corresponds to the input vector. An input vector can be a numerical attribute or a categorical attribute. The output layer presents the clustering result. Each neuron represents a clustering result. Each related prototype vector is the representative of a clustering vector. Particularly, supposes D = {1, 2, . . ., n} is the training dataset. Input neurons hx1, x2, . . ., xni receive the input data vector. A neuron xk represents an input vector, an attribute or a variable. The sets of distance tree are {dt1, dt2, . . ., dtn}, each dti expresses the training dataset’s attribute xi. It rep- resents the concept distance in the special and general relation. A neuron yi of the output layer is a vector. yi = [y1i, y2i, . . ., yni]. The neuron yki = (N, d) is composed of two parts: N is a symbol and d is a real number. When yki corresponds to a categorical attribute, N corresponds to a categorical attribute value. When yki corresponds to a numerical attribute, N corresponds to a numerical symbol. As the traditional self-organizing maps and adaptive neural network, the vector has two functions. One is the vector in the output layer and the other is mapped each other in the training dataset. Each dti in the distance tree relates to the attribute xi in the training data and the neu- ron attribute yik in the output layer. The algorithm has the following steps: Input: N records in the training datasets, which include distance hierarchy sets, warning value and stopping parameter value. Output: Z neurons (the prototype of clustering). Step 1. Read the first record and map it into the first neuron’s vector. Step 2. Read the next record until the record is empty. Find out the most similar output neuron. If similar degrees exceed the warning value, then join this record to the group and adjust the vector else set up a new neu- ron. And set the vector as a new neuron’s vector. Step 3. Meet the stopping conditions and then stop else repeat Step 2 until the record is empty. 3.3. Evaluating clustering results About time complexity, the main decisive factors of the complexity of time in the training algorithm are the train- ing data record N, data dimension P, output neuron num- ber Z and train round C. For each round in the training data, the vectors compare all the attributes with the output neurons. So the time complexity of the whole process is O(C * N * Z * P). The other method is the significance test on external variables. This technique compares the clusters on vari- ables, but it does not generate. One way of doing this is to compute the expected entropy of the clusters using a var- iable, say a class attribute C that does not participate in the C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 1181 clustering. The expected entropy of an attribute C in a set of clusters can be computed as follows. First, it calculates the entropy of an attribute C in each cluster. Then it sum- mates all the entropies weighted by its cluster size. The rela- tionship is: E �ðCÞ¼� X k jCkj jDj X j PðC ¼ V jÞ log PðC ¼ V jÞ ! ð3Þ where Vj denotes one of the possible values that the attri- bute C can take, |Ck| is the size of the cluster k, and |D| is the size of the data set. The categorical utility function (Gluck & Corter, 1985) attempts to maximize the probability that the two objects in the same cluster have attribute values in common and the probability that the objects from different clusters have different attributes. The categorical utility of a set of clus- ters can be calculated as CU ¼ X k jCkj jDj X i X j ½PðAi ¼ V ijjCkÞ 2 � PðAi ¼ V ijÞ 2� ! ð4Þ Here, P(Ai = Vij|Ck) is the conditional probability that the attribute i has the values Vij given the cluster Ck, and P(Ai = Vij) is the overall probability of the attribute i hav- ing the values Vij in the entire set. The function aims to measure if the clustering improves the likelihood of similar values falling in the same cluster. Obviously, the higher the CU values, the better the clustering result (Barbara, Couto, & Li, 2002). For numeric attributes, the standard deviation repre- sents the dispersion of values. Variance (r2) can be used for evaluating the quality of clustering numeric data. Here, it can sum up the respective variance of every numeric attri- bute in all the clusters to evaluate the quality of clustering. The method of calculating the variance is shown in Eq. (5), where V ki;avg and V k i;j are the average and the jth record value of attribute i in cluster k, respectively. Incidentally, the attribute values have been normalized before clustering. Apparently, the lower the variance values, the better the clustering results: r2 ¼ X k 1 jCkj X i X j ðV ki;j � V k i;avgÞ 2 ð5Þ Several cluster validity indices, such as Davies–Bouldin (DB) Index and Calinski Harabasz (CH) Index (Halkidi, Table 1 Synthetic dataset Sex Age Amt Depart F(50%) M(50%) N(20, 2) 70–90 EE ME ID VC MBA MIS Batistakis, & Vazirgiannis, 2001; Maulik & Bandyopadhy- ay, 2002), have been published; however, they are only suit- able for the numeric data. Hence, in order to evaluate the effectiveness of clustering mixed data, this paper uses CV index (Hsu & Chen, 2007), which combined the category utility (CU) function with variance. The CV is defined as in Eq. (6), where the CU and variance are the validity index for categorical and numeric data, respectively. The higher the CV values, the better the clustering result: CV ¼ CU 1 þ Variance ð6Þ 4. Experiments and discussion This paper develops a prototype system with Borland C++ Builder 6. A series of experiments have been per- formed in order to verify the method. A mixed synthetic dataset and a UCI dataset have also been designed to show the capability of the M-ART in reasonably expressing and faithfully preserving the distance between the categorical data. It also reports the experimental results of artificial and actual data. 4.1. Synthetic data sets The synthetic dataset consisted of three categorical attri- butes: Sex, Department and Product. The two numeric attributes are Age and Amt. One class level attribute is Col- lege. The total records are 600. Table 1 shows all distribu- tions of synthetic categorical dataset. Fig. 4 shows the concept hierarchies for the synthetic categorical dataset. The M-ART parameters are established as follows: The initial warning value is 0.5 and it increases progressively by 0.05 until 0.7. The initial learning rate is 0.4–0.6. The stop condition is occurs when the momentum of the output layer is lower than 0.000015. The input sequence influences the performance of M-ART and ART2 algorithms. There- fore M-ART uses five different introductory orders. The experimental results are shown in Table 2. In the K-proto- type method, the initial central points are influenced by clustering. There are five experiments by randomly choos- ing one central point. Each parameter in ART2 algorithm is established a = 0.1, b = 0.1, c = 0.1, d = 0.9, theta = 0.1– 0.2, rho = 0.755–0.76, bottom to top weight = 2. The experimental result is shown in Table 2. The entropy values can be used for evaluating the quality of ment Product College Count Coke Engineering 100 Pepsi 100 Bread Design 100 Rice 100 Apple Management 100 Orange 100 Any Grain Drink Fruit Rice Bread Coke Pepsi Orange Apple Engineering Management ID VC EE ME MIS MBA Any Desing M F Any Fig. 4. The concept hierarchies for the synthetic datasets. Table 2 Entropy values for Student dataset with clusters by ART2, K-prototype and M-ART Order ART2 K-prototype M-ART 1 0.47 0.44 0 2 0.42 0.38 0 3 0.43 0.38 0 4 0.42 0.37 0 5 0.42 0.39 0 Mean 0.43 0.39 0 1182 C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 clustering. The value is small and the quality of clustering is good. In these five experiments, entropy values are zero in M-ART. That means that the classification value in each group is the same. That is to say, the datasets are divided into three groups such as Engineering, Design, and Man- agement. The experimental results show that, in the same input order, the quality of clustering M-ART method is the highest, K-prototype method is second, and traditional ART2 is the lowest. ART2 and K-prototype are unable to divide the datasets into three groups. 4.2. UCI adult data These experiments use a real Adult dataset from the UCI repository [Murphy 1992] with 48,842 records of 15 attributes, including eight categorical attributes, six numer- ical attributes, and one class attribute. This experiment uses seven attributes, which include three categorical attributes, such as Relationship, Mari- tal_status, and Education; and four numeric attributes, Capital_gain, Capital_loss, Age, and Hours_per_week. The concept hierarchies are constructed in Fig. 5. The M arried_A F _spouse W idow ed S eparated N ever_m arried M arried_spouse_absen t M arried_civ_spouse D ivorced Single Couple Any P reschool 10th 12th 11th 7th_8th 9th 5th_6th 1st_4th HighSchoolLittle Junior Any Fig. 5. Concept hierarchies for (a) Marital-status, (b) Educa M-ART parameters are established as follows: the initial warning value is 0.55 and it increases progressively 0.05 until 0.75. The initial learning rate is 0.9. The stop condi- tion t occurs when the momentum of the output layer is lower than 0.000015. 4.3. Experiments and discussion This paper collects the dataset with different methods. These methods divide adult datasets into 5, 6, 7 and 8 groups. Concerning the CU values for categorical attri- butes, the higher the CU values, the better the clustering result. The CU value of clustering M-ART method is the highest, K-prototype method is second, and traditional ART2 is the lowest. The symbol ‘‘***’’ means that it does not find the suitable parameter to divide into group with the datasets. The parameter of ART2 reaches 7, it is unable to divide seven groups all the time. The problem occurs because there are too many parameters in ART2. Table 3 shows the CU values of the clustering results by M-ART, ART2 and K-prototypes of each categorical attribute on level 1 and the leaf level in individual concept hierarchies with cluster numbers 5, 6, 7 and 8. The parameter of M- ART is established as follows: the initial warning value is 0.53–0.58, and the initial learning rate is 0.7. This paper normalizes the variance between 0 and 1 for numeric results. The normalized variance is useful in CV index. Table 4 shows the CV values of the clustering results by M-ART, ART2 and k-prototypes on level 1 and the leaf level in individual concept hierarchies with cluster numbers 5, 6, 7 and 8. The higher the value of CV values, the better the clustering result. The CV value in M-ART method is M aster P rof_school H S _grad A ssoc_acdm B achclors S om e_college D octorate A ssoc_voc College Advanced H usband W ife U nm arried O w n_child O ther_relative N ot_in_fam ily Any tion and (c) Relationship attributes of the Adult dataset. Table 3 The CU values for Adult dataset with 5, 6, 7 and 8 clusters by M-ART, ART2 and K-prototypes Clusters Leaf_Level Level 1 Increased (%) M-ART 5 1.069 1.16 8.5 6 1.113 1.20 7.8 7 1.115 1.21 8.5 8 1.177 1.31 11.3 K-prototype 5 0.859 0.834 �2.9 6 1.002 0.977 �2.5 7 1.039 0.919 �11.6 8 1.088 1.087 �0.3 ART2 5 0.001 0.00081 �19 6 0.0043 0.0057 3.3 7 *** *** *** 8 0.0073 0.00757 3.7 C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 1183 the highest, K-prototype method is second, and traditional ART2 is the lowest. After M-ART clustering, each clustering has a proto- type vector to represent the characteristic of the group. Table 5 shows the result. There is a binary set in a proto- type vector. The binary set includes the anchor and Offset. The anchor shows the mode of this field in this group. The Offset is the distance vector to the root. The offset means that a rough proportion of response is accounted for all number values. The higher the value of Offset values, the bigger the mode proportion of the anchor result. For exam- ple the fourth group of anchors is HS-grad. The Offset value is 0.84 (Offset value regularized between 0 and 1). The proportion of HS-grad is 73%. Further, the anchor of 2, 3, 6 and 8 groups is all HS-grad. The Offset value approaches zero. It expresses the HS-grad node near the root. Table 6 shows the result. As an example in fifth group with Marital-status and Relationship, the Offset value is quite big. The populations of Married-civ-spouse and Hus- Table 4 The CV values for Adult dataset with 5, 6, 7, 8 clusters by M-ART, ART2 an Clustering number CU Variance Leaf_Level Level 1 Age Capital gain Cap M-ART 5 1.069 1.16 0.127 0.022 0.0 6 1.113 1.2 0.163 0.023 0.0 7 1.115 1.21 0.182 0.011 0.0 8 1.177 1.31 0.218 0.014 0.0 K-prototype 5 0.859 0.834 0.114 0.033 0.0 6 1.002 0.977 0.424 0.72 1.4 7 1.039 0.919 0.515 0.893 1.8 8 1.088 1.087 0.61 1.068 2.1 ART2 5 0.001 0.00081 0.176 0.027 0.0 6 0.0043 0.0057 0.211 0.033 0.0 8 0.0073 0.00757 0.281 0.044 0.0 band of this group are 99% and 89% separately. The result shows each prototype vector can reflect characteristics of each group. For example, the fourth group of anchors is HS-grad. The Offset value is 0.84 (Offset value is regularized between 0 and 1). The proportion of HS-grad is 73%. Further, the anchor of 2, 3, 6 and 8 groups is all HS-grad. The Offset value approaches zero. It expresses the HS-grad node near the root. Table 6 shows the result. As an example in the fifth group with Marital-status and Relationship, the Offset value is quite big. The populations of Married-civ-spouse and Husband of this group are 99% and 89%, respectively. The result shows that each prototype vector can reflect the characteristics of each group. The prototype vector reflects the characteristic in this group. A comparison with these prototype vectors and Sal- ary attribute of each group can obtain some characteristics about Salary attribute. Table 5 shows that the salary was over 50k in an order. In the fifth group, all values in the Salary attribute is over 50k. In the first group, the popula- tion is 1%. From these prototype vectors, the characteris- tics in the high-income group can be understood. The salary is over 50k; the proportion is big in 2, 3 and 5 groups. The characteristics are steady marriage states and family relationships, the age is generally relatively longer, and the working hours are relatively longer. The value of Gain and Loss are relatively bigger. Contrarily, the propor- tion is small in the 1, 4, 6 and 8 groups. The characteristic is not married or had divorced, the age is younger, and work- ing hours is relatively lower. The value of Gain and Loss are relatively small. It can find out the obvious differences in groups 1 and 5. The comparison with K-prototype and M-ART, K-pro- totype can present the prototype vector of each group. In the categorical attribute, K-prototype is a representation of the mode as the prototype vector but unable to express the weight of this group. The M-ART algorithm uses the Offset value to understand rough population. d K-prototypes CV Increased (%) ital_loss Hours_per_week Leaf_Level Level 1 38 0.071 0.85 0.59 31.08 44 0.085 0.85 0.60 29.05 44 0.1 0.83 0.61 27.05 52 0.115 0.84 0.65 23.00 46 0.073 0.68 0.46 32.85 54 2.937 0.15 0.16 1.72 01 3.635 0.13 0.12 7.01 54 4.342 0.12 0.13 5.56 42 0.079 0.001 0.001 6.64 51 0.095 0.003 0.005 55.87 68 0.128 0.005 0.006 26.71 Table 5 The categorical attribute prototype for Adult dataset with 1–8 clusters by M-ART C_no >50k Education Marital-status Relationship Age Hours Gain Loss 5 100 (Prof-school, 0.34) (Married-civ-spouse, 0.99) (Husband, 0.89) 48 51 99,999 0 2 46 (HS-grad, 0.03) (Married-civ-spouse, 0.99) (Wife, 0.98) 40 37 1000 131 3 44 (HS-grad, 0) (Married-civ-spouse, 1) (Husband, 1) 44 44 1000 131 7 13 (Some-college, 0.5) (Never-married, 0.61) (Not-in-family, 1) 38 41 1000 87 4 7 (HS-grad, 0.84) (Never-married, 0.5) (Not-in-family, 1) 40 40 0 87 6 6 (HS-grad, 0) (Divorced, 0.5) (Unmarried, 1) 40 39 0 44 8 3 (HS-grad, 0.05) (Never-married, 0.5) (Other-relative, 1) 33 37 0 44 1 1 (Some-college, 0) (Never-married, 0.38) (Own-child, 1) 25 33 0 44 Table 6 The distribution of Education attribute in each cluster Education C5 C2 C3 C7 C4 C6 C8 C1 Preschool 0 0 0 0 0 0 0 0 1st–4th 0 0 1 0 1 1 2 0 5th–6th 1 1 1 1 1 1 4 0 7th–8th 0 1 3 1 2 2 3 1 9th 0 1 2 1 2 2 3 1 10th 1 2 2 0 6 4 4 4 11th 0 2 2 0 6 4 6 9 12th 0 1 1 0 2 1 3 3 HS-grad 12 31 33 0 73 38 40 30 Some-college 6 20 19 37 0 22 20 34 Assoc-voc 2 5 5 7 0 5 3 3 Assoc-acdm 0 5 3 7 0 4 2 3 Bachelors 24 20 18 34 0 10 9 10 Masters 13 8 7 2 5 4 0 1 Prof-school 33 2 3 8 1 1 1 0 Doctorate 8 1 2 1 1 1 0 0 1184 C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 In the C7 prototype vector, the value of Education field is (HS-grad, 0.84): Anchor is HS-grad, and the Offset is 0.84. It expresses that the leaf node in the hierarchy tree is near the root node and the distance is 0.84. The distance between leaf node and HS-grad node is 0.16. It shows that HS-grad has taken heavy proportion in this group. In Table 6, the heavy proportion is 73%. Otherwise, the Offset value of Relationship field is about 1. It means that the mode proportions in each group are between 96% and 100%. 5. Conclusions Most traditional clustering algorithms can only handle either categorical or numeric value. Although some research results have been published for handling mixed data, they still cannot reasonably express the similarities among categorical data. The paper presents a MART algo- rithm, which can handle mixed dataset directly. The exper- imental results on synthetic data sets show that the proposed approach can better reveal the similarity struc- ture among data, particularly when categorical attributes are involved and have different degrees of similarity, in which the traditional clustering approaches do not perform well. The experimental results on the real dataset have bet- ter performances than other algorithms. The future work will try to use this method in finding out the pattern rules from a large time series database. References Anderberg, M. (1973). Cluster analysis for applications. New York: Academic Press. Barbara, D., Couto, J., & Li, Y. (2002). COOLCAT: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on information and knowledge management (pp. 582–589). Can, F. (1993). Incremental clustering for dynamic information process- ing. ACM Transaction for Information Systems, 11, 143–164. Carpenter, G.-A. (1997). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, 10(8), 1473–1494. Carpenter, G.-A., & Grossberg, S. (1987). ART2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics: Special Issue on Neural Networks, 26, 4919–4930. Carpenter, G.-A., & Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neuroscience, 16(4), 131–137. Carpenter, G., Grossberg, A.-S., & Rosen, D.-B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759–771. Dash, M., and Choi, K., Scheuermann, P., and Liu, H., (2002). Feature selection for clustering – a filter solution. In IEEE International Conference on Data Mining, pp. 115–122. Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental clustering for mining in a data warehousing environment. In Proceedings of the 24th international conference on very large data bases (VLDB) (pp. 323–333). Gluck, M.-A., & Corter, J.-E. (1985). Information, uncertainty, and the utility of categories. In Proceedings of the seventh annual conference of the cognitive science society. Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1–51. Grossberg, S. (1999). The link between brain, learning, attention, and consciousness. Consciousness and Cognition, 8, 1–44. Grossberg, S. (2003). How does the cerebral cortex work? Development, learning, attention, and 3D vision by laminar circuits of visual cortex. Behavioral and Cognitive Neuroscience Reviews, 2(1), 47–76. Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the IEEE conference on data engineering (pp. 512–521). Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17, 107–145. Hsu, C.-C., & Chen, Y.-C. (2007). Mining of mixed data with application to catalog marketing. Expert Systems with Applications, 32(1), 12–23. Hsu, C.-C. (2006). Generalizing self-organizing map for categorical data. IEEE Transactions on Neural Networks, 17(2), 294–304. C.-C. Hsu, Y.-P. Huang / Expert Systems with Applications 35 (2008) 1177–1185 1185 Hsu, C.-C., & Wang, S.-H. (2005). An integrated framework for visualized and exploratory pattern discovery in mixed data. IEEE Transactions on Knowledge and Data Engineering, 18(2), 161–173. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304. Jain, A., & Dubes, R. (1988). Algorithms for clustering Data. Englewood Cliffs, NJ: Prentice-Hall. Jain, A.-K., Murty, M.-N., & Flynn, P.-J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. Maulik, U., & Bandyopadhyay, S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12), 1650–1654. Rasmussen, E. (1992). Clustering algorithms. In William B. Frakes & Ricardo Baeza-Yates (Eds.), Information retrieval: Data structures & algorithms. Prentice Hall. Salton, G. & Buckley, C. (1991). Automatic text structuring and retrieval- experiments in automatic encyclopedia searching. In Proceedings of the 14th international ACM SIGIR conference on research and development in information retrieval (pp. 21–30). Somlo, G.-L. & Adele, E.-H. (2001). Incremental clustering for profile maintenance in information gathering web agents. In Proceedings of the fifth international conference on autonomous agents (pp. 262–269). Wilson, D.-R., & Martinez, T.-R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6, 1–34. Incremental clustering of mixed data based on distance hierarchy Introduction Literature review Clustering hybrid data based on distance hierarchy Distance hierarchy Incremental clustering of hybrid data Evaluating clustering results Experiments and discussion Synthetic data sets UCI adult data Experiments and discussion Conclusions References