06 April 2021 POLITECNICO DI TORINO Repository ISTITUZIONALE Analysis of diabetic patients through their examination history / Antonelli, D.; Baralis, E.; Bruno, G.; Cerquitelli, T.; Chiusano, S.; Mahoto, N.. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - STAMPA. - 40:11(2013), pp. 4672-4678. Original Analysis of diabetic patients through their examination history Publisher: Published DOI:10.1016/j.eswa.2013.02.006 Terms of use: openAccess Publisher copyright (Article begins on next page) This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository Availability: This version is available at: 11583/2518542 since: 2016-11-23T14:12:11Z Elsevier Ltd Analysis of diabetic patients through their examination history Dario Antonellia, Elena Baralisb, Giulia Brunoa, Tania Cerquitellib,∗, Silvia Chiusanob, Naeem Mahotob aDepartment of Production Systems and Economics, Politecnico di Torino, Turin, Italy bDepartment of Control and Computer Engineering, Politecnico di Torino, Turin, Italy Abstract The analysis of medical data is a challenging task for health care systems since a huge amount of interesting knowledge can be automatically mined to effectively support both physicians and health care organizations. This paper proposes a data analysis framework based on a multiple-level cluster- ing technique to identify the examination pathways commonly followed by patients with a given disease. This knowledge can support health care or- ganizations in evaluating the medical treatments usually adopted, and thus the incurred costs. The proposed multiple-level strategy allows clustering patient examination datasets with a variable distribution. To measure the relevance of specific examinations for a given disease complication, patient examination data has been represented in the Vector Space Model using the TF-IDF method. As a case study, the proposed approach has been applied to the diabetic care scenario. The experimental validation, performed on a real collection of diabetic patients, demonstrates the effectiveness of the ap- proach in identifying groups of patients with a similar examination history and increasing severity in diabetes complications. Keywords: data mining, cluster analysis, patient examination history, diabetes ∗Corresponding author Email addresses: dario.antonelli@polito.it (Dario Antonelli), elena.baralis@polito.it (Elena Baralis), giulia.bruno@polito.it (Giulia Bruno), tania.cerquitelli@polito.it (Tania Cerquitelli), silvia.chiusano@polito.it (Silvia Chiusano), naeem.mahoto@polito.it (Naeem Mahoto) Preprint submitted to 1. Introduction Nowadays, large amount of medical data, storing the medical patient his- tory, is collected by health care organizations. Data mining, which focuses on studying effective and efficient algorithms to transform large amounts of data into useful knowledge (Pang-Ning T. and Steinbach M. and Kumar V., 2006), may provide valuable insight into these huge data collections. For example, data mining techniques can be used to extract a variety of infor- mation on the patient history such as the medical protocols usually adopted for patients with a given disease. Healthcare organizations can exploit this knowledge to improve their current processes, assess new medical guidelines, or enrich the existing ones. Medical guidelines represent standard medical pathways specifying the actions necessary to treat, with optimal effectiveness and efficiency, patients with a given disease. This study addresses the problem of analysing patients’ examination data to identify the examination pathways commonly followed by patients. This issue is crucial for health care organizations, because it can significantly impact on the effectiveness of the medical treatments as well as on the costs incurred by the organizations. The data analysis framework proposed in this paper exploits a multiple- level clustering approach to discover, in a data collection with a variable distribution, cohesive and well-separated groups of patients with a similar ex- amination history. Cluster analysis is an exploratory data mining technique that partitions a data object collection into groups based on object proper- ties, without the support of additional a priori knowledge (in contrast with classification algorithms using class label information) (Pang-Ning T. and Steinbach M. and Kumar V., 2006). To cluster data collections with a vari- able distribution, the proposed multiple-level clustering strategy iteratively focuses on disjoint dataset portions and locally identifies clusters. Among the state-of-the-art clustering techniques, the density-based DBSCAN algo- rithm (Ester et al., 1996) has been adopted due to its properties. DBSCAN allows the identification of arbitrarily shaped clusters, is less susceptible to noise and outliers, and does not require the specification of the number of expected clusters in the data. To highlight the relevance of specific exam- inations for a given clinical condition, in the proposed framework patient examination data has been represented in the Vector Space Model (VSM) (Salton G., 1971) using the TF-IDF method (Pang-Ning T. and Steinbach M. and Kumar V., 2006). 2 As a reference case study, the proposed framework has been applied to a real dataset of diabetic patients provided by the National Health Center (NHC) of the Asti province (Italy). The results showed that, starting from a large collection of raw examination data, the framework allows the iden- tification of clusters containing patients with a similar examination history. More specifically, clusters contain patients with increasing disease severity, as patients are tested with more and more specific examinations to diagnose diabetes complications. The results were discussed with the support of clini- cal domain experts, showing a fairly good correlation among the examination pathways suggested by the clusters and the guidelines for diabetes disease (ICD-9-CM, 2011). The paper is organized as follows. Section 2 describes the state-of-the- art clustering methods and explains the selection of the DBSCAN algorithm for this study. Section 3 analyses previous related work. Section 4 presents the proposed framework and describes its building blocks, while the results obtained for the real diabetic patient dataset are discussed in Section 5. Finally, Section 6 draws conclusions and future work. 2. Selection of the clustering algorithm Cluster analysis partitions objects into groups (clusters) so that objects within the same group are more similar to each other than those objects assigned to different groups (Pang-Ning T. and Steinbach M. and Kumar V., 2006). Clustering algorithms require the definition of a metric to evaluate the similarity (or distance) between objects based on the features describing objects. Clustering algorithms can be classified into the following four categories (Pang-Ning T. and Steinbach M. and Kumar V., 2006): (i) center, (ii) density, (iii) model, and (iv) hierarchical-based methods. In center-based methods (e.g., K-means (Juang & Rabiner, 1990)), a clus- ter is a set of data objects in which each object is closer (more similar) to the prototype that defines the cluster than to the prototype of any other cluster. The prototype is the most representative point in the cluster. These meth- ods find spherical-shaped clusters, unless clusters are well separated, and are sensitive to outliers. In density-based methods (e.g., DBSCAN (Ester et al., 1996)), a cluster is a dense area of data objects surrounded by an area of low density. These 3 approaches are less sensitive to the presence of outliers than center-based techniques and can identify non-spherical shaped clusters. Model-based methods (e.g. EM (G. McLachlan and T. Krishnan, 1997), COBWEB (Fisher, 1987a,b)) hypothesize a mathematical model for each cluster, and then determine the best fit between the model and the object collection. These methods can deal with outliers and noise. However, sim- ilarly to K-means, EM requires the specification of the number of expected clusters. Hierarchical-based methods exploit an agglomerative or divisive approach to generate a hierarchical collection of clusters. The (most common) agglom- erative approach (Pang-Ning T. and Steinbach M. and Kumar V., 2006) ini- tially assigns each data object to a singleton cluster. The two closest clusters are then iteratively merged using a cluster proximity measure (e.g., single- link, complete-link, or group average average) (Pang-Ning T. and Steinbach M. and Kumar V., 2006). These methods are often used when the underlying application requires the creation of a taxonomy, which is not the case for our application scenario. Density-based methods showed remarkable properties for clustering pa- tients based on their examination history. More specifically, the very effective density-based algorithm DBSCAN (Ester et al., 1996) has been selected in this study. Differently from other algorithms (e.g., K-means), DBSCAN is less sensitive to outliers and can find arbitrarily shaped clusters. Outliers, when not identified and isolated as in DBSCAN, are clustered together with the other data objects, thus decreasing cluster cohesion. In addition, DB- SCAN does not require an a priori specification of the number of clusters in the data, as opposed to K-means and EM. Medical datasets can include outliers as specific examination pathways for some disease conditions and clusters can be non-spherical shaped. Besides, since our aim is discovering the examination pathways usually adopted for a given disease through an explorative data analysis, the expected number of clusters can be hardly guessed a priori. For these reasons, the DBSCAN algorithm has been adopted for the cluster analysis in this study. The main characteristics of DBSCAN are reported in the following Section 2.1. To discover clusters in datasets with a variable distribution, state-of-the- art clustering algorithms can be applied in a multiple-level fashion to focus on different dataset portions and locally identify cluster (e.g., the bisecting K-means algorithm (Pang-Ning T. and Steinbach M. and Kumar V., 2006)). 4 The multiple-level strategy adopted in this study exploits the DBSCAN algo- rithm to select the dataset part analyzed at each iteration and locally cluster it. 2.1. The DBSCAN algorithm The DBSCAN algorithm (Ester et al., 1996) relies on two input parame- ters, named Eps and MinPts, to define a density threshold in the data space. A dense region in the data space is a n-dimensional sphere with radius Eps and containing at least MinPts objects. The DBSCAN algorithm iterates over the data objects in the collection by analyzing their neighborhood. It classifies objects as being (i) in the interior of a dense region (a core point), (ii) on the edge of a dense region (a border point), or (iii) in a sparsly occupied region (a noise or outlier point). Any two core points that are close enough (within a distance Eps of one another) are put in the same cluster. Any border point close enough to a core point is put in the same cluster as the core point. Outlier points (i.e., points far from any core point) are isolated. DBSCAN can discover arbitrarily shaped clusters and identify outliers as objects in a low density area in the data space. The effectiveness of DBSCAN is affected by the selection of the Eps and MinPts values. Section 4.2 discusses how this issue has been addressed in this study. 3. Related work Several works exploited data mining techniques to analyze medical data by addressing different pathologies and facets of various diseases. Clustering algorithms have been widely exploited to analyze medical data for patients affected by different diseases (Ahmed & Funk, 2011; Buczak et al., 2009; Choong et al., 2000; Mulroy et al., 2003; Van Rooden et al., 2010; Santamaria et al., 2003). (Van Rooden et al., 2010) reviewed the cluster methods used to identify Parkinson’s disease subtypes. It showed that the K-means algorithm (Juang & Rabiner, 1990) was mostly adopted, but also highlited its two major limitations, i.e., the sensitiveness to outliers and the need of defining the expected number of clusters. To overcome these issues, the DBSCAN algorithm has been used in this study, being able to internally evaluate the optimal number of clusters and automatically identify outliers. In (Buczak et al., 2009), patients were represented by tracking the number of occurrences for each clinical event (e.g., hospital visit, lab order). Patients 5 were then grouped using a hierarchical agglomerative clustering method with Ward’s linkage as proximity measure (Pang-Ning T. and Steinbach M. and Kumar V., 2006). Differently from (Buczak et al., 2009), in this study we ex- ploit the TF-IDF scheme to weight the relevance of specific examinations for each diabetes condition. In addition, a multiple-level cluster approach is used to identify groups of patients in datasets with a variable data distribution. Concerning the analysis of medical data for diabetic patients, several works, mainly exploiting classification techniques, have been proposed (Kare- gowda et al., 2012; Meng et al., 2012; Mohamudally & Khan, 2011; Zhong et al., 2012). Classification is a supervised data mining approach that assigns new unlabeled data to a class label by means of a model built from data with known class label (Pang-Ning T. and Steinbach M. and Kumar V., 2006). In (Karegowda et al., 2012), diabetic patients were categorized with a K-nearest neighbor (KNN) classifier (Pang-Ning T. and Steinbach M. and Kumar V., 2006). To ehnance classification accuracy, in a pre-processing step a genetic algorithm and a feature selection technique (i.e., the correlation method (Pang-Ning T. and Steinbach M. and Kumar V., 2006)) are used to identify the relevant features for classification. In (Meng et al., 2012) the authors compared three prediction models (i.e., logistic regression, decision tree, and artificial neural networks) for diabetic patients classification. The comparison was based on common risk factors collected from both diabetic and pre-diabetic patients with a standard questionnaire. This work showed that the decision tree model (C5.0 (Pang-Ning T. and Steinbach M. and Kumar V., 2006)) yielded the best accuracy, followed by logistic regression, and artificial neural networks. The work in (Zhong et al., 2012) proposed a multi-level support vector machine approach to classify and predict clinical charge profiles as well as the length of hospital stay for patients affected by heart, diabetes, and cancer diseases. In (Mohamudally & Khan, 2011), different data mining algorithms (the K-means algorithm, C4.5 decision tree classifier, artificial neural networks, and 2D graphs for data visualization) are integrated to predict, classify, and visualize a medical diabetic dataset. A parallel effort has been devoted to exploit clustering techniques for diabetic patients by addressing different issues as food analysis (Phanich et al., 2010), gait patterns (Sawacha et al., 2010), and relationships among diabetes and risk factors (Chaturvedi, 2003). Food clustering analysis for diabetic patients has been proposed in (Phanich et al., 2010). Using Self- Organizing Map (SOM) and the K-means algorithm, this work provided a Food Recommendation System suggesting proper substituted foods. The 6 work in (Sawacha et al., 2010) proposed a cluster analysis of biomechanical data to group patients with similar diabetic gait patterns. In (Chaturvedi, 2003) the spatial clusters of diabetes prevalence in Texas has been analyzed. The relationship of risk factors (i.e., age and obesity) associated with diabetes have also been analyzed. Differently from these works, this study exploits clustering techniques to identify groups of patients with similar examinations history. 4. Methodology The proposed framework to analyse the patient examination history con- tains four main steps: (i) data collection, (ii) data transformation, (iii) cluster analysis, and (iv) cluster evaluation. The building blocks of the framework are shown in Figure 4 and detailed in the following subsections. The patient examination log data is first collected and then transformed using the Vector Space Model (VSM) representation (Salton G., 1971). Ex- amination frequencies are weighted through the Term Frequency (TF) - In- verse Document Frequency (IDF) scheme (Pang-Ning T. and Steinbach M. and Kumar V., 2006). The multiple-level clustering approach is applied to identify, in a dataset with a variable distribution, groups of patients with a similar examination history. The DBSCAN algorithm has been exploited for the cluster analysis. Finally, clustering results are evaluated through a quality index balanc- ing both intra-cluster homogeneity and inter-cluster separation. Silhouette (Rousseeuw, 1987) has been considered as reference index. Cluster sets are also analysed together with a domain expert to assess their significance. To analyse the examination pathways represented by the cluster set, each cluster has been characterized with the examinations appearing in it. 4.1. Representation of the patient examination history The data recording the patient examination history is represented using the Vector Space Model (VSM) (Salton G., 1971). Each patient is a vector in the examination space. Each vector element corresponds to a different exam- ination and is associated with a weight describing the examination relevance for the patient. More specifically, it reports the weighted number of times the examination was repeated by the patient. The Term Frequency (TF) - Inverse Document Frequency (IDF) scheme (Pang-Ning T. and Steinbach M. and Kumar V., 2006) has been adopted to weight examination frequency. 7 W eight ed examination frequency vect ors Data transformation Cluster analysis Cluster evaluation Data collection VSM representation using TF- I DF scheme Pat ient examination log dat a Mult iple- level clust ering Cluster set Clust er evaluat ion Figure 1: The proposed framework for the cluster analysis of patient examination data Both the VSM representation and the TF-IDF scheme have been applied in previous works to represent text documents. The adopted data representation allows highlighting the relevance of spe- cific examinations for a given patient condition. The TF-IDF value increases proportionally to the number of times an examination appears in the pa- tient history, but is offset by the frequency of the examination in the patient collection, which helps to control the fact that some examinations are gen- erally more common than others. Unweighted examination frequencies do not properly characterize the patient condition, since standard routine tests usually appear with high frequency, while more specific tests may appear with lower frequency. More formally, let D be a collection of patient records and Σ = {e1, . . . , ek} the set of examinations done by at least one patient in D. Each patient pi in D is represented by a weighted examination frequency vector vpi of |Σ| cells. Each element vpi[j] of vector vpi reports the weighted frequency wpi,ej of examination ej for patient pi, i.e., 8 vpi = [wpi,e1, . . . , wpi,e|Σ|]. (1) The TF-IDF weight wpi,ej for the pair (pi, ej) is computed as the prod- uct of two terms, called Term Frequency (TFpi,ej) and Inverse Document Frequency (IDFej), wpi,ej = TFpi,ej ∗ IDFej. (2) The Term Frequency TFpi,ej for the pair (pi, ej) represents the relative frequency of examination ej for patient pi. It is given by TFpi,ej = fpi,ej/ ∑ 1≤k≤|Σ| fpi,ek, (3) where fpi,ej is the number of times patient pi underwent examination ej and ∑ 1≤k≤|Σ| fpi,ek is the total number of examinations done by patient pi. The Inverse Document Frequency IDFej for examination ej represents the frequency of ej in the patient collection. It is computed as IDFej = Log[|D|/|pk ∈ D : fpk,ej 6= 0|], (4) where |D| is the number of patients in the collection D and |pk ∈ D : fpk,ej 6= 0| is the number of patients in D who underwent (at least once) examina- tion ej. Mathematically, the base of the log function does not matter and constitutes a constant multiplicative factor towards the overall result. The TF-IDF weight wpi,ej for the pair (pi, ej) is high when examination ej appears with high frequency in patient pi and low frequency in patients in the collection D. When examination ej appears in more patients, the ratio inside the IDF’s log function approaches 1, and the IDFej value and TF- IDF weight wpi,ej become close to 0. Hence, the approach tends to filter out common examinations. 4.2. The multiple-level DBSCAN approach for cluster analysis Density-based algorithms can effectively discover clusters of arbitary shape and filter out outliers, thus increasing cluster homogeneity. Clusters are iden- tified as dense areas of data objects surrounded by an area of low density. In the DBSCAN algorithm, density is evaluated based on the user-specified parameters Eps and MinPts. One single execution of DBSCAN discovers dense groups of patients according to one specific setting for these parame- ters. Patients in lower density areas are labeled as outliers and not assigned 9 to any cluster. Hence, different parameter settings are needed to discover clusters in datasets with a variable data distribution as the one considered in this study. Groups of patients with close examination histories may have both different cardinalities and densities, i.e., groups may contain examina- tion histories with different degrees of similarity. The proposed multiple-level clustering approach allows clustering datasets with a variable distribution by iteratively applying the DBSCAN algorithm on different (disjoint) dataset portions. The whole original dataset is clus- tered at the first level. Then, at each subsequent level, patients labeled as outliers in the previous level are re-clustered. The DBSCAN parameters Eps and MinPts are properly set at each level. In the following, Section 4.2.1 presents the measure used to evaluate the similarity between patient examination histories, while Section 4.2.2 de- scribes how the DBSCAN parameters and the number of clustering levels have been selected. 4.2.1. Similarity between patient examination histories The cosine similarity measure has been adopted to evaluate the simi- larity between the weighted examination frequency vectors representing the patient examination histories. This measure has been often used to compare documents in text mining (Steinbach et al., 2000). Let pi and pj be two arbitrary patients in the collection D. Let vi and vj be the corresponding weighted examination frequency vectors as in Equation 1. The cosine similarity between vi and vj is computed as cos(vi, vj) = vi • vj ‖vi‖ ‖vj‖ = ∑ 1≤k≤|Σ| vi[k]vj[k] √ ∑ 1≤k≤|Σ| vi[k] 2 √ ∑ 1≤k≤|Σ| vj[k] 2 , (5) where cos(vi, vj) is in the range [0,1]. cos(vi, vj) equal to 1 describes the exact similarity of examination histories for patients pi and pj, while cos(vi, vj) equal to 0 points out that patients have complementary histories (i.e., the sets of their examinations are disjoint). 4.2.2. Number of clustering levels and DBSCAN parameters The dataset density is preliminarly analysed using the k-dist graph (Pang- Ning T. and Steinbach M. and Kumar V., 2006) to select both the number of 10 iterations for the multiple-level clustering approach and the Eps and MinPts values for each iteration. For each patient in the collection, the k-dist graph plots the distance to its kth nearest neighbor according to their examination history. On the x- axis patients are sorted by the distance to the kth nearest neighbor, while on the y-axis distances to the kth nearest neighbor are reported. The k value corresponds to the MinPts parameter. The y-axis represents possible values of the Eps parameter. By cutting the graph at a given value on the y-axis, the corresponding px value on the x-axis partitions the patient collection into the following two subsets. Patients placed on the left hand side of px are labeled by DBSCAN as core points, and those on the rigth side of px as outlier or border points. Sharp changes in the k-dist graph identify dataset portions with a dif- ferent density (Pang-Ning T. and Steinbach M. and Kumar V., 2006). The multiple-level strategy analyzes these dataset portions in different iterations. The Eps value to cluster each dataset part is selected in correspondence with the sharp change appearing in the graph. Section 5.2 discusses the selection of DBSCAN parameters and number of iteration levels for the diabetes dataset considered in this study. 4.3. Cluster evaluation The discovered cluster set is evaluated using the Silhouette index (Rousseeuw, 1987). Silhouette allows evaluating the appropriateness of the assignment of a data object to a cluster rather than to another by measuring both intra- cluster cohesion and inter-cluster separation. The silhouette value for a given patient pi in a cluster C is computed as s(pi) = b(pi) − a(pi) max{a(pi), b(pi)} , s(pi) ∈ [−1, 1], (6) where a(pi) is the average distance of patient pi from all other patients in the cluster C, and b(pi) is the smallest of average distances from its neighbour clusters. The silhouette value for a cluster C is the average silhouette value on all its patients. Negative silhouette values represent wrong patient place- ments, while positive silhouette values a better patient assignments. Clusters with silhouette values in the range [0.51,0.70] and [0.71,1] respectively show that a reasonable and a strong structure have been found (Kaufman, L. and 11 Rousseeuw, P. J., 1990). The cosine similarity metric has been used for sil- houette evaluation, since this measure was used to evaluate patient similarity in the cluster analysis (see Section 4.2.1). 4.4. Data mining tool The DBSCAN algorithm available in the RapidMiner system has been used for the cluster analysis within the proposed framework. The RapidMiner toolkit (Rapid Miner Project, 2013) is an open-source system consisting of a number of data mining algorithms to automatically analyze a large data collection and extract useful knowledge. The procedures for data transformation and cluster evaluation have been developed in the Phyton programming language (Python Software Founda- tion, 2013). These procedures transform the patient examination log data into the VSM representation using the TF-IDF scheme and compute the silhouette values for the cluster set provided by the cluster analysis. 5. Results and discussion This section presents and discusses the results obtained when analysing a real collection of examination log data for diabetic patients with the proposed framework. In the following, Section 5.1 describes the dataset considered in this study, while Section 5.2 specifies the framework configuration for the cluster analysis. The cluster results are presented and discussed in Section 5.3. Finally, Section 5.4 evaluates the performance of the multiple-level clus- tering approach in terms of execution time. 5.1. Diabetic patient dataset The dataset considered in this study was collected by the Local Health Center (LHC) of the Asti province in Italy. The LHC database stores all the accesses to the health care system in the Asti province in the year 2007. From this database the examination log data of all patients with overt di- abetes were extracted. Raw data contain 95,788 records with examinations performed by 6,380 patients. They contain both (i) routine and (ii) more specific examinations to analyze diabetes complications on various degrees of severity. The dataset includes both male and female patients in a wide age range (between 4 and 95 years). The diagnostic and therapeutic procedures are defined using the ICD 9-CM (International Classification of Diseases, 9th revision, Clinical Modification) (ICD-9-CM, 2011). 12 5.2. Selection of parameters for clustering diabetic patients To select the number of iterations for the multiple-level clustering strategy and the DBSCAN parameter for each level, we relied on the k-dist graph as discussed in Section 4.2.2. In selecting these parameters, we addressed the following issues. To discover representative examination pathways for the diabetes, we aim at avoiding clusters including few patients. In addition, to consider all different patient examination histories, we aim at limiting the number of patients labeled as outliers and thus unclustered. We plotted the k-dist graph, and analysed the cluster results, by varying the MinPts (i.e., k) value. Low MinPts values were not suitable for our purpose, because DBSCAN identifies small groups of patients. At the same time, large MinPts values may increase the number of outlier points included in the clusters. The experimental results showed that MinPts in the range [25,35] provided similar results, that were compliant with the above issues. We selected MinPts=30. From the k-dist graph for k=30, we observed three main dataset portions with a different density for Eps in the range [0.2-0.3], [0.4-0.5], and [0.6-0.7], respectively. Consequently, we adopted a three-level clustering approach, with each level focusing on one among the three dataset parts. The selected Eps values were 0.3, 0.4, and 0.6 for the first, second, and third clustering level, respectively. 5.3. Analysis of the clustering results Starting from a large collection of raw examination data, the proposed framework allows the discovery of a set of clusters containing patients with a similar examination history. The multiple-level DBSCAN approach, iterated for three levels, computed clusters that progressively contain patients with increasing severity in diabetes, because patients are tested using more and more specialized examinations. More specifically, first-level clusters contain patients mainly undergoing routine tests to monitor diabetes conditions, or some basic tests to diagnose disease complications. Second-level clusters col- lect patients that are tested using an increasing number of examinations to diagnose some diabetes complications. Examinations become progressively more numerous and specific in third-level clusters, indicating patients that can have diabetes complications of increasing severity. Since at each level clusters contain more specific examinations, a lower number of patients is contained in each cluster and the cluster size tends to reduce progressively. Clusters show good cohesion and separation as they are characterized by high 13 silhouette values. The results, discussed with the support of a clinical do- main expert, show a fairly good correlation among the examination pathways suggested by the clusters and the guidelines for diabetes disease (ICD-9-CM, 2011). Cluster properties are discussed in detail in the following subsections. Tables 1, 2, and 3 report, for each first- and second-level cluster, the ex- amination frequencies computed as the percentage of patients in the cluster tested by each examination. Clusters are named as Cij in the tables, where j denotes the level of the multiple-level DBSCAN approach providing the cluster and i locally identifies the cluster at each level j. 5.3.1. First-level cluster set First-level clusters are reported in Tables 1 and 2. Clusters can be par- titioned into the following two main groups: clusters containing patients (i) with standard examinations to monitor diabetes conditions (clusters C11- C51, in Table 1), (ii) coupled with basic examinations to diagnose disease complications (clusters C61-C111 in Table 2). The two largest clusters (C11 and C21) contain patients who mostly per- formed standard routine tests. Besides routine examinations, all patients in cluster C31 had a specialistic visit and were tested with usual basic exam- inations to diagnose the most frequent diabetes complications, as risks for cardiovascular disease and eye problems. All patients in clusters C41 and C51 only had a checkup visit, together with the glucose level test in cluster C51. These two clusters may include patients usually tested in private structures and periodically reporting test results to NHC. Patients in clusters C61-C111 were additionally tested to diagnose diabetes complications in the (a) eye (C61), (b) cardiovascular system (C71), (c) both eye and cardiovascular system (C81), (d) carotid (C91), and (e) limb (C101). Finally, (f) cluster C111 includes tests for the liver, kidneys, and in particular cardiovascular system. Differently from second- and third-level clusters, di- abetes complications were monitored in clusters C61-C111 using a few (quite standard) tests which showed a limited degree of severity. Only the cardio- vascular system has been thoroughly tested in cluster C111. Standard routine examinations still appear in clusters C61-C111 even though (usually) with a lower frequency than in clusters C11-C51. Patients with an examination history significantly dissimilar from all the others are labeled as outliers and not included in any cluster. The DBSCAN 14 Table 1: Examination frequencies (%) in first-level clusters containing patients undergoing routine tests (Eps=0.3, MinPts=30) Category Examination C11 C21 C31 C41 C51 Routine Checkup visit 78 96 58 100 100 Glucose level 78 98 63 - 100 Urine test 72 97 58 - - Venous blood 96 75 35 - - Capillary blood 72 97 58 - - Haemoglobin 100 - - - - Specialistic visit - 13 100 - - Cardiovascular Electrocardiogram - - 100 - - Eye Fundus Oculi - - 28 - - Number of patients 223 1,764 43 110 41 Silhouette 0.67 0.55 0.85 0.99 1.0 algorithm was re-applied on these patients only (3,509 patients), with differ- ent parameter values. The results are discussed in Section 5.3.2. 5.3.2. Second-level cluster set Second-level clusters contain patients with more diversified examination histories. More specifically, the following two main categories of clusters can be identified: (i) clusters containing patients tested with specific examina- tions to diagnose a given diabetes complication (clusters C12-C22); (ii) clus- ters with patients who underwent various examinations to diagnose different diabetes complications (clusters C32-C52). These two categories respectively indicate patients that can be seriously affected by a particular disease com- plication or by more than one disease complication at the same time. Second- level clusters are reported in Table 3. Clusters C12 and C22 include specific examination to diagnose eye compli- cations. More specifically, all patients in cluster C12 underwent a battery of tests to assess vision and ability to focus on objects (called ”Eye examination” in Table 3). Instead, all patients in cluster C22 had Retinal photocoagula- tion, a laser operation done in cases of long-term eye complications, such as proliferative retinopathy. All patients in clusters C32-C52 may suffer complications on the cardio- vascular, liver, and kidneys systems, but with different degrees of severity. 15 Table 2: Examination frequencies (%) in first-level clusters including basic tests to diagnose diabetes complications (Eps=0.3, MinPts=30) Category Examination C61 C71 C81 C91 C101 C111 Routine Checkup visit 77 78 66 62 68 97 Glucose level 74 74 64 62 59 97 Urine test 74 74 64 57 56 92 Venous blood 57 60 53 48 44 97 Capillary blood 74 73 63 55 56 92 Haemoglobin - - - 14 12 100 Specialistic visit - - - - - - Complete blood count - - - - 3 3 Cardiovascular Electrocardiogram - 100 100 - - 42 Cholesterol - - - - - 100 HDL Cholesterol - - - - - 100 Triglycerides - - - - - 100 Eye Fundus Oculi 100 - 100 - - 39 Liver ALT - - - - - 100 AST - - - - - 100 Kidney Creatinine - - - - - 3 Creatinine clearance - - - - - 100 Culture urine - - - - - 100 Microscopic urine analysis - - - - - 100 Uric Acid - - - - - 100 Carotid ECO doppler carotid - - - 100 - - Limb ECO doppler limb - - - - 100 - Number of patients 294 144 140 42 34 36 Silhouette 0.65 0.74 0.79 0.95 0.97 0.90 16 Patients in cluster C52 are at risk of more severe liver complications than those in clusters C32 and C42, since their examination history includes more tests in this category. Analogously, cluster C32 is characterized by more severe renal complications than clusters C42 and C52. Cardiovascular com- plications have similar severity in clusters C32-C52, because examinations in this category appear with similar frequency. At this stage, 2,939 patients are classified as outliers and not assigned to any cluster. We further applied the DBSCAN algorithm on them by modifying the parameter setting. The results are analyzed in Section 5.3.3. 5.3.3. Third-level cluster set The results collected at this stage (with DBSCAN parameter Eps=0.6 and MinPts=30) show a similar trend to second-level clusters. Third-level clusters contain patients that may suffer more complications at the same time (e.g., complications on carotid, liver, and cardiovascular systems) and are tested with more specific examinations (e.g., transcutaneous oxygen and carbon dioxide monitor or upper abdominal ultrasound). By stopping the multiple-level DBSCAN approach at this level, only 1,239 patients labeled as outliers remain unclustered, with respect to the initial collection of 6,380 patients considered at the first level (i.e., about 19% of patients). By further applying the DBSCAN algorithm on this outlier set, fragmented groups of patients can be identified. These clusters can represent patients affected by more rare diabetes complications, and thus with examinations different from those done by most patients. 5.4. Execution time The run time of DBSCAN at the first, second, and third level is 9 min 10 sec, 2 min 25 sec, and 1 min 45 sec, respectively. The run time progressively reduces because less patients are considered at each subsequent level. 6. Conclusion To get the most out of large and complex medical databases, innovative data analysis techniques are needed to extract useful knowledge in a timely fashion. In this paper a data analysis framework based on a multiple-level clustering approach has been proposed to identify groups of patients with a similar examination history in a dataset with a variable data distribution. The TF-IDF method has been exploited to represent patient examination 17 Table 3: Examination frequencies (%) in second-level clusters, including examinations to diagnose more severe diabetes complications (Eps=0.4, MinPts=30) Category Examination C12 C22 C32 C42 C52 Routine Checkup visit 65 90 22 98 71 Glucose level 52 92 22 100 95 Urine test 37 90 19 98 68 Venous blood 35 84 100 100 98 Capillary blood 37 85 17 98 63 Haemoglobin 13 34 100 98 83 Specialistic visit - 13 - - - Complete blood count 3 15 45 7 93 Cardiovascular Electrocardiogram 17 21 70 47 20 Cholesterol 7 25 100 97 93 HDL Cholesterol 4 26 100 99 92 Triglycerides 7 26 100 97 90 Eye Fundus Oculi 50 38 53 49 27 Angioscopy - 30 - - - Eye examination 100 5 - - - Renital photocoagulation - 100 - - - Liver ALT - 21 10 98 98 AST - 21 8 97 98 Bilirubin - 2 - - 100 Gamma GT - - 100 - 95 Kidney Creatinine 4 20 99 11 78 Creatinine clearance 2 10 - 99 - Culture urine 2 11 67 97 44 Microscopic urine analysis 4 16 2 82 73 Uric acid - 13 3 97 68 Microalbuminuria - 7 100 60 37 Carotid ECO doppler carotid - 5 - - 2 Limb ECO doppler limb - - - - - Number of patients 46 61 139 283 41 Silhouette 0.92 0.88 0.72 0.69 0.87 18 data, thus highlighting the relevance of the different examinations. The pro- posed methodology has been validated in the diabetic care scenario on a real dataset of diabetic patients. The analysis identified cohesive and well- separated groups of patients with standard or more specific examinations for diabetes, showing a good correlation with medical guidelines for dia- betes (ICD-9-CM, 2011). Future developments of the proposed approach will explore the correlation between patient examination data and additional aspects of the medical treatments such as pharmaceutical drug therapies. Furthermore, we plan to apply the proposed approach to different medical contexts (e.g., cardiac patients). 7. Acknowledgments The authors wish to thank Dr. Baudolino Mussa and Dr. Dario Bellomo for their advice and fruitfull discussions. References Ahmed, M. U., & Funk, P. (2011). Mining rare cases in post-operative pain by means of outlier detection. In IEEE Symposium on Signal Processing and Information Technology (ISSPIT) (pp. 35–41). Buczak, A. L., Moniz, L. J., Feighner, B. H., & Lombardo, J. S. (2009). Mining electronic medical records for patient care patterns. In IEEE Sym- posium on Computational Intelligence and Data Mining (CIDM) (pp. 146– 153). Chaturvedi, K. (2003). Geographic concentrations of diabetes preva- lence clusters in texas and their relationship to age and obesity. http://www.ucgis.org/summer03/studentpapers/kshitijchaturvedi.pdf. Re- trieved, 9 , 2010. Choong, P. F. M., Langford, A. K., Dowsey, M. M., & Santamaria, N. M. (2000). Clinical pathway for fractured neck of femur: a prospective, con- trolled study. Med J Aust., 172 , 423–426. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining (KDD) (pp. 226–231). 19 Fisher, D. H. (1987a). Improving inference through conceptual clustering. In National conference on Artificial intelligence (AAAI) (pp. 461–465). Fisher, D. H. (1987b). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2 , 139–172. G. McLachlan and T. Krishnan (1997). The EM algorithm and extensions. Wiley series in probability and statistics. John Wiley and Sons. ICD-9-CM, I. (2011). International Classification of Diseases, 9th revision, Clinical Modification. Available: http://icd9cm.chrisendres.com. Last ac- cess on March 2011, . Juang, B.-H., & Rabiner, L. (1990). The segmental k-means algorithm for estimating parameters of hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 38 , 1639–1641. Karegowda, A., Jayaram, M., & Manjunath, A. (2012). Cascading k-means clustering and k-nearest neighbor classifier for categorization of diabetic patients. International Journal of Engineering and Advanced Technology (IJEAT), (pp. 147 – 151). Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. Wiley. Meng, X.-H., Huang, Y.-X., Rao, D.-P., Zhang, Q., & Liu, Q. (2012). Com- parison of three data mining models for predicting diabetes or prediabetes by risk factors. The Kaohsiung Journal of Medical Sciences, (pp. 93–99). Mohamudally, N., & Khan, D. M. (2011). Application of a unified med- ical data miner (umdm) for prediction, classification, interpretation and visualization on medical datasets: the diabetes dataset case. In Interna- tional conference on Advances in data mining: applications and theoretical aspects (ICDM) (pp. 78–95). Berlin, Heidelberg: Springer-Verlag. Mulroy, S., Gronley, J., Weiss, W., Newsam, C., & Perry, J. (2003). Use of cluster analysis for gait pattern classification of patients in the early and late recovery phases following stroke. Gait Posture, 18 , 114–25. Pang-Ning T. and Steinbach M. and Kumar V. (2006). Introduction to Data Mining. Addison-Wesley. 20 Phanich, M., Pholkul, P., & Phimoltares, S. (2010). Food recommendation system using clustering analysis for diabetic patients. In IEEE Interna- tional Conference on Information Science and Applications (ICISA) (pp. 1–8). Python Software Foundation, P. S. (2013). Python Programming Language Official Website. Available: http://www.python.org/ Last access on Jan- uary 2013, . Rapid Miner Project, R. M. (2013). The Rapid Miner Project for Machine Learning. Available: http://rapid-i.com/ Last access on January 2013, . Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathemat- ics, (pp. 53–65). Salton G. (1971). The SMART retrieval system: experiments in automatic document processing. Prentice-Hall. Santamaria, N., Houghton, L., & Kimmel, A., L. Graham (2003). Clinical pathways for fractured neck of femur: A cohort study of health related quality of life, patient satisfaction and clinical outcome. Australian Journal of Advanced Nursing, 20 , 24–29. Sawacha, Z., Guarneri, G., Avogaro, A., & Cobelli, C. (2010). A new classi- fication of diabetic gait pattern based on cluster analysis of biomechanical data. Journal of Diabetes Science and Technology, 4 , 1127–38. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In KDD Workshop on Text Mining. Van Rooden, S. M., Heiser, W. J., Kok, J. N., Verbaan, D., van Hilten, J. J., & Marinus, J. (2010). The identification of parkinson’s disease subtypes using cluster analysis: A systematic review. Movement Disorders, 25 , 969–978. Zhong, W., Chow, R., & He, J. (2012). Clinical charge profiles prediction for patients diagnosed with chronic diseases using multi-level support vector machine. Expert Systems with Applications, 39 , 1474 – 1483. 21