key: cord-0003632-u9z8x4v9 authors: Pei, Dongmei; Zhang, Chengpu; Quan, Yu; Guo, Qiyong title: Identification of Potential Type II Diabetes in a Chinese Population with a Sensitive Decision Tree Approach date: 2019-01-22 journal: J Diabetes Res DOI: 10.1155/2019/4248218 sha: 0dc958db13e5ac3b76b8204a8081ad0cec38bcfa doc_id: 3632 cord_uid: u9z8x4v9 BACKGROUND: Diabetes mellitus is a chronic disease with a steadfast increase in prevalence. Due to the chronic course of the disease combining with devastating complications, this disorder could easily carry a financial burden. The early diagnosis of diabetes remains as one of the major challenges medical providers are facing, and the satisfactory screening tools or methods are still required, especially a population- or community-based tool. METHODS: This is a retrospective cross-sectional study involving 15,323 subjects who underwent the annual check-up in the Department of Family Medicine of Shengjing Hospital of China Medical University from January 2017 to June 2017. With a strict data filtration, 10,436 records from the eligible participants were utilized to develop a prediction model using the J48 decision tree algorithm. Nine variables, including age, gender, body mass index (BMI), hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work-related stress, and salty food preference, were considered. RESULTS: The accuracy, precision, recall, and area under the receiver operating characteristic curve (AUC) value for identifying potential diabetes were 94.2%, 94.0%, 94.2%, and 94.8%, respectively. The structure of the decision tree shows that age is the most significant feature. The decision tree demonstrated that among those participants with age ≤ 49, 5497 participants (97%) of the individuals were identified as nondiabetic, while age > 49, 771 participants (50%) of the individuals were identified as nondiabetic. In the subgroup where people were 34 < age ≤ 49 and BMI ≥ 25, when with positive family history of diabetes, 89 (92%) out of 97 individuals were identified as diabetic and, when without family history of diabetes, 576 (58%) of the individuals were identified as nondiabetic. Work-related stress was identified as being associated with diabetes. In individuals with 34 < age ≤ 49 and BMI ≥ 25 and without family history of diabetes, 22 (51%) of the individuals with high work-related stress were identified as nondiabetic while 349 (88%) of the individuals with low or moderate work-related stress were identified as not having diabetes. CONCLUSIONS: We proposed a classifier based on a decision tree which used nine features of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of diabetes. The classifier indicates that a decision tree analysis can be successfully applied to screen diabetes, which will support clinical practitioners for rapid diabetes identification. The model provides a means to target the prevention of diabetes which could reduce the burden on the health system through effective case management. Diabetes mellitus is a chronic disease that affects a large portion of the population with a steadfast increase in prevalence of diabetes globally, and such a tendency is projected to continue to rise over time as the population grows and ages. As reported by the World Health Organization (WHO), the number of people with diabetes has risen from 108 million in 1980 to 422 million in 2014 and will place diabetes as the seventh leading cause of death in 2030 [1] . Due to the chronic course of the disease combining with devastating complications, this disorder could easily cost medical care tens of billions of dollars. In a very recent report, the direct and indirect economic costs of diabetes in the United States of America in 2017 were as high as 327 billion dollars [2] . People with diabetes could easily carry a financial burden of about $13,700 per year [3] . Although the pathological mechanisms of diabetes appear to be closely associated with either a reduced production of insulin by the beta cells in the pancreas or the failures in transporting circulating glucose into the tissues via glucose receptors, the hyperglycemia and the subsequent complications involving the multiple organs and systems [4] could be managed by a modality of means including diet control, lifestyle modifications, and the effective therapeutic interventions including insulin administration [4, 5] . However, due to its insidious development, the early diagnosis of diabetes still remains as one of the major challenges medical providers are facing, and the satisfactory screening tools or methods are still required, especially a population-or community-based tool [6] . In the last decade, by constructing predictive models, an attempt to identify the factors that are potentially associated with the development of diabetes through data mining techniques has been made with some promising results in predicting or even capturing diabetes at its early stage [4, [7] [8] [9] [10] [11] [12] . Among these techniques, the decision tree technique was widely used in the medical field in making diagnostic approaches during clinical practice [4, 11, [13] [14] [15] . By creating a set of simple classification rules, this simple but sensitive decision tree approach offers its unique capability of establishing a prediction toward a disease by extracting meaningful information from a large dataset which is composed of many attributable factors [4, 14, 16] . Therefore, the objective of the present study was to employ a decision tree method as a support system by rapid and automated identification of individuals with potential diabetes. 2.1. Study Population. Subjects whose age was ≥18 and who underwent the annual health check-up in the Department of Family Medicine of Shengjing Hospital of China Medical University from January 2017 to June 2017 were enrolled in this retrospective cross-sectional study. Informed consent was obtained from all participants. The Ethics Committee of Shengjing Hospital of China Medical University approved this study (reference number: 2017PS42K). By examining the medical record of each participant enrolled, a series of 9 variables pertinent to the health states that a patient could have at a given time point of his or her life were identified and used later for constructing a decision tree as reported in previous studies [8, 11, [17] [18] [19] [20] , and these variables are listed in Table 1 . A series of nine input variables, which were widely used by the others in establishing diabetes prediction models [8, 11, [17] [18] [19] [20] , were obtained through either direct observations or laboratory tests of each participant. In detail, the demographic characteristics such as age and gender were analyzed for all the subjects; family history of diabetes was defined as any family member previously having been diagnosed with diabetes (Yes = 1, No = 0); history of cardiovascular disease or stroke was defined as one previously having been diagnosed as having coronary heart disease or stroke (Yes = 1, No = 0); physical activity referred to those that took at least more than 30 minutes of exercise for 3 days in a week (More = 1, Less = 0 ); work-related stress has three levels according to the participants' subjective impression (High = 2, Moderate = 1, Low = 0); and salty food preference was defined as a person who prefers salty food in daily life (Yes = 1, No = 0). Anthropometric information contain body mass index (BMI) and hypertension. BMI was calculated as weight in kilograms divided by the square of height in meters (kg/m 2 ), and a B MI ≥ 25 was defined as overweight; hypertension was defined as systolic blood pressure ≥ 140 mmHg or diastolic blood pr essure ≥ 90 mmHg and/or current use of antihypertensive drugs for blood pressure control. Diabetes diagnoses were made as either prediabetes or diabetes based on the fasting plasma glucose with a cut-off value of ≥5.6 mmol/L [8, 11] , and each medical report was then marked as either diabetes or nondiabetes accordingly. A total of 15,323 records were initially examined and analyzed for potential construction of a decision tree approach. In order to meet the strict criteria for building data mining algorithms, the dataset in which certain variables were missing cannot be used since the decision tree approach will not work with missing data points [21] . In doing so, 28.6% of the total data (4348 out of 15,323 records) failed to meet the criteria for data selection and was thus excluded accordingly. As a result, BMI (N = 1139, 26.0%), family history of diabetes (N = 313, 7.1%), history of cardiovascular disease or stroke (N = 1035, 23.6%), physical activity (N = 517, 11.8%), work-related stress (N = 618, 14.1%), salty food preference (N = 762, 17.4%), and with a past history of diabetes (N = 503, 3.3%) were discarded, respectively. With such a strict data filtration, the remaining 10,436 records from the eligible participants were used for further analysis in this study. The final database consisted of 10 variables which were divided into 9 input variables and one target variable. The target variable consists of two classes, the presence of diabetes or no diabetes. Classification of individuals as being at risk for diabetes or not was performed using the decision tree classifier J48 (C4.5 algorithm), a robust classifier which operates with numeric and nominal attributes and was implemented in WEKA 3.8.1 (Waikato Environment for Knowledge Analysis, University of Waikato, New Zealand) [22] . With this technique, a tree was constructed to model the classification process based on the best informative attribute of the algorithm. At each iteration, the attribute was selected with the maximum gain ratio as the splitting attribute. Data mining methods often divide the dataset into two parts: a training dataset and a testing dataset. The prediction model is first constructed on the training dataset and then tested on the testing dataset [23] . Consistent with these methods, the final dataset, consisted of 10,436 records, was randomly categorized into two groups, the training and test groups. The training group comprised of 7305 cases (70% of the whole dataset). The remaining 3131 cases (30% of the whole dataset) were allocated as the test group for model validation. In the decision tree, the first variable (root) is the most important factor and variables far away from the root are the next important factors in classifying the data [4] . All the variables in one path are considered predictors (IF part), and the class label of the leaf node is the expected outcome (THEN part). To avoid overfitting and maintain parsimony, the model generated by the tree may be pruned by removing the nonessential terminal branches based on defined algorithms without affecting the classification accuracy [4, 24] . We used a confusion matrix to determine the performance of the decision tree for diabetes. In this study "diabetes" was defined as a positive event and "nondiabetes" was defined as a negative event. The confusion matrix for two classes was used to extract true positives, true negatives, false positives, and false negatives. Using Statistical Package for the Social Sciences (SPSS) version 20.0, a chi-square test was used to compare the categorical characteristics between the nondiabetes and diabetes groups. To measure the performance of the model, we used accuracy, precision, and recall. A receiver operating characteristic (ROC) graph is a technique for visualizing, organizing, and selecting classifiers based on their performance. The area under the ROC curve (AUC) of the classifier can be described as the probability of the classifier to rank a randomly selected positive case higher than a randomly selected negative case. In this study, the characteristics of the 10,436 records that were divided into two groups (diabetes and nondiabetes) are shown in Table 1 . Chi-square analysis revealed there was a significant difference between two groups as indicated in the table. We found that 1570 persons were diabetic and 8866 were not. A decision tree was built on the training dataset (N = 7305) while the testing dataset (N = 3131) was used to evaluate the model. By applying the J48 algorithm, a decision tree with 19 nodes and 20 leaves was built, and the results are illustrated in Figure 1 . The tree shows that age is the most discriminatory attribute, followed by BMI, family history of diabetes, work-related stress, physical activity, salty food preference, hypertension, gender, and history of cardiovascular disease or stroke. Table 2 lists all 19 IF-THEN rules created by the model. The model was further evaluated for its accuracy by applying a confusion matrix analysis on the testing dataset, and the result is shown in Table 3 . This prediction model had an accuracy of 94.2%; with such a high performance, 94.2% (2948 out of 3131 individuals) was correctly classified, whereas only 5.8% (183 out of 3131 individuals) was incorrectly classified. The area under the ROC curve (AUC) for the model was 94.8%, demonstrating that this model has achieved a higher accuracy in classifying the true positives rather than the false positives. Furthermore, the precision and recall values for the model were 94.0% and 94.2%, respectively, which were significantly balanced. The decision tree demonstrated that among those participants with age ≤ 49, 5497 participants (97%) of the individuals were identified as nondiabetic, while age > 49, 771 participants (50%) of the individuals were identified as nondiabetic. In the subgroup where people were 34 < age ≤ 49 and B MI ≥ 25, when with positive family history of diabetes, 89 (92%) out of 97 individuals were identified as diabetic, when without family history of diabetes, 576 (58%) of the individuals were identified as nondiabetic. Work-related stress was identified as being associated with diabetes. In individuals with 34 < age ≤ 49 and BMI ≥ 25 and without family history of diabetes, 22 (51%) of the individuals with high work-related stress were identified as nondiabetic while 349 (88%) of the individuals with low or moderate work-related stress were identified as not having diabetes ( Table 2 ). By having factored nine variables pertinent to the prediction of the development of diabetes in a community-based setting into a sensitive decision tree model, we systemically analyzed a large dataset solely collected from a Chinese ethnic population. Our decision tree approach yielded a highly accurate (94.2%) model with balanced precision (94.0%) and recall (94.2%), respectively. Although all nine variables such as age, gender, BMI, hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work-related stress, and salty food preference carried certain critical values in making an early and noninvasive prediction of diabetes before the expensive laboratory tests become necessary, age plays a dominant role on the potential development of diabetes by serving as a single but the most discriminatory attribute to diabetes. Secondary to age, BMI also stood out as another critical predictive attribute. With BMI being greater than 25, the subjects with age older than 34, younger than or equal to 49, and a positive family history of diabetes tend to have a higher incidence in developing diabetes. Under such a circumstance, BMI itself could be considered a key predictor. In contrast, the work-related stress appeared to be an important predictive factor for those with age younger than 49. Our findings were consistent with those previous results [11, 25, 26] . As suggested in our decision tree model, the participants with BMI < 25 and younger than 49 carry a low risk of having diabetes. Similarly, as shown in rules 1 and 13 (Table 2) , these subjects with BMI < 25 and active physical exercise tend to have a much less chance to develop diabetes. In addition to these findings, the work-related stress appeared to be another key risk factor as those aged people with a relatively higher BMI possess a relatively higher likelihood in their life to develop diabetes. Among several novel discoveries from our decision tree approach toward the prediction of diabetes in a community-based setting, combining multiple variables pertinent to the prediction of diabetes, medical care providers could take advantage of this simple but meaningful approach in their practice to identify those people who are at a high risk for diabetes [27] . It is thus that by screening candidates for potential diabetes, this decision tree approach becomes a powerful tool for the health care providers during their routine encounters with patients in a communitybased setting. No doubt, a successful identification of a person with an early diabetes could not only reduce the chance for the patient to develop a full-blown diabetes with severe complications but also significantly mitigate the direct and indirect costs incurred. Additionally, the key features presented in the tree structure could further facilitate diabetes prevention through community interventions, as individuals who are at a high risk of developing diabetes could quickly and easily be identified and targeted for a variety of education programs specifically designed for diabetes. Indeed, several large-scale trials have demonstrated the benefits of preventing diabetes with simple lifestyle interventions [8, 11, [28] [29] [30] [31] . Appropriately implemented preventive interventions in the early stage of the disease could render the progression of the disease to be less aggressive in its development as well as less expensive to both households with individuals with diabetes and the health care system. It is worth pointing out that such an approach can be extended to include other chronic disease managements in a community setting. Nevertheless, the present study could be improved if it could have been conducted in multiple sites. Indeed, a longitudinal approach with additional prospective data would better to explore the relationship among risk factors and other risk factors with distinctive predictive values. As a direction for the future study, prediction models with sensitivity and specificity higher than the ones reported in the present study should be sought in conjunction with data collected from multiple sites. In doing so, additional variables could be implemented in this model to establish a more comprehensive decision tree by which the holistic approach toward an individual with potential for developing diabetes could be precise and effective [32] . We proposed a classifier, based on a decision tree, to identify potential incidents of diabetes from a database of annual health check-up reports in a large Chinese hospital. We used nine features of patients which are easily obtained and noninvasive as predictor variables. The classifier, trained by J48, indicates that a decision tree analysis can be successfully applied to screen diabetes, which will support clinical practitioners for rapid diabetes identification. This type of work is essential in regions where the epidemiologic risk is high and medical expenses are unaffordable for most of the population. In addition, the proposed model provides a means to target the prevention of diabetes through community interventions, which could help improve early diabetes diagnosis and reduce the burden on the health system. Such study will be cost-effective when more chronic disease-related cares would be avoided through effective case management. Body mass index AUC: The area under the receiver operating characteristic curve WHO: World Health Organization SPSS: Statistical Package for the Social Sciences WEKA: Waikato Environment for Knowledge Analysis. The data used to support the findings of this study are available from the corresponding author upon request. The authors declare no conflict of interest, and the work has not been published elsewhere. Economic costs of diabetes in the U.S Comparative approaches for classification of diabetes mellitus data: machine learning paradigm Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study World Health Organization, 2008-2013 action plan for the global strategy for the prevention and control of noncommunicable diseases, World Health Organization Comparison of predictive models for the early diagnosis of diabetes Screening for prediabetes using machine learning models Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes -ELSA-Brasil: accuracy study Type 2 diabetes mellitus screening and risk factors using decision tree: results of data mining Comparison of three data mining models for predicting diabetes or prediabetes by risk factors Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes Predictive data mining in clinical medicine: current issues and guidelines Tuberculosis transmission in nontraditional settings: a decision-tree approach Data--mining technologies for diabetes: a systematic review Data mining in healthcare and biomedicine: a survey of the literature Diet, lifestyle, and the risk of type 2 diabetes mellitus in women Carbohydrate and fiber recommendations for individuals with diabetes: a quantitative assessment and meta-analysis of the evidence Weight gain as a risk factor for clinical diabetes mellitus in women Moderate alcohol consumption lowers the risk of type 2 diabetes: a meta-analysis of prospective observational studies The application of a decision tree to establish the parameters associated with hypertension Identification of metabolic syndrome using decision tree analysis Building predictive models for MERS-CoV infections using data mining techniques Prediction of periventricular leukomalacia. Part I: selection of hemodynamic features using logistic regression and decision tree algorithms Prevalence and risk factors of diabetes in a community-based study in North India: the Chandigarh Urban Diabetes Study (CUDS) Lifestyle factors and risk for new-onset diabetes: a population-based cohort study Development and validation study of a non-alcoholic fatty liver disease risk scoring model among adults in China The long-term effect of lifestyle interventions to prevent diabetes in the China Da Qing diabetes prevention study: a 20-year follow-up study Lifestyle intervention for prevention of type 2 diabetes in primary health care: one-year follow-up of the Finnish National Diabetes Prevention Program (FIN-D2D) Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: a cross-sectional, unselected, retrospective study A software tool for determination of breast cancer treatment methods using data mining approach This work was supported by the China Medical Board (Grant number 15-219).