key: cord-0819870-03mhju8e authors: Yang, TengFei; Zhao, Bo; Pei, Dongmei title: Estimation of the Prevalence of Nonalcoholic Fatty Liver Disease in an Adult Population in Northern China Using the Data Mining Approach date: 2021-07-28 journal: Diabetes Metab Syndr Obes DOI: 10.2147/dmso.s320808 sha: 810b0fc43fa944c5a726a9d214a7ade6d198302f doc_id: 819870 cord_uid: 03mhju8e BACKGROUND: Nonalcoholic fatty liver disease (NAFLD) is the commonest form of chronic liver disease worldwide and its prevalence is rapidly increasing. Screening and early diagnosis of high-risk groups are important for the prevention and treatment of NAFLD; however, traditional imaging examinations are expensive and difficult to perform on a large scale. This study aimed to develop a simple and reliable predictive model based on the risk factors for NAFLD using a decision tree algorithm for the diagnosis of NAFLD and reduction of healthcare costs. METHODS: This retrospective cross-sectional study included 22,819 participants who underwent annual health examinations between January 2019 and December 2019 at Physical Examination Center in Shengjing Hospital of China Medical University. After rigorous data screening, data of 9190 participants were retained in the final dataset for use in the J48 decision tree algorithm for the construction of predictive models. Approximately 66% of these patients (n=6065) were randomly assigned to the training dataset for the construction of the decision tree, while 34% of the patients (n=3125) were assigned to the test dataset to evaluate the performance of the decision tree. RESULTS: The results showed that the J48 decision tree classifier exhibited good performance (accuracy=0.830, precision=0.837, recall=0.830, F-measure=0.830, and area under the curve=0.905). The decision tree structure revealed waist circumference as the most significant attribute, followed by triglyceride levels, systolic blood pressure, sex, age, and total cholesterol level. CONCLUSION: Our study suggests that a decision tree analysis can be used to screen high-risk individuals for NAFLD. The key attributes in the tree structure can further contribute to the prevention of NAFLD by suggesting implementable targeted community interventions, which can help improve the outcome of NAFLD and reduce the burden on the healthcare system. Nonalcoholic fatty liver disease (NAFLD) is a type of metabolic stress liver injury that is closely associated with insulin resistance and genetic susceptibility. [1] [2] [3] If it is not managed early, it may progress to cirrhosis and liver cancer, which have a higher mortality. 3 NAFLD is one of the commonest chronic liver diseases worldwide; approximately one-quarter of adults are affected by NAFLD. In the United States, the prevalence of adult NAFLD is 24.13% in 2015. 4 In 2020, it was predicted that the prevalence of NAFLD will increase thereon at an annual rate of 0.5% in China. 5 In recent years, the prevalence of fatty liver among younger patients is increasing. Among them, patients with NAFLD are more prone to severe cardiovascular and cerebrovascular diseases, which have high mortality and morbidity. 6 Therefore, implementing population-based interventions for NAFLD prevention is an emergency; these interventions include early diagnosis of NAFLD, assistance of patients in modifying their lifestyle, and prescription of appropriate treatment. The gold standard diagnostic method for NAFLD is liver tissue biopsy; however, it is not widely used in clinical practice because of its invasive nature, high technical requirements, and low patient acceptance. 7, 8 Biochemical criteria and imaging criteria (ultrasonography, computed tomography, and magnetic resonance imaging findings) are used in NAFLD diagnosis. However, imaging techniques are expensive and, therefore, not suitable for mass screening of asymptomatic individuals. Moreover, the accuracy of the imaging is subjective and rests on the operator's assessment. Considering the ongoing coronavirus disease outbreak, the risk of infection should be minimized. Therefore, a simple, highly accurate, and non-invasive method should be developed to identify high-risk individuals for NAFLD. There have been recent rapid advances in data mining techniques. Data mining is a practical branch of artificial intelligence that provides well defined and useful information on selecting, exploring, and modeling large amounts of data for the discovery of unknown patterns or relationships. 9,10 Data mining includes the use of traditional and non-traditional statistical methods such as logistic regression and decision tree analysis. In this study, mining algorithms were used to study the patterns of NAFLD onset. Data mining can be used for screening high-risk patients for NAFLD because it allows the automated extraction of rules from large-scale data. It also assists in developing strategies for NAFLD prevention through education and counseling of high-risk individuals for NAFLD. Therefore, this study aimed to explore the feasibility of using common risk factors in screening patients for NAFLD and developing classifiers for NAFLD using exploratory data mining techniques. 13 ,629 with (i) a mean alcohol consumption of >140 g/week for men and >70 g/week for women in the previous month, (ii) other specific diseases causing fatty liver, (iii) severe liver insufficiency, and (iv) missing data were excluded. Finally, 9190 patients were included in this study ( Figure 1 ). Six input variables from other predictive models of NAFLD or risk factor studies were selected for this study. [11] [12] [13] The data collection process was as follows: ① Recording of healthy participants' general conditions; ② Measurement of patients' anthropometric indicators: the waist circumference was measured at the midpoint level of the line between the lower edge of the costal arch and iliac spine, and the average of two consecutive measurements accurate to 0.5 cm was considered; ③ Blood pressure measurement: blood pressure was measured three times with the patients at rest in one-minute intervals, and the average of the three measurements was calculated; ④ Blood biochemical tests: patients were prohibited from consuming food and water after 12:00 a.m. on the day of the physical examination, and venous blood was drawn at 8:00 a.m. Fasting blood glucose, triglyceride (TG), and total cholesterol (TC) levels were measured on the same day. All tests were performed using the same reagents and methods. The patients' blood glucose level was measured using the glucose oxidation enzyme method, while the TG and TC levels were measured using the enzymatic method, which were performed by the laboratory staff members of Shengjing Hospital of China Medical University. The diagnostic criteria for NAFLD included patients 1) with a mean alcohol consumption of <140 g/week for men and <70 g/week for women in the previous month, 2) with negative hepatitis B serum antigen and/or anti-hepatitis C virus tests, 3) with a definite ultrasonography-based diagnosis of fatty liver, and 4) without other liver diseases, such as drug-related liver injury. The presence of at least two of the following ultrasonography findings was used for the primary diagnosis of NAFLD: 1 (i) diffuse enhancement of near-field echoes in the liver, with echoes stronger than those in the kidneys, (ii) poor visualization of intrahepatic ductal structures, and (iii) progressive attenuation of far-field echoes in the liver. Abdominal ultrasonography was performed by experienced and uniformly trained ultrasonographers. The color Doppler ultrasound scanner model used was Philips iu22 linear array transducer, with a frequency range of 5-13 MHz. Data mining algorithms, especially decision trees, are not good at processing data points with missing values. Therefore, records with missing variables must be removed from the dataset. Data of some patients were deleted: waist circumference (N = 8760, 38.3%), systolic blood pressure (N = 4689, 20.5%), TG level (N = 2589, 11.3%), and TC level (N = 421, 1.8%). After rigorous data screening, records of the remaining 9190 eligible participants were used for further analysis. We applied five popular classifiers to train the dataset, including J48, AdaboostM1, SMO, Bayes Net, and Naïve Bayes. The performance results of all classifiers are shown in Table 1 . Based on the favorable prediction results obtained from the runs, we chose the J48 algorithm. The J48 algorithm, which is capable of handling both continuous and discrete attributes, was used to construct the decision tree model, performed using WEKA software, version 3.8.1. The dependent variable (output variable), a binary categorical variable, was divided into two categories: 0 and 1, where 0 meant healthy and 1 meant having NAFLD. The independent variables (input variable) included age, waist circumference, sex (men=1, women=2), systolic blood pressure, TC level, and TG level; these were significant risk factors in the t-test or chi-square test. In this study, the dataset was randomly divided into two groups: training dataset that contained the data of 66% of the participants (n=6065) and test dataset that contained the data of 34% of the participants (n=3125). The estimation model was constructed using data from the training dataset and then tested on the test dataset. 14 The C4.5 algorithm can be used to construct a decision tree based on the characteristics of the data samples; the leaf nodes of the tree represent the specific categories obtained after classification, and the non-leaf nodes of the tree represent the attributes of the data; a path that forms from the root node to the leaf nodes is the classification rule. The essence of the algorithm is to generalize a set of classification rules from the training data; these classification rules are mutually exclusive and complete. To avoid overfitting and maintain parsimony, the model generated by the decision tree can be pruned by removing unimportant end branches according to the defined algorithm without affecting the classification accuracy. 15 In this study, "patients with NAFLD" were defined as positive events, while "patients without NAFLD" were defined as negative events. True positives, true negatives, false positives, and false negatives were extracted separately using confusion matrices. Accuracy, precision, recall, F-measure, and area under the receiver operating characteristic curve (AUC) were used to evaluate the performance of these models. All variables were included in the descriptive statistical analysis. Quantitative data were expressed as � X±S, and the independent samples t-test was used to compare the two groups. Qualitative data were expressed as relative numbers, and the χ 2 test was used to compare the two groups. A p value less than 0.05 was considered to be significant. All statistical analyses were performed using SPSS 19 statistical software. The subject characteristic curve (receiver operating characteristic [ROC] curve) was used to evaluate the predictive effect of each algorithm on NAFLD. Table 2 . There were significant differences in age, waist circumference, sex, systolic blood pressure, TC level, and TG level between the two groups (p<0.05). Data were divided into a training dataset (66% of the total, N=6065) and test dataset (the remaining 34%, N=3125). Six variables were used as input variables of the model. The accuracy of the model was further evaluated by performing a confusion matrix analysis on the test dataset (Class= NAFLD). The accuracy, precision, and recall of the model were evaluated. The results are shown in Table 3 . The accuracy of the model was 83.0%. A total of 2594 of 3125 individuals were correctly classified, while 16.99% of 3125 individuals were incorrectly classified. The performance of all the classifiers is shown in Table 1 . J48 showed better results than the other classifiers (accuracy=0.830, precision=0.837, recall=0.830, F-measure=0.830, and area under the curve=0.905). The ROC curves of all the classifiers are shown in Figure 2 . A decision tree with 22 nodes and 23 leaves was constructed using the J48 algorithm, and the results are shown in Figure 3 . The decision tree structure showed that waist circumference was the most significant attribute, followed Table 4 . With the introduction of additional, highly correlated input variables for multiple segmentations, the decision column rules became more exhaustive. However, under similar situations, more unique interactions were observed between different input variables, more significant results were obtained, and the predictive values were good. In this study, a decision tree model for NAFLD screening was developed using data from a physically healthy population in northeastern China. Based on the model mined from the decision tree, waist circumference, TG level, systolic blood pressure, sex, age, and TC level were significant screening factors for NAFLD. Data on waist circumference, systolic blood pressure, sex, and age are easy to collect, and the devices used for measuring TC and TG levels are available and inexpensive. The aforementioned six indicators can be used to determine the high-risk groups among poor high-risk patients and in primary hospitals without liver imaging equipment. In this study, five classification algorithms for the prediction of NAFLD were compared. Several evaluation metrics were used to determine the effectiveness of the classification algorithms on the WEKA data mining platform. Based on our findings, the J48 algorithm provided the best classification results on the NAFLD dataset. The J48 decision tree algorithm is widely valued and used for its advantages such as low computational load, easy-tounderstand generated rules, ability to handle continuous and discrete attributes, automatic capture of multi-layer interactions between predictors, rule generation, and ease of interpretation and visualization. 15, 16 Decision trees are simple and can be used effectively in public health programs for early NAFLD screening and health interventions in future. 17 In this study, the decision tree rules for NAFLD screening were constructed using a large sample of data from a physically healthy population; these data can be used in epidemiological screening activities of high-risk persons for NAFLD. The first variable in the tree (the root) is the most important factor, while consecutive variables that are distant from the root are considered secondary factors for data classification. 15 The present study showed that waist circumference was the most important distinguishing factor for the presence/absence of NAFLD, followed by TG level, systolic blood pressure, sex, age, and TC level. The global prevalence of fatty liver is closely related to the rapidly increasing prevalence of obesity. 18, 19 Waist circumference is a convenient measure of abdominal obesity that correlates with abdominal fat volume and area. 20 It is also an independent predictor of NAFLD severity and steatosis. 21 Recent studies have reported a strong association between increased waist circumference and NAFLD risk. 22 Based on these data, reducing one's waist circumference can help in preventing NAFLD. This is a feasible community health care system intervention-several methods should be used to decrease the waist circumference in people with abdominal obesity. Blood pressure has a significant impact on the development of NAFLD. In addition, it interacts with increased waist circumference. In a cross-sectional study of 5362 individuals, 23 elevated blood pressure was found to be a risk factor for the ultrasound diagnosis of NAFLD. Moreover, the incidence of NAFLD was 30% higher in pre-hypertensive patients than in normotensive patients and up to 80% higher in patients with hypertension than in normotensive patients. In this study, Rule 18 of the 22 rules showed that IF a patient had a waist circumference >95.8 cm, SBP >109 mmHg, TG level >1.08 mmol/L, and TC level >3.59 mmol/L, THEN the patient has NAFLD (1741/169). The accuracy was 90.3%. The introduction of waist circumference, TG level, cholesterol level, and systolic blood pressure as input variables into the algorithm ensured efficient detection of patients with NAFLD. The significance of this finding lies in that NAFLD can be prevented by reducing the risk factors and developing more cost-effective measures. The rules applied in our tree model construction suggest that reducing the waist circumference and controlling the lipid and blood pressure levels through lifestyle modification can decrease the prevalence of NAFLD. Moreover, this is the first study investigating the screening method for NAFLD in people with a normal waist circumference. The left number branch in the decision tree diagram focuses on the screening rules for NAFLD when the waistline is less than 84.5 cm. For example, Rule 10 suggests that IF a person has 72< waist circumference ≤84.5 cm, TG level >1.79 mmol/L, SBP >109 mmHg, and is sex=woman, THEN the patient has NAFLD (315/98). In particular, it provides ideas for screening patients with NAFLD in women with a normal https://doi.org/10.2147/DMSO.S320808 Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy 2021:14 waist circumference. Although the classifier rule has a slightly lower diagnostic efficacy for screening NAFLD in people with a normal waist circumference, it is still cost-effective and useful for rapidly screening for NAFLD during community health services in developing countries. This study has some limitations. First, only participants who underwent physical examinations in northern China were included; this limits the generalizability of the findings. Most participants were employees in enterprises and institutions or jobless people with stable financial statuses; therefore, the population constitution was limited, which may explain the higher prevalence in our study than in previous studies. Therefore, the results should be validated by including participants with different ethnicities and genetic backgrounds in future studies. Second, this study used ultrasonography as the gold standard technique for diagnosing NAFLD and not liver tissue biopsy. Liver tissue biopsy cannot be used on a large scale, unlike ultrasonography, which has a high degree of accuracy and is the most used method in population-based studies. Finally, the results of this retrospective cross-sectional study need to be validated in a large prospective study assessing the outcomes of some interventions such as reduction of waist circumference for NAFLD prevention. We proposed the decision tree-based classification model that can help clinicians in developing countries at the grassroots level to rapidly screen patients for NAFLD. The key attributes of the decision tree can further contribute to the prevention of NAFLD by the implementation of targeted community interventions, improvement of NAFLD outcomes, and reduction of the burden on the healthcare system. More comprehensive and rigorous prospective studies should be conducted in the future to explore the sensitivity and specificity of waist circumference, blood pressure, and lipid levels on the early diagnosis and treatment of NAFLD. The datasets used and analyzed in the present study could be available from the corresponding author upon reasonable request. This human study was approved by Shengjing Hospital of China Medical University Ethics Committee (ref. Ethics 2019PS089J) . It was a retrospective study with no direct intervention. All patient data were anonymized. The patients' information and privacy were fully protected. Therefore, the institutional review board waived the need for written informed consent from the participants. This study complied with the Declaration of Helsinki. WC≤84.5cm and TG>0.79 mmol/L, sex=women, SBP≤109mmHg, THEN patient without NAFLD 79 mmol/L, sex=women, SBP>109mmHg and age>71, THEN patient without NAFLD (38/1) Rule 5: IF WC ≤79cm and 1.79≥TG>0.79 mmol/L, sex=women, SBP>109mmHg and age≤71, THEN patient without NAFLD (566/92) Rule 6: IF 791.06 mmol/L, sex=women Rule 15: IF 87109mmHg, 0.7995.5cm, SBP>109mmHg and 0.7984.5cm, SBP>109mmHg, TG>1.08mmol/L and TC≤3.59mmol/L, THEN patient without NAFLD (47) Rule18: IF WC>95.8cm, SBP>109mmHg, TG>1.08mmol/L and TC>3.59mmol/L, THEN The Chinese National Workshop on Fatty Liver and Alcoholic Liver disease for the Chinese Liver Disease Association. Guidelines for management of nonalcoholic fatty liver disease: anupdated and revised edition Nonalcoholic fatty liver disease: a systematic review Cause, pathogenesis, and treatment of nonalcoholic steatohepatitis Global epidemiology of nonalcoholic fatty liver disease-meta-analytic assessment of prevalence, incidence, and outcomes Prevalence of fatty liver disease and the economy in China: a systematic review Prevalence of nonalcoholic fatty liver disease in the United States: the third national health and nutrition examination survey, 1988-1994 Asia-Pacific Working Party on non-alcoholic fatty liver disease guidelines 2017-part 1: definition, risk factors and assessment Development of chronic kidney disease in patients with non-alcoholic fatty liver disease: a cohort study Data mining and computational modeling of high-throughput screening datasets Semi supervised data mining model for the prognosis of pre-diabetic conditions in type 2 diabetes mellitus Non-invasive assessment of non-alcoholic fatty liver disease: clinical prediction rules and blood-based biomarkers Prevalence and risk factors of nonalcoholic fatty liver disease and advanced fibrosis in general population: the French Nationwide NASH-CO Study Non-alcoholic fatty liver disease: a review of epidemiology, risk factors, diagnosis and management Building predictive models for MERS-CoV infections using data mining techniques Applying decision tree for identification of a low risk population for type 2 diabetes Tuberculosis transmission in nontraditional settings: a decision-tree approach Data mining in healthcare and biomedicine: a survey of the literature Obesity and nonalcoholic fatty liver disease: from pathophysiology to therapeutics Fatty liver disease: is it nonalcoholic fatty liver disease or obesity-associated fatty liver disease Waist circumference and abdominal sagittal diameter: best simple anthropometric indexes of abdominal visceral adipose tissue accumulation and related cardiovascular risk in men and women Nonalcoholic fatty liver disease: is all the fat bad Blood pressure is associated with the presence and severity of nonalcoholic fatty liver disease across the spectrum of cardiometabolic risk Original research, review, case reports, hypothesis formation, expert opinion and commentaries are all considered for publication. The manuscript management system is completely online and includes a very quick and fair peer-review system All authors made a significant contribution to the work reported, whether in the conception, study design, execution, acquisition of data, analysis, and interpretation or all; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; agreed on the manuscript to be submitted; and agreed to be accountable for all aspects of the work. This study was funded by China Medical Board under the grant number #15-219. All authors declare that they have no conflicts of interests.