key: cord-0029887-ucv0g9iz authors: Edeh, Michael Onyema; Khalaf, Osamah Ibrahim; Tavera, Carlos Andrés; Tayeb, Sofiane; Ghouali, Samir; Abdulsahib, Ghaida Muttashar; Richard-Nnabu, Nneka Ernestina; Louni, AbdRahmane title: A Classification Algorithm-Based Hybrid Diabetes Prediction Model date: 2022-03-31 journal: Front Public Health DOI: 10.3389/fpubh.2022.829519 sha: 593e93b54e56ce6805a3f25e99ed41f00b95ab2b doc_id: 29887 cord_uid: ucv0g9iz Diabetes is considered to be one of the leading causes of death globally. If diabetes is not treated and detected early, it can lead to a variety of complications. The aim of this study was to develop a model that can accurately predict the likelihood of developing diabetes in patients with the greatest amount of precision. Classification algorithms are widely used in the medical field to classify data into different categories based on some criteria that are relatively restrictive to the individual classifier, Therefore, four machine learning classification algorithms, namely supervised learning algorithms (Random forest, SVM and Naïve Bayes, Decision Tree DT) and unsupervised learning algorithm (k-means), have been a technique that was utilized in this investigation to identify diabetes in its early stages. The experiments are per-formed on two databases, one extracted from the Frankfurt Hospital in Germany and the other from the database. PIMA Indian Diabetes (PIDD) provided by the UCI machine learning repository. The results obtained from the database extracted from Frankfurt Hospital, Germany, showed that the random forest algorithm outperformed with the highest accuracy of 97.6%, and the results obtained from the Pima Indian database showed that the SVM algorithm outperformed with the highest accuracy of 83.1% compared to other algorithms. The validity of these results is confirmed by the process of separating the data set into two parts: a training set and a test set, which is described below. The training set is used to develop the model's capabilities. The test set is used to put the model through its paces and determine its correctness. Diabetes is a chronic disease also known as a silent disease. The World Health Organization (WHO) defines diabetes as a disease that prevents the body from properly using the energy provided by the food it consumes. In addition, the disease occurs when there are problems with the hormone insulin, which is naturally produced by the pancreas to help the body use sugar and fat and store some it (1) . More clearly, when we eat, food is broken down into glucose (sugar). This glucose provides energy for the body to function properly by drawing on its resources. During digestion, the blood carries the glucose throughout the body and supplies the cells. However, in order for the sugar in the blood to be delivered to the cells, the body needs insulin, a hormone secreted by the pancreas, which acts as a key to get the glucose from the blood into the cells of our body (2) . There are three most common types of diabetes: Type 1 diabetes: it is a condition in which the pancreas cannot produces enough insulin or does nor produces any insulin. It accounts for 5-10% of all diabetes cases, commonly affecting childhood and adolescence. Then, Type 2 diabetes-it is a condition in which the insulin produced does not effectively used to maintain the blood sugar level in the body. This type diabetes is common in people age 40 and above, but also appears in younger people. From all diagnosed diabetes cases worldwide, type 2 diabetes accounts for 90-95 percent. There is another type of diabetes called "gestational diabetes", caused by the lack response of the insulin receptors on the body tissues, even if the insulin levels are normal, which makes this condition different from the second type, and this case is very rare, account for 1-2% of all diabetes cased and it also increase the risk of developing type 2 diabetes later (3) . People with diabetes must be treated according to their type of diabetes. The goal of treatment is to keep the patient's blood sugar level within a normal range. Study abstract Reference In the proposed work, SVM with radial basis function kernel is used for classification. The performance parameters such as the classification accuracy (78.2 %), sensitivity (80%), and specificity of the SVM and RBF have found to be high thus making it a good option for the classification process. Ahmed Accuracy of the proposed models has been compared. The random forest method provided an accuracy of 74.7%, ANN gave 75.7% and K-means clustering method has given 73.6% accuracy. The authors in used data from the University of California, diabetes mellitus patients were classified using a machine learning data repository to compare the performance of four common classifiers (J48 DT, the K-Nearest Neighbors algorithm, the Random Forest algorithm, and the Support Vector Machines algorithm). They used a data sample from the UCI machine learning data repository. Preliminary results suggest that the J48 DT classifier outperforms the other three classifiers in terms of accuracy (73.82 percent) before data preparation, and that the KNN (k = 1) and Random Forest classifiers outperform the other three classifiers after data pre-processing. Soleh et al. The data in this study divided into two, 75% for training data, and 25% for testing data. This study produces an evaluation with an accuracy 80%, which means it is better than the previous paper, which is 75, 97%. Rajput et al. The target of analysis made in the present research is to list the risks factors and correlation that exist among those risk factors. In this work, logistic regression, support vector machine, random forest, decision tree, Naive Bayes, K nearest neighbor classifiers are used for prediction, and their accuracy is compared to choose the better machine learning model. SVM provides higher accuracy (96.0) among the choosen algorithms. Deepa et al. This work aims to propose an artificial intelligence-based intelligent system for earlier prediction of the disease using Ridge-Adaline Stochastic Gradient Descent Classifier (RASGD. The results of the proposed scheme have been compared with state-of-the-art machine learning algorithms such as support vector machine and logistic regression methods. The RASGD intelligent system attains an accuracy of 92%, which is better than the other selected classifiers. The proposed process is presented in the form of a model diagram in Figure 1 below. The following diagram depicts the flow of the research done in the process of building the model: In this study, we split the data set into two parts: a training set and a test set. The training set is used to train the model. The test set is used to test the model and evaluate the accuracy. In the second step we use a K-means algorithm for data correction in order to improve performance and control the classified model (By changing the number of clusters). Next, we invite the learned algorithms to be tested using the test database. Finally model evaluation with other related work. Support Vector Machine is based on statistical learning theory. SVMs were originally developed for binary classification, but can be effectively extended to multi-layer problems. SVM or Sequential Minimal Optimization (SMO) is a learning system that use a hypothesis space for linear functions in a high-dimensional space, and that has been trained using an optimization theory learning algorithm that employs a learning bias de-rived from statistical learning theory to achieve its results. SVM implements nonlinear class boundaries by translating nonlinear input vectors into a high-dimensional feature space using a linear model, which is implemented using the kernel of the SVM. Support vectors are training in-stances that are closer to the maximal hyper maximum level than the rest of the training examples. In order to define the binary layer boundaries support, all other training samples are rendered inapplicable. As a result, the vectors are utilized to construct the ideal level hyper linear separation function (in the case of pattern recognition) or linear regression function (in the case of regression) in the feature space in question (13) . The k-means clustering algorithm is a machine learning algorithm that groups nearby points into clusters. In this algorithm, there is no learning model construction because we will locate the new point in any cluster based on its distance from all the clusters (mainly its distance from the cluster center or its arithmetic mean) and it is placed in the cluster that is closest. For example, imagine that you want to divide the points of a line into 3 groups. To Determine how close a point is to a particular group, we will use a measure of its distance from the group (for example, the distance between two points) (14) . NB is a classification approach in which the idea of independence and relatedness of all characteristics is defined as follows: Specifically, it specifies that the state of a given feature inside a class has no effect on the status of any other feature within the class. As a result of its foundation in conditional probability, it is regarded as a strong algorithm that may be utilized for classification applications. It performs effectively when dealing with data that has imbalance issues and missing values (15) : • P(A|B): conditional probability that the response variable has a certain value given the input characteristics. Additionally, this is referred to as the posterior probability. • P(A): The response variable's a priori probability. • P(B): The likelihood that the training data or evidence is correct. • P(B|A): This is referred to as the probability training data. DT learning is one of the predictive modeling techniques used in statistics, data mining, and machine learning. Use a decision tree (as a predictive model) to move from observations on an item (represented in the branches) to conclusions about the target value of the item (represented in the paper). They use a hierarchical representation the data structure in the form of a sequence decisions (tests) to predict an outcome or a category. Each individual (or observation), which must be allocated to a class, is represented by a collection of variables, which are tested in the nodes of the tree. In the internal nodes, testing is carried out, and choices are taken in the paper nodes (16) . In graph theory, a tree is a linked graph that is undirected, acyclic, and has no edges. There are three categories of nodes: • Root node: This is the base of the tree and the most sensitive element when the tree is created and before it is exploded. • Internal node: refers to nodes that have offspring that are themselves nodes. • Final nodes: that do not contain any branches. There are many DT's algorithms, and we can cite: ID3, C4.5, CART, C5, CHAID, SLIQ, UFFT, VFDT... (16) . Random Forest algorithm (17) is for statistics and machines that employs several learning methods to improve prediction performance. The two-part algorithm A. Tree bagging b. Each tree is produced from tree bagging to random forest: 1. Sample N instances at random -but with replacement, from the original data if the number cases are N inside the training set. The training set for developing the tree will be this sample. 2. A random number of characteristics and the optimal division utilized for dividing the node are picked when there are M input variables. During the forest growth, the value M is kept constant. 3. Each tree is cultivated as much as possible. We used two different databases in this study; Pima Indian Diabetes provided by the UCI Machine Learning repository (18) and a database extracted from the hospital in Frankfurt, Germany (19) . Database extracted from the hospital in Frankfurt the first data is 2000 Pima Indian has 768 patient data with 8 attributes/features and one out-put with the patient's label/outcome (0: Not diabetic, 1: Diabetic). Two databases together consist some distinct medical variables, such as: 6. Blood Pressure: Diastolic blood pressure (mm Hg). 7. Skin Thickness: Thickness of the triceps skin fold (mm). Pedigree Function: Diabetes pedigree function (heredity). • If class =1 implies diabetic patient. • If class =0 implies non-diabetic patient. The choice of these two bases is justified by the following criteria: • The size of the database. • Number of attributes. • Number of classes. Data cleaning is the next step in machine learning. It is considered one of the main steps in the working stages, and it is either building the model or breaking it. There is a saying that "the best data beats the most complex algorithms" in machine learning. Several aspects of data cleansing must be considered: 1. Discordances and omissions 2. Data mislabeling, same category repeated. 3. Invalid or missing data. 4. Outliers. The observation and analysis of the two databases are presented in Table 1 . • Ignore/delete these cases: delete all the observations with zero values but in this method we get a significant loss of data (about 50% of the data set). • Put the mean values: calculate the median value of a specific column and replace this value in that column where we have zero. • Avoid using parameters: The model can avoid using parameters with too many incorrect values. This may help thicken skin, although it's hard to tell. • Use of a classification algorithm: use a classification algorithm to recover the missing data where we have zero and replace them with the value found. In our case, we have to apply this method and we have chosen the k-means algorithm with a variable K number of cluster (group) and replacing each column needs cleaning by the representative of cluster. The data set is split into two parts: training and testing. The training set teaches the model. The test set is used to evaluate the model's correctness. "Test size = 0.2." That is, 20% for the test and the rest 80% for the training. "Test size = 0.3." That is, 30% for the test and the rest 70% for the training. This study uses Naive Bayes, Random Forest, SVM, and DT algorithms. Train/Test Split is used in experiments. This study uses Accuracy, F1-Measure, Recall, and Precision metrics to classify. See Table 2 for accuracy measures (20) . This is clear from Table 3 , which compares the different performance measures (accuracy, recall, and F1 score) used to evaluate the investigated machine learning models. Random Forest (RF) demonstrated the best accuracy when used in the Table 4) . Based on different performance measures such as accuracy, recall, and F1 score, it is clear that the examined ML models are comparable; this is seen in the Table 5 , SVM (Support Vector Machine) demonstrated the best accuracy when used in its optimized form (83.1 percent). Other algorithms, such as the Random Forest (RF), have demonstrated sufficient accuracy in addition to the Support Vector Machine (SVM) (80.5 percent). We can see that DT, Gaussian Naive Bayes, Random Forest, SVM, performed better. From the basic level, we can observe that Support Vector Machine and Random Forest work well than other algorithms. From the results of this experimentation, we observe that the accuracy values for this database with all measurements are satisfactory with a disturbance sometimes the rate increases and sometimes it decreases by a small difference when changing the number of cluster as well as the random initialization of cluster centers can influence the results. Note: After running the model several times, different results can be obtained in the same cluster number. This depends on the step of random initialization of the cluster centers. According to the above table, the SVM model obtained the best accuracy which is equal to 83.1%. That is, among 153 attributes that were chosen for testing this model are classified 127 patients correctly. We select the SVM model as the most optimal model that works best for our dataset because it's high accuracy. According to the above table, the Random Forest model obtained the best accuracy which is equal to 97.6%. That is, among 600 attributes that were chosen for testing this model are classified 582 patients correctly. We select the SVM model as the most optimal model that works best for our dataset because it's high accuracy. Based on the results given in Tables 5, 6 , the goal of the four algorithms is to better classify future observations while reducing classification errors. it can be concluded that the suggested models are more accurate than other type 2 diabetes prediction models that have been investigated in the research indicated in these tables. When comparing the above findings, it is obvious that the notion of utilizing the k-means algorithm was successful in our work; as a consequence, we infer that improving the quality of the data enhances the outcomes. This is consistent with other studies (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) which shows the predictive accuracy of machine learning algorithms. In this study, we proposed a supportive diagnosis system based on the comparison four models of prediction algorithms to predict diabetes in two different databases. On the basis of several performance assessment methodologies like as accuracy and recall, as well as the F1 score, different machine learning algorithms are compared and assessed. Using the classification results obtained, it can be concluded that the random forest machine learning technique provides more accurate prediction and higher performance than the other methods described in this study. However, when compared to other research accessible in the current literature, some of the other approaches utilized in this study, such as naive Bayes, DT and SVM, Random Forest, and others, produce the most optimum outcomes. The main objective of this study is to help diabetologist to establish an accurate treatment routine for their diabetic patients. Due to the high accuracy and diagnose the disease in a shorter time and the rapid treatment, this study could open a window in the development of an electronic health system for diabetic patients. There are also a few aspects in this study that could be improved or expanded in the future. In perspective term: • Creation of diabetes database for Algerian patients • Diabetes prediction with the deep learning approach. • Developed a solution based on an Android application in order to help people predict if they have diabetes. The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation. Organisation mondiale de la santé Diabète En Quelque Mots DIABETES -TYPE 2: The Review of Diabetic Studies Classification de la maladie du diabète à l'aide d'une machine à vecteur de soutien Using data mining to develop model for classifying diabetic patient control level based on historical medical records Diabetes disease prediction using data mining Prediction of Diabetes in Females of PimaIndian Heritage: A Complete Supervised Learning Approach' Performance analysis of classifier models to predict diabetes mellitus Study of data mining algorithms for prediction and diagnosis of diabetes mellitus Website-based application for classification of diabetes using logistic regression method Providing diagnosis on diabetes using cloud computing environment to the people living in rural areas of India An AI-based intelligent system for healthcare analysis using Ridge-Adaline Stochastic Gradient Descent Classifier Performance et évaluation des techniques d'exploration de données de classification dans le diabète Contribution des arbres dirigés et les k-means pour l'indexation et recherche d'images par contenu, Mémoire de Magister en Informatique, by H. fizazi izabatene Naïve Bayes Algorithm Exploration de données avec arbres de décision: théorie et applications. 2e édition Base de donnée Pima Indian Diabetes Ensemble de données sur le diabète, extrait de l'hôpital de Francfort, Allemagne Prediction of Diabetes using Classification Algorithms Master's thesis in Computer Science, supervised by Brahimi Farida, University AMO of Bouira Faculty of Sciences and Applied Sciences Performance assessment of different machine learning algorithms in predicting diabetes mellitus Predicting Diabetes Diseases Using Mixed Data and Supervised Machine Learning Algorithms Opportunities and challenges of use of mobile phone technology in teaching and learning in Nigeria-a review Evaluation of the Performance of K-Nearest Neighbor Algorithm in Determining Student Learning Styles Smart Health Application for Remote Tracking of Ambulatory Patients COVID-19 patient health prediction using boosted random forest algorithm An efficient and unique TF/IDF algorithmic modelbased data analysis for handling applications with big data streaming On detection of Sybil attack in large scale VANETs using spider-monkey technique A metaheuristic optimization approach for energy efficiency in the IoT networks MapReduce-based big data classification model using feature subset selection and hyperparameter tuned deep belief network Machine learning for modeling and control of industrial clarifier process Data Analytics for the Identification of Fake Reviews Using Supervised Learning Optimized dynamic storage of data (ODSD) in IoT based on blockchain for wireless sensor networks. Peer-to-Peer Netw Appl Deep learning based intelligent industrial fault diagnosis model The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.