key: cord-0058179-dzfm9fdj authors: Hrimov, Andrew; Meniailov, Ievgen; Chumachenko, Dmytro; Bazilevych, Kseniia; Chumachenko, Tetyana title: Classification of Diabetes Disease Using Logistic Regression Method date: 2020-12-04 journal: Integrated Computer Technologies in Mechanical Engineering - 2020 DOI: 10.1007/978-3-030-66717-7_13 sha: 5dd29c5f1eb20a57f525006dcb03a496b5c803d3 doc_id: 58179 cord_uid: dzfm9fdj At the moment, there are many methods of analysis and classification aimed at building the most accurate and effective mathematical models that are widely used in medicine as a decision-making tool. Existing methods make it possible to identify the relationships between input and output variables in the sample, build models reflecting these relationships, compare them in terms of accuracy, profitability and costs, and choose the most effective model. The increase in the incidence of diabetes not only in the world, but also in Ukraine, dictates the need to introduce a mathematical apparatus for automatic diagnosis of the disease. Within the framework of the study, the classification of patients with diabetes by the logistic regression method was implemented. Python is used for software implementation. The global coronavirus pandemic has once again demonstrated the need to introduce digital tools into healthcare [1] . The digitalization of medicine is one of the most urgent tasks in the modern world, and already created products and solutions are used in various areas of public health: in the management of medical institutions [2] [3] [4] [5] , diagnostics of various diseases [6] [7] [8] , modeling epidemic processes [9, 10] and predicting morbidity [11] , surgical treatment [12, 13] and even in training of medical personnel [14] [15] [16] [17] . The modern development of information technologies makes it possible to develop not just shells for automating the work of medical institutions [18] , but also complex systems based on methods and means of artificial intelligence [19] , multi-agent modeling [20] , game theory [21] , decision theory [22] , machine learning [23, 24] , computer vision [25, 26] and other modern methods and approaches. The worldwide attention to the incidence of Covid-19 not only does not diminish the importance of global epidemics of other diseases, but often exacerbates them. One of these diseases is diabetes. Diabetes is a chronic disease that develops when the pancreas does not produce enough insulin, or when the body cannot use the insulin it makes efficiently. Insulin is a hormone that regulates blood sugar levels. A common result of uncontrolled diabetes is hyperglycemia, or elevated blood sugar levels, which over time leads to severe damage to many systems in the body, especially nerves and blood vessels. According to the latest official data from the Ministry of Health, there were 1.27 million people with diabetes in Ukraine. Among them, almost 200,000 patients require daily insulin intake. From 2010 to 2017, the total number of patients increased by 4%, and the rate per 100 thousand populationby 12%. The specific weight of diabetes mellitus cases among all diseases during this period increased by 0.3% (from 1.4% to 1.7%). According to the Public Health Center, in Ukraine, almost half of the patients with diabetes are not diagnosed. One of the tools in solving the problem of diabetes diagnosis is the use of a machine learning apparatus to identify infected based on test data. Thus, the aim of the research is the automated classification of patients with suspected diabetes based on the logistic regression method. At the moment, there are many methods of analysis and classification aimed at building the most accurate and effective mathematical models that are widely used in medicine as a decision-making tool [27]. Existing methods make it possible to identify the relationships between input and output variables in the sample, build models reflecting these relationships, compare them in terms of accuracy, profitability and costs, and choose the most effective model [28, 29] . In our case, this may be the presence or absence of diabetes. Linear regression is used to model linear relationships between a continuous output variable and a set of input variables [30] . Under certain conditions, the linear regression equation serves as an irreplaceable and very high-quality tool for analysis and forecasting. The linear regression model is the most common and simplest equation for the relationship between input and output variables. In addition, the constructed linear regression equation can be the starting point for data analysis. When analyzing data, there are often problems where the output variable is categorical, and then the use of linear regression is difficult. Therefore, when looking for relationships between a set of input variables and a categorical output variable, logistic regression has become widespread. Logistic regression is a binary classification method. It allows you to estimate the probability of realization (or non-realization) of an event depending on the values of some independent variables. The logistic regression line, unlike the linear one, is not straight. ROC curve (Receiver Operator Characteristic) is a curve used to represent the results of binary classification in machine learning. Since there are two classes, one of them is called a class with positive outcomes, the otherwith negative outcomes. In the terminology of ROC analysis, the former are called true positive, and the latter, false negative. In this case, it is assumed that the classifier has a certain parameter, by varying which, we will obtain one or another division into two classes. This parameter is often called the threshold, or cut-off value. Depending on it, different values of type I and II errors will be obtained. In logistic regression, the cut-off threshold ranges from 0 to 1this is the calculated value of the regression equation. Let's call it a rating. To understand the essence of type I and II errors, consider a four-field confusion matrix (Table 1) What is positive and what is negative depends on the specific task. When analyzing, they often operate not with absolute indicators, but with relative shares (rates), expressed as a percentage: the proportion of True Positive Rate: the proportion of False Positive Rate: Let us introduce two more definitions: sensitivity and specificity of the model. They determine the objective value of any binary classifier. Sensitivity is the proportion of truly positive cases: Specificity is the proportion of true negative cases that were correctly identified by the model: A high-sensitivity model often gives a true result if there is a positive outcome. Conversely, a model with high specificity is more likely to give a true result when there is a negative outcome (it detects negative examples). If we talk in terms of medicinethe problem of diagnosing a disease, where the model for classifying patients into sick and healthy is called a diagnostic test, then we get the following: a sensitive diagnostic test manifests itself in overdiagnosisthe maximum prevention of missing patients; a specific diagnostic test only diagnoses patients with certainty. This is important in the case when, for example, the treatment of a patient is associated with serious side effects and overdiagnosis of patients is not desirable. For implementation the classification method we have used open dataset of Diabetes patients: PIMA Indians Diabetes Database. Each instance represents individual patients and their various medical attributes along with diabetes classification. Database has 768 instances and 9 attributes (Table 2) . First step of data analysis is data preprocessing. For analysis and program realization we have used Python language. First of all, we need to import necessary modules and upload data from database (Fig. 1) . For correct analysis we have to change format of "infected" and "healthy" values to "0" and "1". Let's form data frames of characteristics and Boolean values of the disease. Next, we will set the data for training and validation and build a logistic regression model. Next, we will check it on test data and find the accuracy of its classification. Figure 2 shows the error matrix for the constructed model. Here 39 is the number of correctly predicted healthy people, 35 are incorrectly predicted healthy people, 16 are incorrectly predicted patients and 141 people were correctly identified as sick, in other words, the model correctly predicted 39 + 141 = 180 people, and was mistaken in the case of 35 + 16 = 52 persons. To improve the accuracy of the model, it is necessary to analyze the characteristics that were used for the classification (Fig. 3) . Here, you can see the differences in mean values for sick and healthy patients, in order to better understand the influence of each characteristic, we visualize their values (Fig. 4) . Next, let's build several models, taking into accounts the factors Age, Pregnancies, Serum Ins, DP Function, and the second -PG Concentration, Diastolic BP, Tri Fold Thick and BMI, similarly creating data frames and setting test and data for training the model. Analysis shows that the characteristics of the second group have a greater impact on the classification accuracy. The dashed line represents the ROC curve of a completely random classifier. A good classifier remains as far from it as possible (towards the upper left corner). In this case, the optimal classifiers can be called those presented in Fig. 5 and 7 . The article presents a logistic regression method as a tool for developing a mathematical-statistical model for predicting the probability of an event of interest to the researcher in the presence of two possible outcomes. The ROC analysis method was selected and described in detail as a tool for assessing the quality of the model. The capabilities of these methods are demonstrated by a real example of creating and evaluating the effectiveness (sensitivity and specificity) of a model for predicting the likelihood of diabetes incidence. The analysis showed that the factors PG Concentration, Diastolic BP, Tri Fold Thick and BMI most affect the accuracy of the disease detection. The accuracy of the model was 77%. At first glance, the accuracy of the model is relatively low for the problem of diagnosing morbidity. However, in our case, the option with the maximum sensitivity and specificity of the tests was chosen, which indicates overdiagnosis of patients. In the task of diagnosing diabetes, this is the best option, because a false-positive result can threaten, for example, only an additional visit to the doctor, and a false-negative result can not reveal a dangerous, but curable disease. Acknowledgment. The study was funded by the Ministry of Health of Ukraine for the state budget in the framework of the research work on the theme "To develop a scientifically substantiated strategy of prevention antibiotic resistance of the bacteria causing of healthcareassociated infections in healthcare facilities" (State registration number 0118U000944). Global telemedicine implementation and integration within health systems to fight the COVID-19 pandemic: a call to action Modeling of the process of critical competencies management in the multi-project environment Management of critical competencies in a multi-project environment Project-oriented management of adaptive teams' formation resources in multi-project environment Modeling of the processes of stakeholder involvement in command management in a multi-project environment Determining the probability of heart disease using data mining methods Application of artificial neural networks in the problems of the patient's condition diagnosis in medical monitoring systems Application of the C-means fuzzy clustering method for the patient's state recognition problems in the medical monitoring systems Development of an intelligent agentbased model of the epidemic process of syphilis Intelligent agent-based simulation of HIV epidemic process Computer aided system of time series analysis methods for forecasting the epidemics outbreaks Uncertainty of measurement results for anatomical structures of paranasal sinuses Significance of anatomical variations of maxillary sinus and ostiomeatal components complex in surgical treatment of sinusitis Development of intelligent information technology of computer processing of pedagogical tests open tasks based on machine learning approach Conditionality examination of the new testing algorithms for coal-water slurries moisture measurement Intelligent expert system of knowledge examination of medical staff regarding infections associated with the provision of medical care Development, examination and optimization of the device for quality control of dielectric materials Web-application development for tasks of prediction in medical domain Group structures on quotient sets in classification problems Intelligent simulation of network worm propagation using the code red as an example On intelligent decision making in multiagent systems in conditions of uncertainty Point-set methods of clusterization of standard information Interpretable machine learning in healthcare through generalized additive model with pairwise interactions (GA2M): predicting severe retinopathy of prematurity Development and analysis of intelligent recommendation system using machine learning approach Application of the computer vision system for evaluation of pathomorphological images Assessment of measurement uncertainty of the uncinated process and middle nasal concha in spiral computed tomography data Anaplasmosis: experimental immunodeficient state model Pathomorphological peculiarities of tuberculous meningoencephalitis associated with HIV infection Peculiarities of proliferative activity of cervical squamous cancer in HIV infection Health management based on history of personalized physiological data using linear regression analysis