key: cord-0058990-v5jl0aju authors: Wadhwa, Shruti; Babber, Karuna title: Artificial Intelligence in Health Care: Predictive Analysis on Diabetes Using Machine Learning Algorithms date: 2020-08-19 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58802-1_26 sha: 79c9794945dd8d903f9870ad32a697b77792ab7d doc_id: 58990 cord_uid: v5jl0aju Background: The healthcare organizations are producing heaps of data at alarming rate. This data comprises of medical records, genome-omics data, image scan or wearable medico device data that presents immense advantages and challenges at the same time. These ever growing challenges can be surpassed by applying effective artificial intelligence tools. Methods: This paper uses the large volume of multimodal patient data to perform correlations between Body Mass Index, Blood Pressure, Glucose levels, Diabetes Pedigree Function and Skin Thickness of people in different age groups with diabetes. Python and data analytic packages are used to predict diabetes among people. Results: The blood pressure count of diabetic people comes around 75–85 mmHg and sometimes even higher whereas it is in the range of 60–75 mmHg for non-diabetic people. The people with high body mass index and glucose levels of 120–200 mg/dl and more are found to be diabetic as against the lower body mass index with glucose levels of 85–105 mg/dl of normal people. The Diabetes Pedigree Function count of diabetic people has a peak at 0.25 whereas it is 0.125 in case of non-diabetic people. A similar slight difference in values of Age and Skin Thickness has been found for both diabetic and non-diabetic people. Conclusion: Above results indicate a strong relationship between Blood Pressure, BMI and Glucose levels of people with diabetes whereas a moderate correlation has been found between Age, Skin Thickness and Diabetes Pedigree Function count of people with diabetes. Although present analysis attested many of the previous research findings but getting these inferences matched through analytical tools is a sole purpose of this paper. Intelligent information can be a key for a healthier world. The more data we have, more optimally we can generate health specific information. These days we are surrounded with tons of data from almost every aspect of our lives be it societal, work, science or health. In the technological world the term 'Big data' has been coined to describe such huge volumes of data. Douglas Laney [1] described big data by three V's that is Volume, Velocity and Variety wherein 'Big' part of big data indicates its large volume, velocity represents the rate and speed of data collection and variety stands for different types of organized and unorganized data. Over the time few other authors [2] added two more V's -Veracity and Variability into the big data definition. In the recent years almost every sector of research is analyzing the big data for various purposes. The most challenging task is to manage the heaps of structured and un-structured data [3] and to transform it into subject-oriented information. It is almost impossible to process big data with conventional software; we need technically advanced softwares with highend computational power to make sense of such huge data. The implementation of Artificial Intelligence tools [4, 5] and algorithms can help us to generate decisionmaking information. Healthcare is required at every stage of human life. Professionals come at the first place for primary care consultation, skilled professionals for secondary care, extra or intensive medical treatment comprises tertiary care and highly diagnostic and surgical procedures comes under the ambit of quaternary care [6, 7] . In the recent years multi dimensional model [8] of healthcare for diagnosis, prevention and treatment of healthrelated issues in human beings has been suggested. At all these levels the health specific accurate information is required at a single stop. So far heaps of health data has been collected from various sources and now it has come to brim where it becomes almost unmanageable with the current available technologies [9, 10] . Big data especially Artificial Intelligence can be viewed as a potential analyzer for improved health services. Figure 1 provides generalized analytic workflow for health care systems. The machine learning in healthcare has opened plethora of applications [11] to improve patient life but errors in such applications may prove critical at times. A visual approach for 'Case based reasoning' mixed with quantitative and qualitative aspects have been proposed in [12] . The event sequencing of Electronic Health Record (EHR) of the patients [13] using deep learning algorithms are used to detect sepsis. The National Institutes of Health (NIH) of United States has recently started [14, 15] the 'All of Us' initiative (https://allofus.nih.gov) to collect more than one million of patients' data such as EHRs, medical images, environmental and socio-behavioural data. Data mining and machine learning is helping medical professionals make diagnosis easier by bridging the gap between huge data sets and human knowledge [16, 17] . A referential study in artificial intelligence in healthcare and medicine is provided in [18] . Intelligent machines may improve surgical procedures in the times to come [19, 20] . EHRs promise improvement in public health surveillance by timely reporting of disease outbreaks and can facilitate faster data retrieval for health insurance programs [21] . Similar to EHRs, Electronic Medical Records (EMR), Personal Health Records (PHR) and Medical Practice Management (MPM) software [22] [23] [24] and other related healthcare components collectively generate heaps of data that can be analyzed to provide real-time clinical care to patients in need. Figure 2 gives the fair idea about the framework for integrating multi-level information in order to provide personalized and cost-effective treatment to patients. The observed data has been part of the Pima Indians Diabetes Database collected by National Institute of Diabetes and Digestive Kidney Diseases [25]. The observational data contains different parameters like blood pressure, skin thickness, glucose, Diabetes Firstly 'Pandas' [26] library file has been imported to read our data from a 'csv' file, then with the help of Extract, Transform, Load (ETL) tools [27] and 'Numpy' library file [28] data is transformed into a usable format to further use it on classification models. Figure 3 shows our dataset with 9 columns and 392 rows. The 'Matplotlib' library functions [29] have been used to draw graphs and other visualizations. The machine learning algorithms available in 'sklearn' library file [30] are used for final predictive analysis. To start with, we normalize the dataset to split it into two outcomes that is an outcome of zero (0) received for non-diabetic people and an outcome of one (1) received for diabetic people. The following bar graph (Fig. 4) gives us the fair idea. From the above figure, it is clear that more than 250 people with '0' outcomes are non-diabetic whereas around 125 people with '1' outcome are diabetic. Now to make inferences of different parameters on diabetes, we have taken number of features into consideration, the details of which are provided in the results section. The measure of pressure of the blood in the circulatory system is defined as Blood Pressure [31] . In [32] authors explained effect of hypertension on diabetes mellitus. To find a correlation between blood pressure and diabetes, the bar graph (Fig. 5) visualization is provided below: In graph I, the BP value is about normal with mean around 65-85 mmHg as against the ideal value of 80 mmHg. In graph II, for diabetic patients the BP value plot is little skewed towards right i.e. it is around 75-85 mmHg whereas for non-diabetic people it is around 60-75 mmHg. The box-plot of BP in Fig. 6 shows 'whiskers' that is the maximum value of BP for non-diabetic people at 100 but it is at 110 with few outliers in case of diabetic people. Secondly around 75% diabetic people have BP in the range of 80-85 mmHg whereas in case of non-diabetic people around 75% people have a BP range of 70-75 mmHg. Both the plots indicate strong association between the two parameters. A person's age may play role for Type 2 Diabetes Mellitus. The risk of type 2 diabetes increases with the rising age [33, 34] . In graph I (Fig. 7) , people within the age group of 20 or in their mid 20 s are non-diabetic whereas after the age of 30 and above people are prone to diabetes. In graph II, the median is at 25 years for non-diabetic people but the median for diabetic people is at 32 years. Both the graphs indicate a moderate correlation between the two features. The body mass index [35] is a person's weight in kilograms (kg) divided by his/her height in meters (m). The National Institutes of Health (NIH) has adopted BMI as a parameter to define normal weight, overweight and obesity of people rather than the traditional height/weight charts. The graph I (Fig. 8) , shows the BMI level of all the people within the different age groups. In graph II, the BMI level peak is from 28-32 for non-diabetic people whereas it is skewed and tilted towards right at 36 for diabetic people. In the box-plot (Fig. 9 ) more than 50% diabetic people have high BMI level as against the non-diabetic people. Secondly for diabetic people some outliers are beyond maxima whereas it is not the case in non-diabetic people. The graph visualizations indicate strong correlation between the two parameters. It [36] is a function which scores likelihood of diabetes based on family history. Two histograms for DPF of both the outcomes (0 and 1) are presented in Fig. 10 . Graph I shows the DPF count of all the people within the different age groups. In graph II the peak of DPF count comes at 0.125 for non-diabetic people whereas it is at 0.25 for diabetic people. The graph indicates nominal relation between two parameters. The box-plot in Fig. 11 shows more than 50% non-diabetic people have a DPF count of 0.4 and it is slightly more with 0.525 DPF count for diabetic people. Secondly both the plots have outliers beyond maxima. Therefore, all the graphs indicate a weak correlation between the two features. The skin thickness [37] is primarily determined by collagen content and is increased in insulin-dependent diabetes mellitus. The graph I of Fig. 12 provides the skin thickness of both the outcomes (0 and 1). In graph II the peak is at 30 for non-diabetic people whereas it is at 40 for diabetic people. The peak values indicate a moderate relation between the two parameters. In the box-plot (Fig. 13 ) more than 50% non-diabetic people have a skin thickness count of 28 whereas the skin thickness count is 34 for more than 50% diabetic people. Secondly very few outliers are beyond the maxima in both the cases and the difference between the maxima value for both diabetic and non-diabetic people is only 3 points. The graphs indicate nominal relation between the two features. The global mean fasting plasma blood glucose level [38] in humans is about 100 mg/dl. The graph I of two histograms in Fig. 14 provides the glucose level of both the outcomes (0 and 1). The graph II shows the glucose level range of non-diabetic people between 85-105 mg/dl with little spike at 100-120 mg/dl whereas the range for diabetic people comes in between 120-200 mg/dl. In the box-plot (Fig. 15 ) more than 50% of non-diabetic people have glucose level 100-110 mg/dl. But more than 50% diabetic people have a range above 140 mg/dl. Secondly the maxima value for non-diabetic people is 170 mg/dl whereas it is 200 mg/dl in case of diabetic people. All the above graphs indicate a strong correlation between these two parameters. The above provided results were analyzed on few machine learning algorithms and it is found that Logistic Regression, Support Vector Classifier (SVC) and Linear Support Vector Classifiers (LSVC) have performed well with higher mean accuracy rate as compared to Decision Tree, Gaussian Naive Bayes and Random Forest Classifiers [39] [40] [41] . The box-plot (Fig. 16) provides the Inter Quartile Range (IQR) of the mean accuracy rate of machine learning algorithms and the Table 1 has detailed readings of the actual mean of accuracy along with standard deviation of machine learning algorithms. By applying the thumb rule of correlation and inferences received by applying machine learning algorithms, we can infer that BP, BMI and Glucose levels of people showed strong correlation with diabetes whereas DPF count, Age and Skin thickness have signalled moderate correlation with diabetes. With the privilege of machine learning algorithms, big data health analytics has opened doors for new predictive systems. On the similar lines, this paper presents a predictive analysis on diabetes. We found strong correlation of diabetes with Blood Pressure, BMI and Glucose levels of people. But Age, Skin Thickness and Diabetes Pedigree Function count have shown moderate relationship with diabetes. Though we may have conventional clinical reports on correlation of these parameters with diabetes but getting these inferences matched through analytical tools is a sole purpose of this paper. Through this paper we successfully predict and visualize the correlations between different parameters on diabetes. 3D Data Management: controlling data volume, velocity and variety, Application delivery strategies A formal definition of big data based on its essential features The pathologies of big data Mapreduce: simplified data processing on large clusters Big data analytics in healthcare Big data, big knowledge: big data for personalized healthcare The role of mobile technologies in health care processes: the case of cancer supportive care Artificial intelligence in healthcare and biomedical research: why a strong computational bioethics framework is required The evolution of electronic health record The internet of things in healthcare an overview Machine learning theory and applications for healthcare Explainable artificial intelligence for breast cancer: a visual case-based reasoning approach Early detection of sepsis utilizing deep learning on electronic health record event sequences Traditional bioinformatics in the era of real-time biomedical, health care and wellness data streams The potential for artificial intelligence in healthcare Artificial intelligence in healthcare: past, present and future Global evolution of research in artificial intelligence in health and medicine: a bibliometric study Harnessing the power of intelligent machines to enhance primary care Robot creating a people-powered future for EHRs: the challenge of making electronic data usable and interoperable Using electronic health records for clinical research: the case of the EHR4CR project A research agenda for Personal Health Records (PHR) Personal health records: a systematic literature review Python & ETL 2020: A list and comparison of the top python ETL tools Medical definition of blood pressure Hypertension and diabetes mellitus Age is an important risk factor for type 2 Diabetes Mellitus and Cardiovascular Diseases Diabetes mellitus: the epidemic of the century Body mass index: obesity and health -a critical review Diabetes Mellitus affected patients classification and diagnosis through machine learning techniques Effects of age, gender and anatomical site on skin thickness in children and adults with diabetes Estimation of blood glucose levels by people with diabetes: a crosssectional study Analysis of random forests model Predicting company growth using logistic regression and neural networks A comparative study on machine learning algorithms for smart manufacturing: tool wear prediction using Random Forests