key: cord-0300587-6yx9kxly authors: Zaki, N.; Mohamed, E. A.; Habuza, T. title: From Tabulated Data to Knowledge Graph: A Novel Way of Improving the Performance of the Classification Models in the Healthcare Data date: 2021-06-12 journal: nan DOI: 10.1101/2021.06.09.21258123 sha: 28c5869ab3227e1ed80f0549d84988024d2190f7 doc_id: 300587 cord_uid: 6yx9kxly In sectors like healthcare, having classification models that are both reliable and accurate is vital. Regrettably, contemporary classification techniques employing machine learning disregard the correlations between instances within data. This research, to rectify this, introduces a basic but effective technique for converting tabulated data into data graphs, incorporating structural correlations. Graphs have a unique capacity to capture structural correlations between data, allowing us to gain a deeper insight in comparison to carrying out isolated data analysis. The suggested technique underwent testing once the integration of graph data structure-related elements had been carried out and returned superior results to testing solely employing original features. The suggested technique achieved validity by returning significantly improved levels of accuracy. Over the last few years, there has been an explosion in the interest in the deployment of artificial intelligence, especially machine learning (ML) technologies for improving and accelerating the decisionmaking process within healthcare. This technology is seen as having the potential to provide diagnostic insights swiftly and accurately. A number of machine learning and deep learning (DL) methodologies have been created. These techniques have led to improvements in the accuracy of assessing diseases and fewer errors in treating disease. Nevertheless, numerous challenges still exist in terms of the development of workable ML models for use in healthcare. The difficulty of these challenges is exacerbated by the complex nature and sheer size of real-world data, particularly in the current Big Data environment. Additionally, numerous ML solutions, especially ones employing classification models, did not fulfill the hopes of the creators. The development of extremely accurate and workable classification models in a sector like healthcare is highly challenging. If it is shown to be possible to overcome problems concerning the quality of data, its volume, and complexity, more research is urgently needed regarding the way knowledge can be acquired and utilized practically within the healthcare sector [1] . In sectors like healthcare, this is vital that the classification model should be both reliable and accurate. Although researchers have recently used numerous high-quality and accurate classification methodologies, e.g., deep learning, not every classification model has demonstrated adequate superiority over previous techniques. This is due to the fact that these techniques gave little attention to correlations between data instances. In this research, it is demonstrated how the incorporation of basic knowledge graph algorithms can lead to improvements in the way classification models perform. Knowledge graph algorithms are created with a focus on correlations, and they have a unique capacity to discover structures and provide insights from connected data. As ML and knowledge discovery have rapidly developed in recent times, numerous new forms of analysis of graphs and algorithm mining have been used in a number of areas. Healthcare is one such area, a sector that has been under significant pressure as a result of the COVID-19 pandemic. The development of ML models that can understand how diseases are transmitted, treated, and prevented is urgently needed; these models should be able to mine data from numerous sources, including academic and professional literature, hospital records, the pharmaceutical industry, and biological, microbiological, and genetic research. In order to do this, ML techniques incorporating analytics based on graphs are showing significant potential. It is not surprising that knowledge graphs have been widely used in the healthcare sector in recent times and we may categories the different types that can be applied as generative graph models, knowledge graph construction and inference, and network embedding. Pham et al [2] have suggested the construction of knowledge-based heterogeneous information graphs to be used for classifications of medical health status. He et al [3] created synthetic triples using conceptualization, formulating the challenge as a triple classification that was addressed employing a discriminatory model, transferring knowledge from previously prepared language models. These researchers showed that their suggested methodology was effective in the identification of plausible triples and expansion of the knowledge graph using triples that could be highly diverse and novel in terms of edges and nodes. Additionally, Zhu et al [4] undertook a review of the literature regarding extant pharmaceutical knowledge bases and how they are applied in research by medical informatics. Yue et al. [5] undertook systematic comparisons with three significant predictive tasks for biomedical purposes: predictions of drug-disease Association (DDA), drug-drug interaction (DDI), and protein-protein interaction (PPI). They have provided a guideline framework for the proper selection of graph embedding methodology and delineating hyper-parameters for a variety of biomedical endeavors. Tu et al [6] put forward a Hyper-Network embedding model for the embedding of hyper networks having indecomposable hyperedges. This model was used with a quartet of different forms of hyper-network, including a drug network, and its performance showed promise. Chang Su et al. [7] undertook a review of the literature related to the application of networking embedding to effect advances within biomedicine. Baytas et al. [8] suggested a deep strategy for embedding heterogeneous attributed hyper-networks using complex non-linear node correlations. This involves the design of fully connected graph convolutional layers for the projection of a variety of node types. Chengxi Zang et al. [9] created MoFlow, and an invertible flow model for the generation of molecular graphs. Ling Chen et al. [10] proposed a partially supervised learning algorithm based on graphs in order to mine data from health examinations and to predict risks for the classification of ongoing situations in which most of the data was not labeled. Sacchet MD et al. [11] employed a support vector machines classifier for distinguishing between individuals suffering from depression and those who were not on the basis of a number of brain network indications. These researchers also carried out an assessment of the value of specific graph metrics for the differentiation of the different classes. Although these methods proved successful, a challenge is presented when faced with the unavailability of network data or if data is available but in standard tabulated format. To tackle this problem Tao et al. [1] designed a knowledge graph employing healthcare categorization and other knowledge mined from the NHANES data set, a collection of latent concepts employing the Pearson Correlation for decoding. This research presents a non-complex but extremely accurate way of converting tabulated data into graphs, allowing significant improvements in ML classifications. The suggested technique underwent testing using well-regarded data sets in the healthcare field:  D1 -Pima Indians Diabetes Dataset [12] : this is a dataset built by the National Institute of Diabetes and Digestive and Kidney Diseases. The database aims to provide diagnostic predictions as to the likelihood of a patient having diabetes on the basis of various diagnostic data including age in years, diabetes pedigree function, BMI (kilograms), insulin (mu U/m), skin thickness (millimeters) blood pressure (mm Hg), plasma glucose concentration levels, and the number of times pregnant. The dataset is publically available from Kaggle: https://www.kaggle.com/uciml/pima-indiansdiabetes-database  D2 -Stroke Prediction Dataset: this is a dataset containing a dozen attributes, these being patient id, gender, age in years, hypertension, heart_disease, ever_married, work_type, residence_type, avg_glucose_level, body mass index in kg), and smoking_status. There are missing entries in the data and setting attributes are normal, and so a pre-processing stage was required for imputation of missing entries and conversion of nominal attributes to integers. The dataset is publically available from Kaggle: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021.  D3 -Heart Attack Analysis and Prediction Dataset [13] : this database is used for classifying heart attacks and has 14 data points including the likelihood of heart attack, maximum heart rate, resting ECG, fasting blood sugar (mg/dl), cholesterol level (mg/dl), blood pressure (mm Hg), type of chest pain, number of major vessels, angina induced by exercise, sex, and age (years). The dataset is publically available from Kaggle: https://www.kaggle.com/rashikrahmanpritom/heartattack-analysis-prediction-dataset/metadata  D4 -Hepatocellular Carcinoma dataset [14] , [15] : this dataset was provided by the ML Repository, containing laboratory values from blood donors as well as patients with hepatocellular carcinoma, along with demographic information including gender and age. The dataset was obtained from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/HCV+data Table 1 offers greater detail about the datasets employed by this research. For illustration of the methodology, an assumption is made that there is a hypothetical dataset of samples , = 1, ̅̅̅̅̅ . Every sample is shown through features , = 1, ̅̅̅̅̅ . For simplifying the difficulty, an assumption is made that every feature is defined, i.e. ∈ ℝ. A typical dataset structure would be: Every sample in the dataset, is part of to class , where ∈ , = { 1 , 2 , … }. This is a standard -class classification problem that may be resolved employing ML methods. For integrating graph-related features, we must firstly use the scalar product operation, thus: In this instance the adjacency matrix ̇= {̇, ∶ , ∈ ℝ, , = 1, ̅̅̅̅̅ }, is representative of a weighted graph having nodes. Additionally, we employ a threshold technique for further simplification of the problem, e.g., if = 1 ∑ , =1 , the adjacency matrix A of unweighted G (see below) can be written as For the extraction of meaningful features, we can represent graph G's notes employing the topological features , = 1, ̅̅̅̅̅ , a few of which are PageRank value, centrality score, and degree of the node. These topological features can then be integrated with original features for incorporation of useful graph information and finding correlations between instances, thus: . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258123 doi: medRxiv preprint The topological features under consideration in this research can be seen in summary in Table 2 : Topological features extracted from the pre-processed datasetTable 2. The features final extracted can then be employed alongside the ML classifier, which will provide classifications of greater accuracy. For evaluating the performance of the suggested methodology, we employed 10-fold cross-validation. Widely excepted evaluation measures were employed for evaluation of the accuracy of the classifications, including F1-score, Recall (RE), Precision (PR), and Accuracy (AC). This section will detail the experimental work undertaken for effecting significant improvements in the accuracy of classifications through the conversion of tabulated data structures into graph data structures. It is predicted that graphs will be able to capture extra significant correlations between instances that are frequently disregarded in the process of classification. The intention is to undertake an evaluation of the benefits of the incorporation of adding graph-related features to original features. It is a simple matter to calculate the accuracy of classifications either with or without the suggested graph additions. In order to test this new methodology, we began by preprocessing the data sets, ensuring that numerical values replaced nominal entries and that every missing entry was imputed (employing simple methodologies, e.g. average values). Normalization of the dataset values was then undertaken using a . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; standard scaler or the MinMax scaler. In the initial phase, we employed a number of classifiers, including Naïve Bayes (NB), neural network (NN), support vector machines (SVM), logistic regression (LR), k-nearest neighbor (KNN), random forest (RF), and decision tree (DT) forecast for classifying both classes in all four data sets and recording the accuracy of classification. We only considered the original features (OF) and default parameter values. In the second phase, we only considered topological features. Additionally, we also tested combinations of original features and data extracted from the graphs. As Table 3 shows, incorporating the graph topological features affected a significant improvement in classification accuracy for every dataset and every classifier. Lastly, an elementary correlation-based feature subset selection [16] was undertaken for identification of an elementary correlation-based feature subset selection [16] was undertaken for identification of extremely informative features (either original or graph topological) and the measurement of the accuracy with a limited feature set. As table 4 shows, although 18% of the features based on D1 were selected, 21% of D2, 30% of D3, and 42% of D4, classification accuracy can be compared to the case when we consider every feature (original and graph topological). We additionally undertook an investigation of the information offered by every feature in order to increase the accuracy of classification. We employed heat map visualization techniques to illustrate the level that each feature adds to the classification model in 2D color. Figure 2 : heat map visualization demonstrates that the graph's topological features have a high level of correlation with class attributes, not original attributes; in each of the four datasets the topological features demonstrated more correlation with the class. (a) D1: Pima Indians Diabetes Dataset [12] (b) D2: Stroke Prediction Dataset . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 12, 2021. ; https://doi.org/10.1101/2021.06.09.21258123 doi: medRxiv preprint (c) D3: Heart Disease Dataset [13] (d) D4: Hepatocellular Carcinoma (HCC) dataset [14] , [15] Graphs have a unique capacity for the capture of structural correlations between data and this provides greater insight in comparison to isolated data analysis [17] . Many forms of data are not formulated as graphs originally, e.g., text data, images, and tabulated data. In this research, we have put forward a basic methodology for the conversion of tabulated data to graph data. Whilst this methodology is relatively simple, it has been shown to be a powerful new way of interpreting data. As the result section demonstrates, there have been significant improvements to the classification performance with all of the four sample data sets. Incorporating graph topological features has added information to the classifier which has resulted in better performance. The classifications may be improved to an even greater extent by applying more powerful classification techniques, e.g. deep learning. We did not test deep-learningbased classifiers in this research as the data sets are relatively small. Additionally, using a convolutional graph network (GCN) with the features extracted may return promising results. GCN has recently come to the fore in ML and related disciplines and has shown that it can provide improved results in a number of areas. Although our proposed methodology was tested in a healthcare context, it is a relatively generic methodology that can be employed for improving the classification accuracy of tabulated data in all disciplines. One limitation of the suggested methodology is that it will only work with numerical data. This meant that in this research nominal data had to be transformed to a numerical value when working with the D2 and D4 databases. Mining health knowledge graph for health risk prediction Constructing a knowledge-based heterogeneous information graph for medical health status classification On the Role of Conceptualization in Commonsense Knowledge Graph Construction Drug knowledge bases and their applications in biomedical informatics research Graph embedding on biomedical networks: methods, applications and evaluations Structural deep embedding for hyper-networks Network embedding in biomedical data science Heterogeneous Hyper-Network Embedding MoFlow: an invertible flow model for generating molecular graphs Mining Health Examination Records-A Graph-Based Approach Support vector machine classification of major depressive disorder using diffusion-weighted neuroimaging and graph theory Using the ADAP learning algorithm to forecast the onset of diabetes mellitus International application of a new probability algorithm for the diagnosis of coronary artery disease Using machine learning techniques to generate laboratory diagnostic pathways: a case study he Enhanced Liver Fibrosis (ELF) score: normal values, influence factors and proposed cut-off values Correlation-based Feature Subset Selection for Machine Learning Graph convolutional networks: a comprehensive review Acknowledgment: The authors would like to acknowledge partial support from the big data analytics center, united Arab Emirates University. Special thanks to Mrs. Asha Plackal for her valuable help. Competing interests: The authors declare that they have no competing interests.