key: cord-1042276-f6l6v8jz authors: Almustafa, Khaled Mohamad title: Covid19‐Mexican‐Patients' Dataset (Covid19MPD) Classification and Prediction Using Feature Importance date: 2021-10-16 journal: Concurr Comput DOI: 10.1002/cpe.6675 sha: 8819858c8652004c6d67c7aa74d9a07e46f6e713 doc_id: 1042276 cord_uid: f6l6v8jz Coronavirus disease, Covid19, pandemic has a great effect on human heath worldwide since it was first detected in late 2019. A clear understanding of the structure of the available Covid19 datasets might give the healthcare provider a better understanding of identifying some of the cases at an early stage. In this article, we will be looking into a Covid19 Mexican Patients' Dataset (Covid109MPD), and we will apply number of machine learning algorithms on the dataset to select the best possible classification algorithm for the death and survived cases in Mexico, then we will study the performance of the enhancement of the specified classifiers in term of their features selection in order to be able to predict sever, and or death, cases from the available dataset. Results show that J48 classifier gives the best classification accuracy with 94.41% and RMSE = 0.2028 and ROC = 0.919, compared to other classifiers, and when using feature selection method, J48 classifier can predict a surviving Covid19MPD case within 94.88% accuracy, and by using only 10 out of the total 19 features. of health related dataset in the form of disease detection and dataset classification and/or diseases predictions. In Reference 2, a classification proposed model for the electroencephalogram (EEG) of the elliptic seizure recoded waves was presented using K-mean clustering, and in References 3-5, an automatic elliptic seizure detection based on length feature and discrete wavelet transform (DWT)-based approach was proposed. EEG signal classification for elliptic seizure was presented in Reference 6, and multi-domain EEG signal's feature extraction was shown in References 7 and 8. Time frequency and nonlinear detection analysis of EEG recorded waves was proposed in Reference 9, and detection of elliptic seizure cases using advanced ML algorithms was presented in Reference 10. Parkinson disease (PD) has its share in term of applying ML algorithms for the detection, classification and prediction of PD cases. A speech signal processing for Parkinson patients using classifications algorithms was proposed in and prediction of PD cases using different ML classifiers were presented in References 14-17. Also, number of chronic kidney disease (CKD) related data analysis approaches can be found in the literature, and to mention resent few, prediction of CKD using ML algorithms was presented in References 18-20. Performance evaluation for ML classification algorithm for CKD was proposed in References 21 and 22. A comparative study for the use of different classifies to classify the CKD dataset was shown in References 23 and 24, and detection of CKD cases was presented in Reference 25. Other health related studies including the association between work-related features and coronary artery disease using ML classifiers was proposed in Reference 26, and the diagnosis of coronary artery disease was presented in References 27 and 28, heart disease classification and prediction using different ML classification algorithms and approaches were pretend in References 29-32, and prediction of cleft before birth cases using ML algorithms was proposed in Reference 33 to mention few. Dataset's feature selections also has a fair share in the literature, such as using ++support vector machine (SVM) for feature selection of a PD recorded wave in [34] [35] [36] as well as prediction of CKD using feature selection in Reference 37 and feature reduction and selection based on modified dominance soft set in Reference 38. Feature importance was presented in many health related applications using different ML and classifications algorithms. [39] [40] [41] Other health informatics related applications using ML and feature selection methods can also be found in References 42-47. Covid19 disease being a newly surfacing disease has a moderate ML related approaches in the literature especially work related to classification, prediction and feature selection and reduction of available covid19 datasets. In what follow, we will present the most relevant work related to our proposed techniques. In Reference 48, the authors proposed an artificial intelligence (AI) and ML to fight COVID-19 possible diagnosed cases using the patients' distinct respiratory pattern and by using thoracic computed tomography (CT) images for the detection and monitoring of COVID-19 patients, and using similar approach with CT images and X-ray images was presented in References 49 and 50. A classification approach for covid19 cases based on the CT images as well was shown in Reference 51 where and early phase detection of coronavirus was achieved using different methods of ML algorithms, and a case study using intrinsic genomic signatures for COVID-19 classification using ML algorithms was presented in References 52 and 53 by combining the supervised machine learning algorithms with digital signal processing (MLDSP). Covid19 vaccine design by using reverse vaccinology and ML was suggested in Reference 54. Prediction and growth of Covid19 cases using different ML algorithms by applying different mathematical model was proposed in Reference 55 and screening of covid-19 using infection size-aware random forest (RF) classification was presented in References 56 and 57. In References 58 and 59, the authors suggested a classification of covid19 dataset using X-ray and CT images by applying ML algorithms and selecting certain features. The rest of this article is organized as follows: the preparation of the Covid19MPD is presented in Section 2, methodology is shown in Section 3, and used classifiers in Section 4, Section 5 presents the experimental results. Remarks and the conclusion were presented in Sections 6 and 7, respectively. The dataset was obtained from a publically accessible link provided by the government of Mexico 1 for more than 500,000 cases of covid19 patients admitted to different location of the country. For the purpose of this study, only 200,000 first cases where studied, and the purpose was to evaluate the common classifiers algorithms to classify the mentioned dataset, and the samples were downsized for computational time reasons, but the selected samples have enough data to cover all possible cases as we can see in Table 1 , it is worth mentioning that there was two extra features regarding the patient starting symptoms and the date he/she was admitted to hospital, but they were removed for this study for their non-relevancy to this study. Table 1 shows the available features for the Covid19MPD, the set has 20 features, 19 distinguished features and a class attribute. Each feature has a set of possible values assigned to it and a specific reason for its selection. What is interesting about this collection of dataset that it has a fair group of attributes associated with the patient medical history, such as if the patient has suffered from diabetic illness, asthma, and other illnesses, and as well as if the patient is a smoker, or had contact with a covid19 case, and what type of healthcare the patient received since admitting to hospital. The interesting part for these wide range of features is that we can identify later on the importance of these features to the dataset in hand and identify if a specific illness would actually contribute to the death of a covid19 cases or not for the given country, in which in our case would be Mexico. explanation of the codes given in Table 2 can be explained from Table 1 . And Figure 1 shows a visual representation of the distribution of the attributes values presented in Table 2 . Figure 1 shows a visual representation of the attributes' possible values based on their available cases, and we can see that there are noticeable cases of pneumonia, diabetes, hypertension, and obesity among the tested patients, and a large number of cases were patient had a contact with other covid19 patients. Table 3 shows the distribution of the age groups with 10 years intervals from early born to 110 years old, and we can see from Figure 2 a close to normal distribution for the tested age group, and most of the reported cases are between 30 and 50 years old. It is worth mentioning that for the selected samples of 200,000 patients, male and female patients are almost equal for most age groups. We will present three different methods in this study: 1. The case of a direct classification exercise using common classifiers that will be explained in the next section. 2. The case of using feature selection for some of these classifiers to evaluate the classifier performance with a subset of its features. 3. Last but not least we will present a case of feature importance to study the direct effect of some of the features on the result of death for some of the patients in the presented dataset, and run a feature importance selection for attribute to evaluate the classifier classification performance using these attributes. We will briefly discuss some of the ML classifications algorithms used in this study for the classification, feature selection, and prediction of Covid19MPD, and more details can be found in Reference 32. A well-known classifier that uses the conditional probability of the occurrence of a given feature (attribute) with respect to another feature to perform the classification of the available features. In this classifier, a dataset is usually divided into smaller subsets based on provided questions (conditions) and the decision is based on the comparison of these questions between all possible classes after covering all available attributes in the dataset. A special case of the decision tree (DT) is the J48 classifier, where a unified variable will be associated with the provided dataset. RF classifier is a collection of multiple random trees classifiers and the result of the classification is taken over an average of the performance classifications of all associated trees. Nearest neighbor (NN) classifier is based on a comparison of a given sample test element to be classified to a certain class with respect to an available sample training element based on the distance between the test and the training elements, and usually "K" refers to the number of training elements that a test element can be as close as possible to be a member of their class, and this task can be done using a direct distance between two elements or for more accurate results, one can use the Euclidian distance between two samples. Gradient descent is an algorithm that optimizes many loss functions, such as SVM and logistic regression models, and is usually used to optimize the linear function, and the stochastic concept is introduced here based on the roots finding nature of the optimization task. In stochastic gradient descent (SGD), for each iteration, samples are selected randomly using a term "batch" for number of samples, instead of the whole data set, and these batches are used to calculate the gradient for each iteration. In this section, we will go over the results obtained from using different mentioned methods and their used classifiers and compare the performance of the classifiers per method, then the comparison between methods to select the best approach for the classification and or the prediction of a Covid19MPD. Statistical parameters were used for the simulation results to compare classifiers' performances, such as relative absolute error (RAE) for the relative error of estimation with respect to the actual value, mean absolute error (MAE) in terms of the relative error with respect to the number of instances, and area under curve (ROC) distinguish how well a given classifier is preforming in term of the identification of a specific data point, where best performance for ROC = 1. We will present the results obtained for the direct used classifiers in term of classification accuracy, MAE, RMSE, and ROC for the purpose of performance compassion. A 10-fold cross validation technique for classification was used in this simulation. In this section, we will introduce a feature selection algorithm using classifier subset evaluation to select the most contributing features (attributes) in the Covid19MPD dataset, then compute the classification accuracy for the selected classifiers with the selected feature and evaluate the performance of these classifiers based on the selected features, this method can be used for cases prediction with minimal number of attributes per dataset instead of using the entire attributes for each patient. We see from Table 5 that the feature selected for a slightly better classification accuracy for the J48 classifier are sex, intubed, pneumonia, age, copd, cardiovascular, obesity, contact_other_covid, covid_res, and icu. F I G U R E 6 Accuracy comparison after feature selection F I G U R E 7 MAE and RMSE values before and after feature selection It worth mentioning that pregnancy, diabetes, asthma, inmsupr, hypertension, other_disease, and tobacco have minimal or no effect on the classification of this specific dataset using the J48 classifier. Using J48 one can predict a surviving Covid19MPD case within 94.88% accuracy by using only 10 out of the total 19 features from the original dataset and 94.41% accuracy. And we can also see a slight improvement in the classification accuracy of the K-NN (N = 1) classifier with 92.82% accuracy with selected feature compared to 92.71% with full set. Figure 6 shows a graphical representation of accuracy results obtained in Table 5 . Figure 6 shows a graphical representation of accuracy results obtained in Table 5 , and we can see an increase of the classification accuracy after feature selection for both K-NN (K = 1) and J48 classifiers. Figure 7 shows a graphical representation of the values of MAE and RMSE for the results obtained in Table 5 , and we can see decrease of the MAE value for all concerned classifiers after applying the feature selection method, as well as a slight decrease for RMSE value for the J48 after the feature selection method was applied. In this section, we will introduce a feature importance method to highlight the effect of the selected features to the outcome of the results for surviving and not surviving cases of Covid19MPD based on the provided samples of patients. A feature importance method using the interaction test will be used to calculate the impact of a given feature on the resulting class of the dataset. It is a statistical approach that measures the importance of a feature value in a dataset based on its p-value resulting from a 2 test in a DT. 52 Then we will do a comparison for the J48 classifier using full attributes. Feature selected using subset evaluation attributes and feature selection using interaction test attributes to evaluate and compare the classification accuracy for the three used methods. Figure 8 shows the results obtained for positive and negative feature importance using the interaction test for feature importance on the Covei19MPD, and results show that some of the features are highly contributing to the classification of the sever and non-sever cases on Covid19. Table 6 shows the classification results in term of the accuracy, MAE and the RMSE for the three different algorithms as well as the classification result for the case where common feature was selected from the three methods. We can see that Based on the presented results, the following remarks can be mentioned based on the used methodology: 1. The case of a direct classification: J48 classifier gives the best classification accuracy compared to all other used classifiers. 2. The case of using feature selection: J48 classifier can predict a surviving Covid19MPD case with 94.88%, and also it outperforms all other used classifiers. 3. The case of feature importance: Feature importance method show that this method outperformed the classification results for the full featured data, and that gives a direct importance to the major contributing feature in the classification used for the available dataset. ML algorithms, such as naïve Bayes, SGD, RF, KNN (K = 1), and J48 Classifiers were applied on the Covid19MPD to select the best possible classification algorithm for the selection of the death and survived cases in Mexico, then the performance enhancement of the specified classifiers in term of features selection was performed, such a task can be useful to the healthcare providers in identifying and diagnose covid19 cases in a better efficient ways, also, a feature importance algorithm was applied on the mentioned dataset to evaluate all features importance to Mexico region based on the available dataset, and to understand what was the main contributors to the sever cases of patients. Results show that J48 classifier gives the best classification accuracy with 94.41% and RMSE = 0.2028 and ROC = 0.919, compared to the other classifiers of accuracy of 93.64%, 93.50%, and 92.71% for SGD, RF, and K-NN (K = 1), respectively. When using the feature selection method, J48 classifier can predict a surviving Covid19MPD case with 94.88% accuracy and by using only 10 out of the total 19 available features, which can be a useful fact for healthcare providers in identifying possible infected Covid19MPD cases. Results for the classification using feature selection, based on the feature importance method, show that this method outperformed the classification results for the full featured data with an accuracy of 94.65%, MAE = 0.0778, and RMSE = 0.214. As a future work extension to this work would be to encourage researchers to investigate and develop a feature importance evaluation of the corona virus for different countries/regions, where such an investigation would assess the extent of the possible mutation of Covid19 per country or region, and ideally if the importance of features of several regions were to be evaluated, it may be able to reflect on the possible mutation of the virus per country/region and would help vaccine developers to pin point the required treatment/vaccine based on the selection of the important features that contribute to the well-being of the Covid19 patients. The author would like to thank Prince Sultan University, Riyadh, KSA, for supporting this work. EEG signals classification using the K means clustering and a multilayer perceptron neural network model Automatic epileptic seizure detection in EEG based on line length feature and artificial neural network Epileptic seizures detection in EEG using DWT-based ApEn and artificial neural network. Signal Image Video Process Epilepsy detection using DWT based Hurst exponent and SVM, K-NN classifiers EEG signal classification using PCA, ICA, LDA and support vector machine Automatic epileptic seizure detection in EEG signals using multi-domain feature extraction and nonlinear analysis Detection of epileptic electroencephalogram based on permutation entropy and support vector machine Detection of epileptiform activity in EEG signals based on time-frequency and non-linear analysis Automatic epileptic seizure detection using scalp EEG and advanced artificial intelligence techniques A comparative analysis of speech signal processing algorithms for Parkinsons disease classification and the use of the tunable Q-factor wavelet transform Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson's disease symptom severity Intelligent churn prediction for telecom using GP-AdaBoost learning and PSO undersampling Combining multiple clusterings for protein structure prediction Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings A decision support system to improve medical diagnosis using a combination of k-medoids clustering based attribute weighting and SVM Kidney disease prediction using SVM and ANN algorithms Risk level prediction of chronic kidney disease using Neuro-fuzzy and hierarchical clustering algorithm (s) Survey on prediction of chronic kidney disease using data mining classification techniques and feature selection Performance evaluation on machine learning classification techniques for disease classification and forecasting through data analytics for chronic kidney disease (CKD) Performance analysis of machine learning algorithms for predicting chronic kidney disease Comparative study of chronic kidney disease prediction using KNN and SVM Comparative study of classification algorithms in chronic kidney disease Chronic kidney disease detection by analyzing medical datasets in Weka Association between work-related features and coronary artery disease: a heterogeneous hybrid feature selection integrated with balancing approach On the problems of knowledge acquisition and representation of expert system for diagnosis of coronary artery disease (CAD) Hybrid genetic-discretized algorithm to handle data uncertainty in diagnosing stenosis of coronary arteries. Expert Systems A new machine learning technique for an accurate diagnosis of coronary artery disease Hybrid real-binary particle swarm optimization for rule discovery in the diagnosis of coronary artery disease Towards real-time heartbeat classification: evaluation of nonlinear morphological features and voting method Prediction of heart disease and classifiers' sensitivity analysis Cleft prediction before birth using deep neural network Feature selection for classification of hyperspectral data by SVM A feature selection method based on kernel canonical correlation analysis and the minimum redundancy-maximum relevance filter method Analyzing the effectiveness of vocal features in early telediagnosis of Parkinson's disease Diagnosis of chronic kidney disease using effective classification and feature selection technique Feature reduction based on modified dominance soft set Conditional variable importance for random forests An AUC-based permutation variable importance measure for random forests Correlation and variable importance in random forests A hyper learning binary dragonfly algorithm for feature selection: a COVID-19 case study A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data Novel machine learning approach for classification of high-dimensional microarray data A novel approach for dimension reduction of microarray Comprehensive analysis of feature selection on early heart Strok prediction Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction Artificial intelligence and machine learning to fight COVID-19 Automatic detection of coronavirus disease (COVID-19) in X-ray and CT images: a machine learning-based approach Automated detection of COVID-19 cases using deep neural networks with X-ray images Preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the SARS-CoV-2 (2019-nCoV, COVID-19) coronavirus Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: a review COVID-19 coronavirus vaccine design using reverse vaccinology and machine learning Predicting the growth and trend of COVID-19 pandemic using machine learning and cloud computing Large-scale screening of covid-19 from community acquired pneumonia using infection size-aware classification Regression tress with unbiased variable selection and interaction detection Coronavirus (covid-19) classification using ct images by machine learning methods COVID-19) classification using deep features fusion and ranking technique. Big Data Analytics and Artificial Intelligence Against COVID-19: Innovation Vision and Approach How to cite this article: Almustafa KM. Covid19-Mexican-Patients' Dataset (Covid19MPD) Classification and Prediction Using Feature Importance Data were used from a publicly available dataset https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset/data. https://orcid.org/0000-0003-2129-7686