key: cord-0022054-wxaxbyw4
authors: Hatmal, Ma’mon M.; Alshaer, Walhan; Mahmoud, Ismail S.; Al-Hatamleh, Mohammad A. I.; Al-Ameer, Hamzeh J.; Abuyaman, Omar; Zihlif, Malek; Mohamud, Rohimah; Darras, Mais; Al Shhab, Mohammad; Abu-Raideh, Rand; Ismail, Hilweh; Al-Hamadi, Ali; Abdelhay, Ali
title: Investigating the association of CD36 gene polymorphisms (rs1761667 and rs1527483) with T2DM and dyslipidemia: Statistical analysis, machine learning based prediction, and meta-analysis
date: 2021-10-14
journal: PLoS One
DOI: 10.1371/journal.pone.0257857
sha: 0015ff3512c547223f524e7a74580063cb6fc620
doc_id: 22054
cord_uid: wxaxbyw4

CD36 (cluster of differentiation 36) is a membrane protein involved in lipid metabolism and has been linked to pathological conditions associated with metabolic disorders, such as diabetes and dyslipidemia. A case-control study was conducted and included 177 patients with type-2 diabetes mellitus (T2DM) and 173 control subjects to study the involvement of CD36 gene rs1761667 (G>A) and rs1527483 (C>T) polymorphisms in the pathogenesis of T2DM and dyslipidemia among Jordanian population. Lipid profile, blood sugar, gender and age were measured and recorded. Also, genotyping analysis for both polymorphisms was performed. Following statistical analysis, 10 different neural networks and machine learning (ML) tools were used to predict subjects with diabetes or dyslipidemia. Towards further understanding of the role of CD36 protein and gene in T2DM and dyslipidemia, a protein-protein interaction network and meta-analysis were carried out. For both polymorphisms, the genotypic frequencies were not significantly different between the two groups (p > 0.05). On the other hand, some ML tools like multilayer perceptron gave high prediction accuracy (≥ 0.75) and Cohen’s kappa (κ) (≥ 0.5). Interestingly, in K-star tool, the accuracy and Cohen’s κ values were enhanced by including the genotyping results as inputs (0.73 and 0.46, respectively, compared to 0.67 and 0.34 without including them). This study confirmed, for the first time, that there is no association between CD36 polymorphisms and T2DM or dyslipidemia among Jordanian population. Prediction of T2DM and dyslipidemia, using these extensive ML tools and based on such input data, is a promising approach for developing diagnostic and prognostic prediction models for a wide spectrum of diseases, especially based on large medical databases.

Introduction Diabetes mellitus (DM) is a metabolic disorder characterized by high levels of blood glucose due to defective insulin production, insulin action, or both [1] . If remained uncontrolled, diabetes could lead to serious health complications that affect various systems of human body including blood vessel and nervous system damage, vision complications, cardiovascular disease, and infection [1] . Diabetes affects millions of people worldwide. In 2014, it was reported that nearly 380 million people worldwide had the disease [2] ; this number is constantly increasing, and is expected to grow tremendously in the future. Type-2 DM (T2DM) is the prevalent form of diabetes, which accounts approximately 90% of all diagnosed cases of diabetes in adults [3] . T2DM is mainly manifested by low insulin production by pancreatic cells and/or the produced insulin does not function effectively [4] . Many genetic factors and polymorphisms have been investigated in patients with T2DM; we have previously investigated that the vitamin D receptor (VDR) gene FokI polymorphism, the DNA-binding domain of regulatory factor X6 (RFX6) gene, as well as the epoxide hydrolase (EPHX2) gene rs4149243, rs2234914 and rs751142 variants [5] [6] [7] .

CD36 is a membrane glycoprotein receptor that is expressed on a variety of cells and tissues, including platelets, macrophages, adipocytes, hepatocytes, myocytes, and some specialized epithelia of the breast, kidney and gut [8] . The genetic composition, 2D and 3D protein structures of CD36 are shown in Fig 1. CD36 is a multifunctional signaling receptor with several known ligands, including thrombospondin-1, long chain fatty acids, oxidized low-density and highdensity lipoproteins (LDL and HDL) [9] . In macrophages, CD36 acts as a scavenger receptor that recognizes specific oxidized phospholipids and LDL, as well as participates in internalization of apoptotic cells and certain bacterial and fungal pathogens, contributing to inflammatory responses and atherothrombotic diseases [8] . Also, CD36 functions on adipocytes, enterocytes, hepatocytes and muscles as a facilitator of long-chain fatty acid transport participating in intestinal fat absorption, muscle lipid utilization, and adipose energy storage [10, 11] .

It has been shown that CD36 is involved in lipid metabolism and homoeostasis and has been linked to pathological conditions associated with metabolic disorders, such as obesity, insulin resistance, diabetes, dyslipidemia and atherosclerosis [9, [14] [15] [16] [17] . The mechanistic role of CD36 in metabolic diseases is seemed to be complex and yet to be resolved. However, the contribution of CD36 in mediating cellular lipid transport and intracellular accumulation of lipid is expected to cause lipotoxicity and, hence, insulin dysregulation and resistance [18] . CD36 protein is encoded by a gene which is located on chromosome 7q11.2 and has 15 exons [19] . It has been reported that genetic mutations in CD36 gene could be associated with the pathogenesis of T2DM [20] [21] [22] [23] . The rs1761667 (G>A) and rs1527483 (C>T) polymorphisms are two main single nucleotide polymorphisms (SNPs) in the CD36 gene that have been previously studied in T2DM [24] . The aim of the current study is to assess potential association between the rs1761667 and rs1527483 polymorphisms with T2DM and dyslipidemia in Jordanian population. Although dyslipidemia was shown by previous studies to be associated with T2DM, it is not necessary that every patient with T2DM must have dyslipidemia [25, 26] .

On the other hand, recent years have witnessed an unprecedented development in the use of machine learning (ML) in various biotechnology, biomedicine, medical imaging and [12] . Based on a structure file obtained from "protein data bank (PDB)" (PDB ID: 5LGD), the middle illustration shows the three dimensional structure of CD36 protein extracellular domain with assigning the main two entrances for fatty acids (indicated in red color). The blue residues are NAG gylocsylated asparagines residues. Pro191 is a mutation at the frontage of entrance 1 which affects binding with healthcare applications [27] [28] [29] [30] . Supervised ML tools can be utilized to build predictive models involve the implementation of statistical means for learning and predicting disease status, either by including or excluding the polymorphisms genotypes [5, [31] [32] [33] . The following are popular ML algorithms that were evaluated in the current research to predict T2DM and dyslipidemia based on the clinical parameters, demographic and polymorphism data: random forest (RF) [34] [35] [36] ; naïve Bayesian (NB) [37] [38] [39] [40] ; eXtreme Gradient Boosting (XGBoost) [41] [42] [43] ; k-nearest neighbors (kNN) [44] [45] [46] , support vector machine (SVM) [47, 48] ; probabilistic neural networks (PNN) [49] [50] [51] [52] [53] ; multilayer perceptron (MLP) [54, 55] ; adaptive boosting (AdaBoost) [56, 57] ; gradient boost [58, 59] ; and K-star (K � ) [60, 61] . It was reported that the odds ratio for each T2DM risk allele varies between 1.02 and 1.35. To produce improved prediction results for complicated polygenic traits, recent polygenic risk score models integrate expanded SNP selection [62, 63] . The goal of this research is to see how a small number of polymorphisms can improve machine learning prediction based on clinical and demographic data. However, ML needs to be validated vis-à-vis statistical accuracy (i.e., predictability). Moreover, a protein-protein interaction network was used towards further understanding of CD36 interactions. Also, to compare our results with the previous findings, the first meta-analysis for the association of these two polymorphisms with diabetes was performed.

The average (standard deviation (SD)) of age for T2DM patients (n = 177) and control group (n = 173) were 50.8 (13.9) years and 57.4 (11.6) years, respectively. Based on t-test, there was no significant difference between the two groups. The distribution of the data among categorical classes is indicated briefly in Fig 2. Baseline data of the study subjects is available in S1 Table. The differences between T2DM and control groups based on age, gender, FBS, and lipid parameters are shown in Table 1 , and none of them has shown significant differences between the two groups. Table 2 shows the distribution of samples used in the present study based on their CD36 polymorphisms. The genotypic and allelic frequencies, as well as the exact tests for Hardy-Weinberg equilibrium are shown in S2 Table. The D value for Linkage disequilibrium analysis between the two SNPs is very low (i.e., −0.01), which indicates that the gamete is not more frequent than expected. Table 3 shows odd ratio and its related p-values for different genotypes for rs1761667 in T2DM based on different genetic models, while Table 4 shows them for rs1527483. None of the genotypes for all genetic models was significantly different between T2DM and control groups. The stratified distribution among females and males, and their corresponding odd ratio and p-value, are shown in S3 and S4 Tables for rs1761667 and  rs1527483, respectively. Haplotype frequencies among all people in the study are shown in S5 Table and their gender cross-classification is shown in S6 Table. Based on odd-ration test and its p-value, haplotype association with T2DM is shown in Table 5 , and their frequencies among females and males are indicated. None of the haplotypes was significantly different between T2DM and control groups.

On the other hand, Table 6 shows odd ratio and p-values for different genotypes for rs1761667 in dyslipidemia based on different genetic models, while Table 7 shows them for rs1527483. The stratified distribution among females and males are shown in S7 and S8 Tables for rs1761667 and rs1527483, respectively.

Principal Component Analysis (PCA). PCA function performs a principal component analysis (PCA) on the given data. The input data is projected from its original feature space into a space of (possibly) lower dimension with a minimum of information loss. Fig 3 repre sents the PCA for the normal subjects (status = 0, green color) and subjects with disease (status = 1, red color). 

Predicting diabetes and dyslipidemia for the testing set using different ML models. A heat map was generated for all parameters (input and output), except for those that have either binomial or discrete values (gender and polymorphisms). TG is moderately associated with BS, and a strong correlation between LDL and TC is clearly presented (Fig 4) . Pearson's correlation (r, as shown on Y axis of Fig 4) was used as measure for correlations; r > 0.7 is considered strong correlation, between 0.4 and 0.7 moderately correlated, and < 0.4 weakly correlated [64] .

For prediction using different ML tools, a confusion matrix was built for the testing set for each group of the selected inputs ( Table 8) . Number of people with the disease who were randomly selected in the testing set is "a + c", and number of people with no disease randomly selected in the testing set is "b + d".

Accuracy was calculated according to the following formula (Eq 1):

where TP is true positive; TN is true negative; FP is false positive; and FN is false negative. True positive rate (TPR or sensitivity) was calculated by the formula (Eq 2):

while true negative rate (TNR or specificity) was calculated by the formula (Eq 3): Table 9 shows the accuracy, cohen's κ, TPR and TNR values to predict T2DM of different ML tools for both 5-fold cross validation and 20% testing sets based on all input data: lipid profile (TG, TC, LDL, and HDL), dyslipidemia status, age, gender, rs1761667 genotype, and Some people have missing data as indicated in S1 Table. https://doi.org/10.1371/journal.pone.0257857.t001 rs1527483 genotype, while Table 10 shows the prediction results based on all input data used in Table 9 , excluding the polymorphisms genotypes. MLP was also used to predict dyslipidemia for both 5-fold cross validation and 20% testing sets based on all input data and data excluding polymorphisms (Table 11 ).

The process of article identification and selection is illustrated in Fig 5. A total of 313 articles were found in different databases. Records after duplicates and those that did not meet the 

Machine learning to predict T2DM based on clinical, demographic, and genetic data inclusion criteria were removed. Thereafter, only one study was eligible and included in quantitative synthesis ( Fig 5) . Only one study was included in this meta-analysis for T2DM, where both rs1761667 and rs1527483 polymorphisms were studied. The results of meta-analysis are shown in Fig 6. For CD36 rs1761667, there was no significant difference in the allele frequencies, while genotypes (AA and GA vs. GG) and (AA vs. AG and GG) in both dominant and recessive models, respectively, were significantly different between T2DM and control group (p < 0.05), even that our data shows no significant difference (p > 0.05). For the other CD36 polymorphism (i.e., rs1527483), there was no significant association with T2DM for both alleles and genotypes (p > 0.05, Fig 6) .

Variations in CD36 can lead to several conditions, such as sensory perception, diabetes, coronary heart disease, and others [21, 66] . The role of CD36 in the pathogenesis and prevention of T2DM and lipid metabolism has widespread concerns. CD36 acts as receptor for a broad range of ligands. Ligands can be of proteinaceous nature like thrombospondin, fibronectin, collagen or amyloid-beta as well as of lipidic nature such as oxidized low-density lipoprotein (OxLDL), anionic phospholipids, long-chain fatty acids and bacterial diacylated lipopeptides. 

p-value obtained based on odd-ration (OR) test.

https://doi.org/10.1371/journal.pone.0257857.t006 Table 7 . Polymorphism rs1527483 association with response dyslipidemia based on "SNPStats" analysis tool. In the upper graph, five input features (LDL, HDL, TG, TC and Age) were calculated for subjects with diabetes (red squares) compared to all control subjects (green squares). In the lower graph, two input features (FBS and Age) were calculated for all subjects with dyslipidemia (red squares) compared to all control subjects (green squares). The clinical parameters used to define the diseases (lipid profile for dyslipidemia, and FBS for T2DM) were excluded from the input data as described before.

Color scale is displayed at the right corner. A positive correlation is indicated by light colors (i.e. yellow), while a negative relationship is indicated by dark colors (i.e., dark purple). Triglyceride (TG) is moderately associated with FBS, needless to say that there is a strong correlation between low-density lipoprotein (LDL) and TC.

https://doi.org/10.1371/journal.pone.0257857.g004 Table 8 . General shape of the confusion matrix.

Predicted no condition c d a + c b + d "a" represents the true predicted patients with the disease, "b" represents the false predicted people with the disease, "c" represents the false predicted people with no disease, and "d" represents the true predicted people with no disease. "a + c" represents the total number of people with the disease, while "b + d" represents the total number of people with no disease.

https://doi.org/10.1371/journal.pone.0257857.t008

They are generally multivalent and can therefore engage multiple receptors simultaneously, the resulting formation of CD36 clusters initiates signal transduction and internalization of receptor-ligand complexes [12, 67, 68] . Multiple observational studies reported a correlation between CD36 polymorphisms and T2DM [24, 67, 69] . CD36 is involved in functional and physical interactions with many proteins, for example SRC, peroxisome proliferator-activated receptor gamma (PPARG) and tolllike receptor 4 (TLR4). SRC is one of the key regulators of lipid metabolism and diabetes pathogenesis. After activation, it participates in signaling pathways that control a diverse spectrum of biological activities including gene transcription, immune response, cell adhesion, cell cycle progression, apoptosis, migration, and transformation [70] . Fig 7 represents the potential cellular and molecular mechanisms of action for CD36 upon activation by OxLDL, which could be implicated in the development of dyslipidemia and T2DM [71, 72] . Also, for further understanding of CD36 associations to other proteins, a protein-protein interaction network for CD36 is shown in Fig 7. The purpose of this case-control study was to assess how haplotypes, genotypes and alleles distribution of the CD36 polymorphisms affects the prevalence of T2DM and dyslipidemia in the Jordanian population. Two different major groups were considered: T2DM patients' group and the control group. The control group did not deviate from the HWE (p > 0.05) (S2 Table) .

In this comparison, the CD36 polymorphisms and their respective genotypes were assessed. There were no statistically significant differences for these polymorphisms (p > 0.05) on both 

T2DM and dyslipidemia. The frequency of the minor allele in the CD36 polymorphisms was approximately the same in T2DM patients and control subjects. These results fit with those shown by Banerjee et al. study for rs1527483, but not rs1761667 [24] . Another study on Egyptian people, involving 100 patients with metabolic syndrome (MS) and 100 control samples showed that the rs1761667 variant was significantly associated with risk of MS [79] . The CD36 rs1761667 and rs1527483 polymorphisms association results with T2DM and MS in different populations are summarized in Table 12 . In this table, the outcome (Yes or No) reveals whether or not there is an association between CD36 rs1761667 and rs1527483 polymorphisms and T2DM or metabolic syndrome. 

Machine learning to predict T2DM based on clinical, demographic, and genetic data 

Several ML learners were evaluated against the training and testing set, namely, XGBoost, SVM (C and nu), RF, PNN, NB, kNN, MLP, AdaBoost, and gradient boost. Age, gender, lipid profile, with and without polymorphisms genotypes were evaluated as input descriptors, as indicated in Tables 10 and 11, respectively. Clearly from Table 10 , all learners achieved good accuracy; this indicates that the data is self-consistent and predictive. Still, in most cases including polymorphism genotypes didn't yield apparent better accuracy compared to excluding them, except for Kstar (K � ) ML tool. K � can handle noisy data and it requires less time to train the data. However, its performance becomes better with large datasets [81] . Nevertheless, despite the excellent accuracies (i.e., accuracy > 70%) [82] of some ML models (e.g., MLP, RF, and XGBoost) in the present study, to evaluate the behavior of ML tools prompted us to use Cohen's κ as additional success criteria of the resulting ML models. Cohen's κ is more robust measure than accuracy, as it takes into account the possibility of prediction by chance [83] . Fleiss's [84] equally arbitrary guidelines characterize κ over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as poor.

Three learners (C-SVM, nu-SVM, and KNN) failed to yield significant Cohen's κ values for both 5-fold cross validation and 20% testing set. On the other hand, many of the learners yielded good κ values for 5-fold cross validation but not for the 20% testing set, these ML tools are RF, XGboost, gradient boost, and AdaBoost.

Interestingly, MLP produced good Cohen's κ scores for both models. Only PNN produced better Cohen's κ score for the testing set over that 5-fold cross validation. Artificial neural [24] ) and plugged in the software to do the calculations shown above.

https://doi.org/10.1371/journal.pone.0257857.g006 initiates the activation of CD36 which bounds to membrane-associated Src family non-receptor tyrosine kinases. This interaction enables three main cytoplasmic signaling domains; nuclear factor kappa B (NF-κB), mitogen-activated protein kinase kinase kinase 2 (MEKK2), and Vav-mediated signaling pathways. Activation of NF-κB contributes to network (MLP) is well-known for its high performance and accuracy. Furthermore, due to the increasing size and complexity of the data, Deep Learning (DL) has been introduced as an improvement to ANN. Recent studies that have used DL produced remarkable results [85, 86] .

Moreover, noticeable enhancement in κ score by including rs1761667 and rs1527483 by using K � and AdaBoost, which highlight the importance of such ML tools in the large databases. The best ML tool for predicting T2DM, MLP, was also used to predict dyslipidemia based on FBS, age, gender with and without including rs1761667 and rs1527483 polymorphism genotypes, Despite the fair accuracies (i.e., particularly in 5-fold cross validation), it failed to yield good Cohen's κ score.

Lai et al. used the most recent records of 13,309 Canadian patients aged between 18 and 90 years, along with their demographic and clinical information (age, sex, FBS, body mass index (BMI), HDL, TG, blood pressure (BP) and LDL). Predictive models were built using Logistic Regression and Gradient Boosting Machine (GBM) techniques. They also compared these models to other learning machine techniques such as Decision Tree and RF. According to their findings, The AROC for the proposed GBM model is 84.7% with a sensitivity of 71.6% and the AROC for the proposed Logistic Regression model is 84.0% with a sensitivity of 73.4%. The GBM and Logistic Regression models perform better than the Random Forest and Decision Tree models [87] .

Moreover, in Muhammad et al. study, the diagnostic dataset of T2DM was collected, and used to develop predictive supervised machine learning models based on logistic regression, SVM, KNN, RF, NB and gradient booting algorithms based on age, family history, glucose, TC, BP, HDL, TG and BMI. The random forest predictive learning-based model appeared to be one of the best developed models with 88.76% in terms of accuracy [44] .

inducing the levels of thioredoxin interacting protein (TXNIP), which in turn activates the NLR family pyrin domain containing 3 (NLRP3) and promotes the production of inflammatory cytokines (e.g., TNF and IL-1). Also, by inducing microRNA (miR-204) expression which targets the insulin transcription factor (MafA), TXNIP contributes to inhibiting insulin production, and thus pancreatic beta-cell dysfunction. Activation of Vav-mediated signaling pathways results in complicated cellular mechanisms finished with increased in the generation of reactive oxygen species (ROS), due to promoting nicotinamide adenine dinucleotide phosphate (NADPH) oxidase activation, free fatty acid (FFA) uptake, and Ox-LDL uptake. Furthermore, through the c-Jun N-terminal protein kinase (JNK1/2) pathway, MEKK2 also promotes Ox-LDL uptake. Thereafter, the excessive generation of ROS causes oxidative damage, and thus results in pancreatic beta-cell dysfunction and evolution of dyslipidemia. The exacerbation of beta-cell dysfunction is involved in the progression of T2DM [71] [72] [73] [74] [75] [76] [77] [78] . In the middle illustration, the edges indicate both functional and physical protein associations. Setting included minimum interaction score of 0.15. Max number of interactions is 10 in the first shell and 10 in the second shell. In the lower illustration, the edges indicate that the proteins are part of a physical complex. Setting included minimum interaction score of 0.4. Max number of interactions is 20 in the first shell, and none in the second shell. Line thickness indicates the strength of data support. Both the middle and lower illustrations were created using STRING database.

https://doi.org/10.1371/journal.pone.0257857.g007 

Machine learning to predict T2DM based on clinical, demographic, and genetic data Such tools can be implemented in the future for larger databases which include extensive number of cases and input features. In such case, feature selection and weighting tools (i.e., genetic algorithm, SHAP, and stepwise forward and reverse methods) can be implemented to select the best predicting subsets of input features (Fig 8) .

The present meta-analysis of CD36 included only one article, in addition to the results of the present study. This meta-analysis was performed under various genetic models, including allelic, homozygous, heterozygous, and dominant models. Random effects meta-analysis has become the standard to combine treatment effects from several studies when the presence of between trial heterogeneity is suspected, which is often the case [88] . In the present meta-analysis, no significant association was found between CD36 rs1527483 and T2DM (p > 0.05, Fig  6) . For rs1761667, it was found no significant difference in the allele frequencies. However, genotypes (TT, or TT and CT) vs. GG in either dominant or recessive models are significantly different between T2DM and control group (p < 0.05), even that our present study had shown no significant difference (p > 0.05). The heterogeneity score for these rs1761667 genetic models may represent substantial heterogeneity, which results usually from studies that have confidence intervals (generally depicted graphically using horizontal lines) with poor overlap. This may substantiate performing such meta-analysis on more studies in the future. Meta-analysis of two studies is not uncommon in some diseases, it was concluded that the confidence Fig 8. Suggested an efficient platform for medical databases with large number of inputs (features) . The platform starts with feeding an input file. Genetic algorithm loop randomly select different subsets (chromosomes) of feature to use them in the prediction process, it starts with "feature selection start" and ends with "feature selection end", and this algorithm saves extensive time in analyzing large databases. Partitioning divides data into training set and testing set, and scorer evaluates the prediction accuracy/Cohen's κ for the testing set, the results for different chromosomes (subsets of features) can be evaluated. Different Learners and predictors can be selected and evaluated for their performance in prediction (Adapted from Hatmal et al. [5] 

intervals based on normal quantiles do not have the right coverage and cannot be recommended for use in the case of two studies [88] . While a definite answer to this challenging problem is under dispute, the proposed Bayesian approach works well in many cases. In general, the current methods of meta-analysis have severe limitations, which may be addressed with future research. Until these limitations are resolved, it is recommended to meta-analyze two heterogeneous studies in a Bayesian way using plausible priors [88] .

There were some limitations to the current investigation. To begin, a larger sample size of patients and controls may be required to better understand the influence of the CD36 rs1761667 and rs1527483 polymorphisms on T2DM. Furthermore, machine learning could be used to examine polymorphisms in other CD36 genotypes and their potential interactions with rs1761667 and rs1527483 variants.

On the other hand, glycated hemoglobin (HbA1c) and other dietary information could be included in the future to assess the relationship between its level and lipid profile. Furthermore, upon implementing these ML tools on a greater number of features, future perspectives could incorporate genetic function algorithms for features reduction and feature importance tools to weight which features significantly contribute to the risk of developing T2DM.

A total of 350 blood samples (177 samples from T2DM patients, and 173 from subjects with no diabetes) were collected from Jordanian population. To avoid bias in selection of control subjects, they were randomly selected for having no diabetes, but may have other conditions such as obesity, blood pressure and dyslipidemia. Diabetic participants that enrolled in this study were with known history of diabetes and recruited from Jordan University Hospital (JUH). The study was conducted in accordance with the Declaration of Helsinki and was approved by the Institutional Review Board (IRB), JUH, and informed consent form was obtained from each participant.

After fasting overnight (8-10 hrs), a total of 10 ml blood (5 ml plain tube, and 5 ml EDTA tube) were collected from every participant. Serum was collected from plain tubes, after centrifugation, and then used to measure the levels of fasting blood sugar (FBS) and lipid profile parameters, including total cholesterol (TC), triglycerides (TG) and HDL by using Cobas C111 analyzer (Roche Diagnostics, Indianapolis, IN, USA). According to Friedewald's equation [89] , the level of LDL was also calculated (Eq 4). Dyslipidemia was defined as having greater than or equal to one of the following conditions: TC � 6.2 mmol/L (240 mg/dL); TG � 2.3 mmol/L (200 mg/dL); HDL � 1.0 mmol/L (40 mg/dL); LDL � 4.1 mmol/L (160 mg/dL) [90] .

In the present study, subjects with dyslipidemia were defined based on lipid parameters from both T2DM and control groups. This was done because the diabetes status and FBS were used together with the polymorphisms to predict dyslipidemia.

DNA was extracted from whole-blood samples (EDTA tubes) using the Wizard genomic DNA purification kit (Promega Corporation, Madison, WI, USA), and then a Nano-Drop™ 2000/ 2000c Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) was used to assess the concentration and purity (A260/A280) of the extracted DNA. The extracted DNA were stored at − 20˚C until used. Polymerase Chain Reaction (PCR) was used to amplify two CD36 SNPs; rs1527483 (C>T) in intron 11, and rs1761667 (G>A) in the -31118 promoter region of exon 1A. The PCR was performed in a total volume of 25 μl per each reaction containing 50 ng genomic DNA, 5 μl of 5xFIREPol 1 Master Mix (Solis BioDyne, Tartu, Estonia) and 1 μM of each primer (Gene Link, Hawthorne, NY, USA) by using C1000 Touch™ thermal cycler (Bio-Rad, Hercules, CA, USA). The PCR primers and conditions were as described in Table 13 .

Electrophoresis was used to evaluate PCR amplification, by verifying the migration of DNA fragments in an agarose gel prepared with 1x tis-borate-EDTA (TBE) buffer in a concentration of 2.5% (m/v) (2.5 g agarose with 100 mL 1x TBE) and stained with 5.0 μL RedSafe™ Nucleic Acid Staining Solution (iNtRON Biotechnology, Seoul, South Korea). Subsequently, 3 μL of the amplified PCR products were loaded and DNA fragments migrated through the gel at 120 Volt for 30 minutes. The gel was then visualized under a UV Transilluminator (UVP Bioimaging System, Upland, CA, USA) to compare the molecular weight of DNA fragments based on a 100 bp DNA ladder.

PCR purification was done by using ExoSAP-IT™ kit (Applied Biosystems, Waltham, MA, USA), to eliminate and neutralize PCR residuals, before sending selected samples for DNA sequencing using ABI3730xl DNA Analyzer (Applied Biosystems, Waltham, MA, USA) with big dye terminator version 3.1 kit at Macrogen Inc. (Seoul, South Korea). The determined sequences were aligned with the reference sequence of the CD36 gene that was downloaded from the NCBI-reference sequences (accession number: NG_008192.1) [91] .

The statistical analysis was conducted using SPSS version 16.0 (SPSS Inc., Chicago, IL, USA) and web tool "SNPStats" (www.snpstats.net/analyzer.php) (i.e., odd-ration test, t-test, and chisquare test) [92] . Comprehensive Meta-Analysis (CMA) software package was used for the meta-analysis. In order to verify whether the control group of the present study was under the assumptions of this law, genotype distributions between groups were determined and the Hardy-Weinberg equilibrium (HWE) was carried out.

For the aim of predicting diabetes or dyslipidemia, several orthogonal ML tools (they use different classification protocols, themes and concepts) were utilized, including MLP (Logistic function), SVM, XGBoost, RF, AdaBoost, gradient boost, PNN, NB and K � were built using version 4.1.3 of KNIME Analytics Platform (KNIME AG, Zurich, Switzerland). Data was used as either 5-fold cross validation, or divided as training set (80%) and testing set (20%). The input-output training set contained the polymorphisms genotypes, gender, age, and clinical parameters (i.e., lipid profile (to predict diabetes) or blood sugar (to predict dyslipidemia)) as inputs, and either the diabetes status (1 for person with diabetes and 0 for person without diabetes) or dyslipidemia status (1 for person with dyslipidemia and 0 for person without dyslipidemia) as output. The clinical parameters used to define the diseases (lipid profile for dyslipidemia, and FBS for T2DM) were excluded from the input data. Herein, we aimed to predict dyslipidemia and T2DM based on other independent factors; no added value if ML tools were used to predict based on the same criteria that were defined by.

Random Forest (RF). RF is a versatile ML approach [34] [35] [36] , which is based on ensemble of decision trees (DTs), with each tree independently predicting a classification and "voting" for the related class, and the majority of the votes deciding the overall RF predictions [42] . Within the KNIME Analytics Platform, we constructed an RF learner node with the following settings: splitting criterion is the information gain ratio and number of trees (= 100). There were no restrictions on the number of layers or the minimum node size. Out-of-bag internal validation was used to calculate the accuracy. eXtreme gradient boosting (XGBoost). XGBoost employs an ensemble of weak DT-type models to generate boosted, DT-type models. This system incorporates an unique tree learning algorithm as well as a theoretically justified weighted quantile sketch technique with parallel and distributed computation [41, 42, 93] . We constructed the XGBoost learner node within the KNIME Platform as follows: tree booster was used with depth wise grow policy, boosting rounds = 100, Eta = 0.3, Gamma = 0, maximum depth = 6, minimum child weight = 1, maximum delta step = 0, sub-sampling rate = 1, column sampling rate by tree = 1, column sampling rate by level = 1, lambda = 1, Alpha = 0, sketch epsilon = 0.03, scaled position weight = 1, maximum number of bins = 256, sample type (uniform), normalize type (tree), and dropout rate = 0.

k-Nearest Neighbors (kNN). The kNN classifier is based on a distance learning methodology that calculates an unknown member's disease status based on the disease status of a set number (k) of nearest neighbors in the training set. A distance metric is used to quantify similarity in this classifier [94] . With k = 6, we implemented the kNN Learner node within the KNIME Analytics Platform.

Probabilistic Neural Network (PNN). PNN is based on the DDA (Dynamic Decay Adjustment) method on labeled data using Constructive Training of Probabilistic Neural Networks. This algorithm generates rules based on numeric data. Each rule is defined as highdimensional Gaussian function that is adjusted by two thresholds, theta minus and theta plus, to avoid conflicts with rules of different classes [95, 96] . We implemented PNN Learner node within KNIME Analytics Platform using PNN theta minus = 0.2 and theta plus = 0.4 and without specifying maximum number epochs so that the process is repeated until stable rule model is achieved.

Naïve Bayesian (NB). NB is a simple classifier that predicts and assigns class labels to external data based on vectors of descriptors for a finite set of training observations. The NB classifier posits that each descriptor contributes independently to the probability that an observation belongs to a specific class (e.g., disease or no disease) [37] [38] [39] [40] . The chance of an observation belonging to a specific class is calculated by multiplying the individual probabilities of that class within each individual descriptor [37] [38] [39] [40] . We implemented NB learner node within KNIME Analytics Platform with the following parameters: default probability = 0.0001, minimum standard deviation = 0.0001, threshold standard deviation = 0.0 and maximum number of unique nominal values per attribute = 20.

Multilayer Perceptron (MLP). MLP is a multilayer feed forward network with implementation of the RProp algorithm [97] . MLP is capable of learning nonlinear models in real time. Between the input and output layers of an MLP, one or more nonlinear hidden layers can exist. A varied number of hidden neurons can be allocated to each hidden layer. Each hidden neuron computes a weighted linear sum of the previous layer's values, and the nonlinear activation function is used. After the output layer transforms the values from the previous hidden layer, the output values are reported. We implemented MLP learner node within KNIME Analytics Platform with the following optimized parameters: Maximum number of iterations = 100, Number of hidden layers = 3, and number of hidden neurons per layer = 100. Support Vector Machine (SVM). The SVM selects a small number of boundary instances known as support vectors to generate a discriminating function that divides training observations into discrete classes with the broadest possible boundaries. SVM enables the efficient use of a number of kernels for classification. The aim to minimize error on training data and reduce model computational complexity to avoid overfitting by tuning the factors involved in the process is a major characteristic of SVMs [45, 47, 48] . C-SVM and nu-SVM were the two SVM methods tried. The regularization parameters C and nu penalize misclassifications. C ranges from 0 to infinity, while nu ranges from 0 to 1 and indicates the lower and upper bounds on the number of support vector examples that are on the wrong side of the hyperplane. In both SVM techniques implemented in the WEKA-KNIME LibSVM node, the following default parameters were used: kernel cache (cache size = 40.0), kernel type is radial basis function: exp(-gamma � |u-v|2), loss function is 0.1, kernel coefficients epsilon = 0.001 and Gamma = 0.00. In nu-SVM, however, the optimized nu value of 0.1 was employed.

K-star (K � ). K � is an instance-based classifier, it is distinguished from other instancebased learners by its use of an entropy-based distance function. In this learner, the class of a test instance is determined by the class of training instances that are similar to it, as determined by similarity function [98] . The following default settings were used: manual blend setting is 20% and average column entropy curve was used for missing mode.

Gradient-boost. The algorithm uses very shallow regression trees and a special form of boosting to build an ensemble of trees. The used base learner for this ensemble method is a simple regression tree as it is used in the tree ensemble, RF and simple regression tree nodes. Per default, a tree is build using binary splits for numeric and nominal attributes [99] . The following default settings were used: tree depth is 4, number of models is 100, and learning rate is 0.1.

Adaptive boosting (AdaBoost). The constructed classifier is composed of multiple weaker models that are independently trained and whose predictions are combined to make the overall prediction [100] . AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. It is likely less susceptible to the overfitting problem than other learning algorithms. The final model can be proven to converge to a strong learner [101] . The following settings were used: percentage of weight mass to base training is set to 100, use resampling for boosting is set as "false", random number seed is 1, and number of iterations is 10.

ML model evaluation. The ML models were evaluated by calculating their accuracies (Eq 5) and Cohen's kappa(κ) values (Eq 6) [5] against the training and testing datasets.

Where TP represents the true positive, TN represents true negatives, and N represents the total number of cases. Section 3.2.2 contains more information on how it is calculated.

Where P 0 denotes the observed relative agreement among raters (i.e., accuracy) and P e is the hypothetical probability of random agreement. This is accomplished by calculating the probability of each observer randomly seeing each category based on the observed data. If the raters are completely in agreement, then Cohen's = 1. Cohen's = 0 if there is no agreement among the raters other than what would be expected by chance (as given by P e ). A negative Cohen's value indicates that the agreement is poorer than random [102] . Training against the training set (80% of data, randomly selected) or 5-fold cross validation of the data points is used for evaluation. The model is then used for classifying the testing data. In 5-fold cross validation, the process is repeated until all training data points are removed from the training list and predicted at least once. Evaluation against the testing set involves calculating the accuracy of the particular ML model by comparing its classification results with the actual disease status of the testing set [103, 104] .

The preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines were followed for this work [105] . Literature search was carried out within PubMed (Medline), Google Scholar and Science Direct databases up to February, 2021, using the keywords CD36, gene, patient, polymorphism and disease name (i.e., T2DM/diabetes or dyslipidemia/lipids). Then, potentially relevant publications and studies were retrieved by examining their titles and abstracts and matching the eligible criteria. To facilitate the proper interpretation of results and to minimize heterogeneity, all eligible studies had to fulfill the following inclusion criteria like evaluation of CD36 gene rs1761667 G>A and rs1527483 C>T with T2DM risk; use of case control or cohort studies; recruitment of pathologically confirmed patients/condition and control subjects; and availability of genotypic frequency both in case and control (Fig 9) . The major reasons for exclusion of studies were overlapping data, case only studies, review articles, family-based studies and animal studies. In addition to the present study, only one study was included in this meta-analysis for T2DM, where both rs1761667 and rs1527483 polymorphisms were studied by PCR-RFLP; that study was conducted by Banerjee et al. in North India, and included 250 T2DM cases and 150 healthy controls (all of them from Asian ethnicity) [24] . After extensive search, unfortunately no other studies fit with the inclusion criteria were found on dyslipidemia. The genotypic and allelic frequencies of both polymorphisms for both studies involved in the meta-analysis are shown in S9 Table. Random effect model is used with all analyses. The greatest benefit of conducting the current meta-analysis is to examine sources of heterogeneity, if present, among studies. To the best of our knowledge, there are previous meta-analyses in the literature which covered these CD36 gene polymorphisms (i.e., rs1761667 and rs1527483), or even other polymorphisms on the CD36 gene.

This study has investigated CD36 gene status in Jordanian subjects by screening for the certain rs1761667 and rs1527483 polymorphisms in T2DM patients compared to control subjects. For both polymorphisms, there was no statistically significant difference between patients and control subjects. However, ML tools (i.e., Logistic, Random Forest, XGBoost, PNN, C-LibSVM, nu-LibSVM, AdaBoost, kNN, K � , and NB) were used as computational platforms to predict subjects with diabetes or dyslipidemia (as output) based on their genotyping results, clinical parameters and demographic data (as input features). Some of these tools had shown high prediction accuracy. Interestingly, in some ML tools (i.e., K � ), the prediction accuracy and Cohen's κ were enhanced by including the genotyping results as inputs. Some ML tools like MLP gave good accuracy and Cohen's κ in all cases. Indeed, our findings emphasize the importance of embedding ML tools into large medical databases, as well as the potential to forecast patient vulnerability to certain diseases. ML tools can be deployed in medical databases and expanded in the future to include other clinical and genetic parameters, assisting in the early detection of diabetes.

Supporting information S1 

Current progress of human trials using stem cell therapy as a treatment for diabetes mellitus

Global estimates of diabetes prevalence for 2013 and projections for 2035

Genetic screening for the risk of type 2 diabetes: worthless or valuable? Diabetes Care

Past and current perspective on new therapeutic targets for Type-II diabetes

Artificial Neural Networks Model for Predicting Type 2 Diabetes Mellitus Based on VDR Gene FokI Polymorphism, Lipid Profile and Demographic Data

Screening the RFX6-DNA binding domain for potential genetic variants in patients with type 2 diabetes

No impact of soluble epoxide hydrolase rs4149243, rs2234914 and rs751142 genetic variants on the development of type II diabetes and its hypertensive complication among Jordanian patients

CD36, a scavenger receptor involved in immunity, metabolism, angiogenesis, and behavior

The origin of circulating CD36 in type 2 diabetes

CD36 mediates both cellular uptake of very long chain fatty acids and their intestinal absorption in mice

Defective uptake and utilization of long chain fatty acids in muscle and adipose tissues of CD36 knockout mice

The macrophage Ox-LDL receptor, CD36 and its association with type II diabetes mellitus

The structural basis for CD36 binding by the malaria parasite

The Multifunctionality of CD36 in Diabetes Mellitus and Its Complications-Update in Pathogenesis

CD36: a class B scavenger receptor involved in angiogenesis, atherosclerosis, inflammation, and lipid metabolism

CD36 and lipid metabolism in the evolution of atherosclerosis

Polymorphism rs1761667 in the CD36 Gene Is Associated to Changes in Fatty Acid Metabolism and Circulating Endocannabinoid Levels Distinctively in Normal Weight and Obese Subjects

Lipid-induced insulin resistance is associated with increased monocyte expression of scavenger receptor CD36 and internalization of oxidized LDL. Obesity (Silver Spring)

Molecular basis of human CD36 gene mutations

A CD36 nonsense mutation associated with insulin resistance and familial type 2 diabetes

Direct association of a promoter polymorphism in the CD36/FAT fatty acid transporter gene with Type 2 diabetes mellitus and insulin resistance

Variants in the CD36 gene associate with the metabolic syndrome and high-density lipoprotein cholesterol

Preliminary studies on CD36 gene in type 2 diabetic patients from north India

Association of CD36 gene variants rs1761667 (G>A) and rs1527483 (C>T) with Type 2 diabetes in North Indian population

Lipids and Lipoproteins in Patients With Type 2 Diabetes

Treatment of dyslipidemia in patients with type 2 diabetes. Lipids in Health and Disease

Involvement of Machine Learning Tools in Healthcare Decision Making

Docking-Generated Multiple Ligand Poses for Bootstrapping Bioactivity Classifying Machine Learning: Repurposing Covalent Inhibitors for COVID-19-Related TMPRSS2 as Case Study

Deep Learning in Medical Imaging: General Overview

Side Effects and Perceptions Following COVID-19 Vaccination in Jordan: A Randomized, Cross-Sectional Study Implementing Machine Learning for Predicting Severity of Side Effects

Machine Learning and Data Mining Methods in Diabetes Research

Virulence factor-related gut microbiota genes and immunoglobulin A levels as novel markers for machine learning-based classification of autism spectrum disorder

Comparison of unsupervised machine-learning methods to identify metabolomic signatures in patients with localized breast cancer

Cross-omics analysis revealed gut microbiome-related metabolic pathways underlying atherosclerosis development after antibiotics treatment

Predicting and understanding the response to short-term intensive insulin therapy in people with early type 2 diabetes

Unsupervised Spectral-Spatial Feature Learning With Stacked Sparse Autoencoder for Hyperspectral Imagery Classification. IEEE Geoscience and Remote Sensing Letters

Heart Disease Prediction System Using Decision Tree and Naive Bayes Algorithm

Feature extraction and I-NB classification of CT images for early lung cancer detection

Detection of Cardiovascular Disease Risk's Level for Adults Using Naive Bayes Classifier

Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation

A Novel Image Classification Method with CNN-XGBoost Model In

Computational Intelligence in Smart Grid Environment

Infusion of donor feces affects the gut-brain axis in humans with metabolic syndrome

Predictive Supervised Machine Learning Models for Diabetes Mellitus

Automated identification of normal and diabetes heart rate signals using nonlinear measures

A comparative study on various data mining classification methods: KNN, PNN and ANN for tiles defect detection

A probabilistic neural network approach for modeling and classification of bacterial growth/no-growth data

Diagnosing Breast Cancer Type by Using Probabilistic Neural Network in Decision Support System

Application of probabilistic neural networks in modelling structural deterioration of stormwater pipes

Applications of Data Mining in the Healthcare Industry. Encyclopedia of Healthcare Information Systems

Data mining applications in healthcare

Advanced methods in neural computing

Neural Networks for Identification of Nonlinear Systems: An Overview

Recurrent neural network prediction of steam production in a Kraft recovery boiler

Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation

Adaptive Boosting Based Personalized Glucose Monitoring System (PGMS) for Non-Invasive Blood Glucose Prediction with Improved Accuracy

Diabetes classification model based on boosting algorithms

Diabetics Prediction using Gradient Boosted Classifier

A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

An application of machine learning with feature selection to improve diagnosis and classification of neurodegenerative disorders

Current Techniques for Diabetes Prediction: Review and Case Study

The construction of risk prediction models using GWAS data and its application to a type 2 diabetes prospective cohort

Machine Learning SNP Based Prediction for Precision Medicine

User's guide to correlation coefficients

The PRISMA 2020 statement: an updated guideline for reporting systematic reviews

Common variants in the CD36 gene are associated with dietary fat intake, high-fat food consumption and serum triglycerides in a cohort of Quebec adults

CD36 gene variants and their association with type 2 diabetes in an Indian population

CD36 gene variants in early prediction of type 2 diabetes mellitus

CD36 gene variants is associated with type 2 diabetes mellitus through the interaction of obesity in rural Chinese adults

The role of Src in solid tumors

Rac1-NADPH oxidase signaling promotes CD36 activation under glucotoxic conditions in pancreatic beta cells

CD36, a scavenger receptor implicated in atherosclerosis

The Multifunctionality of CD36 in Diabetes Mellitus and Its Complications-Update in Pathogenesis

Cytokines Regulate beta-Cell Thioredoxin-interacting Protein (TXNIP) via Distinct Mechanisms and Pathways

NLRP3 inflammasome as a potential treatment in ischemic stroke concomitant with diabetes

The Role of CD36 in Type 2 Diabetes Mellitus: beta-Cell Dysfunction and Beyond

CD36 and lipid metabolism in the evolution of atherosclerosis

Oxidized low-density lipoprotein activates p66Shc via lectin-like oxidized low-density lipoprotein receptor-1, protein kinase C-beta, and c-Jun N-terminal kinase kinase in human endothelial cells

Association of cluster of differentiation 36 gene variant rs1761667 (G>A) with metabolic syndrome in Egyptian adults

Metabolic syndrome is linked to chromosome 7q21 and associated with genetic variants in CD36 and GNAT3 in Mexican Americans

Intrusion detection system based on K-star classifier and feature set reduction

Comparing different supervised machine learning algorithms for disease prediction

The measurement of observer agreement for categorical data

Statistical Methods for Rates and Proportions

Reduction of Overfitting in Diabetes Prediction Using Deep Learning Neural Network

Automated detection of diabetes using CNN and CNN-LSTM network and heart rate signals

Predictive models for diabetes mellitus using machine learning techniques

Meta-analysis of two studies in the presence of heterogeneity with applications in rare diseases

Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge

Genetic factors increase the identification efficiency of predictive models for dyslipidaemia: a prospective cohort study

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

SNPStats: a web tool for the analysis of association studies

Bioactive Molecule Prediction Using Extreme Gradient Boosting. Molecules

Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier: A Review

Virtual screening for PPAR modulators using a probabilistic neural network

Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction

A direct adaptive method for faster backpropagation learning: the RPROP algorithm

Intelligence System for Diagnosis Level of Coronary Heart Disease with K-Star Algorithm

Greedy Function Approximation: A Gradient Boosting Machine

A Strong Machine Learning Classifier and Decision Stumps Based Hybrid AdaBoost Classification Algorithm for Cognitive Radios

The return of AdaBoost.MH: multi-class Hamming trees: Cornell University

Interrater reliability: the kappa statistic

Applications of machine learning techniques to predict filariasis using socio-economic factors

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Preferred reporting items for systematic reviews and metaanalyses: the PRISMA statement