key: cord-0029067-ay5v9oor
authors: Kim, Jingeun; Yoon, Yourim; Park, Hye-Jin; Kim, Yong-Hyuk
title: Comparative Study of Classification Algorithms for Various DNA Microarray Data
date: 2022-03-11
journal: Genes (Basel)
DOI: 10.3390/genes13030494
sha: 0aa6e97786ebfd53b380ce72294f8bae1575c2ee
doc_id: 29067
cord_uid: ay5v9oor

Microarrays are applications of electrical engineering and technology in biology that allow simultaneous measurement of expression of numerous genes, and they can be used to analyze specific diseases. This study undertakes classification analyses of various microarrays to compare the performances of classification algorithms over different data traits. The datasets were classified into test and control groups based on five utilized machine learning methods, including MultiLayer Perceptron (MLP), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and k-Nearest Neighbors (KNN), and the resulting accuracies were compared. k-fold cross-validation was used in evaluating the performance and the result was analyzed by comparing the performances of the five machine learning methods. Through the experiments, it was observed that the two tree-based methods, DT and RF, showed similar trends in results and the remaining three methods, MLP, SVM, and DT, showed similar trends. DT and RF generally showed worse performance than other methods except for one dataset. This suggests that, for the effective classification of microarray data, selecting a classification algorithm that is suitable for data traits is crucial to ensure optimum performance.

Microarrays have been developed by combining modern mechanical and electrical engineering technologies with the existing knowledge in molecular biology. While the traditional methods allowed researchers to measure the expression of a small number of genes at a time, the introduction of microarrays enabled the expression analysis of tens of thousands of genes in a single experiment. This led to the development of experimental techniques that were capable of generating a large volume of genomic information from a single cell [1] . Since various genes in an organism mutually affect and regulate their expressions, microarray data can be used as a tool to analyze specific diseases.

Microarray-based analysis methods can be broadly classified into five categories: analysis of differentially expressed genes, analysis of differentially expressed gene pairs, functional annotation, clustering analysis, and classification analysis [2] [3] [4] [5] . In this study, classification analysis was selected for evaluation. The classification analysis method is a widely used multivariate statistical method that can be used to determine or predict classes of unknown groups of data. This method has typically been used to analyze cancer

Microarrays are highly dense arrays of DNA molecules with known nucleotide sequences aggregated on a small slide [19] . Microarrays may be used to assess the overall expression of a large number of genes and contribute to a genome-based analysis of biological phenomena [20, 21] . MicroRNAs (miRNAs) are short RNA molecules consisting of approximately 22 nucleotides and are involved in the post-transcriptional regulation of gene expression. miRNAs are known to regulate the expression of over 60% of human genes and are associated with various diseases [22] .

The miRNA and RNA datasets were acquired from the Gene Expression Omnibus repository at the National Center for Biotechnology Information (GEO-NCBI) [23] . Among the datasets available for lung cancer, chronic obstructive pulmonary disease (COPD), and cardiovascular disease (CVD), wherein each representative disease was induced by exposure to fine dust, we acquired four datasets with clear control groups and relatively large sample sizes. Furthermore, we acquired two microarray datasets that compared the profiles of occupations with and without exposure to fine dust, respectively. Some of the datasets used in this study also included information on smoking, sex, age, and BMI, in addition to miRNA and RNA. While all datasets were associated with exposure to a fine dust to some extent, their classification criteria and data traits were distinct as follows:

Dataset 01, titled "MicroRNA profiling of chronic lung disease for the Lung Genomics Research Consortium", provided miRNA data on patients with COPD and interstitial lung disease (ILD), which are two similar pulmonary diseases. COPD is a lung disease caused by repeated exposure to a noxious agent resulting in irreversible airflow limitation and ILD is a loosely defined group of diseases characterized by changes in the interstitium of the lung, causing pulmonary restriction and impaired gas exchange [24] .

Dataset 02, titled "The site and nature of terminal bronchiolar destruction in chronic obstructive pulmonary disease (COPD)", provided RNA data of patients with COPD besides that of healthy subjects [25] .

Dataset 03, titled "Genome-wide screening of transcriptional modulation in nonsmoking female lung cancer in Taiwan", provided RNA data of non-smoking females with lung cancer and those without lung cancer [26] .

Dataset 04 was titled "Differences in monocyte miRNA profiles between patients with coronary artery disease and healthy controls". This dataset included miRNA data from healthy subjects and patients with CVD, which is one of the primary causes of death in humans [27] .

Dataset 05 was titled "Transcriptomic changes in the nasal epithelium associated with diesel engine exhaust exposure". Diesel engine exhaust (DEE) is one of the primary causes of air pollution worldwide, which can affect the human body. This RNA dataset was collected from those working in diesel engine factories, along with those working without DEE exposure as the control group, for a detailed analysis of genes affected by air pollution [28] .

Dataset 06, titled "Expression of viral and human microRNAs in blood in the Beijing Truck Driver Air Pollution Study", provided miRNA data collected from the blood samples of truck drivers exposed to air pollution, with that of office workers as the control group. This dataset was constructed to investigate the association between air pollution exposure and clinical outcomes [29] .

Since the sample traits listed in the six datasets not only included occupation but also BMIs and smoking status, we were able to conduct classification experiments on various features, such as truck driver/office worker, obese/non-obese, and smoking/non-smoking. All datasets used for the experiments are listed in Table 1 with their titles, classification groups, variable counts, and sample counts. Table 2 shows an example microarray data (miRNA data indicating the effects of air pollution on truck drivers) used in the experiments. The rows specify the gene names and values, whereas the columns specify the sample names. The miRNA data used in the experiments represent the expression of each gene for each sample in a matrix form.

Feature scaling is a kind of data preprocessing to improve the performance of machine learning algorithms [30] . Normalization and standardization are the two most commonly used feature scaling techniques in machine learning. Normalization rescales the values into a range of [0, 1] and standardization rescales data to have a mean of 0 and a standard deviation of 1. The choice of normalization or standardization depends on data and machine Genes 2022, 13, 494 4 of 18 learning algorithms. There is no simple rule that determines when to use normalization or standardization. Therefore, it is necessary to apply both methods and compare the results to know which method is better [31] . In this study, only normalization was applied, because normalization showed generally good performance according to the comparison. However, there were cases where standardization was better than normalization, and the results according to these feature scaling methods were attached to Appendix A. The equation for normalization used in this study is given as follows:

If there are no negative values in the dataset, the values are scaled to a range between 0 and 1, and if there are negative values, the values are scaled to a range between −1 and 1. As the microarray datasets used in this study did not contain any negative values, all of them were scaled to a range [0, 1].

Machine learning is a subfield of artificial intelligence in which algorithms are developed to allow systems to train based on a given dataset and execute activities that are not specified in the code [32] Machine learning can be further categorized into supervised or unsupervised learning based on whether or not the given data are labeled. In supervised learning such as support vector machines, decision trees, and neural networks, systems use the features of the given data to predict their labels. On the other hand, in unsupervised learning such as clustering, the system is trained entirely on the unlabeled input values. In this study, supervised learning was used to classify the microarray data.

MLP is a layered neural network with one or more hidden layers between the input and output layers [33] . The network is a feedforward network in which the layers are directed as per the order: input, hidden, and output layers, and there is no connection between the nodes of the same layer or between the output and input layers [34] . MLP uses activation functions, typically the sigmoid or rectified linear unit (ReLU) functions. The sigmoid function provides a value between 0 and 1 as output, enabling the neural network to make subtle classifications of the data. However, this characteristic of the sigmoid function poses a disadvantage, i.e., with deeper networks; the output of the function is heavily biased towards either end of the range, leading to a derivative value close to 0. To solve this problem, the ReLU function was proposed, which returns 0 for an input value smaller than 0 but returns the original input value for one larger than 0. This does not lead to derivative values converging to zero, even with deeper networks. The equation for the ReLU function is as follows:

where x is the input to a neuron. Some optimizers that enhance and stabilize the learning rates of MLP include stochastic gradient descent, momentum, nesterovated gradient, and adaptive moment estimation (Adam). Adam was selected for this study, owing to its high computational efficiency, low memory requirements, and scalability in large datasets [35] . The default value of the learning rate, which controls the step size in weight updates, was set to 0.01 since the learning rate of 0.01 is known to be effective in preventing underfitting [36] .

SVM is a machine learning algorithm proposed by Vapnik [37] . It is a highly generalizable classifier typically used for classification and regression analysis. SVM classifies a set containing elements of various classes in an N-dimensional space into several subsets using the (N-1)-dimensional hyperplane of the maximum margin [38] . The algorithm is currently being used in a wide range of fields including text, handwriting, facial, object recognition, and bioinformatics [39] . The SVM provides different outputs based on the values of two parameters: C, specifying the degree of error permitted, and γ, specifying the curvature of the boundary.

In this study, the RBF (radial basis function)-SVM was used for its specialization in nonlinear factors. The RBF kernel function is expressed as follows [40] :

2 is the squared Euclidean distance between the two vectors x i and x j . The RBF kernel requires predetermined values for the C and γ parameters since the C and γ value that shows the highest performance varies according to the size of the dataset. The value with the highest performance among 0.001, 0.01, 0.1, 1, 10, and 100 was selected to obtain experimental results. The six values are the most generally used for the γ value of RBF-SVM.

DT is an analytical method in which a tree-like structure consisting of decision rules is constructed to classify data into several subsets [41] . A tree is a collection of layered nodes and branches. Top-down algorithms are generally used to construct decision trees, and the selection of classification variables and threshold values is crucial in each step of constructing the tree from the top to the bottom. Without a limit on the depth of the decision tree, the tree can become infinitely deep and complex. Thus, trees without pruning may result in overfitting, which may be prevented by limiting the tree depth. In this study, the maximum tree depth parameter, max_depth, was set to 3 to avoid overfitting [42] .

RF is an ensemble classification algorithm based on decision trees trained on randomly selected subsets of data and features. The number of trees in the forest was limited to 100 since a forest with more than 100 trees is known to be able to cause overfitting [43] . For each node, the random forest randomly selects explanatory variables and provides the optimal output using the set of selected explanatory variables. However, the algorithm is unstable due to the instability in datasets or variability in classifiers, which cause even a small change in data to lead to a different result. To circumvent this issue, RF is usually performed with bagging and bootstrapping [44] . KNN algorithm, proposed by Cover and Hart in 1968, is a nonparametric method used in classification or regression [45] . KNN intuitively classifies unlabeled samples based on the inter-sample similarity observed in the training set. A small value of the number of neighbors provided as a parameter leads to a complex decision boundary in the model and consequent overfitting, while a large value leads to a simple decision boundary and underfitting. Thus, it is important to determine an appropriate value for this parameter. In this study, the value showing the highest performance was set as the value of the core parameter of KNN, n_neighbors, individually for each dataset.

Cross-validation refers to averaging the performances of k models, each generated from a different partition of the dataset. k-fold refers to partitioning the dataset into k subsets using k-1 of them as the training sets and the remaining as the validation set. This process is repeated k times. The evaluation of models using this approach ensures that the entire data is used as both training and validation data, leading to a lower likelihood of overfitting. As usual, we set k as 10, partitioning the dataset into 10 subsets, for performing cross-validation.

In this study, the accuracy of the classification model is defined as

where n is the number of samples,ŷ i is the predicted value of the i-th sample, y i is the corresponding actual value, and 1(x) is an indicator function. Table 3 shows the types of classification algorithms used in the experiment, parameters used in algorithm design, and values used for parameters. We conducted nested crossvalidation for parameter tuning and evaluation of SVM and KNN. For each fold of crossvalidation, these two classification algorithms were tested with different parameter values to achieve optimal performance. For the other three classification algorithms, MLP, DT, and RF, predetermined values known to be effective in improving performance were used for parameter values [36, 42, 43] . For the SVM model, a model with a linear SVM kernel was used, and the C and γ values were obtained through a grid search of the training set in each fold resulting in different values across models. For the KNN model, a value between 1 and 58 was specified for each model as the number of neighbors used to find the optimal value. Thus, the range of the parameter values and the optimal k value were different for each model. The MLP classification model generated two hidden layers since it has been verified effective in other studies for disease diagnosis [46] . Each hidden layer contained 10 neurons and used the ReLU activation function. Adam was used as the gradient descent algorithm with an initial learning rate of 0.01 and was executed over 500 epochs. The DT model was generated with a maximum tree depth of three. The RF model generated a random forest of 100 trees. Models using MLP, SVM, DT, RF, and KNN were implemented in Python, as mentioned earlier, and the source code was attached as File S1. Optimal models were developed by tuning the parameters. All possible parameter values were tested, especially for SVM and KNN, and the results were compared to each other. The classification models were trained and tested using the system with NVIDIA Tesla K80 GPU, Intel Core i5-6200 CPU @ 2.30 GHz, and 8 GB memory.

Dataset 01 provided miRNA data for patients with COPD or ILD, which are two similar pulmonary diseases. All samples of the patients diagnosed with ILD or COPD were obtained from the Lung Tissue Research Consortium (LRTC). The dataset included 319 subjects, of which 183 had ILD and 136 had COPD [24, 47] . Table 4 shows the classification results for COPD and ILD obtained using five different algorithms: MLP, RF, DT, SVM, and KNN. The results show that SVM, MLP, and KNN had accuracies of over 80%. Figure 1 shows a heatmap of SVM accuracies with different values of C and γ, the two core parameters of SVM, displayed on a color gradient. SVM achieved the highest accuracy when both the C and γ values were 0.001. Dataset 01 provided miRNA data for patients with COPD or ILD, which are two sim-ilar pulmonary diseases. All samples of the patients diagnosed with ILD or COPD were obtained from the Lung Tissue Research Consortium (LRTC). The dataset included 319 subjects, of which 183 had ILD and 136 had COPD [24, 47] . Table 4 shows the classification results for COPD and ILD obtained using five different algorithms: MLP, RF, DT, SVM, and KNN. The results show that SVM, MLP, and KNN had accuracies of over 80%. Figure 1 shows a heatmap of SVM accuracies with different values of C and γ, the two core parameters of SVM, displayed on a color gradient. SVM achieved the highest accuracy when both the C and γ values were 0.001. RF and DT, using tree structures, showed lower accuracy than the others. RF had a lower accuracy (73.3%), and all five classification algorithms classified the patients with COPD from those with ILD with more than 70% accuracy.

Dataset 02 included microarray data from 77 patients with COPD and 40 healthy subjects [25] . Table 5 shows the classification results for COPD and ILD using five different algorithms: MLP, RF, DT, SVM, and KNN. SVM had the highest accuracy (99%) in classifying the subjects based on the presence or absence of COPD. The heatmap ( Figure 2 ) shows that the highest accuracy was achieved with a C value of 0.001 and a γ value of 0.001, 0.01, or 0.1. RF and DT, using tree structures, showed lower accuracy than the others. RF had a lower accuracy (73.3%), and all five classification algorithms classified the patients with COPD from those with ILD with more than 70% accuracy.

Dataset 02 included microarray data from 77 patients with COPD and 40 healthy subjects [25] . Table 5 shows the classification results for COPD and ILD using five different algorithms: MLP, RF, DT, SVM, and KNN. SVM had the highest accuracy (99%) in classifying the subjects based on the presence or absence of COPD. The heatmap ( Figure 2 ) shows that the highest accuracy was achieved with a C value of 0.001 and a γ value of 0.001, 0.01, or 0.1. While RF had the lowest accuracy (68.3%), the algorithms generally had accuracies over 80%, thus classifying the subjects with a disease satisfactorily. While RF had the lowest accuracy (68.3%), the algorithms generally had accuracies over 80%, thus classifying the subjects with a disease satisfactorily.

Dataset 03 provided miRNA data for healthy subjects and patients with lung cancer [26] . This dataset was used to comparatively analyze non-smoking female patients with lung cancer and healthy subjects [48, 49] . Table 6 shows the classification results between patients with lung cancer and healthy subjects using the five classification algorithms. The DT and SVM showed high accuracies of 95% each. The lowest accuracy was 73%, suggesting that all algorithms classified between the two groups with an accuracy of more than 73%. 

Dataset 04 provided the miRNA data of 40 males with premature CVD and 40 healthy males of the same age, measured using microarrays [27, 50, 51] Table 7 shows the classification results between the CVD and control using the five classification algorithms. All algorithms were able to classify between the patients with CVD and healthy subjects with an accuracy of over 50%. However, SVM had the highest accuracy (77%), which was achieved with a C value of 0.001 and a γ value of 0.01. 

Dataset 04 provided the miRNA data of 40 males with premature CVD and 40 healthy males of the same age, measured using microarrays [27, 50, 51] Table 7 shows the classification results between the CVD and control using the five classification algorithms. All algorithms were able to classify between the patients with CVD and healthy subjects with an accuracy of over 50%. However, SVM had the highest accuracy (77%), which was achieved with a C value of 0.001 and a γ value of 0.01. Figure 4 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest performance was observed when the C value was 0.001, and the γ value was 0.01. 

Dataset 05 provided RNA data of factory workers exposed to DEE and those not exposed to DEE [28, 52] . The algorithms attempted to classify these using microarray data. Table 8 shows the classification results of the diesel factory workers and control subjects using the five classification algorithms. Figure 5 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest performance was achieved when the C and γ values were both 0.001. Figure 6 shows the accuracy of the KNN classification algorithm obtained at different values of k plotted as a graph. KNN achieved the highest accuracy, with a k value of five. MLP and KNN showed high accuracies in this experiment. However, the tree-structure algorithms, random forest, and decision tree failed to classify this data well with an accuracy of less than 50%. 

Dataset 05 provided RNA data of factory workers exposed to DEE and those not exposed to DEE [28, 52] . The algorithms attempted to classify these using microarray data. Table 8 shows the classification results of the diesel factory workers and control subjects using the five classification algorithms. Figure 5 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest performance was achieved when the C and γ values were both 0.001. Figure 6 shows the accuracy of the KNN classification algorithm obtained at different values of k plotted as a graph. KNN achieved the highest accuracy, with a k value of five. MLP and KNN showed high accuracies in this experiment. However, the tree-structure algorithms, random forest, and decision tree failed to classify this data well with an accuracy of less than 50%. 

Dataset 06 provided miRNA data on truck drivers with high exposure to air pollution, as well as office workers with relatively low exposure to air pollution [29] . All participants were residents of the Beijing metropolitan area and had been working for at least two years at the same location at the time of data collection. None of the participants took any regular medications, such as anti-inflammatory drugs or aspirin. For each participant, this dataset provided miRNA data collected over two days with an interval of 1-2 weeks in between [53, 54] .

This dataset listed occupations, BMIs, and information on whether the subjects smoked or not as traits of each sample. This enabled us to conduct further classification experiments, such as truck driver/office worker, obese/non-obese, and smoking/nonsmoking. Using this dataset, we classified the differences between truck drivers and office workers. 

Dataset 06 provided miRNA data on truck drivers with high exposure to air pollution, as well as office workers with relatively low exposure to air pollution [29] . All participants were residents of the Beijing metropolitan area and had been working for at least two years at the same location at the time of data collection. None of the participants took any regular medications, such as anti-inflammatory drugs or aspirin. For each participant, this dataset provided miRNA data collected over two days with an interval of 1-2 weeks in between [53, 54] .

This dataset listed occupations, BMIs, and information on whether the subjects smoked or not as traits of each sample. This enabled us to conduct further classification experiments, such as truck driver/office worker, obese/non-obese, and smoking/nonsmoking. Using this dataset, we classified the differences between truck drivers and office workers. 

Dataset 06 provided miRNA data on truck drivers with high exposure to air pollution, as well as office workers with relatively low exposure to air pollution [29] . All participants were residents of the Beijing metropolitan area and had been working for at least two years at the same location at the time of data collection. None of the participants took any regular medications, such as anti-inflammatory drugs or aspirin. For each participant, this dataset provided miRNA data collected over two days with an interval of 1-2 weeks in between [53, 54] .

This dataset listed occupations, BMIs, and information on whether the subjects smoked or not as traits of each sample. This enabled us to conduct further classification experiments, such as truck driver/office worker, obese/non-obese, and smoking/non-smoking. Using this dataset, we classified the differences between truck drivers and office workers. Table 9 shows the classification results for truck drivers and office workers using the five classification algorithms. In general, the accuracies were low in classifying occupations. Figure 7 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved when both C and γ values were 0.001. Table 9 shows the classification results for truck drivers and office workers using the five classification algorithms. In general, the accuracies were low in classifying occupations. Figure 7 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved when both C and γ values were 0.001. In this experiment, KNN showed the highest accuracy among the five classification algorithms. Figure 8 shows the accuracy of the KNN classification algorithm at different values of k plotted as a graph. KNN achieved the highest accuracy with a k value of 24. The decision tree correctly classified less than half of the dataset. In this experiment, KNN showed the highest accuracy among the five classification algorithms. Figure 8 shows the accuracy of the KNN classification algorithm at different values of k plotted as a graph. KNN achieved the highest accuracy with a k value of 24. The decision tree correctly classified less than half of the dataset. Table 9 shows the classification results for truck drivers and office workers using the five classification algorithms. In general, the accuracies were low in classifying occupations. Figure 7 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved when both C and γ values were 0.001. In this experiment, KNN showed the highest accuracy among the five classification algorithms. Figure 8 shows the accuracy of the KNN classification algorithm at different values of k plotted as a graph. KNN achieved the highest accuracy with a k value of 24. The decision tree correctly classified less than half of the dataset. In the next experiment, the algorithms were applied to classify the same dataset by obesity (obese/non-obese). In this experiment, subjects with a BMI of 25 or higher were classified as obese. The WHO (World Health Organization, Geneva, Switzerland) uses the cutoff point of BMI for defining obesity as 30 [55] , however, the lower cutoff point of 25 is usually used for identifying obesity for Asians [56] . Table 10 shows the classification results of the obese and control subjects using the five classification algorithms. The algorithms used miRNA data to classify obese and non-obese subjects better than when classifying the subjects by occupation. Figure 9 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved with a C value of 0.001 and a γ value of 0.001 or 0.01. Figure 10 shows the accuracy of the KNN classification algorithm at different values of k plotted as a graph. The best performance was recorded with a k value of 3, and other algorithms also performed the classification adequately.

In the next experiment, the algorithms were applied to classify the same dataset by obesity (obese/non-obese). In this experiment, subjects with a BMI of 25 or higher were classified as obese. The WHO (World Health Organization) uses the cutoff point of BMI for defining obesity as 30 [55] , however, the lower cutoff point of 25 is usually used for identifying obesity for Asians [56] . Table 10 shows the classification results of the obese and control subjects using the five classification algorithms. The algorithms used miRNA data to classify obese and nonobese subjects better than when classifying the subjects by occupation. Figure 9 shows a heatmap of the SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved with a C value of 0.001 and a γ value of 0.001 or 0.01. Figure 10 shows the accuracy of the KNN classification algorithm at different values of k plotted as a graph. The best performance was recorded with a k value of 3, and other algorithms also performed the classification adequately.

The next experiment also used the same dataset to classify the subjects based on whether or not they smoked. Table 11 shows the classification results between smokers and non-smokers using the five classification algorithms. The classification results based on smoking showed lower accuracies than those obtained based on obesity but higher than those obtained based on occupation. The next experiment also used the same dataset to classify the subjects based on whether or not they smoked. Table 11 shows the classification results between smokers and non-smokers using the five classification algorithms. The classification results based on smoking showed lower accuracies than those obtained based on obesity but higher than those obtained based on occupation. Figure 11 shows a heatmap of SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved when the C and γ values were both 0.001. Figure 12 shows the accuracy of the KNN classification algorithm at different values of k represented as a graph. KNN and DT classified the data adequately, with KNN achieving the highest accuracy, with a k value of eight. Table 11 shows the classification results between smokers and non-smokers using the five classification algorithms. The classification results based on smoking showed lower accuracies than those obtained based on obesity but higher than those obtained based on occupation. Figure 11 shows a heatmap of SVM accuracies with different C and γ values displayed on a color scale. The highest accuracy was achieved when the C and γ values were both 0.001. Figure 12 shows the accuracy of the KNN classification algorithm at different values of k represented as a graph. KNN and DT classified the data adequately, with KNN achieving the highest accuracy, with a k value of eight. 

Experiments were conducted to classify microarray data into two groups using machine learning. In the first experiment, wherein incidences of COPD and ILD were classi-Com plea corr 

Experiments were conducted to classify microarray data into two groups using machine learning. In the first experiment, wherein incidences of COPD and ILD were classified, all five algorithms distinguished two groups with an accuracy of more than 70%. The second experiment aimed to classify the data regarding the differences between patients with COPD and healthy subjects. In this experiment, SVM showed the highest accuracy (99%) in contrast to RF (random forest), which showed the lowest accuracy (68.3%). In the third experiment, classification was carried out to differentiate between lung cancer patients and healthy people, and the DT (decision tree) and SVM models showed high accuracies. In the fourth experiment, subjects in the dataset were classified based on the difference between patients with CVD and control groups. SVM showed higher accuracy (77%) than the other algorithms. The fifth experiment was aimed at differentiating between factory workers with and without exposure to DEE. In this case, KNN classified the dataset optimally and was able to distinguish between the two groups with an accuracy of 90%, while RF and DT displayed low accuracies. The sixth experiment was performed to classify the individuals by occupation, obesity, and smoking habits. When classifying by occupation, all the algorithms showed low accuracy. KNN showed good accuracy of 80.3% in the classification based on obesity. However, all algorithms were also inadequate in classifying the same dataset based on smoking and displayed a mean accuracy of 61.5%. Figure 13 summarizes the accuracies of each machine learning classification algorithm on all datasets used in the experiments. Datasets 01-04 concerned specific diseases, such as lung cancer, CVD, and COPD, while datasets 05-06 compared the data between two different occupational groups. The machine learning model that showed the highest performance varied across datasets. Although SVM showed the highest performance on Datasets 01-04 and KNN showed the highest performance on Datasets 05-06, it is not reasonable to conclude that SVM or KNN is the best method for microarray data because only some of the various microarray data were tested in this study. However, it was obviously observed that the two tree-based methods, DT and RF, showed similar trends in results and the remaining three methods, MLP, SVM, and KNN, showed similar trends. Both DT and RF are implemented in a tree structure, so it seems that they show similar trends on microarray data. DT and RF generally showed worse performance than other methods except for the result of Dataset 03. Furthermore, the heatmaps for Datasets 02 and 03, which were classified by SVM with substantially high performances, displayed clearer boundaries for C values than th heatmaps for other datasets where the performance of SVM was poorer.

Among the dataset used, Datasets 01, 04, and 06 are the miRNA datasets and other are the mRNA datasets. Figure 14 shows the accuracies of each machine learning classifi cation algorithm on mRNA datasets and miRNA datasets separately. The difference in th performance of each model according to the two cases (miRNA and mRNA) was not clear However, it can be observed that the performances of the tree-based models (RF, DT) wer consistently worse than those of the distance-based model (MLP, SVM, KNN) with Furthermore, the heatmaps for Datasets 02 and 03, which were classified by SVM with substantially high performances, displayed clearer boundaries for C values than the heatmaps for other datasets where the performance of SVM was poorer.

Among the dataset used, Datasets 01, 04, and 06 are the miRNA datasets and others are the mRNA datasets. Figure 14 shows the accuracies of each machine learning classification algorithm on mRNA datasets and miRNA datasets separately. The difference in the performance of each model according to the two cases (miRNA and mRNA) was not clear. However, it can be observed that the performances of the tree-based models (RF, DT) were consistently worse than those of the distance-based model (MLP, SVM, KNN) with miRNA datasets. On the contrary, with mRNA datasets, the performances of the tree-based model were sometimes better or worse. Figure 13 . Accuracies for the tested classification algorithms. Furthermore, the heatmaps for Datasets 02 and 03, which were classified by SVM with substantially high performances, displayed clearer boundaries for C values than the heatmaps for other datasets where the performance of SVM was poorer.

Among the dataset used, Datasets 01, 04, and 06 are the miRNA datasets and others are the mRNA datasets. Figure 14 shows the accuracies of each machine learning classification algorithm on mRNA datasets and miRNA datasets separately. The difference in the performance of each model according to the two cases (miRNA and mRNA) was not clear. However, it can be observed that the performances of the tree-based models (RF, DT) were consistently worse than those of the distance-based model (MLP, SVM, KNN) with miRNA datasets. On the contrary, with mRNA datasets, the performances of the treebased model were sometimes better or worse.

(a) (b) Figure 14 . Results of the tested classification algorithms reorganized into miRNA and mRNA data; (a) accuracies for miRNA datasets; (b) accuracies for mRNA datasets.

In this paper, microarray data with various traits were classified into two groups using various representative machine learning methods, MLP, SVM, DT, RF, and KNN. In the experiments, diverse classification criteria were applied, such as classification between two similar diseases, classification between people with and without diseases, and classification between two occupational groups. The accuracies by the five machine learning methods with these various datasets were compared. The results suggest that the bestperforming machine learning model varies across datasets. However, it was observed that the tree-based methods, DT and RF, showed similar trends in results and the remaining methods, MLP, SVM, and KNN, showed similar trends. DT and RF generally showed worse performance than other methods except for one dataset. Figure 14 . Results of the tested classification algorithms reorganized into miRNA and mRNA data; (a) accuracies for miRNA datasets; (b) accuracies for mRNA datasets.

In this paper, microarray data with various traits were classified into two groups using various representative machine learning methods, MLP, SVM, DT, RF, and KNN. In the experiments, diverse classification criteria were applied, such as classification between two similar diseases, classification between people with and without diseases, and classification between two occupational groups. The accuracies by the five machine learning methods with these various datasets were compared. The results suggest that the bestperforming machine learning model varies across datasets. However, it was observed that the tree-based methods, DT and RF, showed similar trends in results and the remaining methods, MLP, SVM, and KNN, showed similar trends. DT and RF generally showed worse performance than other methods except for one dataset.

Although only microarray data were dealt with in this paper, the methodology of this study is similarly applicable to RNA-seq data, which is known to be more sensitive in detecting differential expression and offers increased dynamic range. So, further study using the latest RNA-seq with the methods in this paper is needed. It will be interesting to verify whether similar results are derived in experiments using RNA-seq data as experiments on microarray data.

The data presented in this study are openly available in reference number [24] [25] [26] [27] [28] [29] .

The authors declare no conflict of interest.

The effect of normalization and standardization for all the experiments performed in this study was compared through experiments. Table A1 compares the accuracy of classification algorithms according to normalization and standardization. The result averaged the accuracy of all experiments performed in this study. The higher accuracy between standardization and normalization was expressed in bold. The better feature scaling method was different depending on the machine learning model. According to the table, normalization showed better performance for SVM, DT, and KNN, and standardization showed better performance for MLP and RF. Normalization was better on average over all machine learning algorithms with the microarray data used in this study. 

DNA microarrays in clinical oncology

Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data

A statistical method for identifying differential gene-gene co-expression patterns

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia

Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data

Upper airway gene expression differentiates COVID-19 from other acute respiratory illnesses and reveals suppression of innate immune responses by SARS-CoV-2

Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? RNA

scPred: Accurate supervised method for cell-type classification from single-cell RNA-seq data

Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells

Advantages of RNA-seq compared to RNA microarrays for transcriptome profiling of anterior cruciate ligament tears

Transcriptomic responses in the fish intestine

Ensemble feature selection for stable biomarker identification and cancer classification from microarray expression data

Gene selection for microarray data classification via multi-objective graph theoretic-based method

Naive Bayes combined with partial least squares for classification of high dimensional microarray data

Deep-Learning-Based Cancer Profiles Classification Using Gene Expression Data Profile

Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification

Binding of cells to matrixes of distinct antibodies coated on solid surface

A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data

Using deep learning to enhance cancer diagnosis and classification

MicroRNA target predictions in animals

MicroRNA Profiling of Chronic Lung Disease for the Lung Genomics Research Consortium

The Site and Nature of Terminal Bronchiolar Destruction in Chronic Obstructive Pulmonary Disease (COPD)

Genome-Wide Screening of Transcriptional Modulation in Non-Smoking Female Lung Cancer in Taiwan

Differences in Monocyte miRNA Profiles Between Patients with Coronary Artery Disease and Healthy Controls

Transcriptomic Changes in the Nasal Epithelium Associated with Diesel Engine Exhaust Exposure (data)

Expression of Viral and Human microRNAs in Blood in the Beijing Truck Driver Air Pollution Study (BTDAS)

Data Mining

Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization. Analytics Vidhya

Some studies in machine learning using the game of checkers

The perceptron: A probabilistic model for information storage and organization in the brain

Applications of feedforward multilayer perceptron artificial neural networks and empirical correlation for prediction of thermal conductivity of Mg(OH) 2 -EG using experimental data

A method for stochastic optimization

Optimum learning rate for classification problem with MLP in data mining

Parameter selection in SVM with RBF kernel function

Support-vector networks

Core vector machines: Fast SVM training on very large data sets

An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

An introduction to decision tree modeling

Principles of Data Mining

How Many Trees in a Random Forest? Proceedings of the Machine Learning and Data Mining in Pattern Recognition

Random Forests

An introduction to kernel and nearest-neighbor nonparametric regression

Study on the Efficiency of a Multi-layer Perceptron Neural Network Based on the Number of Hidden Layers and Nodes for Diagnosing Coronary-Artery Disease

Kaminski, N. Integrative phenotyping framework (iPF): Integrative clustering of multiple omics data identifies novel lung disease subphenotypes

Identification of a novel biomarker, SEMA5A, for non-small cell lung carcinoma in nonsmoking women

Identification of regulatory SNPs associated with genetic modifications in lung adenocarcinoma

High miR-124-3p expression identifies smoking individuals susceptible to atherosclerosis

Small sample sizes in highthroughput miRNA screens: A common pitfall for the identification of miRNA biomarkers

Transcriptomic changes in the nasal epithelium associated with diesel engine exhaust exposure

Blood pressure and expression of microRNAs in whole blood

Particulate air pollution exposure and expression of viral and human microRNAs in blood: The Beijing truck driver air pollution study

World Health Organization. Obesity: Preventing and Managing the Global Epidemic; World Health Organization

Defining overweight and obesity: What are the issues? Am

Informed Consent Statement: Not applicable.