key: cord-0058846-b9e1d6qy authors: Rueda, Edwin J.; Ramos, Rommel; Franco, Edian F.; Belo, Orlando; Morais, Jefferson title: One-Class SVM to Identify Candidates to Reference Genes Based on the Augment of RNA-seq Data with Generative Adversarial Networks date: 2020-08-24 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58799-4_51 sha: 88025784801de884b8fd1e1afa7234f3d589699c doc_id: 58846 cord_uid: b9e1d6qy Reference Genes (RG) are constitutive genes required for the maintenance of basic cellular functions. Different high-throughput technologies are used to identify these types of genes, including RNA sequencing (RNA-seq), which allows measuring gene expression levels in a specific tissue or an isolated cell. In this paper, we present a new approach based on Generative Adversarial Network (GAN) and Support Vector Machine (SVM) to identify in-silico candidates for reference genes. The proposed method is divided into two main steps. First, the GAN is used to increase a small number of reference genes found in the public RNA-seq dataset of Escherichia coli. Second, a one-class SVM based on novelty detection is evaluated using some real reference genes and synthetic ones generated by the GAN architecture in the first step. The results show that increasing the dataset using the proposed GAN architecture improves the classifier score by 19%, making the proposed method have a recall score of 85% on the test data. The main contribution of the proposed methodology was to reduce the amount of candidate reference genes to be tested in the laboratory by up to 80%. Reference Genes (RG) are constitutive genes required for the maintenance of remains relatively constant in different cells, tissues, and stress levels [8, 15, 17] . For instance, in the real-time reverse transcription-polymerase chain reaction (RT-qPCR) which provides accurate and reproducible quantification of genetic copies, the selection of RG is essential for accurate normalization of gene expression data obtained [4, 11] . The RG identification is not a trivial task because in some tissues or cells, the number of validated RG in the literature is small, and the level of expression of some of these genes varies depending on tissue, cell, or level of stress. Several methods to identify RG are available in the literature, mainly using optimization algorithms and unsupervised machine learning (ML) approach [2, 5, 13, 14] . In [5] is proposed a methodology based on clustering algorithms using Euclidean distance as a similarity metric. In this methodology, the number of clusters is chosen based on the SD validity index and the Dunn index [9] . Then, based on some reference genes, are selected the clusters in which at least one reference gene is present, reducing the number of candidate genes to be considered. Finally, the candidate genes are selected based on the coefficient of variation and the standard deviation of the expression level of each gene. In [2] is proposed a methodology based on optimization algorithms, more specifically a particular type of linear programming called minimum cost network flow problem. The idea is to find the cheapest path in a directed, acyclic n sdimensional graph, where n s denotes the total number of samples, from the lowliest, non-zero expressed gene (source) to the most highly expressed gene (sink), given several constraints. The genes with identically zero expression are omitted from the analysis and the distance from one gene to another is calculated using the Euclidean distance. Finally, the genes belonging to the cheapest path are considered candidate genes. The main disadvantage of current methods is based on the small set of RG available in the literature for some tissues or cells. Therefore, methods that use the Euclidean distance between unclassified genes and RG as a similarity metric to detect potential candidates are not giving importance to unclassified genes that are distant from the RG, but that could be candidates due to stability in their gene expression. To mitigate the main disadvantage of current methods, this paper proposes a novel approach based on Generative Adversarial Networks (GAN) and a oneclass classifier based on novelty detection to the problem of RG identification. The proposal is divided into two main steps. First, the GAN is used to generate enough synthetic RG from the actual RG available in the literature for the Escherichia coli MG1655 dataset [2, 11] . Second, a one-class classifier is trained using some real RG and synthetic ones generated by the GAN in the first step to detect new candidate genes. In this work, we adopted the support vector machine as a one-class classifier, since this classifier has a good performance in the literature [1, 16] . The remaining sections of this paper are organized as follows. Section 2 shows the main related works and the contributions of this work; Sect. 3 presents the proposed method for candidate RG identification, and the results are discussed in Sect. 4 . Finally, Section 5 shows the final remarks on this research. Different papers in the literature propose the in-silico identification of candidates for reference genes. These approaches are based on datasets that contain a minority of genes labeled as reference genes. The objective is to classify the remaining genes as candidates or not to reference genes. Currently, emerging different methods that use optimization algorithms and unsupervised machine learning algorithms to tackle the problem. In [2] was proposed a novel method for the identification of RG called moose 2 . This method uses the minimum cost network flow problem to find candidates for RG. The idea is to find the cheapest path in a directed, acyclic n s -dimensional graph, where n s denotes the total number of samples, from the lowliest, non-zero expressed gene (source) to the most highly expressed gene (sink), taking into account that for any edge (i, j) that connects two nodes (genes) in the graph, this edge is in the direction of the lower to the higher ranked genes, where the ranking is determined by sorting the values of the gene expression averaged over all samples. Note that in the graph, each gene is connected to every other gene. The data set used is Escherichia coli obtained from CGSC (Coli Genetic Stock Center) at Yale University (http://cgsc.biology.yale.edu), where from 6 RG, moose 2 identifies 27 possible candidates for RG. In [5] is proposed a methodology based on clustering algorithms using Euclidean distance as a similarity metric. In this methodology, the number of clusters is chosen based on the SD validity index and the Dunn index [9] . Then, based on some RG, are selected the clusters in which at least one RG is present, reducing the number of candidate genes to be considered. Finally, the candidate genes are selected based on the coefficient of variation and the standard deviation of the expression level of each gene. This method was evaluated in the Corynebacterium pseudotuberculosis strains CP1002 and CP25 database [10] , obtaining 134 candidate genes in total. Vandesompele et al. [14] presents the gNorm algorithm, which uses a measure of gene-stability to determine the stability of gene expression. For each control gene, they determine the variation in pairs with all other control genes as the standard deviation of the logarithmically transformed expression ratios and define the stability measure of the internal control gene M as the average variation in pairs of a particular gene with all other control genes. Genes with the lowest M values have the most stable expression. Therefore, those genes can be considered candidates for RG. These current methods manage to identify candidate genes based on a number of initial RG, which is sometimes inefficient due to the small number of RG identified in certain cells or tissues. Therefore, in this paper, we propose to increase the number of reference genes synthetically through a GAN architecture. Then, based on the augmented reference data, train a one-class SVM classifier to select from a dataset of unclassified genes, the possible candidates for RG. This section presents the dataset used to select the possible candidates for RG and the algorithms used in this approach. Figure 1 summarizes the proposed approach, which consists of two main steps. First, the amount of RG available in the dataset is increased using a proposed GAN architecture. Second, a oneclass classifier based on support vector machine (SVM) is trained to detect new candidates for RG using 70% of the real RG and synthetic genes generated by the GAN architecture in the first step. The performance of the proposed classifier is evaluated with 30% of the remaining RG available in the training set. The next subsection explains the dataset used, its initial processing for its subsequent increase through the proposed GAN architecture. Then, the one-class SVM algorithm used for the selection of candidates is explained. This work uses the RNA-seq database of Esquerichia coli MG1655 available at [2] . This dataset consists of 4293 genes that were subjected to stress conditions. Three different samples were removed at 0, 30, and 90 min, and immediately mixed with 0.25 vol of RNA stop solution (95 % ethanol, 5 % phenol) on ice. Note that the dataset has 9 features, 3 for each sampling time. For the identification of the largest number of RG in this dataset, a review of the literature was necessary, which consisted of taking the genes that were at least tested in three different studies. Finding 21 reference genes in total, 15 in [11] and 6 in [2] . The RNA-Seq dataset was normalized by reads per kilobase of exon model per million mapped reads (RPKM) and applying (log 2 + 1) to mitigate the number of outliers and increase the proximity of the data features. To filter the dataset and based on the fact that the reference genes have a minimum level of variation [18] , a stability test is performed based on the coefficient of variation (CV ) given by Eq. 1, where σ represents the standard deviation of the set of features of each gene, and μ represents its mean. The stability test consists of selecting the genes for which their CV value is within an interquartile range given by [ with k = 1.5. Note that genes with a CV value outside this range are not considered. Finally, the data is scaled between [−1, 1] as show in Eq. 2, where X t represents the transformed data, x i the i-th gene and X max and X min represent the maximum and minimum values for each features of the dataset. Generative adversarial networks were introduced by [6] . The concept was developed for the estimation of generative models based on adversarial processes. In these adversarial processes are trained two neural networks; the first, a generative network G (z, θ g ) which receives as input some noise variables from a, p z (z) distribution and aims to learn a p g distribution on training data x, and a second discriminatory network D (x, θ d ), which has as output a single scalar which represents the probability that x comes from the distribution of the real data p data , and not from p g . In this minimax game, there is only one solution, which occurs when the G network recovers the distribution of the real data p data , making the D network fail to distinguish if a data comes from the original distribution or the distribution generated by the G network, D (x) = 1 2 . Figure 2 illustrates the training process of the GAN architecture. This training is divided into two main steps. First, the discriminative model D is trained based on real and synthetic data and its weights are frozen (these weights are not updated in the second step). Second, the generative model G is trained based on an input noise vector to maximize the probability that the model D makes a mistake. After repeated steps one and two a defined amount of epochs, the model G may generate synthetic data similar to the original data. The objective of this approach is to generate enough synthetic RG with a GAN architecture to be able to build a one-class model as general as possible. For this, Fig. 3 illustrates the proposed GAN architecture. Where the generator model ( Fig. 3(a) ) takes as input a noise vector based on a normal distribution with μ = 0 and σ = 1 and produces as output a synthetic RG. On the other hand, the discriminator model ( Fig. 3(b) ) was also built with fully connected dense layers. This model takes as input real reference genes (real distribution) and synthetic reference genes (generated by the G network) and its output represents the probability that a gene belongs to the real data distribution. The number of layers and neurons was taken based on the lowest cost value for the metrics proposed in Eqs. 9 and 10. The GAN architecture was trained with the stochastic gradient descent algorithm with a different learning rate l r for the discriminator and the generator. These learning rates decreased at each epoch by a factor equal to lr epochs to improve the convergence of the GAN. The initial weights for each layer were chosen based on Xavier uniform initializer [3] , which initializes the weights based on a uniform distribution U [−t, t], where t is equal to 6/ (n j + n j+1 ) being n j the number of input neurons and n j+1 the number of output neurons in the j layer. The one-class support vector machine induced by [12] , is an algorithm that tries to estimate a function f , being f ≥ 0 for positive examples and f < 0 for any other example. In this strategy, the data is mapped with a function Φ in a new characteristic space F and separated from the origin with a maximum margin. Consider the training set x 1 , . . . , x ∈ X, where ∈ N is the number of observations, and X is some set ∈ R N . To separate the training set from the origin, the following quadratic problem is solved: min w∈F,ξ∈R ,ρ∈R Deriving the dual problem, the solution can be shown to have an support vector expansion: where the sgn function is equal to 1 for values greater than or equal to zero and −1 otherwise. Patterns x i with nonzero α i are called support vectors, where the coefficients are found by solving the dual problem: In this approach, the parameter ν is a trade-off between the normal and anomaly data in the dataset, which was set to 0.010452, and the mapping function Φ was represented by the Gaussian kernel given in Eq. (7), where γ is equal to 1/n fe and n fe is the number of features of the training set. If the proposed dual problem is solved, the decision function f (x) will take positive values for the greatest number of samples x i contained in the training set. To validate the performance of the proposed classifier and based on the premise that we only have a positive class (RG), we consider only the recall metric: Recall = T P T P + F N (8) where T P (True Positives) represents the number of genes that the classifier correctly classified as RG, and F N (False Negatives) represents the amount of RG that the classifier classified as non-RG. This metric allows us to evaluate the ability of our classifier to recognize reference genes. Therefore, a value close to 1 indicates that the classifier has a good performance. This section presents the experiments and results for the identification of candidates to RG based on the use of GAN to synthetically increase the training set, and based on this dataset, build a one-class SVM classifier. The focus of this study was based on the creation of the GAN architecture and the one-class SVM classifier for the identification of possible RG candidates in Escherichia coli MG1655 dataset obtained from [2] . For this study, the dataset was cleaned taking into account two important stages. In the first stage, we removed the genes for which their expression value was equal to zero. In the second stage, based on the CV (Eq. 1), we eliminated all the genes for which the CV was not within the interquartile range (outliers). Table 1 shows the data processing to obtain the final dataset. Of the 21 RG identified in the literature for Escherichia coli, the idnT gene was not taken into account because it's CV was outside the interquartile range calculated for the reference genes. Note that were calculated two interquartile ranges, one based on the reference genes and the other based on the remaining genes. Table 2 shows the list of 20 RG used in this study. For the training of the GAN architecture (Fig. 3) , a parameter adjustment was made in the G and D networks to reach the global optimum (D(x) = 1 2 ). For this reason, we opted to update the weights of both networks (θ g and θ d ) based on all available reference genes (batch size = 20), because this allowed the network to converge faster. Note that the weights of network D are updated based on 20 reference genes and 20 synthetic genes generated by network G. Finally, 1700 epochs were used for the training of the proposed GAN architecture, considering that an epoch occurs when all the genes are iterated in the training set (one batch size). The parameters for training the GAN architecture are presented in Table 3 . The selection of the best initial weights taking the Xavier uniform initializer was based on a similarity metric S(x, x ) proposed in Eq. (9) , where x represents the training set given by a matrix ∈ R (20, 9) , x represents the set of synthetic data generated by the G network, andŷ represents the class predicted by network D for synthetic data, which takes values equal to 1 for real data and 0 for synthetic data. The parameters n f , n g , and m represent the number of characteristics of each gene, the number of synthetic genes, and the number of RG in the training set. Note that x (k) i represents the feature k in the i-th gene. The GAN network was trained on a hundred different occasions, for each of these occasions, the average of the similarity S(x, x ) was calculated based on 30 repetitions, wherein each repetition, were generated 300 synthetic genes. Thus, we chose the weights for which the GAN architecture presented the lowest similarity value. Figure 4 illustrates the convergence process in the training of the proposed GAN architecture, where Fig. 4(a) illustrates that the loss (based on binary cross entropy) tends to 0.7 and Fig. 4(b) illustrates that the accuracy of the Discriminator tends to 0.5 (global optimum). These results indicate that network G is generating synthetic genes similar to the real ones, making network D unable to distinguish them. Principal component analysis PCA [7] is used for the graphical representation of the augmented dataset, with this representation, we can visually observe the similarity between the original and the synthetic data generated by the G network. For the choice of the best set of synthetic data, we generate 5000 different sets of 300 synthetic genes and based on the metric E(x ) proposed in Eq. (10), where CV represents the coefficient of variation for the i-th synthetic gene, we chose the set of synthetic genes with the lowest value of this metric (E = 0.11). Figure 5 illustrates the 2-PCA representation of the RG (denoted by a green star) and the synthetic genes (denoted by a red circle), where we can see that the genes generated by the GAN architecture are not distant from the real RG. The selection of candidates to RG is based on the one-class SVM classifier. We train the proposed classifier based on the augmented training set that was built based on 70% of the RG and the 300 synthetic genes generated in the previous step. The validation of this classifier (Eq. (8)) was performed with four sets of data. Table 4 shows the results of the proposed classifier for each of these data sets. Where in the training set, we calculate two different recalls, the first for the augmented dataset, which consists of the augmented data used for classifier training, and the second based only in 70% of the RG chosen above. For the validation of the classifier in the test data, these were divided into two sets, the first set with the remaining 30% of the RG, and the second set based only on new synthetic RG generated by the proposed GAN architecture. Finally, to observe how the increase of the dataset through the GAN architecture improves the performance of the classifier, we trained the proposed classifier based only on the 20 reference genes (70% for training and 30% for testing) and compared with the trained classifier with augmented data. Table 5 compares the performance of both classifiers. After training the classifier with the data augmented and based on the 4168 unclassified genes in the dataset, we consider as possible candidates for RG the genes for which the f (x) function (Eq. (5)) is equal to 1. Note that this function takes values equal to 1 for data within the decision boundary (positive class). In this approach, the classifier detected 807 possible candidates to RG, reducing the amount of reference genes candidates to be tested in the laboratory by 80.64%. Figure 6 illustrates the 2-PCA representation of the RG and candidate genes selected by the proposed classifier. In this representation, we can perceive that the proposed classifier generates candidates close to the RG as we expected, which indicates that the mapped space created by the classifier is being consistent with the training data. Finally, based on the set of candidates for RG found in this approach and taking into account 27 candidates for RG found in [2] . We select the 11 candidates for RG that are common in both studies. Table 6 shows the common candidates in both studies. This paper presented a new approach for in-silico identification of candidate reference genes from the Escherichia coli MG1655 dataset. This approach consisted of two main steps. First, a proposed GAN architecture was trained to synthetically augment the reference gene set. Second, with the augmented dataset was trained a one-class SVM classifier which obtained an 85% recall score in the test data. This new approach allowed the identification of 807 possible candidate genes for reference genes out of a total of 4170 expressed genes, which reduces the number of genes to be analyzed in the laboratory by 80%. Finally, the approach demonstrated, as expected, that synthetically increasing the number of reference genes improves the classifier score, having in this research, an increase of 19% with respect to the trained classifier without the augmented data, making the selection of candidate genes more reliable. For future work, other GAN architectures will be tested to try to improve the global optimum (D(x) = 1 2 ). Thus, we could generate synthetic genes that are closer to the real distribution of the reference genes, being able to increase the score of the proposed novelty detector. Abnormal event detection in crowded scenes using one-class SVM RNAsequence data normalization through in silico prediction of reference genes: the bacterial response to DNA damage as case study Using neural networks for RSSI location estimation in LoRa networks Selection of reference genes for quantitative real-time PCR analysis of photosynthesis-related genes expression in Lilium regale A clustering approach to identify candidates to housekeeping genes based on RNA-seq data Generative adversarial nets Principal component analysis for surface reflection components and structure in facial images and synthesis of facial images for various ages Evaluation of reference genes for gene expression studies using quantitative real-time PCR in Drosophila melanogaster after chemical exposures Cluster validity measurement techniques Differential transcriptional profile of Corynebacterium pseudotuberculosis in response to abiotic stresses Bacterial reference genes for gene expression studies by RT-qPCR: survey and analysis Support vector method for novelty detection A computational approach using ratio statistics for identifying housekeeping genes from cDNA microarray data Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes Identification and evaluation of reference genes for quantitative real-time PCR analysis in Passiflora edulis under stem rot condition Anomaly detection in activities of daily living using one-class support vector machine Selection of stable reference genes for gene expression analysis in sweet potato (Ipomoea batatas L.) Selection and validation of reference genes for RT-PCR expression analysis of candidate genes involved in morphine-induced conditioned place preference mice Acknowledgments. This study was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior -Brasil (CAPES), under the Program PROCAD-AMAZÔNIA, process n o 88881.357580/2019-01.