key: cord-0074626-9lyl12ww authors: Zulfiqar, Hasan; Huang, Qin-Lai; Lv, Hao; Sun, Zi-Jie; Dao, Fu-Ying; Lin, Hao title: Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique date: 2022-01-23 journal: Int J Mol Sci DOI: 10.3390/ijms23031251 sha: 0fefd57ce48ebc0dfda68f4dd025ce1aca8774c2 doc_id: 74626 cord_uid: 9lyl12ww 4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model. Alterations in DNA play a significant role in gene expression and regulation, DNA replication, and transcriptional regulation. Methylcytosine is a key epigenetic trait at 5cytosine-phosphate-guanine-3 site. Methylcytosine is precisely correlated with cell growth and chromosomal protection [1, 2] . 5-Hydroxymethylcytosine (5hmC), 5-methylcytosine (5mC), and 4-methylcytosine (4mC) are the familiar cytosine methylations in multiple genomes of prokaryotes and eukaryotes [3, 4] . 5mC is a frequent type of methylcytosine and responsible for many neurodegenerative and cancerous diseases [5] . 4mC is a significant alteration that protects genomic knowledge from weakening by restriction enzymes [6] . Precise identification of 4mC sites can give important signs to understand the method of gene regulation. At present, there are several techniques to recognize 4mC sites, for example, single-molecule real-time sequencing [7] , mass spectrometry [8] , and bisulfite sequencing [9] , but these techniques are time-consuming and expensive when utilized on next-generation sequencing data. Hence, a computational model to identify 4mC sites is needed on an urgent basis. Currently, a few computational and mathematical methods have been introduced to predict 4mC sites in multiple species. In 2017, Chen at al. [10] introduced the first computational model to predict 4mC sites in multiple species on the basis of confirmed 4mC dataset. Subsequently, Wei at al. [11] designed the novel iterative feature illustrative algorithm for the prediction of 4mC sites. Tang et al. [12] introduced the new linear integration method by merging the existing models for the identification of 4mC sites. Afterwards, Manavalan et al. [13] established the new tool Meta-4mCpred to recognize 4mC sites in six different species. Khanal et al. [14] introduced the first deep Int. J. Mol. Sci. 2022, 23, 1251 2 of 10 learning model 4mCCNN by utilizing numerous feature combinations [15] [16] [17] for the prediction of 4mC sites in multiple genomes [18] . Although the prediction model 4mCCNN can yield good outcomes, there is still space for more improvement. To tackle these hitches, we constructed a 1D CNN model to recognize 4mC sites in Geobacter pickeringii. Figure 1 illustrates the flowchart of the whole study. Binary and k-mer nucleotide composition descriptors were used to encode DNA sequences of Geobacter pickeringii into feature vectors and then these features were optimized by using a correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. After this, these optimized features were inserted into 1D CNN-based classifier using 10-fold cross-validation and we attained the finest model to classify 4mC from non-4mC. troduced the new linear integration method by merging the existing models for the identification of 4mC sites. Afterwards, Manavalan et al. [13] established the new tool Meta-4mCpred to recognize 4mC sites in six different species. Khanal et al. [14] introduced the first deep learning model 4mCCNN by utilizing numerous feature combinations [15] [16] [17] for the prediction of 4mC sites in multiple genomes [18] . Although the prediction model 4mCCNN can yield good outcomes, there is still space for more improvement. To tackle these hitches, we constructed a 1D CNN model to recognize 4mC sites in Geobacter pickeringii. Figure 1 illustrates the flowchart of the whole study. Binary and kmer nucleotide composition descriptors were used to encode DNA sequences of Geobacter pickeringii into feature vectors and then these features were optimized by using a correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. After this, these optimized features were inserted into 1D CNN-based classifier using 10-fold cross-validation and we attained the finest model to classify 4mC from non-4mC. We constructed a 1D CNN-based model named Deep-4mCGP for the identification of 4mC sites in Geobacter pickeringii. In the first step, we converted the sequence data in to feature vectors by using k-mer nucleotide composition and binary encodings. Subsequently, these feature vectors were improved by means of correlation and GBDT-based algorithm with IFS method. Initially, correlation and then GBDT with IFS were utilized to pick the finest features. Figure 2A ,B displays the IFS curve of top features. Afterward, these finest features were inserted into 1D CNN by using 10-fold cross-validation to classify 4mC sites from non-4mC sites in Geobacter pickeringii. In this work, 10-fold cross-validation was employed to examine the efficiency of the model. The data were arbitrarily divided into 10 segments of equal proportion. Each segment was independently tested by We constructed a 1D CNN-based model named Deep-4mCGP for the identification of 4mC sites in Geobacter pickeringii. In the first step, we converted the sequence data in to feature vectors by using k-mer nucleotide composition and binary encodings. Subsequently, these feature vectors were improved by means of correlation and GBDT-based algorithm with IFS method. Initially, correlation and then GBDT with IFS were utilized to pick the finest features. Figure 2A ,B displays the IFS curve of top features. Afterward, these finest features were inserted into 1D CNN by using 10-fold cross-validation to classify 4mC sites from non-4mC sites in Geobacter pickeringii. In this work, 10-fold cross-validation was employed to examine the efficiency of the model. The data were arbitrarily divided into 10 segments of equal proportion. Each segment was independently tested by the model, which was trained on the outstanding nine segments. Thus, 10-fold cross-validation technique was executed 10 times, and the average of the outcomes was the ultimate result. AUROC of the anticipated model was 0.986, which was 6.5% higher than the existing model. The accuracy, precision, recall, and F1 are shown in Table 1 , and the ROC curve is shown in Figure 2C . the model, which was trained on the outstanding nine segments. Thus, 10-fold cross-validation technique was executed 10 times, and the average of the outcomes was the ultimate result. AUROC of the anticipated model was 0.986, which was 6.5% higher than the existing model. The accuracy, precision, recall, and F1 are shown in Table 1 , and the ROC curve is shown in Figure 2C . Table 1 . Outcomes of single encodings and their fusion based-models on training and independent data by using different classification algorithms. Bold is used to highlight the best results. The pattern of sequence along the alteration site is a crucial phase to recognize and understand the definition of genomic disparities [19] . In this work, we utilized Two Sample Logo [20] to inspect the dispersal of nucleotides along the 4mC site. Figure 2D illustrates the dispersal of nucleotides. Nucleotides 'A' and 'T' were separately rich at the upstream and downstream of the positive sequences, e.g., five consecutive 'A' nucleotides (30) (31) (32) (33) (34) and four successive 'A' (15) (16) (17) (18) (24) (25) (26) (27) originated in positive sequences. Nucleotides 'C' and 'G' were abundant at the upstream and downstream of the negative sequences, e.g., five repeated 'G' nucleotides (30) (31) (32) (33) (34) and four repeated 'G' nucleotides (3) (4) (5) (6) (24) (25) (26) (27) and four consecutive 'C' nucleotides (15) (16) (17) (18) were noticed in negative sequences. Figure 2D shows that there was a significant variance amongst 4mC sequences and non-4mC sequences. The consequences proposed that the dispersal of nucleotides in diverse places are supportive for the precise identification of 4mC. Features fusion were inserted into LSTM [21] , GBDT [22] , and RF [23, 24] to compare with the CNN-based model [25] . Ultimately, on the basis of AUROC, we achieved a perfect model for each predictor, which is shown in Table 1 and Figure 2F . Comparison of anticipated model with 4mCCNN by using 10-fold cross-validation is shown in Figure 2E . On the independent data (200 Pos. seq and 200 Neg. seq) the efficiency of Deep-4mCGP was checked and then compared with the existing 4mCCNN. The accuracy, precision, recall, F1, and AUROC of the 4mCCNN were 0.826, 0.818, 0.823, 0.825, and 0.920, respectively. The accuracy, precision, recall, F1, and AUROC of Deep-4mCGP were 0.868, 0.876, 0.773, 0.859, and 0.961, respectively. The performance of the anticipated Deep-4mCGP on independent data exhibited the accuracy of 0.868, which was 4.2% higher than the 4mCCNN. The performance comparison is shown in Table 2 . Authentic data are a significant requirement for the construction of a machine learningbased model [26, 27] . Thus, we acquired the data of 1138 (569 Pos. seq and 569 Neg. seq) sequences of Geobacter pickeringii from the work of Chen et al. [10] for training and testing the model. Moreover, we attained the data of 400 sequences (200 Pos. seq and 200 Neg. seq) from the work of Manavalan et al. [13] for the sake of independent testing. Selecting useful and ideal features is an important step in developing machine learning models [4, [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] . Converting the DNA sequences into numerical feature vectors is key in the recognition of functional elements, e.g., physiochemical properties, natural vectors, binary composition, and k-mer nucleotide compositions, which have been utilized in computational biology and bioinformatics [38, 39] . In this study, binary and k-mer composition were used to encode DNA sequences of Geobacter pickeringii. k-mer composition has the ability to show interactions between nucleotides of DNA sequences [40] . The residues of nucleotides can be attained by setting the size of window and steps. A random sample F with n sequence length can be designated as where S i indicates the i-th nucleotide of the DNA sequences and can be converted in to 4 k D features vector with the help of k-mer. where d 1 k-tuple denotes the incidence of i-th k-mer and T represents the transposition. If the value of k is equal to 1, then DNA sequence will be decoded in to 4D features vector, and if the value of k is equal to 2, then DNA sequence will be 16D features vector. In this work, k was set as 1, 2, 3, 4, 5, 6. Consequently, DNA sequences were converted into (4 1 + 4 2 + 4 3 + 4 4 + 4 5 + 4 6 = 5460D) formulated as Binary encodings such as 0s and 1s have the ability to illustrate any information. Therefore, we can transform DNA sequence in the form of 0s and 1s. In this work, DNA sequences of Geobacter pickeringii with length of 41bp was encoded into the (4 × 41 = 164D) features vector. Correlation is a familiar comparison amongst two different features, e.g., if the features are un-correlated, then the correlation will be zero; otherwise, it will be ±1. Two complete modules named classical linear correlation and correlation on the basis of information theory were implemented to compute the correlation amongst the two unique variables. Linear correlation coefficient is the most acquainted and utilizable. The linear correlation coefficient 'r' for a pair of (p, q) variables is specified as Correlation generates good results in smaller datasets, but the performance of correlation coefficient is not up to the mark on gigantic amounts of data. Therefore, it is necessary to determine the substantial relationship amongst the features. Thus, we utilized the t-test to investigate the statistical correlation between the features and picked the significant features. The value of 't' can be computed as where 'r' signifies the coefficient of correlation and 'n' represents the occurrences. 'n−2 denotes the degree of freedom. Probability of the significance relation is 0.05. If 't' is greater than the probability of the significance relation 0.05, then the feature will be selected. GBDT is a popular machine learning-based classifier that has been utilized in various mathematical, cheminformatics, and bioinformatics tools [41, 42] . It has the ability to establish a scalable and reliable prediction model by utilizing non-linear joints of weak learners [43] . {(x 1 , y 1 ) . . . ( x n , y n )} (∴ x i x ⊆ S n , and y i y ⊆ S) q k (x):= where θ k is minimal risk of the decision tree and D k (x; θ k ) is the decision tree. GBDT also computes the concluding evaluations in an advancing mode. Negative gradient loss function q k−1 is applied for residual computation. Hence, we trained the anticipated model through S ki to compute the minimal risk θ k . This kind of trees rationally represents the relations between variables, e.g., plotting the input X into J fragments S 1 . . . S J , and output is Z J for area S J . The IFS [44, 45] method was implemented in this work to pick the finest feature. IFS estimates the performance of the best q-ranked features repetitively for q (1, 2, 3, . . . n), where 'n' is the overall number of the features. IFS frequently stops at the first scrutiny of performance. In IFS, features were picked incrementally from a randomly taken initial feature and the finest result from several randomly re-instated IFS processes were outputted. A brief explanation of the IFS technique can be found in [46] . for i = 1 to k do 8 t = to calculate the significance (r, ρ) for L i (∴ by utilizing the t-test value from Equation (5)) 9 if t > critical value 10 Q best = Q list 11 end 12 return Q best Algorithms 1: Cont. 2nd Round Input: Where, (x i = data and y i = label) LF: = P (y i , q (x)) 13 By initializing the model 14 q• (x): = argument minimum ∑ n i=1 P(y i , z) 15 for I = {1, 2, 3, 4, 5 . . . , n} do 16 for k = {1, 2, 3, 4, 5 . . . , K} do 17 Pseudo residual error calculations: On the basis of S ki , θ k = {S kj j = [1, 2, 3 . . . J]}, we built a decision tree D k (x; θ k ) 21 for j = {1, 2, 3, 4, 5 . . . ., J} do 22 Updating the model q k (x) = q k−1 (x) + ∑ j j=1 z kj I xÎS kj 25 q (x) = ∑ K k=1 ∑ J j=1 z kj I xÎS kj Output: The decision tree function q (x) LeCun at al. [47] introduced convolutional neural network, and now it has been roughly utilized in many biological and bioinformatics advances [48] [49] [50] . The fundamental principle of CNN is to create abundant filters that have the ability to produce hidden topological features from data by executing pooling procedures and layer-wise convolutions. The performance of CNN on 2D data of images and matrices is exceptional [51] . Subsequently, 1D CNN has been used to tackle the difficulties of biomedical sequence data identification and the research associated with natural language processing [41, 52] . In this work, we implemented 1D CNN to identify 4mC sites in Geobacter pickeringii. We employed Keras 2.3.1 [53] , TensorFlow 2.1.0, and Python 3.5.4 to perform this experiment. The best tuning parameters are recorded in Table 3 . Precision, accuracy, recall, and F1 [54] [55] [56] were employed to examine the effectiveness of the anticipated prediction model and formulated as (11) where 'TP' symbolizes the accurately predicted 4mC sequences, 'TN' represents the perfectly predicted non-4mC sequences, 'FP' indicates the non-4mC sequences predicted as 4mC sequences, and 'FN' indicates the 4mC sequences predicted as non-4mC sequences. 4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements for example DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. Currently, several machine learning models have been used to predict 4mC sites in multiple genomes [10, 12, 13, [57] [58] [59] [60] . However, there is only one deep learning-based model, 4mCCNN [14] , that exists for Geobacter pickeringii. In this work, a deep learning model was constructed to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and GBDT-based algorithm with IFS method. Then, these optimized features were inserted into a 1D CNN-based classifier using 10-fold cross-validation, and we attained the finest model to classify 4mC from non-4mC. The performance of the anticipated Deep-4mCGP on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the 4mCCNN. The source code and data are available at GitHub: https://github.com/linDing-groups/Deep-4mCGP (accessed on 19 January 2022). In future work, we have a plan to release a web-based application to make our anticipated model more convenient for the users without programming and statistical knowledge. Function and information content of DNA methylation Prediction of bio-sequence modifications and the associations with diseases 3-methylcytosine in cancer: An underappreciated methyl lesion? Epigenomics An Unbiased Predictive Model to Detect DNA Methylation Propensity of CpG Islands in the Human Genome DNA methylation and human disease Natural history of eukaryotic DNA methylation systems Direct detection of DNA methylation during single-molecule, real-time sequencing Exploring genome wide bisulfite sequencing for DNA methylation analysis in livestock: A technical assessment Xanthomonas AvrBs3 family-type III effectors: Discovery and function Identifying DNA N4-methylcytosine sites based on nucleotide chemical properties Iterative feature representations improve N4-methylcytosine site prediction DNA4mC-LIP: A linear integration method to identify N4-methylcytosine site in multiple species Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation Identification of N4-methylcytosine sites in prokaryotes using convolutional neural network 4mCpred-EL: An ensemble learning framework for identification of DNA N4-methylcytosine sites in the mouse genome Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method MethSMRT: An integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing DNA methylation: Roles in mammalian development A graphical representation of the differences between two sets of sequence alignments Learning to forget: Continual prediction with LSTM Stochastic gradient boosted distributed decision trees Random forest for bioinformatics Prediction of Protein-protein Interactions in Arabidopsis thaliana Using Partial Training Samples in a Machine Learning Framework PSAC: Proactive Sequence-aware Content Caching via Deep Learning at the Network Edge PPD: A Manually Curated Database for Experimentally Verified Prokaryotic Promoters Protein Secondary Structure Prediction Using Character bi-gram Embedding and Bi-LSTM NeuroPred-FRL: An interpretable prediction model for identifying neuropeptide using feature representation learning StackIL6: A stacking ensemble model for improving the prediction of IL-6 inducing peptides Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli Prediction of Neddylation Sites Using the Composition of k-spaced Amino Acid Pairs and Fuzzy SVM An XGBoost-based predictor for identifying bioluminescent proteins DeepIPs: Comprehensive assessment and computational identification of phosphorylation sites of SARS-CoV-2 infection using a deep learning-based approach Multi-Information Sources of Features to RNA Binding Sites Prediction Application of artificial intelligence and machine learning for COVID-19 drug discovery and vaccine design Screening of prospective plant compounds as H1R and CL1R inhibitors and its antiallergic efficacy through molecular docking approach HLPpred-Fuse: Improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation Extremely-randomized-tree-based Prediction of N(6)-Methyladenosine Sites in Saccharomyces cerevisiae PsePSSM-based Prediction for the Protein-ATP Binding Sites A computational platform to identify origins of replication sites in eukaryotes A sequence-based deep learning approach to predict CTCFmediated chromatin loop Identification of cyclin protein using gradient boost decision tree algorithm Lightgbm: A highly efficient gradient boosting decision tree A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization Identification of hormone binding proteins based on machine learning methods PoGB-pred: Prediction of Antifreeze Proteins Sequences Using Amino Acid Composition with Feature Selection Followed by a Sequential-based Ensemble Approach Gradient-based learning applied to document recognition Identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks Deep-BSC: Predicting Raw DNA Binding Pattern in Arabidopsis thaliana Electroencephalography based fusion two-dimensional (2D)-convolution neural networks (CNN) model for emotion recognition system Integrated Analysis of mRNA-seq and miRNA-seq to identify c-MYC, YAP1 and miR-3960 as Major Players in the Anticancer Effects of Caffeic Acid Phenethyl Ester in Human Small Cell Lung Cancer Cell Line Keras: Deep learning library for theano and tensorflow ProLanGO: Protein function prediction using neural machine translation based on a recurrent neural network Effective Classification of Melting Curve in Real-time PCR Based on Dynamic Filter-based Convolutional Neural Network RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features 4mCPred: Machine learning methods for DNA N4-methylcytosine sites prediction iDNA-MS: An integrated computational tool for detecting DNA modification sites in multiple genomes Identification of Potential Inhibitors Against SARS-CoV-2 Using Computational Drug Repurposing Study DeepTorrent: A deep learning-based approach for predicting DNA N4-methylcytosine sites Informed Consent Statement: Not applicable. Data Availability Statement: All the data are available at https://github.com/linDing-groups/ Deep-4mCGP (accessed on 19 January 2022). We are very thankful to Hui Ding (Center for Informational Biology, University of Electronic Science and Technology of China) for their constructive suggestions and support on this work. All the authors are in agreement and declare that there is no conflict of interest.