key: cord-0298125-p62rxpdx authors: Bonidia, Robson Parmezan; Sampaio, Lucas Dias Hiera; Domingues, Douglas Silva; Paschoal, Alexandre Rossi; Lopes, Fabrício Martins; de Leon Ferreira de Carvalho, André Carlos Ponce; Sanches, Danilo Sipoli title: Feature Extraction Approaches for Biological Sequences: A Comparative Study of Mathematical Models date: 2020-08-06 journal: bioRxiv DOI: 10.1101/2020.06.08.140368 sha: 03612e84b490d3e307c21b4a93dad864c7ba9f8a doc_id: 298125 cord_uid: p62rxpdx The number of available biological sequences has increased significantly in recent years due to various genomic sequencing projects, creating a huge volume of data. Consequently, new computational methods are needed to analyze and extract information from these sequences. Machine learning methods have shown broad applicability in computational biology and bioinformatics. The utilization of machine learning methods has helped to extract relevant information from various biological datasets. However, there are still several obstacles that motivate new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes to study and analyze a feature extraction pipeline based on mathematical models (Numerical Mapping, Fourier, Entropy, and Complex Networks). As a case study, we analyze Long Non-Coding RNA sequences. Moreover, we divided this work into two studies, e.g., (I) we assessed our proposal with the most addressed problem in our review, e.g., lncRNA vs. mRNA; (II) we tested its generalization on different classification problems, e.g., circRNA vs. lncRNA. The experimental results demonstrated three main contributions: (1) An in-depth study of several mathematical models; (2) a new feature extraction pipeline and (3) its generalization and robustness for distinct biological sequence classification. tion about the lncRNAs function or evolutionary origin; nevertheless, it can be used to organize these broad categories [53] . Signal Processing (GSP), DNA Numerical Representation (DNR) [61, 26] , 172 and Complex Networks [66] . Nevertheless, the authors used these charac- characteristic descriptors. Therefore, we will not use the works shown for 178 comparison, but the most applied features. 179 In this section. we describe the methodological approach used to achieve 181 the proposed objectives, as shown in Figure 2 . Essentially, we divided our it has presented several works, mainly with ML, as explored in Section 2. However, we will also adopt other datasets to assess the generalization of 195 mathematical features. As preprocessing, we used only sequences longer 196 than 200nt [57] , and we also removed sequence redundancy. Moreover, the 197 sampling method was adopted in our dataset, since we are faced with the 198 imbalanced data problem [20] . Therefore, we applied random majority under-199 sampling, which consists of removing samples from the majority class (to 200 adjust the class distribution) [74] . Finally, we divided this paper into two 201 case studies. time domain data to frequency domain space [79] . According to Yin and 246 Yau [80] , the DFT of a signal with length N , x ∈ R N , at frequency k, can be defined by Equation (1): This method is has been widely studied in bioinformatics, mainly for 249 analysis of periodicities and repetitive elements in DNA sequences [81] and 250 protein structures [82] . This approach is shown in Figure 3 and was based 251 on [20] . To calculate DFT, we will use the Fast Fourier Transform (FFT), that 253 is a highly efficient procedure for computing the DFT of a time series [83] . 254 However, to use GSP techniques, a numeric representation should be used 255 for the transformation or mapping of genomic data. In the literature, dis-256 tinct DNR techniques have been developed [84] . According to Ruiz et al. [85] , these representations can be divided into three categories: 258 single-value mapping, multidimensional sequence mapping, and cumulative 259 sequence mapping. Thereby, we study 6 numerical mapping techniques (or 260 representations), which will be presented below: Voss [86] , Integer [85, 87] , Real [88] , Z-curve [89] , EIIP [90] and Complex Numbers [84, 91, 92] . (2) As a result, each row of matrix V may be seen as an array that marks each The power spectrum of a biological sequence can be obtained by Equation (4): This representation is one-dimensional [87, 85] . This mapping can be Equation (5). The DFT and power spectrum are presented in Equation (6). In this representation, Chakravarthy et al. [88] use real mapping based on 287 the complement property of the complex mapping of [81] . This mapping ap- (7) and Equation (8). The Z-curve scheme is a three-dimensional curve presented by [89] , to 295 encode DNA sequences with more biological semantics. Essentially, we can three (Z-curve) in a symmetrical way for all four components [93] . Therefore: Where the Z-curve consists of a series of nodes P 1 , P 2 , . . . , P N , whose The coordinates x[n], y[n], and z[n] represent three independent distri-306 butions that fully describe a sequence [84] . Therefore, we will have three dis- can be either 1 or −1 [89] . Therefore, we may define the following set of 314 equations in order to update the values of each dimension array considering Finally, the DFT and power spectrum may be defined as [94] : Nair and Sreenadhan [90] proposed EIIP values of nucleotides to repre- (17). This numerical mapping has the advantage of better translating some of 327 the nucleotides features into mathematical properties [92] and represents the , as shown in Equation (18) . The DFT 331 and power spectrum of this representation are presented in Equation (19) . semi-interquartile range, coefficient of variation, skewness, and kurtosis. According to [95] , the RNA has a statistical phenomenon known as period-3 340 behavior or 3-base periodicity, where the peak power will always be at the 341 sample N/3. The PAPR is defined as [96] : Information theory has been widely used in bioinformatics [97, 98] . Based 344 on this, we consider the study of [99] , which applied an algorithmic and 345 mathematical approach to DNA code analysis using entropy and phase plane. According to [98] , entropy is a measure of the uncertainty associated with a 347 probabilistic experiment. To generate a probabilistic experiment, we use a 348 known approach in bioinformatics, the k-mer (our pipeline - Figure 4 ). In this method, each sequence is mapped in the frequency of neighboring 350 bases k, generating statistical information. The k-mer is denoted by P k , 351 corresponding to Equation (21) . We applied this equation to each sequence with frequencies of k = 1, 2, Where k represents the analyzed k-mer, N the number of possible events 376 and p[n] the probability that event n occurs. Based on this, we consider the study of [66] , in which we propose a feature 382 extraction model based on complex networks, as shown in Figure 6 . Each sequence is mapped to the frequency of neighboring bases k (k = 3 -see Figure 5 ). This mapping is converted into an undirected graph rep-385 resented by an adjacency matrix, in which we applied a threshold scheme 386 for feature extraction, thus generating our characteristic vector. Fundamen-387 tally, we represent our structure by undirected weighted graphs. According 388 to [104] , a graph G = {V, E} is structured by a set V of vertices (or nodes) 389 connected by a set E of edges (or links). Each edge reflects a link between 390 two vertices, e.g., e p = (i, j) connection between the vertices i and j [104] . If there is an edge connecting the vertices i and j, the elements a ij are equal 392 to 1, and equal to 0 otherwise. In our case, the graph is undirected, that is, the adjacency matrix A 394 is symmetric, e.g., elements a ij = a ji for any i and j [104] . Furthermore, 395 we apply a threshold scheme presented by [66] , in which we extract weight ization procedure makes the ranges similar, reducing this problem [106] . We 407 used the min-max normalization, which reduces the data range to 0 and 1 408 (or -1 to 1, if there are negative values) [20] . The general formula is given as 409 (Equation (24)) [107] : Where x is the original value and x ij is its normalized version. Further-411 more, min(j) and max(j) are, respectively, the smallest and largest values of 412 a feature j [6, 107] . Next, we investigate three classification algorithms, such 413 as Random Forest (RF) [108] , AdaBoost [109] and CatBoost [110] . We chose 414 these ML algorithms because they induce interpretable predictive models 415 when humans can easily understand the internal decision-making process. Thus, domain experts can validate the knowledge used by the models for 417 the classification of new sequences [6] . Finally, to induce our models, we 418 used 70% of samples for training (with 10-fold cross-validation) and 30% for 419 testing, as shown in Table 2 . The methods were evaluated with four measures: Sensitivity (SE -Equa- communis, we randomly chose three datasets for evaluating the classifiers). Our initial goal is to choose the best classifier to follow in the testing phases. Thereby, to estimate the real accuracy, we applied 10-fold cross-validation, 439 as shown in Table 3 . Assessing each classifier, we noted that the best performance was of the 441 CatBoost with all mathematical models in A. trichopoda, followed by Ad-442 aBoost (6 best results) and RF (no better results). In A. thaliana, CatBoost 443 kept the best performance (7 best results), followed by RF (6 best results) 444 and AdaBoost (3 best results). In contrast, the RF classifier obtained the 445 best results (6) in R. communis, followed by CatBoost (5 best results) and 446 AdaBoost (3 best results). Based on this, we continued testing the models 447 with the CatBoost classifier. Thus, in Table 4 , we present the results of all 448 mathematical models using 4 evaluation metrics. Again, all showed robust results, in which, graph-based models are the best in 2 of the 3 problems analyzed, followed by entropy and GSP. In the 476 three datasets, our approaches have achieved relevant results with ACC, SE 477 and SPC, proving to be efficient and generalist, when exposed to different 478 problem scenarios. Furthermore, if we analyze at the last problem (circRNA 479 vs. lncRNA), our approaches were effective when compared to our references 480 that reached an ACC of 0.7780 [21] and 0.7890 [27] in their datasets against as well as p-values (see Table 5 ), using 95% of significance (α = 0.05). In addition, we also assessed the computational time cost of each tested 506 model. To do this, we ran three models, GSP (Fourier + complex numbers), 507 entropy (Tsallis) and graphs (complex networks)), in 1291 random sequences, 508 as shown in Figure 8 . This section discusses our findings in terms of whether they support our 517 hypothesis (feature extraction approaches based on mathematical models are 518 as efficient and generalist as biological approaches). Overall, several exper-519 imental tests were assumed in this research, in which all feature extraction 520 approaches based on mathematical models showed excellent results, as can 521 be seen in Table 4 and Figure 7 . Regarding its performance in distinct clas-522 sification problems, case study II, we used only three mathematical models 523 for generalization analysis, including GSP (Fourier + complex numbers), en-524 tropy (Tsallis) and graphs (complex networks). In which, entropy and graph-525 based models reported the best performance followed by GSP. Furthermore, 526 all models maintained robust results in different sequence classification prob-527 lems. Furthermore, to fully support our hypothesis, we also compare three 529 mathematical models shown in Figure 7 we fully support the suggested hypothesis. This work proposed to analyze feature extraction approaches for biolog-579 ical sequence classification. Specifically, we concentrated our work on the 580 study of feature extraction techniques using mathematical models. We ana-581 lyzed mathematical models to propose efficient and generalist techniques for 582 different problems. As a case study, we used lncRNA sequences. Moreover, 583 we divided this paper into two case studies. In our experiments, as a start-584 ing point, 9 mathematical models for feature extraction were analyzed: 6 585 numerical mapping techniques with Fourier transform; Tsallis and Shannon 586 entropy; Graphs (complex networks). Thereby, several biological sequence 587 classification problems were adopted to validate the proposed approach. In our experiments, all mathematical models presented relevant and ro-589 bust results, with performances (ACC) between 0.8901-0.9606 in case study I. In case study II, once more, all showed effective results with models based on 591 entropy and graphs showing the best performance, followed by GSP. Further-592 more, to validate our study, we compared three mathematical models against Evolution of k-mer frequencies and entropy in duplication and substitution mutation systems Deep learning in bioinformatics: Introduction, application, and perspective in the big data era Machine Learning Approaches to Biological Sequence and Phenotype Data Analysis Bioinformatic analysis and prediction of the function and regulatory network of long non-coding rnas in hepatocellular carcinoma Bioinformatics: an overview and its applications Selecting the most relevant features for the identification of long non-coding rnas in plants An introduction to deep learning on biological sequence data: examples and solutions pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks Deep learning in bioinformatics iLearn: an integrated platform and metalearner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data Machine learning workflows to estimate class probabilities for precision cancer diagnostics on dna methylation microarray data Machine learning for big data analytics in plants Puzzle of highly pathogenic human coronaviruses (2019-ncov) The 2019-new coronavirus epidemic: Evidence for virus evolution BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches A survey of modern questions and challenges in feature extraction Feature extraction in protein sequences classification: a new stability measure Feature extraction: foundations and applications lncrnanet: Long non-coding rna identification using deep learning Feature extraction of long non-coding rnas: A fourier and numerical mapping approach Predcircrna: computational classification of circular rna from other long non-coding rna using hybrid features PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences A review of computational methods for finding non-coding rna genes Evaluation of deep learning in non-coding rna classification Cpc2: a fast and accurate coding potential calculator based on sequence intrinsic features Lncfinder: an integrated platform for long non-coding rna identification utilizing sequence intrinsic composition, structural information and physicochemical property Discriminating cirrnas from other lncrnas using a hierarchical extreme learning machine (h-elm) algorithm with feature selection Unique features of long non-coding rna biogenesis and function Non-coding rna genes and the modern rna world Rna maps reveal new rna classes and a possible function for pervasive transcription Long noncoding rna: a crosslink in biological regulatory network A text feature-based approach for literature mining of lncrna-protein interactions Computational identification of human long intergenic noncoding rnas using a ga-svm algorithm A novel method for lncrna-disease association prediction based on an lncrna-disease association network The linear neighborhood propagation method for predicting long non-coding rna-protein interactions Bmncrnadb: a comprehensive database of non-coding rnas in the silkworm, bombyx mori Non-coding rnas: Epigenetic regulators of bone development and homeostasis Highly dynamic and sexspecific expression of micrornas during early es cell differentiation Unique signatures of long noncoding rna expression in response to virus infection and altered innate immune signaling Involvement of long noncoding rnas in diseases affecting the central nervous system The characteristic landscape of lncrnas classified by rbp-lncrna interactions across 10 cancers Long noncoding rnas in plants Characterization of stress-responsive lncrnas in arabidopsis thaliana by integrating expression, epigenetic and structural features Transposable elements (te s) contribute to stress-related long intergenic noncoding rna s in plants Genome-wide screening and functional analysis identify a large number of long noncoding rnas involved in the sexual reproduction of rice Roles, functions, and mechanisms of long noncoding rnas in cancer The gencode v7 catalog of human long noncoding rnas: analysis of their gene structure, evolution, and expression Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution On the classification of long non-coding rnas lncrnatargets: a platform for lncrna target prediction based on nucleic acid thermodynamics The steroid receptor rna activator is the first functional rna encoding a protein Long noncoding rnas: Novel insights into hepatocelluar carcinoma Long noncoding rnas: past, present, and future Cpc: assess the protein-coding potential of transcripts using sequence features and support vector machine Cpat: Coding-potential assessment tool using an alignment-free logistic regression model Utilizing sequence intrinsic composition to classify proteincoding and long non-coding transcripts Plek: a tool for predicting long non-coding rnas and messenger rnas based on an improved k-mer scheme lncrna-mfdl: identification of human long non-coding rnas by fusing multiple features and using deep learning Long noncoding rna identification using balanced random forests lncrscan-svm: a tool for predicting long non-coding rnas using support vector machine Lncrnapred: Classification of long non-coding rnas and protein-coding transcripts by the ensemble algorithm with a new hybrid feature Deeplnc, a long non-coding rna prediction tool using deep neural network Walter, Plantrna sniffer: a svm-based workflow to predict long intergenic non-coding rnas in plants Plncpro for prediction of long non-coding rnas (lncrnas) in plants and its application for discovery of abiotic stress-responsive lncrnas in rice and chickpea Pattern recognition analysis on long noncoding rnas: a tool for prediction in plants Basinet-biological sequences network: a case study on coding and non-coding rnas identification Prediction of plant lncrna by ensemble machine learning classifiers CNIT: a fast and accurate web tool for identifying protein-coding and long non-coding transcripts based on intrinsic sequence composition Plit: An alignment-free computational tool for identification of long non-coding rnas in plant transcriptomic datasets Predlncgfstack: A global sequence feature based on a stacked ensemble learning method for predicting lncrnas from transcripts Characterization and identification of long non-coding RNAs based on feature relationship DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction Gapped blast and psi-blast: a new generation of protein database search programs The effect of oversampling and undersampling on classifying imbalanced text datasets Phytozome: a comparative platform for green plant genomics Greenc: a wiki-based database of plant lncrnas PlantNATsDB: a comprehensive database of plant natural antisense transcripts Plantcircbase: a database for plant circular rnas A measure of dna sequence similarity by fourier transform with applications on hierarchical clustering A fourier characteristic of coding sequences: origins and a non-fourier approximation Genomic signal processing Repetita: detection and discrimination of the periodicity of protein solenoid repeats by discrete fourier transform What is the fast fourier transform? Abd-Elrahman, Genomic analysis and classification of exon and intron sequences using dna numerical mapping techniques On dna numerical representations for genomic similarity computation Evolution of long-range fractal correlations and 1/f noise in dna base sequences Conversion of nucleotides sequences into genomic signals Autoregressive modeling and feature analysis of dna sequences Z curves, an intutive tool for visualizing and analyzing the dna sequences A coding measure scheme employing electron-ion interaction pseudopotential (eiip) Genomic signal processing Survey on encoding schemes for genomic data representation and feature learning-from signal processing to machine learning Snr of dna sequences mapped by general affine transformations of the indicator sequences A symmetrical theory of dna sequences and its applications Prediction of protein coding regions by the 3-base periodicity analysis of a dna sequence Peak-to-average power ratio Entropy and information within intrinsically disordered protein regions Information theory applications for biological sequence analysis rényie and tsallis entropy analysis of dna using phase plane Shannon entropy: a rigorous notion at the crossroads between probability, information theory, dynamical systems and statistical physics Image thresholding using tsallis entropy Inference of gene regulatory networks from time series by tsallis entropy Determining the entropic index q of tsallis entropy in images through redundancy Complex networks: the key to systems biology Complex networks: topology, dynamics and synchronization Investigations on impact of feature normalization techniques on classifier's performance in breast tumor classification Comparative study on normalization procedures for cluster analysis of gene expression datasets IEEE International Joint Conference on Random forests Multi-class adaboost Catboost: gradient boosting with categorical features support A coefficient of agreement for nominal scales The authors would like to thank UTFPR-CP, ICMC-USP, and CAPES 611 for the financial support given to this research.