key: cord-0147313-ney9k2i2 authors: Mahmud, Mufti; Kaiser, M Shamim; Hussain, Amir title: Deep Learning in Mining Biological Data date: 2020-02-28 journal: nan DOI: nan sha: 892eb1f9972f5b1a94973066dabf35f6f14963b7 doc_id: 147313 cord_uid: ney9k2i2 Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from different biological application domains. Broadly categorized in three types (i.e., sequences, images, and signals), these data are huge in amount and complex in nature. Mining such an enormous amount of data for pattern recognition is a big challenge and requires sophisticated data-intensive machine learning techniques. Artificial neural network-based learning systems are well known for their pattern recognition capabilities and lately their deep architectures - known as deep learning (DL) - have been successfully applied to solve many complex pattern recognition problems. Highlighting the role of DL in recognizing patterns in biological data, this article provides - applications of DL to biological sequences, images, and signals data; overview of open access sources of these data; description of open source DL tools applicable on these data; and comparison of these tools from qualitative and quantitative perspectives. At the end, it outlines some open research challenges in mining biological data and puts forward a number of possible future perspectives. Understanding pathologies, their early diagnosis and finding cures have driven the life sciences research in the last two centuries [1] . This accelerated the development of cutting edge tools and technologies that allow scientists to study holistically the biological systems as well as unprecedentedly dig down to the molecular details of the living organisms [2, 3] . Increasing technological sophistication presented scientists with novel tools for DNA sequencing [4] , gene expression [5] , bioimaging [6] , neuroimaging [7] , and brain-machine interfaces [8] . These innovative approaches to study the living organisms produce huge amount of data [9] and create a situation often referred as 'Data Deluge' [10] . This biological big data can be characterized by their inherent characteristics of being hierarchical (i.e., data coming from different levels of a biological system -from molecules to cells to tissues to systems), heterogeneous (i.e., data acquired by different acquisition methodsfrom genetics to physiology to pathology to imaging), dynamic (i.e., data changes as a function of time), and complex (i.e., data describing nonlinear biological processes) [11] . These intrinsic characteristics of the biological big data posed an enormous challenge to the data scientists to identify patterns and analyze them to infer meaningful conclusions from these data [12] . This triggered the development of rational, reliable, reusable, rigorous, and robust software tools [11] using machine learning (ML) based methods to facilitate recognition, classification, and prediction of patterns in the biological big data [13] . The conventional ML techniques can be broadly categorized in two large setssupervised and unsupervised. The methods pertaining to the supervised learning paradigm classify objects in a pool using a set of known annotations, alternatively called attributes or features, i.e., learning from a few annotated data samples the remaining data are classified using those annotations. Instead, the techniques in the unsupervised learning paradigm form groups (or clusters) among the objects in a pool by identifying their similarity, i.e., data annotations are first defined and then used for the data classification. Apart, there is a special category called reinforcement learning, that allows a system to learn from the experiences it gains through interacting with its environment, and is out of the scope of this work. Some of the popular supervised methods include: ANN and its variants, Support Vector Machines and other linear classifiers, Bayesian Statistics, k-Nearest Neighbors, Hidden Markov Model, and Decision Trees. On the other hand, a number of popular unsupervised methods are: Autoencoders, Expectation-Maximization, Information Bottleneck, Self-Organizing Maps, Association Rules, Hierarchical Clustering, k-Means, Fuzzy Clustering, and Density-based Clustering. Interested readers may refer to [14] [15] [16] for brief introductory reviews on many of the techniques mentioned above. The literature is in abundance with reports of successful application of the above mentioned popular ML methods and their respective variants to Biological data coming from various sources. For the sake of simplicity in this review, the Biological data sources have been categorized to a few broad application domains, e.g., Omics (covers data from genetics and [gen/transcript/epigen/prote/metabol]omics [17] ), Bioimaging (covers data from [sub-]cellular images acquired by diverse imaging techniques), Medical Imaging (covers data from [medical/clinical/health] imaging mainly through diagnostic 2/36 imaging techniques), and [Brain/Body]-Machine Interfaces or BMI (covers mostly electrical signals generated by the Brain and the Muscles and acquired using appropriate sensors). Each of these application domains (i.e., omics [18] , bioimaging [19] , BMI [20] [21] [22] , medical imaging [23, 24] ) have witnessed major contributions from diverse ML methods (the ones mentioned above) and their variants. In recent years Deep Learning (DL), Reinforcement Learning (RL), and deep RL methods are considered to reshape the future of ML (see the schematic diagram in Fig. 1 G) [25] . Despite notable popularity and their applicability to diverse disciplines, there exists no comprehensive review in the literature focusing on Biological data. To fill this gap, this review provides-a brief overview on DL, RL, and deep RL concepts; the state-of-the-art applications of these techniques to Biological data; and a comprehensive list of existing open source libraries and frameworks which can be utilized to harness the power of these techniques. Towards the end, some open issues are identified and some speculative future perspectives are outlined. Finally, working lists of available open access sources of datasets / databases from various application domains are supplied. As for the organization of the rest of the article, section 1 provides a conceptual overview to the DL technique and introduces the reader to the underlying theory; section 4 presents the reader with brief descriptions of the popular open-source tools, software, and frameworks that implement DL techniques; section 6 provides a comparative study of the various tools' performances in implementing the defined DL architectures, section 7 provides the reader with some of the open issues and hints on the future perspectives; and finally, the article is concluded in section 8. In DL the data representations are learned with increasing abstraction levels, i.e., at each level more abstract representations are learned by defining them in terms of less abstract representations at lower levels [26] . Through this hierarchical learning process a system can learn complex representations directly from the raw data [27] . Though many DL architectures have been proposed in the literature for various applications, the ones discussed below are most oftenly used in mining Biological data [28] . Output Hidden Input Figure 2 . Architectures of autoencoder (a) and deep autoencoder (b). Autoencoder is a data driven unsupervised NN model mainly used for dimensionality reduction (see Fig. 2a ). It mainly projects high dimensional inputs to lesser dimensional outputs. In other words, an input (from I input units) x ∈ R I is mapped to a hidden unit (from H hidden units) a ∈ R H using a nonlinear activation function f (·) with a = f (W (1) x + b (1) ), where W (1) ∈ R H×I is the encoding weight matrix and b (1) is the bias vector. The projected a is then reconstructed through remapping to an approximated value y ∈ R I as y = f (W (2) a + b (2) ) ≈ x, where W (2) ∈ R I×H is the decoding weight matrix and b (2) ∈ R I is the bias vector. Usually Autoencoders use equal number of input and output units with lesser hidden units. However, to represent complex relationships among data, more hidden units with sparsity criteria have also been used. In both cases, the (non)linear transformations incorporated in the hidden units mainly perform the compression [29] . In the learning process, the goal of an autoencoder is to minimize the reconstruction error -for a given set of parametersbetween x and y. Thus, the objective function is given by: where, γ is the sparsity parameter, KL(ρ||ρ j ) = ρlog ρ ρj + (1 − ρ)log 1−ρ 1−ρj is the relative entropy to measure how jth hidden unit's average activation (ρ j ) diverges from target average activation (ρ), and E(X, with N samples. The Deep Autoencoder (DA) architecture, also known as 'Stacked Autoencoder', (see Fig. 2b ) is obtained by stacking several Autoencoders where the activation values of one autoencoder's hidden unit become input to the next autoencoder, and backpropagation with gradient based algorithm is used to obtain the optimal weights. But this suffers from poor local minima problem which is overcome by pretraining the network with greedy layer-wise learning [30] . Despite the pretraining stage and vanishing error problem [31] , DA is a popular data compressing DL architecture with quite a few variants, e.g., Denoising Autoencoder [32] , Sparse Autoencoder [33] , Variational Autoencoder [34] , and Contractive Autoencoder [35] . Restricted Boltzmann Machine (RBM, Fig. 3a) , also considered as nonlinear feature detector, is an undirected probabilistic generative model capable of representing specific probability distributions [36] . It contains one visible layer and one hidden layer with symmetric connections (W ∈ R V ×H ) between them, with a and b as bias values for the visible and hidden layers, respectively. Generally, the visible layer contains x ∈ R V units for the input observations, and the hidden layer contains y ∈ R H units to model their relation with the observations. The symmetrical connections make RBMs usable as Autoencoders, and the joint probability of (x, y) is given by [37] : 4/36 where, Φ = {W, a, b}, Z(Φ) is a partition function derived from possible (∀x, ∀y) pairs, and E(x, y; Φ) is the energy function which -for a generic case of binary visible and hidden units -is described as: Here the conditional probability distributions of visible given hidden units and hidden given visible units are computed as -P (x i = 1|y; Φ) = σ(a i + H j=1 W ij y j ) and W ij x i ) respectively, with σ(·) as a logistic sigmoid function. Now, as the hidden units of RBM are unobservable, the objective function can be defined using the marginal distribution of the visible units only as: Training RBM parameters are done by maximizing the log-likelihood of the observations through a contrastive divergence algorithm. Gibbs sampling technique [38] is used to approximate the expected values of the distribution and calculate the gradient descent [37] . Stacking multiple RBMs as learning elements leads to a popular DL architectures known as Deep Belief Network (DBN, Fig. 3b ) where one RBM's latent layer is connected to the subsequent RBM's visible layer. Therefore, a DBN contains one visible layer x and L hidden layers y =1...L . With downwards directed connections except the top two undirected layers, DBN is a hybrid model combining undirected graphical model and directed generative model [39] . The joint distribution of the visible units (x) and hidden layers (y =1...L ) is given by: with y 0 = x, and P (y L−1 , y L ) denotes the joint distribution between layers L − 1 and L. Individual layers are pretrained in layerwise greedy fashion using unsupervised learning and perform generative fine tuning depending on the required outcome of the model [40] . Nonetheless, the training process remains computationally expensive. Fully connected Input Conv/Pool Kernel Figure 4 . Architecture of convolutional neural network. CNN (Fig. 4) is a multilayer NN model, comprised of convolutional layers (often interfused with subsampling layers) followed by fully connected layers, that mimics the 5/36 locally sensitive, orientation selective neurons in the visual system [41] . CNN is designed to handle multidimensional locally correlated inputs, e.g. the 2D structure of an image or speech signal, and to avoid overfitting by sharing weights which also makes it easier to train with lesser parameters compared to a fully connected network with equal hidden units. These facilitated the wide usage of CNN in problems with large number of hidden units and training parameters. A convolutional layer recognizes local patterns in terms of features from the input feature maps through learnable filter kernels -k ij . These convolution filters (CF) mainly represent connection weights between feature maps i and j belonging to the layers − 1 and respectively. The activations of a convolution layer's units (A j ) are computed by convolving activations of a vicinal subset of units from the preceding layer's feature maps (A −1 i ) with the filter kernels (k ij ) as: where N is the total feature maps in − 1 layer, is the convolution operator, b j is the bias at layer , and f (·) is the nonlinear activation function [42] . A suitable pooling layer reduces the feature maps at every pooling step between the subsequent layers. These interspersed pooling layers, thus, reduces computational times and make CNN invariant to small spatial shifts. Also, because of the feature reduction at every applied step, only limited amount of features are eventually supplied to the fully connected network to classify. When a convolutional layer is followed by a pooling layer + 1, a block of units in a feature map from layer are connected to a single unit of a feature map in layer + 1. The associated sensitivity map δ for layer is calculated as: where f (·) is the activation function's derivative evaluated using preactivations of convolutional layer z j , and up(·) is the upsampling operation. When a current layer (pooling or covolutional) is followed by a convolutional layer, it is important to identify the correspondence in the feature maps between the two layers, i.e. the mapping between the current layer's patch and the next layer's unit in the feature maps. The gradients for the kernel wights are calculated using chain rule, and as the weights are being shared across multiple connections, they are given by: where (P −1 i ) uv is the patch in the ith feature map (A −1 j ) which is elementwise multiplied by the kernel (k ij ) during convolution to compute the element at (u, v) in the output convolution feature map A j [37] . Nonetheless, in case of very large datasets training even this kind of network can be daunting and can be solved using sparsely connected networks. Some of the popular CNN configurations include: AlexNet [43] , VGGNet [44] , and GoogLeNet [45] . RNN (Fig. 5 ) is a NN model that detects sequences in streams of data. It computes the current state's output (h t ) for a given input (x t ) depending on the outputs of the previous states (captured by h t−1 ) [46] : Output Hidden Input Figure 5 . Architecture of recurrent neural network. where f (·) is a nonlinear function (e.g., tanh, ReLU [47] ), and U and W are shared weight matrices. In other words, RNN learns a distribution over classes for a sequence of inputs (e.g., x 1 , x 2 , . . . , x T ). As for the classification, generally a softmax, following few fully connected layers, is added for mapping the classes: where V is the output weight matrix, and Φ is the set of parameters shared across different states. Due to this 'memory'-like property RNN gained popularity in many fields involving streaming data (e.g., text mining, time series, genomes, etc.). However, the backpropagating gradients-from the output through time-create learning problems similar to the conventional deep NN (e.g., vanishing and exploding gradients) [48] . In recent years, development of specialized memory units allowed expansion of classical RNN to useful variants including-bidirectional RNN (BRNN) [49] , long short-term memory (LSTM) [50] , and gated recurrent units [51] . Though RNN's primary application remains with sequential data, it is also increasingly applied to other data, e.g. images [52] . Many studies have been reported in the literature which employ diverse DL architectures with related and varied parameter sets (see section 1) to analyze patterns in biological data. A summary of these studies which use open access data is reported in table 1. Stacked Denoising DA was employed by Danaee et al. to extract features for cancer diagnosis and classification along with the identification of the related genes from GE data [53] . A template based DA learning model was proposed by Li et al. to reconstruct the protein structures [54] . Lee et al. applied a DBN based unsupervised method to perform the auto-prediction of splicing junction at the level of DNA [55] . Combining DBN with active learning, Ibrahim et al. devised a method to select feature groups from genes or microRNAs (miRNAs) based on expression profiles [56] . For translational research, bimodal DBNs were used by Chen et al. to predict responses of human cells using model organisms [57] . Pan Jirayucharoensak et al. used PCA to extract power spectral densities from each EEG channel, which were then corrected by covariate shift adaptation, finally stacked DA was used to detect emotion [81] . DBN was applied to decode motor imagery (MoI) through classifying EEG signal frequency information [82] . For a similar purpose CNN was used covering large frequency ranges with augmented common spatial pattern features [83] . In a rather different approach using DA, features based on combined selective location, time, and frequency attributes were classified [84] . Li et al. used DBN to extract low dimensional latent features, and select critical channels to classify affective state using EEG signals [86] . Also, Jia et al. used an active learning to train DBN and generative RBMs for the classification [87] . Tripathi et al. utilized DNN and CNN based model for emotion classification using the DEAP dataset [88] . CNN was employed to predict seizures through synchronization patterns classification from Freiburg dataset [89] . DBN [90] and CNN [91] were used to decode motion action from Ninapro database. The later approach was also used on MIT-BIH, INCART, & SVDB repositories [91] . Moreover, the ECG Arrhythmias were classified using DBN [92, 93] from the data supplied by MIT-BIH arrhythmia database. Saccharomyces Genome Database (SGD) provides complete biological information for the budding yeast Saccharomyces cerevisiae. They also gives a open source tool for searching and analyzing these data, and thereby enable the discovery of functional relationships between sequence and gene products in fungi and higher organisms. The Study of Genome expression, transcriptome and computational biology are main function of the SGD. 9/36 The PubChem database contains millions of compound structures and descriptive datasets of chemical molecules and their activities against biological assays. Maintained by the National Center for Biotechnology Information of the United States National Institutes of Health, it can be freely accessed through a web user interface and downloaded via FTP. It also contains software services (such as plotting and clustering). It can be use for [Gen/Prote]omics study and Drug design. The Encyclopedia of DNA Elements (ENCODE) is a whole-genome database curated by the ENCODE Consortium which is composed primarily of scientists who were funded by US National Human Genome Research Institute. It contains genome datasets (including meta data) of human/mouse. Molecular Biology Databases (MBD) at the UCI contains three molecular biology databases: i) Protein Secondary Structure [94], which is a bench repository that classifies secondary structure of certain globular proteins; ii) Splice-Junction Gene Sequences [95] , which contains primate splice-junction gene sequences (DNA) with associated imperfect domain theory; and iii) Promoter Gene Sequences [96] , which contains E. Coli promoter gene sequences (DNA) with partial domain theory. Objectives-i) Sequencing and predicting the secondary structure of certain proteins; ii) Study primate splice-junction gene sequences (DNA) with associated imperfect domain theory; iii) Study E. Coli promoter gene sequences (DNA) with partial domain theory. The International Nucleotide Sequence Database Collaboration [97] , popularly known as INSDC, corroborates biological data from three major sources: i) DNA Databank of Japan [98], ii) European Nucleotide Archive [99], and iii) GenBank [100]. These sources provide the spectrum of data raw reads, though alignments and assemblies to functional annotation, enriched with contextual information relating to samples and experimental configurations. Nature Scientific data (NSD) includes omics data; taxonomy and species diversity; mathematical and modelling resources; cytometry; organism-focused resources and Health science data.This can be used for studying and modelling different aspect of Genomics The Small Molecule Pathway Database (SMPDB) includes 618 molecule pathways found in humans. This data are used for drug design, understanding gene / metabolite and protein complex concentration. The Cancer Genome Atlas (TCGA) contains more than two petabytes of genomic data of multi dimensional maps of prime genomic deviation in 33 categories of cancer. These data are generated by National Cancer Institute (NCI) and the National Human 10/36 Genome Research Institute (NHGRI). This database is used to study genomic information for improving the prevention, diagnosis, and treatment of cancer. Protein Data Bank (PDB) contain more than 135 thousand data of proteins, nucleic acids, and complex assemblies. These can be used to understand all aspects of biomedicine and agriculture. Gene Expression Model Selector (GEMS) includes microarray GeEx Data. Cancer Diagnosis and Biomarker Discovery are the two key objective of this dataset. Cancer Program Datasets (CPD) includes Nearest Template Prediction (NTP), Parallel sequencing, Subclass Mapping (PSSM), DNA Microarray, gene sequence and different disease datasets. Cancer gene expression (GeEx) contains different cancer datasets which can be employed for designing tool/algorithm for disease detection iONMF dataset contains Yeast RPR and RNA binding protein datasets. This datasets is used for analyzing multiple RNA binding proteins. JASPAR database is a database for transcription factor DNA binding profile. SysGenSim includes bioinformatics tool, and Pula-Magdeburg single-gene knockout, StatSeq and DREAM 5 benchmark datasets for studying Gene Sequence. The genomes of eukaryotes containing at least 100 miRNAs. This dateset is use for Studying post-transcriptional gene regulation (PTGeR)/miRNA-related pathology. The Indian Genetic Disease Database (IGDD) tracks of mutations in the normal genes for genetic diseases reported in India Retrieve and study genetic disorders is the main objective of this database. It consists of radiogenomics, genetic / chemical databases, and cell and tissue phenotypes databases and bioimage processing tools. The targeted applications: design algorithm for features extraction and anomaly detection. It presents cell image datasets and Cell Library app. The aim of this dataset is to study cell biology. Berkeley Drosophila Transcription Network Project (BDTNP) contains 3D Gene expression data, In-vivo DNA binding data as well as Chromatin Accessibility data (ChAcD). Research on gene expression and detect anomaly are the key applications of this dataset. It provides biological and biomedical imaging data. The analysis of image data in bioimaging is the prime objective of this dataset. The Cell Centered Database (CCDB) provides API for high resolution 2/3/4D data from e-microscope and software tools to analyze the images. JCB Data Viewer facilitates viewing, analysis, and sharing of multi-D image data.for Analyzing cell biology. MITOS dataset contains breast cancer histological images (haematoxylin and eosin stained slides). The Detection of mitosis and evaluation of nuclear atypia are key uses. The Internet Brain Segmentation Repository (IBSR) gives segmentation results of MRI data. Development of segmentation methods is the main application of this IBSR.. The LONI Probabilistic Brain Atlas (LPBA40) contains maps of brain anatomic regions of 40 human volunteers. Each map generates a set of whole-head MRI whereas each MRI describes to identify 56 structures of brain, most of them lies in the cortex. The Study of skull-stripped MRI volumes, and classification of the native-space MRI, probabilistic maps are key uses of LPBA40. Attention Deficit Hyperactivity Disorder (ADHD) dataset includes 776 resting-state fMRI and anatomical datasets which are fused over the 8 independent imaging sites. The phenotypic information includes: age, sex, diagnostic status, measured ADHD symptom, intelligence quotient (IQ) and medication status. Imaging-Based Diagnostic Classification is the main aim of ADHD 200 dataset. It contains MRI & EEG datasets to study brain regions and its functions. The Open Access Series of Imaging Studies contains MRI datasets and open source data management platform (XNAT) to study and analyze Alzheimer's Disease. Neurosynth includes fMRI literature (with some datasets) and synthesis platform to study Brain structure, functions and disease. fMRI dataset contains fMRI datasets which can be useful for studying brain tumour surgical planning 13/36 The Autism Brain Imaging Data Exchange (ABIDE) includes autism brain imaging datasets for studying autism spectrum. Open Neuroimaging dataset contains imaging Modalities and brain diseases data which can be used to study decision support system for disease identification The Neuroimaging Informatics Tools and Resources Clearinghouse contains range of imaging data from MRI to PET, SPECT, CT, MEG/EEG and optical imaging for analyzing Functional and structural neuroimages. Alzheimer's Disease Neuroimaging Initiative (ADNI) includes mild cognitive impairment (MCI), early AD and elderly control subjects diagnosis data. for detecting and tracking of Alzheimer's disease It provides neuroimaging data and toolkit software to Identify normal, healthy subjects. This is a web-based repository (API) for collecting and sharing statistical maps of the human brain to Study human brain regions. The Cancer Imaging Archive (TCIA) contains CT, MRI, and nuclear medicine (e.g. PET) images for Clinical diagnostic, biomarker and cross-disciplinary investigation. The BCI Competition datasets include EEG datasets (such as Cortical negativity or positivity, feedback test trials, self-paced key typing, P300 speller paradigm, motor/mental imagery data, continuous EEG; EEG with eye movement), ECoG datasets (such as finger movement, motor/mental imagery data in EEG/ECoG) and MEG dataset(such as wrist movement). These datasets can be used for signal processing and classification methods for BMI. A Database for Emotion Analysis using Physiological Signals (DEEP) provides various datasets for analyzing the human affective states. The EEG and sEMG of 32 volunteers were generated while watching music videos to analyze the affective states These volunteer also rated the video and The front face was also recorded for 22 volunteer with consent. The NinaPro database includes of the kinematic as well as the sEMG data of 27 subjects while these subjects were moving finger, hand and wrist. These data can be employed to study Biorobotics This repository contains datasets of using 2 lead ECG (m-HEALTH), ECG of heart-attacks patients, arrhythmia, 64 electrode EEG, 2 mental state (Relax), EMG of Lower Limb, sEMG Brain decoding and anomaly detection are the focused application of this dataset. This sites contains neuroelectric and myoelectric databases (EEG, EHG, and ECG databases), waveform databases, multi-parameter databases, CHB-MIT Scalp EEG Database, EOG datasets, EEG motor movement/imagery datasets, ERP based BCI recording. The MIT-BIH Supraventricular Arrhythmia Database, the Physionet Normal Sinus Rhythm Database (NSRDB), the Physionet Supraventricular Arrhythmia Database (SVDB) are also the part of Phyionet. Epileptic seizure onset detection and treatment, Modelling and development of the BMI instrumentation are some of the targeted applications of this database. This databse contains more than 25 datasets such as stimulated EEG datasets, ECoG-based BCI datasets, ERP-based BCI datasets, Mental arithmetic, motor imagery (extracted from EEG, EOG, fNIRS EMG) datasets,Neuroprosthetic control of an EEG/EOG datasets, speller datasets and so on. Modelling and designing of BMI devices are the key application of this database. MAHNOB-HCI datasets produces a ECG and EEG database for affect recognition and implicit tagging (stimulated by fragments of movies and pictures). DECAF is a multimodal dataset for decoding user physiological responses to affective multimedia content. It contains magnetoencephalogram (MEG), horizontal electrooculogram (hEOG), ECG, Trapezius muscle-EMG, near-infrared face video data to study Physiological and mental states. This datasets includes event-related potential (ERP), event-related synchronization (ERD), epileptic seizure studies, brain mapping (including fMRI data). TELE-ECG dataset includes 250 ECG records with annotated QRS and artifact masks. It also includes QRS and artifact detection algorithms to Study QRS and artifact detection from the ECG signal. This dataset includes Raw EEG data, and Group level covariate describing age of subjects and channel location describing all electrode. This is a 128-channel EEG dataset which can be used to detect anomaly in the EEG signal. The EEG database contains invasive EEG recordings of 21 intractable focal epilepsy patients. This is a 128-channel EEG data of single subject. This dataset can be used to study the Muscles potentials, which are well maintained with a reasonable amount of implemented algorithms. For the sake of brevity, the individual publication references of the tools are omitted and interested readers may consult them at their respective websites from the provided urls. Table 5 summarizes the main features and differences of the various tools. To measure the impact and acceptability of a tool in the community, we provide GitHub based measures such as, numbers of Stars, Forks, and Contributors. These numbers are indicative of the popularity, maturity, and diffusion of a tool in the community. Known as Singa (https://singa.incubator.apache.org/), it is a distributed DL platform written in C++, Java, and Python. It's flexible architecture allows synchronous, asynchronous, and hybrid training frameworks to run. It supports a wide range of DL architectures including CNN, RNN, RBM, and DBM. Caffe (http://caffe.berkeleyvision.org/) is scalable, written in C++ and provides bindings for Python as well as Matlab. Dedicated for experiment, training, and deploying general purpose DL models, this framework allows switching between development and deployment platforms. Targeting computer vision applications, it is considered as the fastest implementation of the CNN. Chainer (http://chainer.org/) is a DL framework provided as Python library. Besides the availability of popular optimization techniques and NN related computations (e.g., convolution, loss, and activation functions), dynamic creation of graphs makes Chainer powerful. It supports a wide range of DL architectures including CNN, RNN, and DA. Deeplearning4j (DL4J, https://deeplearning4j.org/), written in Java with core libraries in C/C++, is a distributed framework for quick prototyping that targets mainly nonresearchers. Compatible with JVM supported languages (e.g., Scala/Clojure), it works on distributed processing frameworks (e.g., Hadoop and Spark). Through Keras (section 4.7) as a Python API, it allows importing existing DL models from other frameworks. It allows creation of NN architectures by combining available shallow NN architectures. The The Python based Keras (https://keras.io/) library is used on top of Theano or TensorFlow. Its models can be imported to DL4J (section 4.4). It was developed as a user friendly tool enabling fast experimentation, and easy and fast prototyping. Keras supports CNN, RNN, and DBN. Lasagne (http://lasagne.readthedocs.io) DL library is built on top of Theano. It allows multiple input, output, and auxiliary classifiers. It supports user defined cost functions and provides many optimization functions. Lasagne supports CNN, RNN, and LSTM. to mobile or even embedded devices (e.g., Raspberry Pi). Written in C++, it is memory efficient and supports Go, JavaScript, Julia, Matlab, Perl, Python, R, and Scala. Neon (www.nervanasys.com/technology/neon/) is a DL framework written in Python. It provides implementations of various learning rules, along with functions for optimization and activation. Its support for DL architecture includes CNN, RNN, LSTM, and DA. PyTorch (http://pytorch.org/) provides Torch modules in Python. More than a wrapper, its deep integration allows exploiting the powerful features of Python. Inspired by Chainer, it allows dynamic network creation for variable workload, and supports CNN, RNN and LSTM. TensorFlow (www.tensorflow.org), written in C++ and Python, is developed by Google and supports very-large-scale deep NN. Amended recently as 'TensorFlow Fold', its capability to dynamically create graphs made the architecture flexible, allowing deployment to a wide range of devices (e.g., multi-CPU/GPU desktop, server, mobile devices, etc.) without code rewriting. Also contains a data visualization tool named TensorBoard and supports many DL architectures including CNN, RNN, LSTM, and RBMs. TF.Learn (www.tflearn.org) is a TensorFlow (section 4.13) based high level Python API. It supports fast prototyping with modular NN layers and multiple optimizers, inputs, and outputs. Supported DL architectures include CNN, BRNN, and LSTM. Theano (www.deeplearning.net/software/theano/) is a Python library that builds on core packages like NumPy and SymPy. It defines, optimizes, and evaluates mathematical expressions with tensors, and served as foundation for many DL libraries. Veles (https://velesnet.ml/) is a Python based distributed platform for rapid DL application development. It provides machine learning and data processing services and supports IPython notebooks. Developed by Samsung, one of its advantages is that, it supports OpenCL for cross-platform parallel programming, and allows execution across The effect of community's participation on individual tools is shown by the bubble size, which is product of normalized number of GitHub forks and contributors (c). As for the interoperability among the DL tools (d), Keras allows model importing from Caffe, MCT (CNTK), Theano, Tensorflow and lets DL4j to import. Regarding hardware based scalability of the DL tools (e), most of the tools provide CPU and GPU support, whereas FPGA and ASIC can mainly execute pre-trained models. heterogenous platforms (e.g., servers, PC, mobile, and embedded devices). The supported DL architectures include-DA, CNN, RNN, LSTM, and RBM. To perform relative comparison among the available open-source DL tools, we selected four assessing measures for the tools which are detailed below: trend in their usage, community participation in their development, interoperability among themselves, and their scalability (see Fig. 6 ). To assess the popularity and trend of the various DL tools among the DL consumers, we looked into two different sources to assess the utilization of the tools. Firstly, we 20/36 extracted globally generated search data from Google Trends 1 for two years (July 2015 to June 2017) related to search terms consisting of [tool name] + Deep Learning . The data showed a progressive increase of search about Tensorflow since it's release followed by Keras (see Fig. 6a ). Secondly, mining the content of around 2,000 papers submitted to arXiv's cs. [CV|CL|LG|AI|NE] , and stat.ML categories, during the month of March 2017, for the presence of the tool names [154] . As seen in Fig. 6b which shows an weighted percentage of each individual tool's mention in the papers, the top 6 tools were identified as: Tensorflow, Pytorch, Caffee, Keras, Torch, and Theano. The community based development score for each tool discussed in Section 4 was calculated from repository popularity parameters of GitHub (https://github.com/) (i.e., star, fork, and contributors). The bubble plot shown in Fig. 6c depicts community involvement in the development of the tools indicating the year of initial stable release. Each bubble size in the figure, pertaining to a tool, represents the normalized combined effect of fork and contributors of that tool. It is clearly seen that a very large part of the community effort is concentrated on Tensorflow, followed by Keras and Caffe. In today's cross-platform development environments, an important measure to judge a tool's flexibility is it-s interoperability with other tools. In this respect, Keras is the most flexible one whose high-level neural networks are capable of running on top of either Tensor or Theano. Alternatively, DL4j-model imports neural network models originally configured and trained using Keras that provides abstraction layers on top of TensorFlow, Theano, Caffe, and CNTK backends (see Fig. 6d ). Hardware based scalability is an important feature of the individual tools (see Fig. 6e ). Today's hardware for computing devices are dominated by graphics processing units (GPUs) and central processing units (CPUs). But considering increased computing capacity and energy efficiency, the coming years are expected to witness expanded role for other chipset types including application specific integrated circuits (ASICs), and field programmable gate arrays (FPGAs). So far DL has been predominantly used through software. Requirement for hardware acceleration, energy efficiency, and higher performance allowed development of chipset based DL systems. The power of DL methods lie in their capability to recognize patterns for which they are trained. Despite the availability of several accelerating hardware (e.g., multicore [C/G]PUs), this training phase is very time consuming, cumbersome, and computationally challenging. Moreover, as each tool provides implementations of several DL architectures and often emphasizing separate components of them on different hardware platforms, selecting an appropriate tool suitable for an application is getting increasingly difficult. Besides, different DL tools have different targets, e.g., Caffe aims applications, whereas, Torch and Theano are more for DL research. To facilitate the scientists in picking the right tool for their application, a handful of scientists benchmarked the performances of the popular tools concerning their training times [155, 156] . Moreover, to the best of our knowledge, there exist two main efforts that provide the benchmarking details of the various DL tools and frameworks publicly [157, 158] . Summarizing those seminal works, below we provide the time required to complete the training process as a performance measure of four different DL architectures (e.g., FCN, CNN, RNN, and DA) among the popular tools (e.g., Caffe, CNTK, MXNET, Theano, Tensorflow, and Torch) on multicore [C/G]PU platforms. Table 6 lists the experimental setups used in benchmarking the specified tools. Mainly three different setups, each with Intel Xeon E5 CPU, were utilized during the process. Though the CPU were similar, the GPU hardware were different: GeForce GTX Titan X, GTX 980, GTX 1080, Tesla K80, M40, and P100. Stacked autoencoders or DA were benchmarked using the experimental setup number 1 in Table 6 . To estimate the performance of the various tools on implementing DA, three autoencoders (number of hidden layers: 400, 200, and 100, respectively) were stacked with tied weights and sigmoid activation functions. A two step network training was performed on the MNIST dataset [159] . As reported in Fig. 7 (a, b) performances of various DL tools are evaluated using forward runtime and training time. The forward runtime refers to the required time for evaluating the information flow through the full network to produce the intended output for an input batch, dataset, and network. In contrast, the gradient computation time measures the time that required to train DL tools. The results suggest that, regardless of the number of CPU threads used or GPU, Theano and Torch outperforms Tensorflow both in gradient and forward times (see Fig. 7 a, b) . Experimental setup number 2 (see Table 6 ) was used in benchmarking RNN. The adapted LSTM network [160] was designed with 10000 input and output units with two layers and ∼13 millions parameters. As the performance of RNN depends on the input length, an input length of 32 was used for the experiment. As the results indicate (see Fig. 7 c-f), MCT outperforms other tools on both CPU and all three GPU platforms. On CPUs, Tensorflow performs little better than Torch (see Fig. 7 c) . On GPUs, Torch is the slowest with Tensorflow and MXNet performing similarly (see Fig. 7 d-f) . Still a large portion of the pattern analysis is done using CNN, therefore, we further focused on CNN and investigated how the leading tools performed and scaled in training different CNN networks in different GPU platforms. Time speedup of GPU over CPU is considered as a metric for this purpose. The individual values are calculated using the benchmark scripts of DeepMark [157] on experimental setup number 3 (see Table 6 ) for one training iteration per batch. The time needed to execute a training iteration per batch equals the time taken to complete a forward propagation operation followed by a backpropagation operation. Figure 8 summarizes the training time per iteration per batch for both CPU and GPUs (left y-axis), and the corresponding GPU speedup over CPU (right y-axis). These findings for four different CNN network models (i.e., Alexnet [43] , GoogLeNet [45] , Overfeat [161] , and VGG [44] ) available in four tools (i.e., Caffe, Tensorflow, Theano, and Torch) [162] clearly suggest that network training process is much accelerated in GPUs in comparison to CPUs. Moreover, another important message is that, all GPUs are not the same and all tools don't scale up at the same rate. The time required to train a neural network strongly depends on which DL framework is being used. As for the hardware platform, the Tesla P100 accelerator provides the best speedup with Tesla M40 being the second and Tesla K80 being the last among the three. In CPUs, TensorFlow achieves the least training time indicating a quicker training of the network. In GPUs, Caffe usually provides the best speedup over CPU but Tensorflow and Torch perform faster training than Cafee. Though Tensorflow and Torch have similar performances (indicated by the height of the lines), Torch slightly outperforming Tensorflow in most of the networks. Finally, most of the tools outperform Theano. Brain has the capability to recognize and understand patterns almost instantaneously. Since last several decades, scientists have been trying decode the biological mechanism of natural pattern recognition takes place in the brain and translate that principle in AI systems. The increasing knowledge about the brain's information processing policies enabled this analogy to be adopted and implemented in computing systems. Recent technological breakthroughs, seamless integration of diverse techniques, better understanding of the learning systems, declination of computing costs, and expansion of computational power empowered computing systems to reach human level intelligence in certain scenarios [25] . Nonetheless, many of these methods require improvements in order not to fall short in situations they fail at present. In this line, we identify below shortcomings and bottlenecks of the popular methods, open research questions and challenges, and outline possible directions which requires attention in the near future. First of all, DL methods usually require large datasets. Though the computing cost is declining with increasing computational power and speed, it is not worthy to apply 24/36 the DL methods in cases of small to moderate sized datasets. Besides, considering that many of the DL methods perform continuous geometric transformations of one data manifold to another with an assumption that there exist learnable transfer functions which can perform the mapping [163] . However, in cases when the relationships among the data are causal or very complex to be learned by the geometric transformations, the DL methods fail regardless the size of the dataset [164] . Also, interpreting high level outcomes of DL methods are difficult due to inadequate in-depth understanding of the DL theories which causes many of such models to be considered as 'Black box' [165] . Moreover, like many other ML techniques, DL is also susceptible to misclassification [166] and over-classification [167] . Additionally, harnessing full benefits offered by the open access data repositories, in terms of data sharing and re-use, are often hampered by the lack of unified reporting data standards and non-uniformity of reported information [168] . Data provenance, curation, and annotation of these biological big data is a huge challenge too [169] . Furthermore, except very few large enterprises, the power of distributed and parallel computation through cloud computing remained unexplored for the DL techniques. Due to the fact that the DL techniques require retraining for different datasets, repeated training becomes a bottleneck for cloud computing environments. Also, in such distributed environments, data privacy and security concerns are still prevailing [170] , and real-time processing capability of experimental data is underdeveloped [171] . To mitigate the shortcomings and address the open issues, the existing theoretical foundations of the DL methods need to be improved. The DL models are required not only to be able to describe specific data but also generalize them on the basis of experimental data which is crucial to quantify the performances of individual NN models [172] . These improvements should take place in several directions and address issues like-quantitative assessment of individual model's learning efficiency and associated computational complexity in relation to well defined parameter tuning strategies, the ability to generalize and topologically self-organize based on data-driven properties. Also, to facilitate intuitive and less cumbersome interpretation of the analysis results, novel tools for data visualization should be incorporated in the DL frameworks. Recent developments in combined methods pertaining to deep reinforcement learning (deep RL) have been popularly applied to many application domains (for a review on deep RL, see [173] ). However, deep RL methods have not yet been applied to biological pattern recognition problems. For example, analyzing and aggregating dynamically changing patterns in biological data coming from multiple levels could help to remove data redundancy and discover novel biomarkers for disease detection and prevention. Also, novel deep RL methods are needed to reduce the currently required large-set of labeled training data. Renewing efforts are required for standardization, annotation, curation, and provenance of data and their sources along with ensuring uniformity of information among the different repositories. Additionally, to keep up with the rapidly growing big data, powerful and secure computational infrastructures in terms of distributed, cloud, and parallel computing tailored to such well-understood learning mechanisms are badly needed. Lastly, there are many other popular DL tools (e.g., Keras, Chainer, Lasagne) and architectures (e.g., DBN) which need to be benchmarked providing the users with a more comprehensive list to choose. Also, the currently available benchmarks are mostly performed on non biological data, and their scalability to biological data aren't very well, thus, specialized benchmarking on biological data are needed. The biological big data coming from different application domains are multimodal, multidimentional, and complex in nature. At present, a great deal of such big data are publicly available. The affordable access to these data came with a huge challenge to analyze patterns in them which require sophisticated ML tools to do the job. As a result, many ML based analytical tools have been developed and reported over the last decades and this process has been facilitated greatly by the decrease of computational costs, increase of computing power, and availability of cheap storage. With the help of these learning techniques, machines have been trained to understand and decipher complex patterns and interactions of variables in biological data. To facilitate a wider dissemination of DL techniques applied to biological big data and serve as a reference point, this article provides a comprehensive survey of the literature on those techniques' application on biological data and the relevant open access data repositories. It also lists existing open source tools and frameworks implementing various DL methods, and compares these tools for their popularity and performance. Finally, it concludes by pointing out some open issues and proposing some future perspectives. Biology in the nineteenth century : problems of form, function, and transformation. Cambridge A history of the life sciences History of science. the revolution in the life sciences Next-generation DNA sequencing Sequencing technologies -the next generation Bio-imaging : principles, techniques, and applications Progress and challenges in probing the human brain Brain-machine interfaces: From basic science to neuroprostheses and neurorehabilitation Extracting biology from high-dimensional biological data Computing: A vision for data science Big Biological Data: Challenges and Opportunities Biology: The big challenges of big data Machine learning and its applications to biology Neural Networks: A Review from a Statistical Perspective Artificial neural networks: a tutorial Machine learning: a review of classification and combining techniques Omic' technologies: genomics, transcriptomics, proteomics and metabolomics Machine learning applications in genetics and genomics Machine learning applications in cell image analysis Machine-Learning-Based Coadaptive Calibration for Brain-Computer Interfaces Feature Selection in Classification of Eye Movements Using Electrooculography for Activity Recognition Processing and Analysis of Multichannel Extracellular Neuronal Signals: State-of-the-Art and Challenges Introduction to machine learning for brain imaging Human-level control through deep reinforcement learning Learning deep architectures for ai Deep Learning Applications of Deep Learning and Reinforcement Learning to Biological Data Autoencoders, unsupervised learning and deep architectures Deep learning in medical image analysis Learning long-term dependencies with gradient descent is difficult Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion Efficient learning of sparse representations with an energy-based model Auto-Encoding Variational Bayes Contracting auto-encoders: Explicit invariance during feature extraction Deep boltzmann machines Deep Learning for Medical Image Analysis Stochastic relaxation, gibbs distributions, and the bayesian restoration of images A fast learning algorithm for deep belief nets Deep Learning for Health Informatics The handbook of brain theory and neural networks Notes on Convolutional Neural Networks ImageNet classification with deep convolutional neural networks Very deep convolutional networks for large-scale image recognition Going deeper with convolutions Finding Structure in Time On rectified linear units for speech processing A Critical Review of Recurrent Neural Networks for Sequence Learning Bidirectional recurrent neural networks Long short-term memory Learning phrase representations using RNN encoder-decoder for statistical machine translation A survey on deep learning in medical image analysis A deep learning approach for cancer detection and relevant gene identification A template-based protein structure reconstruction method using da learning Boosted categorical restricted boltzmann machine for computational prediction of splice junctions Multi-level gene/mirna feature selection using deep belief nets and active learning Trans-species learning of cellular signaling systems with bimodal deep belief networks RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach Deep modeling of gene expression regulation in erythropoiesis model Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks Convolutional neural network architectures for predicting dna-protein binding Predicting effects of noncoding variants with deep learning-based sequence model Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data Protein secondary structure prediction using deep convolutional neural fields Predicting the sequence specificities of dna-and rna-binding proteins by deep learning deepMiRGene: Deep neural network based precursor microrna prediction deepTarget: End-to-end learning framework for miRNA target prediction using deep recurrent neural networks Mitosis detection in breast cancer histology images with deep neural networks Deep neural nets segment neuronal membrane in electron microscopy images Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation Deep Learning Trends for Focal Brain Pathology Segmentation in MRI Alzheimer's disease diagnostics by a deeply supervised adaptable 3d convolutional network Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis A Robust Deep Model for Improved Classification of AD/MCI Patients Deep MRI brain extraction: A 3D convolutional neural network for skull stripping Efficient multi-scale 3d CNN with fully connected CRF for accurate brain lesion segmentation Deep neural networks for fast segmentation of 3d medical images Medical image deep learning with hospital pacs dataset Classification on adhd with deep learning Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine mr Eeg-based emotion recognition using deep learning network with principal component based covariate shift adaptation A Deep Learning Scheme for Motor Imagery Classification based on Restricted Boltzmann Machines On the use of convolutional neural networks and augmented csp features for multi-class motor imagery of eeg signals classification A novel deep learning approach for classification of EEG motor imagery signals Parallel convolutional-linear neural network for motor imagery classification Affective state recognition from eeg with deep belief networks A novel semi-supervised deep learning framework for affective state recognition on eeg signals Using deep and convolutional neural networks for accurate emotion classification on deap dataset Classification of patterns of EEG synchronization for seizure prediction Classification of electrocardiogram signals with dbn Deep learning with convolutional neural networks applied to electromyography data: A resource for the classification of movements for prosthetic hands A novel method for classification of ecg arrhythmias using deep belief networks A restricted boltzmann machine based two-lead electrocardiography classification Uci molecular biology (uci mb) protein secondary structure data set Uci molecular biology (uci mb) splice-junction gene sequences data set Uci molecular biology (uci mb) promoter gene sequences data set The international nucleotide sequence database collaboration International nucleotide sequence database collaboration The Cancer Genome Atlas Home Page Cancer Program Legacy Publication Resources iONMF: Integrative orthogonal non-negative matrix factorization JASPAR 2018: An open-access database of transcription factor binding profiles The cell," now part of cell image library NITRC: IBSR: Tool/Resource Info Construction of a 3d probabilistic atlas of human cortical structures Open fmri: A multi-subject, multi-modal human neuroimaging dataset Open access series of imaging studies (oasis)," (Accessed on: 17/12/2017) Neuroimaging dataset of brain tumour patients Autism brain imaging data exchange Open neuroimaging datasets Neuroimaging informatics tools and resources clearinghouse dataset Alzheimer's disease neuroimaging initiative (adni datasets The cancer imaging archive Bci competition datasets Database for emotion analysis using physiological signals Mahnob-hci Meg-based multimodal database for decoding affective physiological responses Tele ecg Eeg single subject mismatch negativity (essmn) Eeg dataset Facial s-emg dataset A Peek at Trends in Machine Learning Comparative study of deep learning software frameworks Benchmarking state-of-the-art deep learning software tools The source code and experimental data of benchmarking state -of-the-art deep learning software tools The MNIST database of handwritten digits Recurrent neural network regularization OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks Deep learning benchmarks of NVIDIA Tesla P100 PCIe, Tesla K80, and Tesla M40 GPUs The limitations of deep learning An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems Opening the Black Box of Deep Neural Networks via Information Deep neural networks are easily fooled: High confidence predictions for unrecognizable images Intriguing properties of neural networks Standardizing data Data management and data enrichment for systems biology projects Service Oriented Architecture Based Web Application Model for Collaborative Biomedical Signal Analysis A Web-Based Framework for Semi-Online Parallel Processing of Extracellular Neuronal Signals Recorded by Microelectrode Arrays Challenges in deep learning Deep Reinforcement Learning: A Brief Survey