key: cord-0322710-qevo7d0q authors: Coutinho, Maria G. F.; Câmara, Gabriel B. M.; Barbosa, Raquel de M.; Fernandes, Marcelo A. C. title: Deep learning based on stacked sparse autoencoder applied to viral genome classification of SARS-CoV-2 virus date: 2021-10-15 journal: bioRxiv DOI: 10.1101/2021.10.14.464414 sha: 1932b53397cca796ca03755dfdb4f45a98578e8c doc_id: 322710 cord_uid: qevo7d0q Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2 virus, first identified in Wuhan, China. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infections diagnosis, metagenomics, phylogenetic, and analysis. This work proposes to generate an efficient viral genome classifier for the SARS-CoV-2 virus using the deep neural network (DNN) based on stacked sparse autoencoder (SSAE) technique. We performed four different experiments to provide different levels of taxonomic classification of the SARS-CoV-2 virus. The confusion matrix presented the validation and test sets and the ROC curve for the validation set. In all experiments, the SSAE technique provided great performance results. In this work, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a viral classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation, with k = 6, was applied. The results indicated the applicability of using this deep learning technique in genome classification problems. Since the emergence of the SARS-CoV-2 virus at the end of 2019, many works are 19 been developed aiming to provide more comprehension about this novel virus. In March 20 2020, the World Health Organization (WHO) raised the level of contamination to the 21 COVID-19 pandemic, due to its geographical spread across several countries. On July 9, 22 2021, the disease had registered more than 185 million confirmed cases, and more than 4 23 million confirmed deaths. In the case of a novel virus identification, the early elucidation 24 of taxonomic classification and origin of the virus genomic sequence is essential for 25 strategic planning, containment, and treatments of the disease [1] [2] [3] . 26 One of the fields of research in the bioinformatics area is the analysis of genomic 27 sequences. In the last years, many strategies based on alignment-free methods have been 28 explored as an alternative for the alignment-based methods, considering the limitations 29 of the second approach. Alignment-based programs assume that homologous sequences 30 comprise a series of linearly arranged and more or less conserved sequence stretches, 31 which is not always the case in the real world [4] . 32 Among the alignment-free methodologies, there are some models based on deep 33 learning (DL) techniques, that can provide significant performance in applications of 34 genome analysis [5] [6] [7] . Deep neural networks (DNN) can improve prediction accuracy 35 by discovering relevant features of high complexity [7] . 36 Figure 1 presents the genome analysis stages and how deep learning integrates 37 this process. The genome analysis stages include the primary analysis, the secondary 38 analysis, and the tertiary analysis. The primary and secondary analysis compose the 39 genome sequencing. The primary analysis receives the biological sample and generates 40 genomic data information, called "reads", after the processing by the sequencer machine. 41 Then, the secondary analysis processes the reads and produces the complete genome 42 sequence. Lastly, the tertiary analysis provides the genome interpretation, which can be 43 performed for many algorithms and techniques [8] [9] [10] . The deep learning techniques 44 have been successful used for the tertiary analysis in many viral classification problems 45 associated with the diagnosis of viral infections, metagenomics, pharmacogenomics, and 46 others [11] [12] [13] [14] [15] . art, such as one-hot encoding [13, [16] [17] [18] , number representation [11, 12] , digital signal 52 processing [19] , and other strategies, including multiple mapping strategies applied 53 sequentially [20, 21] . The processing stage consists of the utilization of a DNN to perform 54 classification, prediction, and other assumptions about the genome information. The mapping stage is crucial for the performance of the processing stage. The 56 genome sequence length varies by the type of the virus. Since the DNN only receive 57 a fixed-size input, some researchers have not been using the whole or long sequence 58 length. Nevertheless, longer sequences contain more information and thus are more 59 convenient to make predictions [17] . In this work, we will explore the utilization of the as we show in the last column of Table 1 . Table 2 shows the details about the input and 71 the output of the DNN, besides the biology fields and the bioinformatics area. In the work presented in [11] In [13] , was proposed an approach to provide viral classification using the contigs 92 (fragments of the genome sequence) and two different reverse-complement (RC) neural 93 networks architectures: a RC-CNN and a RC-LSTM. These models were also applied to 94 the SARS-CoV-2 virus. In works presented in [14] and [15] , a taxonomic classification for metagenomics 96 applications is proposed. Both works used segments of genome (reads) with DL input 97 (see Figure 1 ), and the output is the number of the classes. In [14] , it was proposed two 98 DL models, one to classify species, and another to classify genus. In [15] , a hierarchical 99 taxonomic classification for viral metagenomic data via DL, called CHEER, was proposed. Similar to the work proposed in [14] , the CHEER framework classifies the genus, family, 101 and genus. Proposals presented in [16] , [17] and [23] used the contigs with DL input for viral 103 prediction, and classification. In [16] , and [17] a DL virus identification framework was 104 proposed and both cases try to recognize if the input is a virus or not. In work from [16] , called ViraMiner, was proposed and approach to detect the 106 presence of viruses on raw metagenomic contigs from different human samples. They [25] Gene expression analysis [26] In the proposal presented in [17] , called DeepVirFinder, the output is a score be-113 tween 0 and 1 for a binary classification between virus and prokaryote. They fragmented 114 the genomes into non-overlapping sequences of different sizes (150, 300, 500, 1000, and 115 3000 bp). The sequences were mapped for the network input using the one-hot encoding 116 method. Since they increase the length of the input, i.e. the sequence fragment, they 117 achieve better performance results, which was measured by the area under the receiver 118 operating characteristic curve (AUROC). The maximum AUROC achieved was 0.98 for 119 the 3000 bp fragment. The work presented in [23] identifies metagenomic fragments as phages, chromo-121 somes or plasmids using the CNN technique. The experiments were performed using 122 artificial contigs and real metagenomic data. The network output, provided by a softmax 123 layer, consists of 3 scores that indicate the probability that each fragment belongs to a 124 specific class. In the works from [22] and [18] are present DL architectures for host prediction and 126 classification. [22] used a CNN to provide host and infectivity prediction of SARS-CoV-2 127 virus. In [18] was proposed an approach to predict viral host from three different virus 128 species (influenza A virus, rabies lyssavirus and rotavirus A) from the whole or only 129 fractions of a given viral genome. In the works from [19] , [24] , [25] and [26] were proposed methodologies to predict 131 or classify specific regions in the genome sequence. [19] presented a methodology for the The authors in [24] proposed a DL framework to identify similar patterns in DNA In [25] was provided a method based on CNN and BLSTM for exploring the RNA 143 recognition patterns of the CCCTC-binding factor (CTCF) and identify candidate IncR-144 NAs binding. The experiments conducted with two different datasets (human U2OS and 145 mouse ESC) were able to predict CTCF-binding RNA sites from nucleotide sequences. 146 Moreover, [26] propose a computational prediction approach for DNA-protein binding 147 based on CNN and BLSTM. 148 We intend to provide viral classification using the whole genome sequences, as 149 presented in [11] and [12] . However, in these works were used the length of the longest 150 genome sequence of the dataset as the input of the DNN. So, it was necessary to add 151 some padding for the missing entries. In this work, we will explore the utilization 152 of k-mers image representation of the complete genome sequences as the DNN input, 153 which will feasibly the use of genome sequences of any length and enable the use 154 of smaller network inputs. The k-mers representation was used in many works that 155 provide genome sequence classification, as presented in [31] , which explores the spectral 156 sequence representation based on k-mers occurrences. However, that work doesn't 157 explore the k-mers image representation. 158 We also explore the utilization of the stacked sparse autoencoder (SSAE) technique 159 as an efficient viral genome classifier. The SSAE has been successfully applied in many Each d-th sequence, stored in dataset, is expressed by where N d is the length of d-th sequence and s d,n is the n-th nucleotide of the sequence. Each n-th s d,n can be characterized as a symbol belonging to an alphabet of 4 possible symbols expressed by set {A, T, C, G} for DNA or by set {A, U, C, G} for RNA, that is, In k-mers representation, each d-th nucleotide sequence, s d , is grouped in k-mers sub-sequences [37, 38] that can be expressed as where the matrix H d stores the k-mers associated with each d-th sequence s d . The k-mers representations are based in each d-th matrix H d and the matrix Γ, call here as symbol matrix. The symbol matrix is expressed as The k-mers count 1D representation can be expressed as where and representation for k = 2, . . . , 6. The k-mers count 2D representation for each d-th sequence, s d , is described by Finally, the k-mers image representation, for each d-th sequence, can be represented as where ϕ d,i,j represents each pixel associated with d-th image Φ d . Each pixel, ϕ d,i,j , is be expressed as where max{·} is the maximum value in d-th matrix Λ d , ⌊·⌋ is the greatest integer less 177 than or equal, and b is number of bits associated with the image pixels. Figure 3 show For all experiments, the network architecture used three hidden layers (K = 3), containing 3000 neurons in the first hidden layer, Q 1 , 1000 in the second hidden layer, Q 2 , and 500 in the third hidden layer Q 3 . For input of the SSAE, it was used k-mers images, with k = 6, generating images, matrix Φ, with 64 × 64 pixels (based on Equation Input layer Hidden layer 1 Hidden layer K Output layer where each u-th output, o u , represents a specific virus in a taxonomic level classification and is defined by The SSAE was implemented in the Matlab platform (License 596681) [40] , adopting the deep learning toolbox. All network was trained with the Scaled Conjugate Gradient (SCG) algorithm. The loss function used for the training in each AE was the Mean Squared Error with L2 and Sparsity Regularizers, that can be expressed as where I is the number of training examples, U is the number of classes, Ω weights is the 212 L2 regularization term, λ is the coefficient for the L2 regularization term, Ω sparsity is the 213 sparsity regularization term, and β is the coefficient for the sparsity regularization term. The loss function applied for the softmax layer was the Cross-Entropy. In this work, 215 after the training in each layer, the fine-tuning was performed, which retrained all the 216 stacked network in a supervised way in order to improve the classification results. The 217 fine-tuning process also used the Cross-Entropy as the loss function, as in the softmax 218 layer. for the validation set we also present the receiver operating characteristic (ROC) curve. The ROC curve measures the classification performance, that is the true positive rate 239 and the false positive rate of each class, at various thresholds settings. In Experiment 1, we intended to classify the viruses in 14 different classes, as 241 presented in Table 5 , which consists of 10 families (Adenoviridae, Anelloviridae, Cir- which is important to make decisions for the next experiments. The confusion matrix from the test set of Experiment 1 is present in Figure 8 . In The confusion matrix from the test set of Experiment 2 is present in Figure 11 . The 294 SSAE achieve 100% of classification accuracy, i.e., all SARS-CoV-2 sequences applied in 295 this experiment were perfectly classified as Coronaviridae family sequences. The three AEs were trained for 400 epochs each, and the softmax layer was trained for 301 2000 epochs or until reaching the minimum gradient. Regarding the test set of Experiment 3, the confusion matrix is present in Figure 308 14. The test phase of Experiment 3 achieved 98.9% of classification accuracy. In the 309 validation phase of Experiment 3, the Betacoronavirus genus did not reach the highest 310 performance, which probably explains these result in the test phase. Table 6 presents the results regarding some popular classification performance 323 metrics obtained from the validation set. The first column of the table indicates the ex-324 periment proposed. The second column shows the overall accuracy for each experiment. The precision, recall, F1-score, and specificity are present in the others columns, which 326 were obtained by the average of the values obtained for each class. All the metrics presented in Table 6 indicate that the viral classifier proposed NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% 100% 0.0% NaN% NaN% proposed. The second column shows the overall accuracy for each experiment. And the 337 last column shows the recall, or true positive rate, which were obtained only for the class 338 that corresponds to the SARS-CoV-2 samples. The other metrics (precision, F1-score, and 339 specificity) are not presented because in the tests we do not have false positives samples. cross-validation scheme. Besides, we also intend to study data balancing alternatives, 348 based on the analysis of the results presented here. NaN% NaN% NaN% NaN% 100% 0.0% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% NaN% F.; others. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins The proximal origin of SARS-CoV-2 Alignment-free sequence comparison: benefits, applications, and tools A primer on deep learning in genomics Recent advances of deep learning in bioinformatics and computational biology Deep learning: new computational modelling techniques for genomics Sequencing technologies and genome sequencing A survey of tools for variant analysis of next-generation genome sequencing data Recent advances in inferring viral diversity from high-throughput sequencing data Viral Genome Deep Classifier Accurate identification of sars-cov-2 from viral genome sequences using deep learning Interpretable detection of novel human viruses from genome sequencing data DeepMicrobes: taxonomic classification for metagenomics with deep learning CHEER: hierarCHical taxonomic classification for viral mEtagEnomic data via deep leaRning Deep learning on raw DNA sequences for identifying viral genomes in human samples Identifying viruses from metagenomic data using deep learning Viral host prediction with Deep Learning Deep Learning for the Classification of Genomic Signals DNA sequence classification by convolutional neural network DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction Host and infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning Deep6mA: a deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species Identification and analysis of consensus RNA motifs binding to the genome regulator CTCF DeepSite: bidirectional LSTM and CNN models for predicting DNA-protein binding B.; others. A machine learning approach for viral genome classification Identifying viruses from metagenomic data using deep learning Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins Machine Learning for detection of viral sequences in human metagenomic datasets A deep learning approach to dna sequence classification Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images Application of stacked sparse autoencoder in automated detection of glaucoma in fundus images A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study A. k-mers 1D and 2D representation dataset of SARS-CoV-2 nucleotide sequences. Mendeley Data 2020 KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies Genomic DNA k-mer spectra: models and modalities Deep Learning