key: cord-0293460-iwbx4pij authors: Fan, Junyu; Chen, Chutao; Song, Chen; Pan, Jiajie; Wu, Guifu title: A multi-class gene classifier for SARS-CoV-2 variants based on convolutional neural network date: 2021-11-23 journal: bioRxiv DOI: 10.1101/2021.11.22.469492 sha: c8dfbef7da653f2c42f987f44e3a68da45a78de8 doc_id: 293460 cord_uid: iwbx4pij Surveillance of circulating variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is of great importance in controlling the coronavirus disease 2019 (COVID-19) pandemic. We propose an alignment-free in silico approach for classifying SARS-CoV-2 variants based on their genomic sequences. A deep learning model was constructed utilizing a stacked 1-D convolutional neural network and multilayer perceptron (MLP). The pre-processed genomic sequencing data of the four SARS-CoV-2 variants were first fed to three stacked convolution-pooling nets to extract local linkage patterns in the sequences. Then a 2-layer MLP was used to compute the correlations between the input and output. Finally, a logistic regression model transformed the output and returned the probability values. Learning curves and stratified 10-fold cross-validation showed that the proposed classifier enables robust variant classification. External validation of the classifier showed an accuracy of 0.9962, precision of 0.9963, recall of 0.9963 and F1 score of 0.9962, outperforming other machine learning methods, including logistic regression, K-nearest neighbor, support vector machine, and random forest. By comparing our model with an MLP model without the convolution-pooling network, we demonstrate the essential role of convolution in extracting viral variant features. Thus, our results indicate that the proposed convolution-based multi-class gene classifier is efficient for the variant classification of SARS-CoV-2. Introduction 31 The Coronavirus disease 2019 (COVID-19) has greatly impacted global public health and all 32 aspects of social activities. According to World Health Organization (WHO) reports, more than 246 33 million people were diagnosed with COVID-19 as of 31 October 2021, of whom nearly 5 million have 34 died (WHO, 2021) . The outbreak is caused by a highly transmissible coronavirus designated as severe 35 acute respiratory syndrome coronavirus 2 (SARS-CoV-2). 36 Like SARS-CoV and the Middle East respiratory syndrome coronavirus (MERS-CoV), SARS-37 CoV-2 is a positive-sense RNA virus (Zhou et al., 2020) , with a relatively higher mutation rate than 38 double-stranded viruses (Peck and Lauring, 2018) . Genetic changes may impact virus phenotypes, such 39 as transmissibility, infectivity, pathogenicity and antigenicity (Harvey et al., 2021) , providing the viral 40 population a greater adaptive ability to various environmental conditions. For example, the spike 41 protein substitution D614G enhances SARS-CoV-2 replication, affects susceptibility to antibody 42 neutralization, and has become dominant during the pandemic (Plante et al., 2021 encode the mix-bases symbol gene sequences. 84 As a common data pre-processing method in machine learning, the collected dataset was randomly 85 split into three separate subsets in an 8:1:1 ratio for training, validation and testing, respectively. 86 We employed convolution networks to learn the underlying correlation and patterns in the 88 sequence data (Al-Ajlan and El Allali, 2019). Convolutional neurons processes data only for their 89 receptive field, allowing them to capture the local patterns in the genomic sequences. 90 Considering the discreteness of the genomic sequence data, for the i th input I, its convolution 91 operations is defined as: 92 where F is the 1-D convolutional filter with a filter size k (normally an odd number). Each 94 convolutional layer comprises n convolutional filters, transforming the input by arranging the neurons 95 in n dimensions. Each convolutional filter has a depth D, which is equal to the input depth. The m th 96 filter produces a feature map as in Eq (2): 97 where is a non-linear activation function (e.g. ReLU (Dahl et al., 2013) , sigmoid (Langer, 2021) , 99 etc.), ⊗ represents the convolution symbol in Eq (1) and is the bias. 100 Multilayer perceptron (MLP) is a fully connected feed-forward artificial neural network that 102 assigns the input to the output through hidden layers, in which the neurons are operated using non-103 linear activation functions. Typically, three types of activation functions are used: ReLU, sigmoid and 104 tanh. 105 Since the output for variant classification is categorical, we use logistic regression as the output 107 layer of our classifier to generate the results. For input feature and label , the logistic regression 108 model can be presented as shown in Eq (3) and Eq (4): 109 where the weight matrix and the bias are trained to minimize the objective function. 112 The loss function represents the difference between the label and predicted result ̂. Different 114 loss functions could be used in a machine learning model. The cross-entropy objective was used for 115 our classifier: 116 To build a predictive model for classifying virus variants, we constructed a deep learning model 119 utilizing a stacked 1-D convolutional neural network and MLP. As shown in Figure 1 . and pipeline 120 (Figure 2 ), the proposed model comprises ten layers, three stacked convolution-pooling nets, 2-layer 121 MLP, one input, and one output. The input layer takes the input as the labeled gene sequence data. The 122 output layer of the classifier is a logistic regression model that generates the probabilities of each type 123 of variant. Stacked convolution-pooling nets are used as hidden layers, followed by MLP layers. The 124 model was trained and optimized using a backpropagation algorithm with an Adam optimizer. We 125 determined the hyperparameters using the grid research method. We used cross-validation to select the 126 optimal model and evaluated model performance on an independent dataset. 127 The macro-averages of precision, recall and F1 scores were used in this study in consideration of the 147 balanced dataset. The macro-average scores were calculated as the arithmetic mean of individual 148 classes' corresponding scores. 149 In the external validation, a confusion matrix, also called an error matrix, was used to visualize 150 the predictive performance of the model in the testing data by showing the number of correct and 151 incorrect predictions for each viral variant. t-distributed stochastic neighbor embedding (t-SNE) is a 152 widely used statistical method for converting the distance of high-dimensional data into conditional 153 probabilities representing similarities (Van der Maaten and Hinton, 2008). We used t-SNE to visualize 154 the separability of different variants before and after training with the value of perplexity set as 38. The 155 predictive ability of this multi-class classifier for testing data is also illustrated by receiver operating 156 characteristic (ROC) curves and the area under the ROC curve (AUC). 157 Results and discussion 158 This paper proposes an in silico approach for classifying SARS-COV-2 variants based on their 159 genomic sequences. We first used the learning curves of the CNN gene classifier to observe the training 160 process of our CNN gene classifier. Next, the proposed classifier was evaluated using stratified 10-fold 161 cross-validation. We then used external validation and visualization to test the performance of the 162 classifier. Finally, we compared the performance of the proposed CNN classifier with that of other 163 machine learning methods to further validate the ability of the classifier. The stratified 10-fold cross-validation method was applied to a joined dataset of the training and 179 validation data for unbiased evaluation of multiple train-test splits. The joined dataset was shuffled and 180 split into ten groups containing the same proportions of the four variants. Nine groups were used to 181 train the model within each fold, while the remaining 1 group was assigned for validation. The 182 performance metrics obtained in each fold are listed in The best results for each evaluation indicator are highlighted in bold. 188 189 The 10% holdout subset of the original dataset is then applied for the final estimation to ensure 191 that the CNN gene classifier can generalize well to new, unseen data. The accuracy, precision, recall 192 and F1 score for each variant are shown in Table 2 . There is only little difference in the predictive 193 performance of the proposed classifier applying to different variants. The overall accuracy for multi-194 class classification is 0.9962. Figure 4 shows the ROC curves of the classifier. They further proved 195 that the CNN classifier performed well for all the variants as the area under the ROC curve (AUC) is 196 approximately 1.00. Figure 5 shows the confusion matrix of the classifier used to visualize the correct 197 and incorrect classifications. This indicates that the proposed classifier has a low rate of error 198 identification. 199 The performance of our CNN classifier was also visualized using t-SNE. The two-dimensional 210 maps generated from the raw SARS-CoV-2 genome data showed strong overlap between variants and 211 did not appear to be easily separable ( Figure 6A, 6C ). After being processed by the trained CNN 212 classifier, the separation visibility among variant clusters was greatly improved both in the training and 213 testing data ( Figure 6B, 6D) . This result indicates that our proposed model can effectively extract 214 features from the four viral variants for classification. t-SNE distribution of the testing data after being processed by the classifier. 220 221 Finally, as shown in Table 3 This study reports an automated method based on CNN for SARS-CoV-2 variant classification 247 with an accuracy of 0.9962. By visualizing the learning process and data separability, and comparing 248 it with other machine learning methods, we demonstrated that the proposed CNN gene classifier could 249 effectively extract features from the collected four variants of SARS-CoV-2 genomic data and enable 250 robust variant classification. 251 Tensorflow: Large-261 scale machine learning on heterogeneous distributed systems CNN-MGP: Convolutional Neural Networks for Metagenomics Basic local alignment 266 search tool Deep learning for computational 268 biology RNA regulatory processes in RNA 270 virus biology Improving deep neural networks for LVCSR using 272 rectified linear units and dropout Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK COVID-19 vs influenza viruses: A 278 cockroach optimized deep neural network classification approach SARS-CoV-2 variants, spike mutations and immune escape Approximating smooth functions by deep neural networks with sigmoid activation 284 function Deep learning Automated medical diagnosis of COVID-288 19 through EfficientNet convolutional neural network Rapid implementation of SARS-CoV-2 sequencing to investigate cases of health-care associated 292 COVID-19: a prospective genomic surveillance study Weekly epidemiological update on COVID-19 -2 The next phase of SARS-CoV-2 surveillance: real-time molecular epidemiology Complexities of Viral Mutation Rates Scikit-303 learn: Machine Learning in Python Spike mutation 306 D614G alters SARS-CoV-2 fitness Continuous and Discontinuous RNA 308 Synthesis in Coronaviruses Visualizing data using t-SNE DL-CNV: A deep learning 313 method for identifying copy number variations based on next generation target sequencing A pneumonia outbreak 316 associated with a new coronavirus of probable bat origin Deep learning 319 suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory 320 structure The authors declare that the research was conducted in the absence of any commercial or financial 253 relationships that could be construed as a potential conflict of interest. 254 We would like to express our gratitude to all the frontline healthcare workers for their dedication 256to human health during the COVID-19 pandemic. 257 The datasets analyzed in this study are available at https://github.com/chotiu5/CNN. 259