key: cord-0455571-py9h4k6f authors: Ali, Sarwan; Sahoo, Bikram; Ullah, Naimat; Zelikovskiy, Alexander; Patterson, Murray; Khan, Imdadullah title: A k-mer Based Approach for SARS-CoV-2 Variant Identification date: 2021-08-07 journal: nan DOI: nan sha: ff6cb2b13cf10b47f1d70ba2fc3ce293430a7729 doc_id: 455571 cord_uid: py9h4k6f With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants. In this paper, we use spike sequences to classify different variants of the coronavirus in humans. We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples ($1%$ of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC). The novel coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), that emerged in late 2019 belongs to the coronaviridae family. SARS-CoV-2 has an unparalleled capacity for human-to-human transmission and became the reason for the COVID-19 pandemic. Having witnessed two recent pandemics caused by coronaviridae, namely SARS-CoV in 2002 and MERS-CoV in 2012, there was an immediate research interest in studying the zoonotic origin, transmissibility, mutation and variants of SARS-COV-2 [22, 23] . SARS-CoV-2 has a positive-strand RNA genome of about 30kb and encodes two categories of proteins: structural and non-structural (see Figure 1 ). The spike protein is one of the substantial structural proteins of the virus, having 1160 to 1542 amino acids. The spike protein's primary function is to serve as a mechanism for the virus to enter inside the human cell by binding the ACE2 receptor. Detailed study of the structure of the spike glycoprotein unveils the molecular mechanism behind host cell recognition, attachment, and admission. Notably, the spike glycoprotein of SARS-CoV-2 has two subunits, S 1 and S 2 , belonging to the N and C terminals, respectively [16, 22] . The receptor binding domain (RBD) (≈ 220 amino acids) of the S 1 subunit helps the virus attach to the host cell by binding the ACE2 receptor protein, and the S 2 subunit helps to insert into the cell. SARS-CoV-2 continues to mutate over time, resulting in changes in its amino acid sequences. The change in the spike protein's amino acids, specifically in the RBD, makes the virus more transmissible and adaptable to the human immune system. In the language of phylogenetics and virology, the virus is creating new variants and strains by accruing new amino acid changes in the spike protein and its genome [16, 29, 37, 38] . The state-of-the-art mRNA vaccines train the host immune system to create specific antibodies that can bind to the spike protein, which leads to preventing the virus from entering inside the host cell. Therefore, changing amino acids in the spike protein generates new variants which could potentially be more contagious and more resistant to vaccines [20] . 5 3 Fig. 1 : The SARS-CoV-2 genome is around 29-30kb encoding structural and non-structural proteins. ORF1ab encodes the non-structural proteins and the four structural proteins: spike, envelope, membrane, and nucleocapsid encoded by their respective genes. The spike protein has 1160 to 1542 amino acids. In-depth studies of alterations in the spike protein to classify and predict amino acid changes in SARS-CoV-2 are crucial in understanding the immune invasion and host-to-host transmission properties of SARS-CoV-2 and its variants. Knowledge of mutations and variants will help identify transmission patterns of each variant that will help devise appropriate public health interventions to prevent rapid spread [2, 3, 35, 1] . This will also help in vaccine design and efficacy. A massive amount of genomic sequences of SARS-CoV-2 are available from different parts of the globe, with various clinical, epidemiological, and pathological information from the GISAID database 3 . In this study, we design sophisticated machine learning (ML) models which will leverage the available genomic data and metadata to understand, classify and predict the changes in amino acid in SARS-CoV-2, most notably in its spike protein [20, 23, 25] . When sequences have the same length and they are aligned, i.e., a one-to-one correspondence between positions or indices of the sequences is established, ML methods devised for vectors in metric spaces can be employed for sequence analysis. This approach treats sequences as vectors, considering each character (e.g., amino acid or nucleotide) as the value of the vector at a coordinate, e.g., using one-hot encoding [22] . In this case, the order of positions loses its significance, however. Since the order of indices in sequences is a defining factor, ignoring it may result in performance degradation of the underlying ML models. In representation based data analysis, on the other hand, each data object is first mapped to a vector in a fixed-dimensional vector space, taking into account important structure in the data (such as order). Vector space ML algorithms are then used on the vector representations of sequences. This approach has been highly successful in the analysis of data from various domains such as graphs [18, 17] , nodes in graphs [7] , and electricity consumption [5, 6] . This approach yields significant success in sequence analysis, since the feature representation takes into account the sequential nature of the data, such as texts [30, 32, 31] , electroencephalography and electromyography sequences [11, 36] , Networks [4] , and biological sequences [24, 15, 21, 9, 8, 10] . For biological sequences (DNA and protein), a feature vector based on counts of all length k substrings (called k-mers) occurring exactly or inexactly up to m mismatches (mimicking biological mutations) is proposed in [24] . The kernel value between two sequences -the dot product between the corresponding feature vectors -serves as a pairwise proximity measure and is the basis of kernel based ML. We provide the technical definition of feature maps and the computational aspects of kernel computation in Section 3. In this paper, our contributions are the following: 1. We propose a method based on k-mers (for feature vector generation) and kernel approximation (to compute pairwise similarity between spike sequences) that classify the variants with very high accuracy. 2. We show that spike sequences alone can be used to efficiently classify different COVID-19 variants. 3. We show that the proposed method outperforms the baseline in terms of accuracy, precision, recall, F1-measure, and ROC-AUC. 4. We show that with only 1% of the data for training, we can achieve high prediction accuracy. The rest of the paper is organised as follows: Section 2 contains the previous work related to our problem. Section 3 contains the proposed approach of this paper in detail, along with the description of the baseline model. Section 4 contains the information related to different datasets. We present our results in Section 5. Finally, we conclude the paper in Section 6. Phylogeny based inference of disease transmission [13] and sequence homology (shared ancestry) detection between a pair of proteins are important tasks in bioinformatics and biotechnology. Sequence classification is a widely studied problem in both of these domains [20] . Most sequence classification methods for viruses require the alignment of the input sequence to fixed length predefined vectors, which enables the machine learning algorithm to compare homologous feature vectors [15] . Pairwise local and global alignment similarity scores between sequences were used traditionally for sequence analysis. Alignment-based methods are computationally expensive, especially for long sequences, while heuristic methods require a number of ad-hoc settings such as the penalty of gaps and substitutions, and alignment methods may not perform well on highly divergent regions of the genome. To address these limitations, various alignment-free classification methods have been proposed [15] . The use of k-mer (substrings of length k) frequencies for phylogenetic applications started with [12] , which reported success in constructing accurate phylogenetic trees from several coding and non-coding nDNA sequences. Typically, k-mer frequency vectors are paired together with a distance function to measure the quantitative similarity score (kernel matrix) between any pair of sequences [15] . However, the basic bottleneck for these techniques is the quadratic (in the lengths of the sequences) runtime of kernel evaluation. With the global outbreak of Covid-19 in 2020, different mutations of its variants were discovered by the genomics community. Massive testing and largescale sequencing produced a huge amount of data, creating ample opportunity for the bioinformatics community. Researchers started exploring the evolution of SARS-CoV-2 [14, 27] to vaccine landscapes [33] and long-term effects of covid to patients [34] . In [23] , the authors indicate how the coronavirus spike protein is fine-tuned towards the temperature and protease conditions of the airways, to enhance virus transmission and pathology. After the global spread of the coronavirus, researchers started exploring ways to identify new variants and measuring vaccine effectiveness. In [28] , the authors study genome structures of SARS-CoV-2 and its previous versions, while [25] explores the genomics and proteomic variants of the SARS-CoV-2 spike glycoprotein. Given a set of spike sequences, our goal is to find the similarity score between each pair of sequences (kernel matrix). The resultant kernel matrix is given as input to the kernel PCA method for dimensionality reduction. The reduced dimensional principal components-based feature vector representation is given as input to the classical machine learning models. We discuss each step in detail below. For mapping protein sequences to fixed-length vectors, it is important to preserve order information of the amino acids within a sequence. To achieve this, we use substrings (called mers) of length k. For each spike sequence, the total number of k-mers are "N − k + 1", where N is the total number of amino acids in the spike sequence (1274), and k is a user-defined parameter for the size of each mer. An example of k-mers (where k = 4) is given in Figure 2 . However, since we do not know the total unique k-mers in all spike sequences, we need to consider all possible pairs of k-mers to design a general purpose feature vector representation for spike sequences in a given dataset. In this way, given an alphabet Σ (a finite set of symbols), we know that a spike sequence X ∈ Σ (X contains a list of "ordered" amino acids from Σ). Similarly, we can extract sub-strings (mers) from X of length k, which we called k-mers. Given X, k, and Σ, we have to design a frequency feature vector Φ k (X) of length |Σ| k , which will contain the exact number of occurrences of each possible k-mer in X. The distance between two sequences X and Y is then simply the hamming distance d H (count the number of mismatched values). After computing the feature vector, a kernel function is defined that measures the pairwise similarity between pairs of feature vectors (usually using dot product). The problem we consider so far is the huge dimensionality of the feature vector |Σ| k that can make kernel computation very expensive. Therefore, in the so-called kernel trick, kernel values are directly evaluated instead of comparing indices. The algorithm proposed in [15] takes the feature vectors (containing a count of each k-mers) as input and returns a real-valued similarity score between each pair of vectors. Given two feature vectors A and B, the kernel value for these vectors is simply the dot product of A and B. For example, given a k-mer, if the frequency of that k-mer in A is 2 and B is 3, its contribution towards the kernel value of A and B is simply 2 · 3. The process of kernel value computation is repeated for each pair of sequences and hence we get a (symmetric) matrix (kernel matrix) containing a similarity score between each pair of sequences. Note that k is a user-defined parameter -in our experiments we use k = 9. O(k 2 n log n) , where k is the length of k-mers and n is the length of sequences. Due to a high-dimensional kernel matrix, we use Kernel PCA (K-PCA) [19] to select a subset of principal components. These extracted principal components corresponding to each spike sequence act as the feature vector representations for the spike sequences (we selected 50 principal components for our experiments). Various ML algorithms have been utilized for the classification task. K-PCA output, which is 50 components fed to different classifiers for prediction purposes. We use Support Vector Machine (SVM), Naive Bayes (NB), Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN) (with K = 5), Random Forest (RF), Logistic Regression (LR), and Decision Tree (DT) classifiers. All experiments are done on a Core i5 system with Windows 10 OS and 32 GB RAM. Implementation of our algorithm is done in Python. Our code and pre-processed datasets are available online 4 . The evaluation metrics that we are using are average accuracy, precision, recall, weighted and macro F1, and ROC area under the curve (AUC). We consider the approach of [22] as a baseline model. The authors of [22] convert spike sequences into one-hot encoding vectors that are used to classify the viral hosts. We have the 21 amino acids "ACDEFGHIKLMNPQRSTVWXY " (unique alphabets forming Σ). The length of each spike sequence is 1273 (with * at the 1274 th location). After converting sequences into one-hot encoding vectors we will get a 26, 733 dimensional vector (21 × 1273 = 26, 733). Principal Component Analysis (PCA) on these vectors is applied to reduce dimensionality for the underlying classifiers. For reference, we use the name "One-Hot" for this baseline approach in the rest of the paper. For PCA, we select 100 principal components. We sampled three subsets of spike sequences from the largest known database of human SARS-CoV-2, GISAID 5 . We refer to those 3 subsets as GISAID 1, GISAID 2, and GISAID 3, having 7000, 7000, and 3029 (aligned) spike sequences, respectively, each of length 1274 from 5 variants. For GISAID 1 and GISAID 2 datasets, we preserve the proportion of each variant as given in the original GISAID database. For the GISAID 3 dataset, we use a comparatively different proportion of variants to study the behavior of our algorithm (see Table 1 ). To visualize the local structure of spike sequences, we use t-distributed stochastic neighbor embedding (t-SNE) [26] that maps input sequences to 2D real vectors. The t-SNE helps to visualize (hidden) clusters in the data. The visualization results are shown in Figure 3 , revealing that variants are not well separated, making variant classification a challenging task. It is clear from Figure 3 that the dominant Alpha variant is not in a single cluster, and smaller variants are scattered around (e.g., the least frequent variant, B.1.427, appears in most clusters). In this section, we first report the performance of different classifiers using multiple performance metrics. Then we analyze the importance of the positions of each amino acid in the spike sequence using information gain. Results for the GISAID 1, 2 and 3 datasets are given in Tables 2-4 . We present results for each classifier separately for the baseline method and compare it with our proposed method. We can observe that for most of the classifiers, our proposed method is better than the baseline. For example, in the case of SVM classifier, the One-Hot method got 0.962 F1-Macro score for the GISAID 1 dataset while our proposed model got 0.973, which is a significant improvement considering that all values are on the higher side. Similar behavior is observed for other classifiers also. For all of these results, we use 1% data for training and 99% for testing purposes. Since we are getting such high accuracies, we can conclude that with a minimum amount of available data, we can train a classifier that can classify different variants very efficiently. Also, we can observe that the SVM classifier is consistently performing best for all the datasets. Note that results in Tables 2-4 are averaged over all variants. We also show the variant-wise performance of our best classifier (SVM). One-Hot approaches for GISAID 1. Clearly, the kernel-based approach performs better than the one-hot approach for most of the variants. To evaluate each individual position's importance, we measure the Information Gain (IG) for each position with respect to the variant defined as IG(Class, position) = Table 5 : Confusion matrices for SVM classifiers using Kernel Approx. approach (left) and using One-Hot approach (right) for the GISAID 1 dataset. , where H = i∈Class −p i log p i is the entropy, and p i is the probability of the class i. Figure 4 depicts how informative a position is to determine variants (higher value is better). We observe that positions such as 452, 570 and 681 are more informative across all datasets. The USA's CDC also declared mutations at these positions from one variant to the other, which validates our feature selection algorithm. For instance, R452L is present in B.1.427(Epsilon) and B.1.617 (Kappa, Delta) lineages and sub-lineages. The combination of K417N, E484K, and N501Y substitutions are present in B.1.351 (Beta). Similarly, K417T, E484K, and N501Y substitutions are present in P.1(Gamma) 6 (they can be seen having higher IG in Figure 4 ). We propose an approach to efficiently classify SARS-CoV-2 variants using spike sequences. Results show that the k-mer based sequence representation outperforms the typical one-hot encoding approach since it preserves order information of amino acids. We showed the importance of specific amino acids and demonstrate that it agrees with the CDC definitions of variants. In the future, we will work towards detecting new (unknown) variants based on whole genome sequences. Another exciting future work is considering other attributes like countries, cities, and dates to design richer feature vector representations for spike sequences. Combinatorial trace method for network immunization Who should receive the vaccine? In: Australasian Data Mining Conference (AusDM) Spectral methods for immunization of large networks Detecting DDoS attack on SDN due to vulnerabilities in OpenFlow Short term load forecasting using smart meter data Shortterm load forecasting using AMI data Predicting attributes of nodes using network structure Simpler and faster development of tumor phylogeny pipelines Effective and scalable clustering of SARS-CoV-2 sequences Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences Electromyography data for non-invasive naturallycontrolled robotic hand prostheses A measure of the similarity of sets of sequences not requiring sequence alignment Tnet: Phylogeny-based inference of disease transmission networks using within-host strain diversity Targeted self supervision for classification on a small Covid-19 CT scan dataset Efficient approximation algorithms for strings kernel based sequence classification Emergence of SARS-CoV-2 b. 1.1. 7 lineage Computing graph descriptors on edge streams Estimating descriptors for large graphs Kernel PCA for novelty detection Predicting vaccine hesitancy and vaccine sentiment using topic modeling and evolutionary optimization Generalized similarity kernels for efficient sequence classification Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone The SARS-Cov-2 and other human coronavirus spike proteins are fine-tuned towards temperature and proteases of the human airways Mismatch string kernels for svm protein classification Exploring the genomic and proteomic variations of SARS-CoV-2 spike glycoprotein: A computational biology approach. Infection Visualizing data using t-SNE Clustering based identification of SARS-CoV-2 subtypes Genotype and phenotype of covid-19: Their roles in pathogenesis Phylogenetic relationship of SARS-Cov-2 sequences from amazonas with emerging brazilian variants harboring mutations e484k and n501y in the spike protein Language independent sentiment analysis A multi-cascaded deep model for bilingual sms classification A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts Learning from the past: development of safe and effective COVID-19 vaccines Critical illness myopathy as a consequence of Covid-19 infection Scalable approximation algorithm for network immunization Effect of analysis window and feature selection on classification of hand movements using EMG signal Neutralization potential of covishield vaccinated individuals sera against b. 1.617. 1 Emergence of a novel SARS-Cov-2 variant in Southern California