key: cord-0910356-sfuat5vo authors: Lebatteux, Dylan; Soudeyns, Hugo; Boucoiran, Isabelle; Gantt, Soren; Diallo, Abdoulaye Baniré title: Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures date: 2022-02-07 journal: bioRxiv DOI: 10.1101/2022.02.07.479343 sha: 91b866c0f66fe8ac7dc3911cbe714f85576b6201 doc_id: 910356 cord_uid: sfuat5vo Machine learning has proven to be a powerful tool for the identification of distinctive genomic signatures among viral sequences. Such signatures are motifs present in the viral genome that differentiate species or variants. In the context of SARS-CoV-2, the identification of such signatures can contribute to taxonomic and phylogenetic studies, help in recognizing and defining distinct emerging variants, and focus the characterization of functional properties of polymorphic gene products. Here, we study KEVOLVE, an approach based on a genetic algorithm with a machine learning kernel, to identify several genomic signatures based on minimal sets of k-mers. In a comparative study, in which we analyzed large SARS-CoV-2 genome dataset, KEVOLVE performed better in identifying variant-discriminative signatures than several gold-standard reference statistical tools. Subsequently, these signatures were characterized to highlight potential biological functions. The majority were associated with known mutations among the different variants, with respect to functional and pathological impact based on available literature. Notably, we found show evidence of new motifs, specifically in the Omicron variant, some of which include silent mutations, indicating potentially novel, variant-specific virulence determinants. The source code of the method and additional resources are available at: https://github.com/bioinfoUQAM/KEVOLVE. Author summary Advances in cloning and sequencing technologies have yielded a vast repository of viral genomic sequence data. To analyze this complex and massive data, Machine learning, which refers to the development and application of computer algorithms that improve with experience, has proven to be efficient. Although many methods have been developed to classify viruses into different characteristic groups, it is often difficult to explain the predictions of these methods. To overcome this, we are working in our laboratory on the design of machine learning based methods for discriminative signatures identification within viral genomic sequences. These signatures which are a specific motifs to groups of viruses known to be pervasive in their genome, are used to 1) build accurate and explainable prediction tools for pathogens and 2) highlight mutations potentially associated with functional changes. In this paper we present the potential of our latest approach KEVOLVE. We first compare it to three discriminating motif identification tools with data sets covering several SARS-CoV-2 variant genomes. We then focus on the identified motifs by KEVOLVE to analyze the mutations associated with the different variants and the potential changes in biological functions that they may involve. single-stranded RNA of 29,903 nucleotides (Fig 1 from [1] ). Its sequence identity with 6 SARS-CoV and MERS-CoV is 79.5% and 50% at the nucleotide sequence level, 7 respectively [2, 3] . The SARS-CoV-2 genome contains 11 genes encoding 15 Open 8 Reading Frames (ORF), which result in between 29 and 33 viral protein products [1] . 9 SARS-CoV-2 is associated with a very high mutation rate ranging from 5.2 to 8.1 × 10 10 −3 substitutions/site/year [4, 5] , higher than human immunodeficiency virus (HIV), 11 which has a mutation rate of 3 to 8 × 10 −3 substitutions/site/year [6] . Many of these 12 mutations, principally in the spike gene, are associated with increased SARS-CoV-2 13 transmission rates [7] , and the development of new variants associated with reduced 14 efficacity of current COVID-19 vaccines and antibody-based treatments [8] . 15 Given this rapid rate of evolution, it is important to be able to efficiently identify 16 genomic signatures that discriminate between the different variants of SARS-CoV-2 and 17 highlight potential functional changes. These signatures are defined as species or 18 variant-specific motifs that are pervasive throughout the viral genome [9] . In the 19 context of SARS-CoV-2, the identification of this type of signature can contribute to 20 taxonomic [10] and phylogenetic [11] studies to differentiate distinct groups of variants, 21 provide an explanation for their evolutionary history [9] , as well as to facilitate 22 mechanistic studies to elucidate the functional basis of variant-specific differences in 23 virulence [12] . 24 To identify discriminating motifs that constitute genomic signatures among different 25 groups of biological sequences, the traditional approach is to first compute multiple 26 sequence alignment [13] with tools as: MUSCLE [14] , Clustal W/X [15] or MAFFT [16] . 27 These alignments can then be analyzed to identify the divergent genomic regions that 28 constitute the discriminating motifs. However, the use of multiple alignment approaches 29 has significant limitations, particularly when applied to viral genomes [12] . 30 First, alignment-based approaches are generally computationally-and time-intensive 31 and are therefore less well suited to dealing with very large viral sequence datasets that 32 are increasingly available [17] . Indeed, computing an accurate multi-sequence alignment 33 is an NP-hard problem with (2N )!/(N !) 2 possible alignments for two sequences of 34 length N [18] , which means that in some case, the alignment cannot be solved within a 35 realistic time frame [19] . Even with dynamic programming, the time requirement is on 36 the order of the product of the lengths of the input sequences [20] . 37 Second, alignment algorithms assume that homologous sequences consist of a series 38 of more or less conserved linearly arranged sequence segments. However, this 39 assumption, named collinearity, is often questionable, especially for RNA viruses [19] . 40 This is because RNA viruses show extensive genetic variation due to high mutation 41 rates, as well as high frequencies of genetic recombination, horizontal gene transfer, and 42 gene duplication, leading to the gain or the loss of genetic material [21] . 43 Finally, performing multiple alignments often requires adjusting several parameters 44 (e.g., substitution matrices, deviation penalties, thresholds for statistical parameters) 45 that are dependent on prior knowledge about the evolution of the compared 46 sequences [19] . The adjustment of these parameters is therefore sometimes arbitrary 47 and requires a trial-and-error approach. Many experiments have shown that minor 48 variations in these parameters can significantly affect the quality of alignments [22] . 49 To overcome the limitations of discriminative motif identification among different 50 groups of biological sequences using multiple sequence alignment, specialized 51 statistical-based tools have been developed. The most popular of these method is 52 MEME [23, 24] , which is dedicated to motif identification. MEME has a discriminative 53 mode [25] that considers two sets of sequences and identifies the enriched motifs that 54 discriminate the first set (primary) from the second (control). A suite of other MEME 55 tools has been developed, of which STREME [26] is the latest and most powerful for 56 motif discovery in sequence datasets. The STREME algorithm is based on a generalized 57 suffix tree and evaluates motifs using a statistical test of the enrichment of matches to 58 the motif in a primary set of sequences compared to a set of control sequences [26] . In parallel, machine learning methods have been widely used in the field of genomics 60 over recent years and have proved to be highly effective for solving complex and massive 61 data analysis problems [27] . For viral genomic sequence classification CASTOR [28] has 62 shown the relevance of RFLP (Restriction fragment length polymorphism) signatures 63 coupled with machine learning models. These models obtained in cross-validation 64 evaluations performance in terms of F1-score > 99% for the prediction of viral genomes 65 of hepatitis B and human papillomavirus. However, these signatures showed some 66 limitations for HIV sequence prediction where the F1-score dropped below 0.90. Subsequently, KAMERIS [29] addressed this problem by using k-mers (nucleotide 68 subsequences of length k) to characterize the sequences given to the learning model. To 69 tackle the problem of the number of exponential number of features (4 k ) associated with 70 k-mers, KAMERIS performs a dimensionality reduction using truncated singular value 71 decomposition. However, this transformation significantly affects the ability to explain 72 the predictions of the model. For this reason, CASTOR-KRFE [30] is a method that focuses on the identification 74 of minimal sets genomic signatures based on minimal sets of k-mers to discriminate 75 among several groups of genomic sequences. During cross-validation evaluations 76 covering a wide range of viruses, CASTOR-KRFE successfully identified minimal sets of 77 motifs. Subsequently, these motifs, coupled with supervised learning algorithms, have 78 allowed to build prediction models resulting in average F1-score > 0.96 [30] . However, 79 this study is limited to identifying an optimal set of motifs, instead of exploring the showed that KEVOLVE-identified motifs allowed the construction of models that 87 out-performed specialized HIV prediction tools. Here, we evaluate the KEVOLVE, whose search function has been improved in order 89 to identify smaller sets of motifs while trying to respect the same discriminative 90 performance criteria. We compared several reference tools (MEME, STREME and 91 CASTOR-KRFE) to identify discriminating motifs among SARS-CoV-2 genome 92 sequences. The motifs were first identified in a restricted set of nucleotide sequences 93 associated with different variants of SARS-CoV-2. Second, the motifs were used to 94 build prediction models that were assessed through the classification of a large set of 95 SARS-CoV-2 sequences. Third, the motifs identified by KEVOLVE were analyzed in 96 order to highlight the potential biological functions of the sequences/motifs in questions. 97 Finally, a specific analysis was dedicated to the new variant of concern, Omicron, that To assess the relative accuracy of KEVOLVE to identify discriminating motifs, we 103 performed a comparative study with specialized tools. This involved for each tool to 104 identify a subset of discriminating motifs in a set of training sequences of SARS-CoV-2 105 variants. These sets of motifs were designed to provide genomic signatures specific to 106 each SARS-CoV-2 variant. In a second step these signatures combined with a 107 supervised learning algorithm and the training sequences to fit a prediction model. Then, the quality of the signatures was assessed through the prediction of trained 109 models on a large test set of unknown sequences. Finally, we analyzed in line with the 110 literature, the variant-discrimination motifs identified by KEVOLVE according to their 111 location in the genome, to assess the potential functional impact of these mutations. Discriminative motif identification tools 113 The first tool that was evaluated was KEVOLVE [31] . KEVOLVE, is a new method 114 based on a genetic algorithm including a machine learning kernel. KEVOLVE 115 implementation is based on two main units: 1) an identification unit that provides 116 subsets of features that are minimal and likely to provide the best performance metrics; 117 and 2) a prediction unit that applies an ensemble classifier using the subsets of features. 118 The second tool that was evaluated was CASTOR-KRFE [30] . It is an alignment-free 119 machine learning approach for identifying a set of genomic signatures based on k-mers 120 to discriminate between groups of nucleic acid sequences. The core of CASTOR-KRFE 121 is based on feature elimination using SVM (SVM-RFE). CASTOR-KRFE identifies an 122 optimal length of k to maximize classification performance and minimize the number of 123 features. This method also provides a solution to the problem of identifying the optimal 124 length of k-mers for genomic sequence classification [32] . 125 The third tool that was evaluated was MEME (discriminative mode) [25] , a tool 126 from the MEME suite [24] specialized in motif identification. MEME inputs two sets of 127 sequences and identifies enriched motifs that discriminate the primary set from the 128 control set. In discriminative mode, the algorithm first calculates a position-specific 129 prior from the two sets of sequences. It then searches the first set of sequences for 130 motifs using the position-specific prior to inform the search based on the discriminative 131 prior D [33] . In addition, MEME considers as a parameter a potential motif distribution 132 type to be identified to improve the sensitivity and quality of the motif search. In 133 discriminative mode, the two available options are: 1) zero or one occurrence per 134 sequence (zoops), where MEME assumes that each sequence may contain at most one 135 occurrence of each motif; and 2) one occurrence per sequence (oops), where MEME 136 assumes that each sequence in the dataset contains exactly one occurrence of each motif. 137 The last tool evaluated was STREME [26] , which during a recent comparative study 138 of motif identification was found to be more accurate, sensitive and thorough than 139 several widely used algorithms [26] . STREME algorithm makes use of a data structure 140 called a generalized suffix tree and evaluates motifs using a one-sided statistical test of 141 the enrichment of matches to the motif in a primary set of sequences compared to a set 142 of control sequences STREME assumes that each primary sequence may contain zoops 143 but the motif discovery will not be negatively affected if a primary sequence contains 144 more than one occurrence of a motif. section was dedicated to it. We specify that only complete genomes with high coverage 156 were included in our data set (Table 1 ) and the list of accession ids of the sequences 157 used in our different dataset is available on our GitHub repository. Subsequently, this initial dataset was partitioned into two independent subsets. The 159 first (training subset) was composed of 2,250 randomly selected sequences (250 160 sequences for the 9 types of variants). The second (testing subset) was composed of the 161 remaining sequences (224,282 sequences). Setting the length of k 163 A preliminary step of this comparative study consists in setting the parameter k for the 164 length of the motifs to be discovered for the respective identification tools. For this purpose, we used CASTOR-KRFE, giving it as input the training sequence set. For the 166 associated parameters, we set the performance threshold to be maintained by reducing 167 the number of features to T = 0.99 and the minimum/maximum length of k-mers to be 168 explored to k-min = 1 and k-max = 20 respectively. As output, CASTOR-KRFE 169 identified the following subset which is composed of 9 motifs of length k = 8: [AACTAAAA, ATATCTGG, AATTTCTC, ATAGAATG, CCGGTATA, CATAGCGC, 171 TAGTGAAT, TCTTGCAT, CAAAGTAG]. During the CASTOR-KRFE identification 172 process, this subset of motifs, coupled with a supervised prediction model based on a 173 linear SVM, was evaluated by 5-fold cross-validation on the training set and obtained a 174 weighted F1-score > 0.99. In addition, the length of k-mers that was identified was 175 consistent with other studies using k-mers for viral sequence classification [9, 30, 32] . 176 Benchmarking 177 To assess the relevance of the discriminating motifs identified by each tool, we were 178 inspired by the evaluation conducted in [30] . For CASTOR-KRFE, the previously 179 identified subset of motifs coupled with the set of training sequences and an SVM were 180 used to fit a prediction model. From this model the testing sequence set was predicted. 181 Regarding KEVOLVE, the identification unit was used using as input the training set of 182 sequences as well as the following parameters: n iterations = 1000, n solutions = 100, 183 n chromosomes = 2500, n genes = 1, objective score = 0.99, crossover rate = 0.2, 184 mutation rate = 0.1 and variance threshold = 0.01. Initially, KEVOLVE was designed 185 to identify multiple discriminating subsets to build a single ensemble prediction model. 186 However, in this evaluation, for each identified subset, a model was trained and 187 evaluated by predicting the test set. 188 Table 2 . Summary of the evaluated motif identification tools and their associated parameters. The Tools column provides information about the different tools evaluated in this study. The column Number of motifs indicates the number of discriminating motifs that have been asked to identify for each tool. For CASTOR-KRFE and KEVOLVE no parameter is filled in because these tools automatically try to minimize the number of motifs. The column Motifs width corresponds to the length of the discriminating motifs to be identified. The column Site distribution refers for MEME to how the discriminating motifs are supposed to be present in the sequences to improve the sensitivity and quality of the search. The column Discovery mode also indicates for MEME the type of search to perform. In this context we have selected the mode for identifying motifs that discriminates between groups of sequences given as input. The Performance Threshold column refers to a quality criterion that the identified motif must satisfy. For MEME and STREME the n best motifs in terms of p-value are selected where n corresponds to the number in the Number of motifs column. For CASTOR-KRFE and KEVOLVE, the algorithms will search until they obtain a set of motifs that satisfy an F1-score > 0.99 during their internal evaluation. To evaluate MEME, considering its limitation to take as input a binary set (primary 189 February 5, 2022 6/17 set and control set), we set up the following process: for each variant v present in the 190 training set V , we select all sequences belonging to v to form the primary set. All other 191 sequences in V were used to form the control set. Then, we applied MEME to discover 192 the motifs that discriminated the primary set from the control set. This process was 193 repeated for each v belonging to V in order to build a set of motifs that can 194 discriminate each variant from the others. Then, this set was used to train a model and 195 predict the testing set in the same configuration as CASTOR-KRFE and KEVOLVE. For the associated distribution site parameters, both zoops and oops options were 197 evaluated. In addition, for the motifs to be identified, we ran the experiments to 198 discover 1, 2 and 3 motifs of width 8 for each variant. This involved for MEME the 199 training and evaluation of six prediction models. Finally, for STREME, we applied the same iterative process to identify the motifs as 201 for MEME. As mentioned before, for the motif distribution type, STREME does not 202 require an input parameter and handles this automatically. Finally, as for MEME, we CASTOR-KRFE identified a set composed of 9 discriminating motifs, which is slightly 214 better than KEVOLVE. To describe the results of MEME and STREME we use the 215 name of the tool followed by the distribution type and the number of motifs to identify. 216 If this is not specified we discuss the overall tool. With the option of 1 motif per 217 variant, MEME zoops, MEME oops and STREME were each able to constitute a subset 218 of 9 discriminative motifs, which is similar to CASTOR-KRFE. For the 2 motifs per 219 variant option, MEME zoops, MEME oops and STREME identified 14, 17 and 18 220 discriminative motifs respectively. Finally for the 3 motif per variant configuration, the 221 identified subsets reached the size of 18, 24 and 26 respectively for MEME zoops, 222 MEME oops and STREME. These results show that by increasing the number of motifs 223 to be discovered, MEME zoops tends to identify more motifs that are redundant unlike 224 MEME oops, STREME, CASTOR-KRFE and KEVOLVE. In summary, KEVOLVE identified from a small portion of the dataset multiple 290 motifs that can discriminate by their absence or presence between different groups of 291 SARS-CoV-2 variants. The discriminative potential of these motifs can be generalized 292 to larger data sets as well as to constitute genomic signatures associated with 293 SARS-CoV-2 variants. Biological interests of the identified motifs 295 We analyzed the variant-discrimination motifs identified by KEVOLVE according to 296 their location in the genome, to assess their potential functional impact of these 297 mutations. Preliminary sequence analyses 299 To study the motifs, we first used UGENE bioinformatics software [35] to perform 500 sequences that were aligned by MUSCLE algorithm [14] in large alignment mode. From this alignment, we calculated the dissimilarity matrix based on Hamming 304 distances. Finally, the matrix representing the dissimilarity percentages of nucleotide 305 between the different groups of SARS-CoV-2 variants as well as a phylogenetic tree 306 (Fig 3) based on the neighbour-joining method [36] was computed. From this matrix we can observe that the divergence between the genomes of the 308 several clusters of SARS-CoV-2 variants is less than 1% and the mean divergence 309 between all the sequences is 0.29%. Focusing on the phylogentic tree on the right of Considering the columns related to Omicron as well as the phylogenetic tree, we observe 312 that Omicron is the most divergent. It diverges by 0.44% compared to the other 313 variants and shows an intra-variant divergence of 0.30%. Lastly, the Alpha, Zeta and 314 Iota variants are the least divergent (0.26%, 0.24% and 0.26% respectively compared to 315 the other variants) and (0.05%, 0.007% and 0.14% intra variant divergence). Table 6 , the different 323 mutations contained in the identified motifs, where they are located and the associated 324 variants. Concerning the Alpha variant, the identified motifs highlighted the D1118H 326 mutation located in the Spike glycoprotein, the SGF3675-3677 deletions located in 327 ORF1ab (NSP6), which is also present in the Beta, Gamma, Eta and Iota variants, and 328 the substitutions R203K / G204R, which are shared with the Zeta and Omicron 329 variants. A recent study [37] showed that the 203K/204R mutation located in ORF9 330 (Protein N) is associated with increased COVID-19 infectivity. Thus, this mutation is 331 potentially a major contributor to the high contagiousness of Omicron. For the Beta variant, the motifs pointed out the K1655N mutation in ORF1ab 333 (NSP3), the Q57H mutation located in ORF3a, which is present in Epsilon and Iota, as 334 well as the T205I mutation shared with Epsilon and Eta and which is located in ORF9 335 (Protein N). Regarding the Gamma variant, KEVOLVE identified motifs that contain 336 three characteristic substitutions of this variant [38] which are: K1795Q in ORF1ab 337 (NSP3), R190S and L18F in ORF2 (Spike Protein S1). For Delta variant-associated motifs, they highlighted D63G (ORF9 (Protein N)), 339 G5063S (ORF1b (NSP12)), D950N (ORF2 (Protein Spike S2)), 156del / 157del (ORF2 340 (Protein Spike S1)), and T19R (ORF2 (Protein Spike S1)) mutations that are specific to 341 Delta [39, 40] . In the motifs, we also identified the I82T mutation located in ORF5 (membrane 343 protein), which has been proposed to increase replication fitness through alteration of 344 cellular glucose uptake during viral replication [41] . Our analysis also confirms the 345 presence of this mutation in the Eta variant [42] . The L452R mutation located in the 346 spike protein, which increases fusogenicity and promotes viral replication and 347 [43] , was also found in motifs within the Delta, Epsilon and Kappa genomes. 348 Finally, three substitutions constituting unique features of Omicron were highlighted, 349 by KEVOLVE: I3758V in ORF1ab (NSP6) and N679K and D796Y in ORF2 (Spike 350 protein) [44] . The functional implications of these Omicron variant mutations are 351 unknown, leaving many questions about how they may affect viral fitness and 352 vulnerability to natural and vaccine-mediated immunity [45] . However, the combination 353 of N679K with H655Y and P681H, due to their proximity to the furin cleavage site, 354 could increase the cleavage of spike, enhancing fusion and viral transmission [46] . In this study, we compared the ability of machine learning-based tools to classify 357 SARS-CoV-2 variants compared to statistical tools specialized in discriminative motif 358 identification. We found that the identification of motifs in SARS-CoV-2 genome 359 sequences readily discriminates different groups of variants. However, the machine 360 learning-based approaches, CASTOR-KRFE and KEVOLVE, were generally more 361 efficient. The predictive models based on the motifs (8 for KEVOLVE and 9 for 362 CASTOR-KRFE) identified by these two approaches predict a large set of SARS-CoV-2 363 variant sequences with an average F1 score greater than 0.98. Furthermore, these two 364 approaches predicted a large set of SARS-CoV-2 variant sequences (over 225,000) with 365 an average F1-score greater than 0.98. In contrast, the model involving the most motifs 366 (26), using STREME, which was the best performing approach after KEVOLVE and 367 CASTOR-KRFE, only obtained an average F1-score of 0.836. In addition, unlike the 368 statistical approaches, KEVOLVE and CASTOR-KRFE, can deal with multi-class sets 369 and are not limited to binary sets. In addition, KEVOLVE is distinguished by its ability 370 to identify multiple discriminative sets unlike other tools that are limited to a single 371 optimal set. 372 Subsequently, we analyzed the motifs identified by KEVOLVE with respect to their 373 recognized or potential functional importance from the existing literature. Not 374 surprisingly, we found that the majority of SARS-CoV-2 motifs identified by KEVOLVE 375 were associated with known mutations among the different viral variants. However, of 376 interest, several motifs derived from CASTOR-KRFE and KEVOLVE did not 377 correspond to recognized variant-specific mutations. With respect to Omicron, 4 motifs 378 contained what appear to be silent mutations, indicating potentially novel 379 variant-specific virulence determinants [47] . Interestingly, although Omicron displays 380 increased transmissibility and evades vaccine-induced and natural-acquired neutralizing 381 antibodies through its numerous spike mutations, it may also cause less severe disease, 382 perhaps due to altered tissue tropism [48, 49] . As the genetic basis of SARS-CoV-2 383 virulence remain incompletely understood, variant-discriminating mutations represent 384 valuable targets for understanding differences in viral phenotypes and clinical outcomes. 385 These results suggest that KEVOLVE is a robust tool for the rapid and accurate 386 determination of SARS-CoV-2 variants. The identified motifs provide genomic 387 signatures that can be used to build peptide or oligonucleotide libraries for rapid and 388 accurate pathogen detection using tools such as VirScan [50] . The identification of 389 motifs by KEVOLVE is automatic and independent of multiple sequence alignments, in 390 contrast to traditional methods by which mutations are associated with 391 variant-discriminating motifs. Indeed, such analyses require manual verification based 392 on annotated reference sequences and multiple sequence alignment, making them 393 impractical for variant discrimination of diverse viruses with large and complex genome 394 structures, such as cytomegalovirus [51] . KEVOLVE and CASTOR-KRFE can also be 395 adapted to allow the automatic analysis of previously-identified motifs, further 396 increasing its efficiency. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing COVID-19 pneumonia: what has CT taught us? The Lancet Infectious Diseases Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The lancet The first two cases of 2019-nCoV in Italy: Where they come from? Temporal increase in D614G mutation of SARS-CoV-2 in the Middle East and North Africa What does the future hold for yellow fever virus?(II) SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning Supporting pandemic response using genomics and bioinformatics: A case study on the emergent SARS-CoV-2 outbreak. Transboundary and emerging diseases Design of genomic signatures for pathogen identification and characterization Benchmarking of alignment-free sequence comparison methods MUSCLE: multiple sequence alignment with high accuracy and high throughput Clustal W and Clustal X version 2.0. bioinformatics MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution Alignment-free inference of hierarchical and reticulate phylogenomic relationships Mathematical and statistical methods for genetic analysis Alignment-free sequence comparison: benefits, applications, and tools What is dynamic programming? Nature biotechnology Rates of evolutionary change in viruses: patterns and determinants Alignment uncertainty and genomic analysis Fitting a mixture model by expectation maximization to discover motifs in bipolymers The MEME suite The value of position-specific priors in motif discovery using MEME STREME: accurate and versatile sequence motif discovery Machine learning applications in genetics and genomics A machine learning approach for viral genome classification An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes Toward an alignment-free method for feature extraction and accurate classification of viral sequences Combining a genetic algorithm and ensemble method to improve the classification of viruses Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer Nucleosome occupancy information improves de novo motif discovery disease and diplomacy: GISAID's innovative contribution to global health. Global challenges Unipro UGENE: a unified bioinformatics toolkit The neighbor-joining method: a new method for reconstructing phylogenetic trees Nucleocapsid mutations R203K/G204R increase the infectivity, fitness, and virulence of SARS-CoV-2 Genomic monitoring unveil the early detection of the SARS-CoV-2 B. 1.351 (beta) variant (20H/501Y. V2) in Brazil Clinical characterization and Genomic analysis of COVID-19 breakthrough infections during second wave in different states of India. medRxiv Evolutionary analysis of the Delta and Delta Plus variants of the SARS-CoV-2 viruses Emerging variants of concern in SARS-CoV-2 membrane protein: a highly conserved target with potential pathological and therapeutic implications. Emerging microbes & infections Evolution, mode of transmission, and mutational landscape of newly emerging SARS-CoV-2 variants SARS-CoV-2 spike L452R variant evades cellular immunity and increases infectivity Omicron SARS-CoV-2 variant: Unique features and their impact on pre-existing antibodies SARS-CoV-2 Omicron variant: characteristics and prevention Positive selection within the genomes of SARS-CoV-2 and other Coronaviruses independent of impact on protein function SARS-CoV-2 Omicron variant replication in human respiratory tract ex vivo The SARS-CoV-2 B. 1.1. 529 Omicron virus causes attenuated infection and disease in mice and hamsters Comprehensive serological profiling of human populations using a synthetic human virome Islands of linkage in an ocean of pervasive recombination reveals two-speed evolution of human cytomegalovirus genomes. Virus evolution