key: cord-0740659-45g9morh authors: Ikemura, Toshimichi; Iwasaki, Yuki; Wada, Kennosuke; Wada, Yoshiko; Abe, Takashi title: AI-based search for convergently expanding, advantageous mutations in SARS-CoV-2 by focusing on oligonucleotide frequencies date: 2022-05-16 journal: bioRxiv DOI: 10.1101/2022.05.13.491763 sha: 7d33dc3f4739018a246840c106171723e8773475 doc_id: 740659 cord_uid: 45g9morh Among mutations that occur in SARS-CoV-2, efficient identification of mutations advantageous for viral replication and transmission is important to characterize and defeat this rampant virus. Mutations rapidly expanding frequency in a viral population are candidates for advantageous mutations, but neutral mutations hitchhiking with advantageous mutations are also likely to be included. To distinguish these, we focus on mutations that appear to occur independently in different lineages and expand in frequency in a convergent evolutionary manner. Batch-learning SOM (BLSOM) can separate SARS-CoV-2 genome sequences according by lineage from only providing the oligonucleotide composition. Focusing on remarkably expanding 20-mers, each of which is only represented by one copy in the viral genome, allows us to correlate the expanding 20-mers to mutations. Using visualization functions in BLSOM, we can efficiently identify mutations that have expanded remarkably both in the Omicron lineage, which is phylogenetically distinct from other lineages, and in other lineages. Most of these mutations involved changes in amino acids, but there were a few that did not, such as an intergenic mutation. To confront the global threat of COVID-19 [1] [2] [3] , a massive number of SARS-CoV-2 38 genome sequences have been rapidly decoded and promptly released through the 39 GISAID database [4] . To understand this formidable virus from multiple perspectives, 40 we must implement diverse research methods such as artificial intelligence (AI) that can 41 efficiently analyze big data even on a personal computer and support efficient 42 knowledge discovery through intuitive visualization. Unsupervised machine learning 43 can provide new information without relying on particular models or assumptions. We 44 previously established an unsupervised AI, batch-learning self-organizing map By analyzing oligonucleotide composition to study the viral adaptation, we previously 66 found time-series directional changes (i.e., monotonic increases or decreases) in mono- 67 and oligonucleotide composition of four zoonotic RNA viruses (influenza virus [19] [20] [21] [22] were detectable even on a monthly basis. 70 In the present study, we used a similar approach to analyze the recently prevalent 71 Omicron subvariants and compare them with previously prevalent lineages. The predicted that a certain G-derived virus once spread to a nonhuman animal and then 76 reinvaded the human population [28] [29] [30] . If a mutation that occurs independently in 77 different lineages remarkably increases its population frequency commonly in distantly 78 related lineages, convergent evolutionary increase is highly likely due to the increase in 79 fitness caused by the mutation, and therefore, Omicron is believed to be especially well 80 suited for searching for and studying convergently expanding, advantageous mutations 81 that have rapidly spread in the human population [31] . 82 We previously analyzed time-series changes in composition of short and long 83 oligonucleotides from SARS-CoV-2 strains isolated mainly in the first year of the 84 pandemic and found many long oligonucleotides (e.g., 20-mers) that had expanded 85 remarkably in the viral population, which allowed us to predict candidates of 86 advantageous mutations that had spread among humans [9, 23, 24 ]. In the present study, 87 we analyzed 20-mer frequencies to search for mutations that expanded markedly in the 88 Omicron population. In the case of Omicron, the pandemic is still ongoing during the course of this study and 104 preparation of this paper, and a large number of genome sequences have been released 105 by GISAID. This appears to make research difficult but offers a distinct advantage in 106 that the results obtained at an early stage of the study can be verified in the same 107 publication with new sequences obtained later, increasing the reliability of the analysis. 108 We adopted this strategy in the case of Omicron. On analysis, which was intended to provide an overall picture of the Omicron lineage, the The self-organizing map (SOM) developed by Kohonen [32, 33] is an unsupervised 125 neural network algorithm that implements characteristic nonlinear projection from the 126 high-dimensional space of input data onto a two-dimensional array of weight vectors. own territory (red) on the right side of the map (Fig 2A) . to Jan. 2022, so we can examine territories of these three sublineages on the BLSOM by 213 mapping the strains in the three sublineages separately onto the BLSOM (Fig 2B) . In 214 more detail, using the vectorial data of the occurrence frequency in each genome, we 215 searched for the node with the closest Euclidean distance, attributed each genome to that 216 node and colored the node for each sublineage, as described previously [7, 9] . Separation In this BLSOM for heatmaps, the number of nodes was set to 24 (approximately 50 260 heatmap patterns per node). Here, nodes having multiple 20-mer patterns are shown in 261 black, while nodes with no pattern are left blank (Fig 3A) . The number of patterns that 262 were attributed to each node is shown by the height of a colored column in a 3D display 263 (Fig 3B) , and for each column, a representative example of heatmaps attributed to the 264 corresponding node is presented. First, the green column on the rightmost will be case where a portion of the red region is clearly missing in a particular territory as for 324 the example numbered 1 in Fig 3B, the corresponding mutation is marked with ++ for 325 the territory. If the red region was extremely limited as for the example numbered 2 in 326 Fig 3B, the corresponding mutation is specified with +; even for the + category, a few or 327 several hundreds of genome sequences were usually found because the condition of 328 BLSOM (Fig 2A) was set so that an average of 100 sequences belonged to each node, 329 indicating that the corresponding mutations expanded to an unignorable level in the 330 corresponding lineages. The lineages in Table 1 Since the mutations in Table 1 World Health Organization. Coronavirus Disease (COVID-2019) Characteristics of SARS-CoV-2 and COVID-19 A Review of Coronavirus Disease-2019 (COVID-19) disease and diplomacy: GISAID's innovative 461 contribution to global health Analysis of codon usage diversity of bacterial genes with a self-organizing map 464 (SOM): characterization of horizontally transferred genes Informatics for 467 unveiling hidden genome signatures Novel phylogenetic studies 470 of genomic sequence fragments derived from uncultured microbe mixtures in 471 environmental and clinical samples Self-Organizing Map 474 (SOM) unveils and visualizes hidden sequence characteristics of a wide range of 475 eukaryote genomes AI for the collective analysis of a 477 massive number of genome sequences: various examples from the small genome of 478 pandemic SARS-CoV-2 to the human genome García-Sastre A. Inhibition of interferon-mediated antiviral responses by influenza A 481 viruses and other negative-strand RNA viruses Interferons and viruses: an interplay between induction, 484 signalling, antiviral responses and virus countermeasures Cellular host factors for SARS-487 CoV-2 infection The SARS-CoV-2 Spike 490 Implications for 491 the Design of Spike-Based Vaccine Immunogens Geographic and Genomic Distribution of SARS-CoV-2 The SARS-CoV-2 RNA-protein interactome in infected human cells Potential role of cellular miRNAs in coronavirus-host interplay The emerging role of microRNAs in the severe acute 503 respiratory syndrome coronavirus 2 (SARS-CoV-2) infection Das 506 R. RNA genome conservation and secondary structure in SARS-CoV-2 and SARS-507 related viruses: a first look Prediction of directional changes of 509 influenza A virus genome sequences with emphasis on pandemic H1N1/09 as a 510 model case Novel bioinformatics strategies for 512 prediction of directional sequence changes in influenza virus genomes and for 513 surveillance of potentially hazardous strains Directional and reoccurring 516 sequence change in zoonotic RNA virus genomes visualized by time-series word 517 count Time-series oligonucleotide count to assign 519 antiviral siRNAs with long utility fit in the big data era Time-series analyses of directional sequence changes 522 in SARS-CoV-2 genomes and an efficient search method for candidates for 523 advantageous mutations for growth in human cells Human cell-dependent, directional, time-dependent 526 changes in the mono-and oligonucleotide compositions of SARS-CoV-2 genomes Pandemic SARS-CoV-2 Variants Visualized Using Batch-Learning Self-Organizing 530 Map for Oligonucleotide Compositions Heavily mutated Omicron variant puts scientists on alert From Alpha to Omicron 535 -2 variants: What their evolutionary signatures can tell us OMICRON (B.1.1.529): A new SARS-CoV-2 variant of 538 concern mounting worldwide fear Where did Omicron come from? Three key theories Evidence for a mouse origin 543 of the SARS-CoV-2 Omicron variant The emergence and ongoing convergent evolution of the SARS-546 CoV-2 N501Y lineages The self-organizing map Engineering applications of the self-551 organizing map Omicron BA.2 (B.1.1.529.2): high potential to becoming the next 553 dominating variant First cases of infection with the 21L/BA.2 Omicron variant in Marseille CoVariants: SARS-CoV-2 Mutations and Variants of Interest Ultsch A. Self organized feature maps for monitoring and knowledge acquisition of 561 a chemical process A new coronavirus 564 associated with human respiratory disease in China The establishment of 567 reference sequence for SARS-CoV-2 and variation analysis Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and 570 Other Coronaviruses: Causes and Consequences for Their Short-and Long-Term 571 Potential APOBEC-mediated RNA editing of the genomes 574 of SARS-CoV-2 and other coronaviruses and its impact on their longer term 575 evolution In vitro 577 evolution predicts emerging CoV-2 mutations with high affinity for ACE2 and 578 cross-species binding