key: cord-0869054-x3f9usoj authors: Abe, Takashi; Furukawa, Ryuki; Iwasaki, Yuki; Ikemura, Toshimichi title: Time-series trend of pandemic SARS-CoV-2 variants visualized using batch-learning self-organizing map for oligonucleotide compositions date: 2021-04-15 journal: bioRxiv DOI: 10.1101/2021.04.15.439956 sha: c7cb6981037b276faee90e31de1fdade4892cfdd doc_id: 869054 cord_uid: x3f9usoj To confront the global threat of coronavirus disease 2019, a massive number of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome sequences have been decoded, with the results promptly released through the GISAID database. Based on variant types, eight clades have already been defined in GISAID, but the diversity can be far greater. Owing to the explosive increase in available sequences, it is important to develop new technologies that can easily grasp the whole picture of the big-sequence data and support efficient knowledge discovery. An ability to efficiently clarify the detailed time-series changes in genome-wide mutation patterns will enable us to promptly identify and characterize dangerous variants that rapidly increase their population frequency. Here, we collectively analyzed over 150,000 SARS-CoV-2 genomes to understand their overall features and time-dependent changes using a batch-learning self-organizing map (BLSOM) for oligonucleotide composition, which is an unsupervised machine learning method. BLSOM can separate clades defined by GISAID with high precision, and each clade is subdivided into clusters, which shows a differential increase/decrease pattern based on geographic region and time. This allowed us to identify prevalent strains in each region and to show the commonality and diversity of the prevalent strains. Comprehensive characterization of the oligonucleotide composition of SARS-CoV-2 and elucidation of time-series trends of the population frequency of variants can clarify the viral adaptation processes after invasion into the human population and the time-dependent trend of prevalent epidemic strains across various regions, such as continents. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread rampantly 45 worldwide since it was first reported in December 2019, and its momentum is still ongoing (WHO. Simmonds 2020). Considering this clade-independent tendency, we performed BLSOM analysis of 122 not only the pentanucleotide composition but also their odds ratio, which can reduce the effects 123 caused by changes in the mononucleotide composition. Additionally, to check the robustness of 124 sequence accuracy, we used datasets with different sequence accuracies: 167,905 sequences with less 125 than 10% unknown nucleotides other than ATGCs in the genome sequence and 130,753 sequences 126 with less than 1% unknown nucleotides; for each sequence dataset, the number of cases by region 127 and clade is shown in Table 1 . 128 First, we constructed BLSOM for sequences with less than 10% unknown nucleotides, using the 129 pentanucleotide composition and their odds ratios ( Figure 1A and B). BLSOM utilizes unsupervised 130 machine learning, and the genome sequences are clustered (self-organized) on a two-dimensional 131 plane, based only on the difference in the vector data in a 1024 (=4 5 )-dimensional space. Lattice 132 points that include sequences from more than one clade are indicated in black, those that contain no 133 genomic sequences are indicated by blank, and those containing sequences from a single clade are 134 indicated in the color representing the clade. The odds ratio ( Figure 1B ) gave more accurate 135 separations (a smaller percentage of black grid points), possibly by excluding effects owing to the 136 clade-independent time-series change in the mononucleotide composition (Iwasaki Abe & Ikemura. 137 2021), which affected all SARS-CoV-2 clades. Even for the sequences with low-sequence accuracy, 138 clade-dependent separation occurs, allowing us to understand characteristics of the oligonucleotide 139 composition that are specific to each clade; thus, oligonucleotide-BLSOM is thought to be a robust 140 method. However, it is clear that BLSOMs for sequences with less than 1% unknown nucleotides 141 ( Figure 1C and D) gave more accurate separation than those listed in Figure 1A and B, and the 142 highest resolution was obtained for the BLSOM for the odds ratio ( Figure 1D ). 143 BLSOM is a sequence alignment-free analysis that is suitable for the analysis of massive data. 146 Because sequences at different locations on BLSOM have different oligonucleotide compositions, 147 clustering according to clades means that sequences belonging to different clades have different 148 oligonucleotide combinations, that is, differential combinations of mutations. 149 150 3D display of the data for different continents 151 Using BLSOM ( Figure 1D ) for the pentanucleotide odds ratio, Figure 1E examines the classification 152 according to four continents (Asia, Europe, North America, and Oceania) that have large numbers of 153 sequences. Here, the lattice points containing sequences of different continents are displayed in 154 black, and those containing only sequences of a single continent are displayed in the color specifying 155 each continent. Although not as clear as clade-dependent separations, regional differences have been 156 observed, which should reflect differential shares of prevalent variants among continents. However, 157 it is apparently difficult to obtain sufficient information from the results shown in Figure 1E alone. 158 BLSOM is equipped with various visualization tools for analysis results; therefore, we next show the 159 number of sequences belonging to each lattice point with a 3D display. 160 Again, using the BLSOM shown in Figure 1D , Figure 2 shows the number of sequences 161 belonging to each lattice point for each clade in each continent as a vertical bar, which is colored by continent, as shown in Figure 1E . Looking laterally at a particular clade, each clade consists of 163 several subclusters, each consisting of several high peaks surrounded by many low peaks. Different 164 subclusters observed in each clade are distinguished by numbering in each figure, but if they are 165 located in the same zone on BLSOM, the same number is given even if they are of different 166 continents. Looking vertically at a particular continent, sequences of different subclusters of 167 different clades exist in different amounts, and some subclusters are only in a particular continent, 168 that is, the prevalent variants for each continent can be visualized in an easy-to-understand manner. 169 In Supplementary Figure S1 , the data shown in Figure for a certain month is more than 100, the data for that month is indicated by a thick horizontal bar. We focused mainly on such months. 187 In the clade S/L/V detected in the early stage of the epidemic (December 2019-March 2020), 188 three major subclusters of each clade were observed and distinguished by suffix numbers, and most 189 sequences belonged to the two subclusters: S1/L1/V1 and S2/L2/V2. In Asia, many sequences 190 belonging to S1/L1/V1 were detected in December 2019, but in Europe and other regions, S2/L2/V2 191 were more abundantly detected in March and April 2020 than S1/L1/V1, and the proportion became 192 more pronounced in April than in March. In March and April in Europe, a remarkable number of 193 sequences belonging to S3/L3/V3 were also detected, showing three different variants prevalent at 194 the beginning of the epidemic in Europe. Far fewer than 100 sequences were detected after May; 195 sequences belonging to S1/L1/V1 were mainly detected in Asia and those belonging to S2/L2/V2 196 were shown in other regions, presenting differential trends in prevalent variants among continents. 197 For clade G, which started the epidemic in Europe in February, we defined five subclusters. In 198 February, roughly equal amounts of sequences belonging to G1 and G2 were detected in Europe and 199 North America, but as the epidemic progressed, those belonging to G2 were mainly detected in 200 Europe, whereas those belonging to both G1 and G2 were prevalent in North America. In Asia, only 201 sequences belonging to G1 are detected; in Oceania, those belonging to G2 accounted for about 10% 202 in the early stage, but afterward, those belonging to Oceania-specific G5 accounted for the majority. 203 For GH, we defined seven subclusters, including GH1 and GH2, which dominated in North 204 America and Europe, respectively. In North America, in addition to GH1, several months contain 205 approximately 20% of the sequences belonging to GH3, GH5, and GH6. In Asia, only GH1 has been 206 detected. In Oceania, only GH4 and GH7, which were specific to this region, were detected; initially, 207 GH4 was dominant, but after July, GH7 was primarily detected. 208 For GR, we defined five subclusters, including GR1 and GR2, which dominated in North America 209 and Europe, respectively. Moreover, in Europe, GR1 was detected to the same extent as GR2 in February, but as the epidemic progressed, GR2 began to predominate. In North America, the 211 occupancy of GR1 and GR2 varied to some extent depending on the collection month. In Asia, GR1 212 was mainly detected, and in Oceania, only region-specific subclusters have been detected. 213 These temporospatial changes in subclusters show that the subcluster is the separation (self-214 organization) that reflects biological significance and is fundamental information for understanding 215 the overall picture of the SARS-CoV-2 variants. The authors declare that there is no conflict of interests regarding the publication of this paper. 264 Figure S1 . 2D display of the classification by clade and continent shown in Figure 2 . 307 Each subcluster territory is circled by a dotted line. In clades G, GH, GR and GV, lattice points 308 where less than 5 sequences exist are not shown. The sequences belonging to each territory defined 309 here are used for the analysis in Figure 4 . Geographic and Genomic Distribution of SARS-CoV-2 Mutations Viral population analysis of the taiga tick, Ixodes 393 persulcatus, by using Batch Learning Self-Organizing Maps and BLAST search. The Journal of 394 veterinary medical science Rampant C→U Hypermutation in the Genomes of SARS-CoV-2 and Other 397 Coronaviruses: Causes and Consequences for Their Short-and Long-Term Evolutionary Trajectories COVID-401 19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives On the 405 origin and continuing evolution of SARS-CoV-2 Time-series analyses of directional sequence changes in SARS-409 CoV-2 genomes and an efficient search method for candidates for advantageous mutations for growth 410 in human cells 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 S1 112 95 122 59 34 17 5 1 4 52 8 3 1 1 2 47 603 276 91 67 2 1 1 2 46 1 1 1,654 S2 1 31 34 134 2 3 1 1 6 449 163 16 7 1 4 2 1 7 1,293 323 39 7 1 2 1 244 54 2 2,829 S3 1 191 40 6 1 239 #Total 1 143 129 256 61 37 18 6 1 11 692 211 22 11 2 4 3 3 54 1,896 599 130 74 3 1 3 3 290 55 2 1 4,722 Clade Month 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 L1 21 172 169 144 9 3 3 1 3 1 14 297 92 11 4 8 88 113 19 1 13 3 1,189 L2 27 34 65 10 5 6 3 7 562 297 40 9 2 1 2 1 180 68 19 3 1,341 L3 174 99 273 #Total 21 199 203 209 19 8 9 4 3 1 21 1,033 488 51 9 2 1 2 4 9 268 181 19 1 32 6 2,803 Clade Month 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 V1 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 G1 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 GH1 GH7 14 8 3 3 13 8 9 1 59 #Total 4 39 142 239 310 97 6 2 995 959 455 151 150 838 1,095 295 1 3,406 2,280 1,495 1,948 617 471 550 38 25 24 32 5 13 8 9 1 16,700 Clade Month 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 2019/12 2020/1 2020/2 2020/3 2020/4 2020/5 2020/6 2020/7 2020/8 2020/9 2020/10 GR1 21 42 72 121 74 9 4 13 284 378 94 53 52 62 115 0 3 241 586 6 60 5 2,295 GR2 61 54 14 18 18 38 34 0