key: cord-0838271-l3gik4v3 authors: Miyake, Jun; Sato, Takaaki; Baba, Shunsuke; Nakamura, Hayao; Niioka, Hirohiko; Nakazawa, Yoshihisa title: Cluster Analysis of SARS-CoV-2 Gene using Deep Learning Autoencoder: Gene Profiling for Mutations and Transitions date: 2021-03-16 journal: bioRxiv DOI: 10.1101/2021.03.16.435601 sha: e101f8ba9f964cc52e8f88606609f27862d30f1c doc_id: 838271 cord_uid: l3gik4v3 We report on a method for analyzing the variant of coronavirus genes using autoencoder. Since coronaviruses have mutated rapidly and generated a large number of genotypes, an appropriate method for understanding the entire population is required. The method using autoencoder meets this requirement and is suitable for understanding how and when the variants emarge and disappear. For the over 30,000 SARS-CoV-2 ORF1ab gene sequences sampled globally from December 2019 to February 2021, we were able to represent a summary of their characteristics in a 3D plot and show the expansion, decline, and transformation of the virus types over time and by region. Based on ORF1ab genes, the SARS-CoV-2 viruses were classified into five major types (A, B, C, D, and E in the order of appearance): the virus type that originated in China at the end of 2019 (type A) practically disappeared in June 2020; two virus types (types B and C) have emerged in the United States and Europe since February 2020, and type B has become a global phenomenon. Type C is only prevalent in the U.S. and is suspected to be associated with high mortality, but this type also disappeared at the end of June. Type D is only found in Australia. Currently, the epidemic is dominated by types B and E. it would be helpful to conceptualize these viral mutations and visualize the spatiotemporal transition. 43 We have been studying the application of deep-learning autoencoder for analyzing gene 44 sequences (Miyake et al. 2018 ). The feature extraction capability of autoencoder is useful for this 45 kind of analysis. There is no need to organize the potentially characteristic sites in the gene 46 beforehand. In our previous study of the human leukocyte antigen A (HLA-A) gene, we discovered 47 that autoencoder can correctly represent and classify differences in HLA-A alleles (Miyake et al. 48 2018) . Autoencoder has the potential to extract the genetic characteristics of a gene at a level close to 49 human recognition. A brand-new method of classification could be realized. 50 By using a deep learning autoencoder, various analyses of genes can be performed in a limited 51 period of time using a GPU computer, as long as the target is about tens of thousands of genes with 52 the length of a coronavirus genome (tens of thousands of base pairs). Autoencoder does not require a 53 gene pre-processing, such as alignment and marking of characteristic gene sequences, nor the need 54 to prepare supervised learning data in advance. Despite this, gene types can be classified and 55 displayed as clusters in three-dimensional space. Similar genes in sequences form a single cluster and 56 the group can be intuitively grasped. The spatial distances between genes/clusters can serve as an 57 indicator of genetic relationships and may contribute to a sophisticated understanding of evolutionary 58 processes. 59 In this paper, we used the ORF1ab gene sequences of the new coronaviruses (collected 60 between December 2019 and February 2021), which were obtained from the NCBI Virus and NCBI 61 Genbank databases, to extract the self-contained features of about 30,000 genes and display them in 62 three-dimensional space to investigate how the SARS-CoV-2 virus mutated over time. in this study. Namely, we applied the document vector method (the nucleotide sequence was replaced 87 by a vector (4 5 = 1,024 dimensions) with a normalized histogram of 1,024 words consisting of 5-mer 88 tiny nucleotide sequences without alignment). In this research, the hierarchy was compressed to four 89 layers and three dimensions. In order to visualize the obtained 3D data, we plotted them as x, y, z 90 coordinates in 3D space. Each dot corresponds to a variant nucleotide sequence. The spatial distance 91 from the center of the all dots plotted in 3D space was calculated and used to represent the gene 92 profile in a time series. 93 Phylogenetic trees were constructed using maximum likelihood phylogenetic analysis 94 (RAxML) with 1000 bootstraps (GENETYX ver. 15, GENETYX Co., Tokyo, Japan). Alignment of 95 nucleotide sequences was performed using the above software. 96 97 The ORF1ab genes, extracted from the genomes of 33,915 novel coronaviruses (12/19/2019-99 02/16/2021), were categorized into eight clusters in 3D space (Fig. 1) . The variation of the ORF1ab 100 gene sequence length was small, leading to the result that the separation of the clusters was clear. The similar characteristics in 3D coordinates and distances from the center. Close proximity of three pairs 105 of neighboring clusters suggested their similarities in mutation profiles, respectively. The eight 106 clusters of the ORF1ab genes were categorized into five major groups (Fig. 1) . These clusters were 107 named A, B, C, D, and E in the order of appearance. 108 In order to investigate the temporal changes, we replotted the 3D dots monthly or bimonthly 109 for the collection period (Fig. 2) . The ORF1ab genes collected during the two months of December 110 2019 and January 2020 showed a predominance of type A cluster in the center of 3D plotted genes 111 (Fig. 2 a, b) . Type B became the dominant genotype from February to March, and type E became the 112 dominant genotype from April to May. Type C started in February and fell and disappeared in June. 113 Type D appeared in June-July and disappeared in October. 114 The time series of the type C obtained by autoencoder analysis seems to be consistent with 115 the emergence and disappearance and geo location of coevolving variant group 4 (CEVg4) reported 116 by Chan et al. (Chan et al. 2020 ). Based on genome frequencies and geo locations, our classification 117 of types A1, A2, B1, and D seemed to correspond to the wild type, CEVg3, CEVg1, and CEVg6, 118 respectively. The B2 cluster is in a different location from the B1 cluster and, is a group of similar 119 size to the B1 cluster ( Figs. 1 and 2) . In contrast, there is no CEVg similar to CEVg1. 120 The distance of each dot from the center of the all dots in the 3D space was calculated and 121 used to represent the genotype profile in a time series by country/region. The data were color-coded 122 by cluster and displayed separately by geographic region (Figs. 3 and 4) . 123 The stretching and extinction of genotypes was quite frequent, with a new species emerging 124 and disappearing approximately every two months. It is unclear whether this was derived from a 125 single species, or whether a species that originally existed was grown. 126 The maximum likelihood phylogenetic trees of 88 ORF1ab genes and their corresponding 127 full-length genomes are shown in Fig. 5a and b The cluster classification by the autoencoder method (shown in Fig. 1 ) showed a certain 150 correlation with the classification by the phylogenetic tree method (Fig. 5) . In both ORF1ab and the 151 whole genome, a certain degree of cohesion was observed for genotypes A1, A2, B1, B2, C, D, E1, 152 and E2, and we judged that there is considerable correlation in gene sequencing. Because both 153 classification by autoencoder and phylogenetic tree analysis based on sequence homology and 154 differences, the methods are do not always match perfectly in principle, but they help each other to 155 understand classification. As shown in Fig. 5 , they can be considered as essentially distant 156 correlations as classification methods for gene sequences. 157 We found the eight clusters using over 30,000 SARS-CoV-2 ORF1ab genes in the NCBI 158 Virus database, whereas Chan et al. identified nine CEVg using 86,450 genomes in the GISAID 159 database (Chan et al. 2020 ). Yet there is not enough data to rigorously compare the differences 160 between the ABCDE and CEVg classifications. autoencoder-based classification is considered to be 161 a useful method for scanning the entire SARS-CoV-2 virus for variations or for rapid genetic 162 classification of viral genes and viruses with a certain genetic distance. 163 With regard to the new coronavirus, more than 40,000 gene sequencings were performed in 164 one year for the whole world. In order to take advantage of the vast information space made possible 165 by next-generation sequencing technology, we believe that we need technology to grasp the entire 166 picture of genetic variation and its distribution patterns. We hope that artificial intelligence will 167 contribute to the development of methods for rapid recognition and classification of genetic 168 mutations. We believe that being able to explain the direction of mutations and the principles that 169 constrain them will make a significant contribution to this field. A better understanding of viral 170 evolution will allow us to respond more effectively and quickly to pandemics. clusters (A1, red; A2, pink; B1, green; B2, light green; C, purple; D, orange; E1, blue; E2, light 255 blue). To prevent mixing, dots in border regions between the clusters were omitted. Along the axis 256 of the distance from the center, B2 cluster had some overlap with B1 cluster. 257 Conserved genomic terminals of SARS-CoV-2 as coevolving functional 188 elements and potential therapeutic targets Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute 190 respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 Mapping genome variation of 193 SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders Quantitative phylogenomic evidence reveals a spatially structured SARS-CoV-2 196 diversity Graphical classification of DNA sequences 198 of HLA alleles by Deep learning A pneumonia outbreak associated with a new coronavirus 211 of probable bat origin