key: cord-0193754-u6ewlh16 authors: Wang, Rui; Hozumi, Yuta; Yin, Changchuan; Wei, Guo-Wei title: Decoding SARS-CoV-2 transmission, evolution and ramification on COVID-19 diagnosis, vaccine, and medicine date: 2020-04-29 journal: nan DOI: nan sha: 6f9f3822c92c26963496411548e43cb7939fdd4f doc_id: 193754 cord_uid: u6ewlh16 Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genotyping of 6156 genome samples collected up to April 24, 2020, we report that SARS-CoV-2 has had 4459 alarmingly mutations which can be clustered into five subtypes. We introduce mutation ratio and mutation $h$-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively non-conservative. In particular, the nucleocapsid protein has more than half its genes changed in the past few months, signaling devastating impacts on the ongoing development of COVID-19 diagnosis, vaccines, and drugs. The ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses crucial threats to the public health and the world economy since it was detected in Wuhan, China in December 2019 [1] . As of April 24, 2020, more than 2.6 million cases of COVID-19 have been reported in 185 countries and territories, resulting in more than 184,000 deaths [2] . Tragically, there is no sign of slowing down nor relief at this monument partially due to the fact there is no specific anti-SARS-CoV-2 drugs and effective vaccines. SARS-CoV-2 is a positive-strand RNA virus and belongs to the beta coronavirus genus. The genomic information underpins the development of antiviral medical interventions, preventative vaccines, and viral diagnostic tests. The first SARS-CoV-2 genome was reported on January 5, 2020 [3] . It has a genome size of 29.99 kb, which encodes for multiple non-structural and structural proteins. The leader sequence and ORF1ab encode non-structural proteins for RNA replication and transcription. Among various nonstructural proteins, viral papain-like (PL) proteinase, main protease (or 3CL protease), RNA polymerase, and endoribo-nuclease are common targets in antiviral drug discovery. However, it typically takes more than ten years to put an average drug to the market. The downstream regions of the genome encode structural proteins including spike (S) protein, envelope (E) protein, membrane (M) protein, and nucleocapsid (N) protein. Notably, S-protein uses one of two subunits to bind directly to the host receptor angiotensinconverting enzyme 2 (ACE2), enabling the virus entry into host cells [4] . The nucleocapsid (N) protein, one of the most abundant viral proteins, can bind to the RNA genome and is involved in replication processes, assembly, the host cellular response during viral infection [5] . The E protein is a small integral membrane protein, a virulence factor, regulating cell stress response and apoptosis, and promoting inflammation [6] . The structural proteins, especially, the Spike protein and the N protein, are the candidate antigens for vaccine development. Developing safe and effective vaccines is urgently needed to prevent the spread of SARS-CoV-2. However, it typically takes over one year to design and test a new vaccine. Furthermore, the SARS-CoV-2 genome undergoes rapid mutations partially stimulated as a response to the challenging immunological environments arising from the COVOD-19 patients of different races, ages, and medical conditions. SARS-CoV-2 exists as heterogeneous and dynamic populations because of their error-prone replication [7] . The vaccine developed at one time may not be effective for mitigating the infection by new mutated virus isolates. An alarming fact is that many of these mutations may devastate the on-going effort in the development of effective medicines, preventive vaccines, and diagnostic tests. Accurate identification of the antigens and their mutations represents the most important roadblock in developing effective vaccines against COVID-19. For example, different vaccines are needed for different geographic locations due to predominant mutations in the corresponding regions. In COVID-19 diagnosis, the diagnosis kits are designed using two major methods, i.e., specific serological tests and molecular tests. Serological tests are to detect specific COVID-19 proteins. Molecular diagnoses test specific COVID-19 pathogenic genes, which usually rely on the polymerase chain reaction (PCR).Because of the fast mutations of the SARS-CoV-2 genome, genotyping analysis of SARS-CoV-2 may optimize the PCR primer design to detect SARS-CoV safely and reduce the risk of false-negatives caused by genome sequence variations. In addition, the genotyping analysis may also reveal those regions that are highly conserved with very few mutations, which can be selected as a target sequence for reliable drug therapy and general diagnosis. The evolution pattern through the highly frequent mutations of SARS-CoV-2 can be observable on short time scales. In the early infection period (i.e., February 2020), the SARS-CoV-2 variants were clustered as S and L types [8] . Recent genotyping analysis reveals a large number of mutations in various essential genes encoding the S protein, the N protein, and RNA polymerase in the SARS-CoV-2 population [9] . Monitoring the evolutionary patterns and spread dynamics of SARS-CoV-2 is of grace importance for COVID-19 control and prevention. Although mutations occur randomly, most preserved mutations can be regarded as virus responds to the host immune system surveillance. As a result, the faster and the wider the SARS-CoV-2 spread is, the more frequent and diverse the mutations will be. The tracking and analysis of COVID-19 dynamics, transmission, and spread is of paramount importance for winning the on-going battle against COVID-19. Genetic identification and characterization of the geographic distribution, intercontinental evolution, and global trends of SARS-CoV-2 is the most efficient approach for studying COVID-19 genomic epidemiology and offer the molecular foundation for region-specific SARS-CoV-2 vaccine design, drug discovery, and diagnostic development [10] . For example, different vaccines for the shell can be designed according to predominant mutations. This work provides the most comprehensive genotyping to reveal the transmission trajectory and spread dynamics of COVID-19 to date. Based on genotyping 6156 SARS-CoV-2 genomes from the world as of April 24, 2020, we trace the COVID-19 transmission pathways and analyze the distribution of the subtypes of SARS-CoV-2 across the world. We use K-means methods to cluster SARS-CoV-2 mutations, which provides the updated molecular information for the region-specific design of vaccines, drugs, and diagnoses. Our clustering results show that globally, there are at least five distinct subtypes of SARS-CoV-2 genomes. While, in the U.S., there are at least three significant SARS-CoV-2 genotypes. We introduce mutation hindex and mutation ratio to characterize conservative and non-conservative proteins and genes. We unveil the unexpected non-conservative genes and proteins, rendering an alarming warning for the current development of diagnostic tests, preventive vaccines, and therapeutic medicines. Tracking the SARS-CoV-2 transmission pathways and analyzing the spread dynamics are critical to the study of genomic epidemiology. Temporospatially clustering the genotypes of SARS-CoV-2 in transmission provides insights into diagnostic testing and vaccine development in disease control. In this work, we retrieve and genotype 6156 SARS-CoV-2 isolates from word as of April 24, 2020. There are 4459 single mutations in 6156 SARS-CoV-2 isolates. Based on these mutations, we classify and track the geographical distributions of 6156 genoytype isolates by K-means clustering. The SARS-CoV-2 genotypes, represented as SNP variants, are clustered as five groups in the world Table 2 . The genotypes in the U.S. are clustered as three groups Table 2 . The optimal clustering groups are established using the Elbow method in the K-means clustering algorithm (see Supporting Material). The detailed distribution of the SNP variants from the world for each cluster is provided in the Supporting Material. The SNP variant clusters from 11 countries that have the highest number of cases recorded in are listed in Table 2 Figure 1 . The geographic distribution of the SNP variant clusters reflects the approximate transmission pathways and spread dynamics across the world. Several findings can be read from the Table 2: 1. Two early subtypes of SASR-CoV-2 (cluster I and II) are epidemic in the Asian countries (CN, JP, KR). 2. The subtypes of SARS-CoV-2 in Cluster III is not spreading in the European countries (UK, DE, FR, IT). 3. All of the subtypes of SARS-CoV-2 in five different clusters can be found in the US, AU, and Canada. Moreover, we analyze the statistic of SNP variants located in different states of the United States. In Table 3 , we list the number of cases in three different clusters with respects to the west coast states We note that Cluster C in the U.S. is derived from Cluster III in the world, with an additional mutation at the leader sequence 241. The high spread in New York is consistent with the high transmission of SARS-CoV-2 in the European countries, where the subtype in Cluster III is predominated. Table 5 presents the statistics of single mutations on various SARS-CoV-2 proteins that occurred in the recorded genomes between January 5, 2020, and April 24, 2020. The papain-like protease has the highest number of mutations of 599 while the envelope protein has the lowest number of mutations of 13. Since the sizes of proteins vary dramatically from 1945 for the papain-like protease to 75 for the envelope protein, it is useful to consider the mutation ratio, i.e., the number of mutations per residue. In this category, the envelope protein still has the lowest score of 0.17, whereas the nucleocapsid protein has the highest score of 0.56, i.e, 235 mutations on its 419 residues. Note that 3CL protease has the second-lowest mutation ratio of 0.22, indicating its conservative nature. Another relatively conservative protein judged by the mutations ratio is the RNA-dependent RNA polymerase. It has 223 mutations over its 932 residues. Counting the number of single mutations and mutation ratio does not reflect the fact some mutations occur numerous times over genome samples while other mutations may happen only on a few genome samples. To account for the frequency effect of mutations, we introduce a mutation h-index to measure both the number of mutations and the frequency of mutations of a given protein or genetic section. It is defined as the maximum value of h such that the given protein genetic section has h single mutations that have each occurred at least h times. It is very interesting to note from Table 5 that the mutation h-index correlates very well with the number of mutations per residue. Specifically, nucleocapsid protein has both the highest number of mutations per residues of 0.56 and the highest h-index of 27, suggesting that it is the most non-conservative protein in SARS-CoV-2 genomes. In contrast, the envelope protein has the lowest number of mutations per residues of 0.17 and the lowest h-index of 5, indicating its relatively conservative nature. By combining the number of mutations per residue and the mutation h-index, we report that the three most conservative SARS-CoV-2 proteins are 1) the envelope protein, 2) the main protease, and 3) the endoribonuclease. It is found that the most non-conservative SARS-CoV-2 proteins are 1) the nucleocapsid protein, 2) the spike protein and 3) the papain-like protease. Real-time RT-PCR (rRT-PCR) is routinely used in the qualitative detection of nucleic acid from SARS-CoV-2 for diagnostic testing COVID-19 [11, 12] . The primers used in the rRT-PCR are critical for the precise diagnosis of COVID-19 and the discovery of new strains. The primer sequences are specially designed for amplifying the conserved regions across the different existing strains for high specificity and sensitivity, and also are subject to genotype changes as the SARS-CoV-2 coronavirus evolves. In diagnostic testing COVID-19, many rRT-PCR primers are designed to detect for three perceived conservative SARS-CoV-2 regions: (1) RNA-dependent RNA polymerase (RdRP) gene in ORF1ab region, (2) the E protein gene, and (3) the N protein gene [11] . Our genotyping statistics given in Table 5 indicates that the nucleocapsid protein is the worst choice. Among four structural proteins of SARS-CoV-2, the spike surface glycoprotein (S) of 1273 amino acid residues, nucleocapsid protein (N) of 419 amino acid residues, membrane protein (M) of 222 amino acid residues, envelope protein (E) of 75 amino acid residues, the S protein is the most divergent with 385 unique mutations among the 6156 SARS-CoV-2 genomes. The N protein has 235 unique mutations, the E protein has 13 mutations. Considering the lengths of the proteins, all the four structural proteins undergo high mutations. The RdRP gene, which is often used in diagnostic testing COVID-19, also has 223 mutations. Therefore, all the three regions in routine rRT-PCR target, namely RdRP, the N protein gene, and the E protein genes, have significant mutations. Precise and robust diagnosis tools must be re-established according to the conserved regions and predominated mutations in the SARS-CoV-2 genomes detailed in the Supporting Material. Notably, SARS-CoV-2 has a unique furin cleavage site, where four amino acid residues (PRRA) are inserted into the S1-S2 junction region 681-684 of the S protein [13] . The furin cleavage site is crucial for zoonotic transmission of SARS-CoV-2 [14] . This study reveals crucial mutations near the S1-S2 junction region in the S protein, including 23403A>G- Moreover, these mutations of the S protein SARS-CoV-2 are located at the epitope region, corresponding to the regions 469-882 and 599-620 in SARS-CoV) [15] . Additionally, many mutated amino acids are on the surface of the S protein as shown in Fig. 5 . Unfortunately, the S protein is the second most non-conservative protein in the genome based on the number of mutations per residue and mutation h-index. In fact, about half of the receptor-binding domain residues of the S proteins have had mutations in the past few months as shown in Fig. 6 . Because the surface accessibility of epitope is also important for the interaction of antibody and antigen, these mutations are critical for the antigenicity of the S protein. The convalescent COVID-19 patients show a neutralizing antibody response after infection, which are directed against the S protein or the N protein [16] . The neutralizing antibody responses against SARS-CoV-2 could give some defense against SARS-CoV-2 infection and thus, having implications for preventing SARS-CoV-2 outbreaks. The divergence of spike proteins, the non-conserved regions of the spike proteins might contribute to the antigenicity. The high frequent mutations identified in the S protein and the N proteins must be considered when designing a vaccine. Unfortunately, there is no specific effective drug for SARS-CoV-2 at this point. Much of the drug discovery effort focuses on SARS-CoV-2 non-structural proteins. Among the major non-structural proteins of SARS-CoV-2, the main protease of 306 amino acids has 68 mutations with 0.22 mutations per residue and the mutation h-index of 9, RNA polymerase of 932 amino acids has 223 mutations with 0.24 mutations per residue and the mutation h-index of 13, and papain-like protease of 1945 amino acids has 599 mutations with 0.31 mutations per residue and the mutation h-index of 15. In fact, the main protease is the most popular drug target because there are no similar known genes in the human genome, which implies SARS-CoV-2 main protease inhibitors will be likely less toxic [17] . The present study suggests that the main protease is the second most conservative protein. Therefore, it remains the most attractive target for drug discovery. The SARS-CoV-2 spike glycoprotein, or S protein, comprised of two subunits, S1 and S2, of very different properties [13] , see Fig. 5 . Among them, the S1 subunit, as shown in Fig. 5 , contains the receptor-binding domain (RBD) responsible for binding to the host cell receptor angiotensinconverting enzyme 2 (ACE2). The RBD is also the common binding domain for antibodies. The S2 subunit offers the structural support of the S protein and mediates fusion between the viral and host cell membranes. After the fusion, the virus releases the viral genome into the host cell. The S1 RBD protein plays key parts in the induction of neutralizing-antibody and T-cell responses, as well as protective immunity. However, S2 and extracellular domain (ECD) of spike protein and their combination are commonly used in recombinant proteins in SARS-CoV-2 antibody development. Table 5 , the S protein is the most heterogeneous structural protein with a significant number of mutations as shown in Figs. 5 and 6 and Table 6 . The divergence of the spike protein, the non-conserved regions of the spike protein might contribute to the antigenicity difference in SARS-CoV-2 isolates. We found that most of the high frequent mutations of the S protein are located in the S1 subunit. Figure 6 indicates that near half of the amino acid residues have had mutations since January 5, 2020. One of the important mutations at S1 is 23010 (V483A) within the RBD for ACE2 binding. The structural study revealed that the amino acids 442-487 in the S1 subunit may impact viral binding to human ACE2 [18, 19] . The mutations identified in this study imply the change in ACE2 binding affinity and the transmissibility of SARS-CoV-2 as well as negative impacts in preventive vaccine and diagnostic test development. Top1 10323 K90R 52 1 8 43 0 0 Top2 10097 G15S 51 0 1 0 50 0 Top3 10851 A266V 44 0 2 16 0 26 Top4 10582 D176D 19 0 0 5 0 14 Top5 10771 Y239Y 15 15 0 0 0 0 Top6 10507 N151N 11 0 10 1 0 0 Top7 10948 R298R 11 0 0 0 11 0 Top8 10265 G71S 9 0 0 0 9 0 Top9 10870 L272L 9 0 0 1 4 4 Top10 10319 L89F 8 0 1 4 0 3 Top11 10450 P132P 8 Main protease SARS-CoV-2 main protease, or 3CL protease, is essential for cleaving the polyproteins that are translated from the viral RNA [17] . It operates at multiple cleavage sites on the large polyprotein through the proteolytic processing of replicase polyproteins and plays a pivotal role in viral gene expression and replication. SARS-CoV-2 main protease is one of the most attractive targets for anti-CoV drug design because its inhibition would block viral replication and it is unlikely to be toxic due to no known similar human proteases. Another reason for the focused drug discovery efforts in developing SARS-CoV-2 main protease inhibitors is that this protein is relatively conservative as shown in Table 5 . Figure 7 illustrates the main protease mutation patterns. Figure 8 further highlights the inhibitor binding domain (BD). Indeed, the main protease is relatively conservative compared to the spike protein. Table 7 lists top 11 mutations and their frequency in our dataset. It is interesting to see that many mutations, such as Y239Y, N151N, R298R, L272L, and P132P, are degenerate ones. One possible explanation is that nondegenerate may be non-silent and likely cause unsurvivable disruption to the virus. Note that mutation G15S mostly occurs in Cluster IV. Mutation Y239Y is restricted to Cluster I. Some other mutations, such as R298R, G71S, and P132P, are specific to certain clusters. Nonetheless, some mutations at the BD shown in Fig. 8 are worth noting. They can undermine the ongoing drug discovery effort. Top1 3037 F106F 3889 0 80 1800 964 1045 Top2 2891 A58T 120 0 119 0 1 0 Top3 3177 P153L 72 0 69 2 1 0 Top4 4540 Y607Y 60 0 60 0 0 0 Top5 7011 A1431V 45 0 43 2 0 0 Top6 6312 T1198K 44 0 42 1 1 0 Top7 7438 Y1573Y 34 0 9 21 4 0 Top8 3373 D218D 29 0 3 0 26 0 Top9 4002 T428I 26 0 1 0 25 0 Top10 6040 F1107F 26 0 10 12 0 4 Figure 9 : Illustration of SARS-CoV-2 papain-like protease mutations using 6W9C as a template [20] . Papain-like protease SARS-CoV-2 papain-like protease (PLPro) is a cysteine cleavage protein located within the non-structural protein 3 (NS3) section of the viral genome [20] . Like, the main protease, PLPro activity is required to cleave the viral polyprotein into functional, mature subunits and, thereby, contributes to the biogenesis of the virus replication. Additionally, PLpro possesses a deubiquitinating activity. The SARS PLPro is also a major therapeutic and diagnostic target. As shown in Table 5 , the SARS PLPro is prone to mutations. Figure 9 shows that mutations are all over the places in PLPro. Table 8 lists top ten mutations in PLPro. Five of these mutations are degenerate ones, including one of the highest frequented mutations. Note that none one of the top mutations occurred in Cluster I. On the contrast, Cluster II has many different mutations. [21] . As one of the non-structure proteins, RdRPs located in the early part of ORF1b section. Like most other RNA viruses, SARS-CoV-2 RdRPs are considered to be highly conserved to maintain viral functions and thus targeted in antiviral drug development as well as diagnostic tests. On the other hand, the SARS-CoV-2 RNA polymerase lacks proofreading capability and thus its mutations are deemed to happen as shown in Table 5 . Figure 10 illustrates the SARS-CoV-2 RdRP mutations since January 5, 2020. Surprisingly, there are many mutations in SARS-CoV-2 RdRP. Table 9 describes the top ten mutations. As in other cases, five of these mutations are degenerate ones. Cluster I has no nondegenerate mutations. Endoribo-nuclease (NendoU) protein is a nidoviral RNAuridylate-specific enzyme that cleaves RNA [22] . It contains a C-terminal catalytic domain belonging to the EndoU family RNA processing. The NendoU protein is presented among coronaviruses, arteriviruses, and toroviruses. The many aspects of the detailed function and activity of SARS-CoV-2 NendoU protein are yet to be revealed. Figure 11 depicts SARS-CoV-2 NendoU protein mutations. Like in most other SARS-CoV-2 proteins, mutations have occurred over different parts. Table 5 shows that NendoU is relatively conservative. Table 10 lists the top twelve high-frequency mutations of the SARS-CoV-2 NendoU protein that occurred in the past few months. Three of these mutations are degenerate ones. The frequencies of these mutations range from 38 to 6. Note that Cluster I do not have any of these mutations. Total frequency Cluster I Cluster II Cluster III Cluster IV Cluster V Top1 26319 V25V 8 0 7 1 0 0 Top2 26340 A32A 7 0 5 2 0 0 Top3 26326 L28L 5 0 5 0 0 0 Top4 26256 F4F 4 0 0 2 2 0 Top5 26301 L19L 3 0 2 1 0 0 Top6 26433 K63K 2 0 0 1 0 1 Top7 26370 Y42Y 1 0 1 0 0 0 Top8 26392 S50G 1 0 1 0 0 0 Top9 26313 F23F 1 The SARS-CoV-2 envelope (E) protein is one of SARS-CoV's four structural proteins. As a transmembrane protein, it involves in ion channel activity, and thus facilitates viral assembly, budding, envelope formation, pathogenesis, and release of the virus [23] . The E protein may not be essential for viral replication but it is for pathogenesis. Figure 12 illustrates E protein as a very small pentamer with a few mutations. Table 11 shows its top thirteen mutations. Note that the first 7 mutations are degenerate ones. All other mutations have very low frequencies. As shown in Table 5 , the SARS-CoV-2 E protein is very conservative. Total frequency Cluster I Cluster II Cluster III Cluster IV Cluster V Top1 28881 R203K 989 1 20 4 964 0 Top2 28882 R203R 983 0 18 1 964 0 Top3 28883 G203R 983 0 18 1 964 0 Top4 28657 D128D 125 1 124 0 0 0 Top5 28311 P13L 102 0 101 1 0 0 Top6 28688 L139L 91 0 90 1 0 0 Top7 29045 P258T 67 0 65 2 0 0 Top8 29046 P258R 67 0 65 2 0 0 Top9 29047 P258P 67 0 65 2 0 0 Top10 29049 R259L 67 0 65 2 0 0 Top11 29050 R259R 67 0 65 2 0 0 Top12 29051 Q260E 67 0 65 2 0 0 Top13 29052 Q259R 67 0 65 2 0 0 Top14 29053 Q260H 67 0 Its primary function is to encapsidate the viral genome. To do so, it is heavily phosphorylated (or charged) and thereby, can bind with RNA. Additionally, SARS-CoV-2 N protein confirms the viral genome to replicasetranscriptase complex (RTC) and plays a crucial role in viral genome encapsulation. Therefore, it may function completely differently at different stages of the viral life cycle. SARS-CoV-2 N protein is considered to be one of the most conservative SARS-CoV-2 proteins in the literature and is a popular target for diagnosis of vaccine development [11] . The present works shown in Table 5 indicates the SARS-CoV-2 N protein is the worst target of any drug, vaccine, and diagnostic development. Table 12 presents the top fourteen mutations of the SARS-CoV-2 N protein since January 5, 2020. Note that only three out of fourteen top mutations are degenerate ones, which is a significantly lower ratio than that of other proteins. The frequency of 14th mutation is 67, which suggests there are many mutations associated with these mediate-sized proteins. Most top mutations occurred to Clusters II and IV. Cluster V has none of the top fourteen mutations. Membrane protein SARS-CoV-2 membrane (M) protein is another structural protein and plays a central role in viral assembly and viral particle formation. It exists as a dimer in the virion and has certain geometric shapes to enable certain membrane curvature and binding to nucleocapsid proteins. Similar to other SARS-CoV proteins, M protein is also a popular target for viral diagnosis and vaccines. Table 5 gives SARS-CoV-2 M protein the meddle ranking for its conservation. Table 13 details the top eleven mutations in SARS-CoV-2 M protein occurred in the past few months. Seven of these mutations are degenerate. Clusters I and V have relatively a few of these mutations. On January 5, 2020, the complete genome sequence of SARS-CoV-2 was first released on GenBank (Access number: NC 045512.2) by Zhang's group at Fudan University [3] . Since then, there has been a rapid accumulation of SARS-CoV-2 genome sequences. In this work, 6156 complete genome sequences with high coverage of SARS-CoV-2 strains from the infected individuals in the world are downloaded from the GI-SAID database [25] (https://www.gisaid.org/) as of April 24, 2020. All the records in GISAID without the exact submission date will not take into considerations. To rearrange the 6156 complete genome sequences according to the reference SARS-CoV-2 genome, multiple sequence alignment (MSA) is carried out by using Clustal Omega [26] with default parameters. SNP genotyping measures the genetic variations between different members of a species. Establishing the SNP genotyping method for the investigation of the genotype changes during the transmission and evolution of SARS-CoV-2 is of great importance. By analyzing the rearranged genome sequences, SNP profiles which record all of theSNP positions in teams of the nucleotide changes and its corresponding positions can be constructed. The SNP profiles of a given genome of a COVID-19 patient capture all the differences from a complete reference genome sequence and can be considered as the genotype of the individual SARS-CoV-2. The Jaccard distance measures dissimilarity between sample sets. The Jaccard distance of SNP variants is widely employed in the phylogenetic analysis of human or bacterial genomes [9] . In this work, we utilize the Jaccard distance to compare the difference between the SNP variant profiles of SARS-CoV-2 genomes. The Jaccard similarity coefficient, also known as the Jaccard index, is defined as the intersection size divided by the union of two sets A, B [27] : The Jaccard distance of two sets A, B is scored as the difference between one and the Jaccard similarity coefficient and is a metric on the collection of all finite sets: Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of their SNP variants. In principle, the Jaccard distance measure of SNP variants takes account of the ordering ofSNP positions, i.e., transmission trajectory, when an appropriate reference sample is selected. However, one may fail to identify the infection pathways from the mutual Jaccard distances of multiple samples. In this case, the dates of the sample collections offer useful information. Additionally, clustering techniques, such as kmeans described below, enable us to characterize the spread of COVID-19 onto the communities. K-means clustering is one of the fundamental unsupervised algorithms in machine learning which aims at partitioning a given data set X = {x 1 , x 2 , · · · , x n , · · · , x N }, x n ∈ R d into k clusters {C 1 , C 2 , · · · , C k }, k ≤ N such that the specific clustering criteria are optimized. More specifically, the standard K-means clustering algorithm starts to pick k points as cluster centers randomly and then allocates each data to its nearest cluster. The cluster centers will be updated iteratively by minimizing the within-cluster sum of squares (WCSS) which is defined by: where µ k = i ∈ C k x i /n k is the mean of points located in the k-th cluster C k and n k is the number of points in C k . Here, · 2 denote the L 2 distance. The algorithm above only provides a way to obtain the optimal partition for a fixed number of clusters. However, we are interested in finding the best number of clusters for the SNP variants. Therefore, the Elbow method is applied. By varying the number of clusters k, a set of WCSS can be calculated in the K-means clustering process, and then the plot of WCSS according to the number of clusters k can be carried out. The location of the elbow in this plot will be considered as the optimal number of clusters. To be noticed, the WCSS measures the variability of the points within each cluster which is influenced by the number of points N . Therefore, as the number of total points of N increases, the value of WCSS becomes larger. Additionally, the performance of k-means clustering depends on the selection of the specific distance. In this work, we propose to implement K-means clustering with the Elbow method for analyzing the optimal number of the subtypes of SARS-CoV-2 SNP variants. The Jaccard distance-based and Locationbased representations are considered as the input features for the K-means clustering method. Suppose we have a total of N SNP variants concerning a reference genome in a SARS-CoV-2 sample. The location of the mutation sites for each SNP variant will be saved in the set S i , i = 1, 2, · · · , N . The Jaccard distance between two different sets (or samples) S i , S j is denoted as d J (S i , S j ). Therefore, the N ×N Jaccard distance-based representation will be: Suppose we have N SNP variants with respect to a reference genome in a SARS-CoV-2 sample. Among them, M different mutation sites can be counted. For i-th SNP variant, V i = [v 1 i , v 2 i , · · · , v M i ], i = 1, 2, · · · , N is a 1 × M vector which satisfies: Therefore, an N × M location-based representation will be: Hundreds of complete genome sequences are deposit to the GISAID every day, which results in an evergrowing massive quantity of high dimensional data representations for the K-means clustering. For example, if the dataset of an organism involves 10,000 SNPs, the initial representation will be a 10,000dimensional vector for each sample, which can be computationally difficult for a simple K-means clustering algorithm. Therefore, a dimensionality reduction method is used to pre-process the data. The essential idea of PCA-based K-means clustering is to invoke the PCA to obtain a reduced-dimensional representation of each sample before performing the K-means clustering. In practice, one can select a few lowest dimensional principal components as the K-means input for each sample. In Ref. [28] , the authors proved that the principal components are the continuous solution of the cluster indicators in the K-means clustering method, which provides us a rigorous mathematical tool to embed our high-dimensional data into a low-dimensional PCA subspace. The rapid global transmission of coronavirus disease 2019 (COVID-19) has offered some of the most heterogeneous, diverse, and challenging mutagenic environments to stimulate dramatic genetic evolution and response from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This work provides the most comprehensive genotyping of SARS-CoV-2 transmission and evolution up to date based on 6156 genome samples and reveals five clusters of the COVID-19 genomes and associated mutations on eight different SARS-CoV-2 proteins. We introduce mutation h-index and mutation ratio to qualify individual protein's degree of non-conservativeness. We unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively the most conservative, whereas, SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively the most non-conservative. We report an alarming fact that all of the SARS-CoV-2 proteins have undergone intensive mutations since January 5, 2020, and some of these mutations may seriously undermine ongoing efforts on COVID-19 diagnostic testing, vaccine development, and drug discovery. The nucleotide sequences of the SARS-CoV-2 genomes used in this analysis are available, upon free registration, from the GISAID database (https://www.gisaid.org/). Eighteen tables are provided in the Supporting Material for SNP variants of 6156 SARS-CoV-2 samples across the world, SNP variants of 1625 SARS-CoV-2 samples in the US, SNP variants in five global clusters, SNP variants in three US clusters, and mutation records for eight SARS-CoV-2 proteins. The acknowledgments of the SARS-COV-2 genomes are also given in the Supporting Material. The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 WHO. Coronavirus disease 2019 (COVID-19) situation report 93. Coronavirus Disease (COVID-2019) Situation Reports A new coronavirus associated with human respiratory disease in China The sars-cov s glycoprotein: expression and functional characterization The coronavirus nucleocapsid is a multifunctional protein Severe acute respiratory syndrome coronavirus envelope protein regulates cell stress response and apoptosis The population genetics and evolutionary epidemiology of RNA viruses Origin and evolution of the 2019 novel coronavirus Genotyping coronavirus SARS-CoV-2: methods and implications Models of RNA virus evolution and their roles in vaccine design Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Diagnosing COVID-19: The disease and tools for ddtection Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Furin cleavage of the SARS coronavirus spike glycoprotein enhances cell-cell fusion but does not affect virion entry A strategy for searching antigenic regions in the SARS-CoV spike protein Coronavirus infections: Epidemiological, clinical and immunological features and hypotheses Structure of mpro from covid-19 virus and discovery of its inhibitors. bioRxiv SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor Receptor recognition by the novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus The crystal structure of papain-like protease of sars cov-2 Structure of the rna-dependent rna polymerase from covid-19 virus Structural model of the sars coronavirus e channel in lmpg micelles Crystal structure of rna binding domain of nucleocapsid phosphoprotein from sars coronavirus 2. Center for Structural Genomics of Infectious Diseases (CSGID) Gisaid: Global initiative on sharing all influenza data-from vision to reality Clustal omega. Current protocols in bioinformatics Distance between sets K-means clustering via principal component analysis This work was supported in part by NIH grant GM126189, NSF Grants DMS-1721024, DMS-1761320, and IIS1900473, Michigan Economic Development Corporation, Bristol-Myers Squibb, and Pfizer. The authors thank The IBM TJ Watson Research Center, The COVID-19 High Performance Computing Consortium, and NVIDIA for computational assistance.