key: cord-0841620-iszr6nrb
authors: Li, Yawei; Liu, Qingyun; Zeng, Zexian; Luo, Yuan
title: Using an Unsupervised Clustering Model to Detect the Early Spread of SARS-CoV-2 Worldwide
date: 2022-04-07
journal: Genes (Basel)
DOI: 10.3390/genes13040648
sha: 8754804dddff197c84e6770d3f98a80f0d5d0a8f
doc_id: 841620
cord_uid: iszr6nrb

Deciphering the population structure of SARS-CoV-2 is critical to inform public health management and reduce the risk of future dissemination. With the continuous accruing of SARS-CoV-2 genomes worldwide, discovering an effective way to group these genomes is critical for organizing the landscape of the population structure of the virus. Taking advantage of recently published state-of-the-art machine learning algorithms, we used an unsupervised deep learning clustering algorithm to group a total of 16,873 SARS-CoV-2 genomes. Using single nucleotide polymorphisms as input features, we identified six major subtypes of SARS-CoV-2. The proportions of the clusters across the continents revealed distinct geographical distributions. Comprehensive analysis indicated that both genetic factors and human migration factors shaped the specific geographical distribution of the population structure. This study provides a different approach using clustering methods to study the population structure of a never-seen-before and fast-growing species such as SARS-CoV-2. Moreover, clustering techniques can be used for further studies of local population structures of the proliferating virus.

The COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1, 2] , has spread throughout the world since 2019. To understand the molecular characteristics of the virus, a large collaborative project known as the Global Initiative on Sharing All Influenza Data (GISAID) was developed to collect and publish the sequenced genomes of SARS-CoV-2 worldwide. The database has helped researchers in investigating comprehensive patterns of SARS-CoV-2 with high efficiency.

As a newly emerging pandemic virus, exploring the genetic diversity, evolutionary trajectory and possible routes of transmission of SARS-CoV-2 from its natural reservoir to humans is crucial for supporting clinical studies. For that reason, studies have implemented phylogenetic trees to investigate the evolutionary trajectory of SARS-CoV-2 and its strain diversification [3] [4] [5] . A phylogenetic tree is a graph showing the dissimilarities and distances between various biological entities based on their pairwise distance metrics [6, 7] . Nevertheless, given the continuous accumulation of entities, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of entities through the phylogenetic tree. As a result, grouping similar entities into the same cluster and then extracting the major characteristics of each cluster greatly facilitated the efficiency of our analysis. Typically, entities can be handily grouped in terms of the distance matrix and the bifurcations between branches of leaves on the phylogenetic tree. However, when the entity size is enormous, manually partitioning the clades of the phylogenetic tree becomes much more challenging.

With the development of machine learning algorithms, machine learning-based approaches have played a critical role in phylogenetic analysis. Compared with other methods, machine learning-based methods can find a better balance between accuracy and running time [8] . To our knowledge, machine learning methods have been widely implemented in various aspects of phylogenetic analyses, including phylogenetic tree construction [8] [9] [10] [11] [12] , ancestral relationship identification [13] , evolutionary rate estimation [14, 15] , gene evolutionary mechanisms research [16] , and population structure analysis [17] . The unsupervised clustering models, a subgroup of machine learning models, have acted as more productive and robust solutions to effectively and accurately cluster these entities. The objective of clustering is to simultaneously minimize intra-cluster distances and maximize inter-cluster distances [18] . Although finding the globally optimal solution is near-impossible, a valuable and effective clustering method allows us to easily extract the inner differences between clusters within the dataset and inform downstream analyses.

In this study, to elucidate the genetic structure and evolution of early SARS-CoV-2, we downloaded 16,873 genomes from the published database, and used a clustering model based on the recently published state-of-the-art deep embedding clustering (DEC) method [19] to cluster these genomes into six major clusters. Interestingly, the proportions of clusters across the six continents (Africa, Asia, Europe, North America, Oceania, and South America) displayed distinct geographical distributions. Our comprehensive analysis revealed that the unique geographical distributions across the clusters are influenced by both intrinsic genetic factors and the migration of humans. This study provides an example of using machine learning algorithms to assist phylogenetic analysis, supporting the exploration of the spread of SARS-CoV-2 across humans worldwide.

The genomes of SARS-CoV-2 uploaded before 15 May 2020 were downloaded from GISAID (https://db.cngb.org/gisaid/, accessed on 15 May 2020). Only the strains marked as "high coverage" were maintained in our study. In terms of GISAID, the definition of "high coverage" was strains with <1% Ns and <0.05% unique amino acid mutations (not seen in other sequences in databases), and no insertion/deletion unless verified by the submitter. Moreover, non-human host strains and assemblies of total genome length less than 29,000 bps strains were removed. A total of 16,873 strains passed the filtration. The Accession IDs with demographic information for the strains are listed in Supplementary Table S1 . The distribution of the number of collected strains per day is shown in Supplementary Figure S1.

We followed the Nextstrain pipeline [20] to map the downloaded SARS-CoV-2 strains to the reference genome (GenBank Accession Number: NC_045512.2). We used CLUSTALW 2.1 [21] for multiple and pairwise sequencing alignments. Considering putatively artefactual substitutions and gaps located at the beginning and end of the sequencing, we masked the first 130 bps and last 50 bps in the mutation calling step based on the Nextstrain pipeline [20] .

To calculate the mutation rate of the strains, we defined the collection date of the reference genome as the reference date. The mutation rate (substitutions per year) of a given strain can be expressed as: mutation rate = mutation counts (nucleotide substitations) × 365 days from the collection date of the given strain to the collection date of the reference strain The average mutation rate was the mean mutation rate of all strains within a population.

A phylogenetic tree was generated using all 16,873 strains. We employed FastTree 2 [22] to reconstruct the phylogenetic tree of SARS-CoV-2 using the nucleotide substitutions of these strains. The phylogeny tree was rooted using FigTree v1.4.4 and visualized using the online tool Interactive Tree Of Life (iTOL v5) [23] .

We achieved a total of 7970 unique substitutions across 16,873 strains through mutation calling. Herein, we used the aggregated substitution profiles to train our unsupervised clustering model. The input of the clustering model was a matrix with dimensions 16,873 × 7970, where each row represented a strain, and each column represented a unique substitution. If a substitution "s" was detected in the strain "i", the cell with the coordinate (i, s) was marked as "1", whereas no mutation was marked as "0".

We implemented a clustering model based on the recently published DEC model [19] to iteratively cluster the strains. We first used K-means clustering to initialize the centroids of clusters. To update the cluster assignments, we used Student's t-distribution as a kernel to measure the distance (q ij ) of each strain (h i ) to each centroid (u j ):

Based on the equation, the distance q ij can be interpreted as the probability of assigning strain i to cluster j. The α was the degree of freedom of the distribution. For convenience, we assumed that α = 1 in this study. Next, we defined an auxiliary target distribution P by raising each q ij to the second power to enhance strains assigned with high confidence:

where the denominator was used to normalize the loss contribution of each centroid to prevent large clusters from distorting the feature space. Finally, we defined the Kullback-Leibler (KL) divergence loss as the objective function:

where the parameters and cluster centroids were jointly optimized by minimizing L using the Stochastic Gradient Descent (SGD) with momentum. Notably, the DEC model needs to pre-specify the number of clusters. To this end, we tested different cluster numbers from 2 to 20 and plotted the sum of squared errors (SSE) and Bayesian information criterion (BIC) [24] curves. The number of clusters was determined using the elbow method [25] . Moreover, considering the possible overfitting, we tested the non-negative matrix factorization (NMF) method for dimensionality reduction [26] [27] [28] . However, the mean intra-cluster pairwise genetic distances indicated that directly using the entire substitution binary matrix as the input feature achieved a minimal mean pairwise genetic distance.

Pairwise genetic distance is a measure of the genetic divergence between two entities. Herein, we used Hamming distance (d(i, j)) to measure the pairwise genetic distance between two strains (i, j) based on their nucleotide substitutions. Thus, the pairwise genetic distance is the number of nucleotide positions in which the two strains are different. For a cluster (A), N(A) denotes the number of strains of the cluster, and the mean intra-cluster pairwise genetic distance (D(A) intra ) can be computed as:

Likewise, the mean inter-cluster pairwise genetic distance (D(A) inter ) is:

In this study, we used the mean overall intra-cluster pairwise genetic distance (D(overall) intra ) as the metric to evaluate the performance of a clustering model. For the six clusters, D(overall) intra can be expressed as:

where i k and j k denote the two strains in cluster k, respectively. In terms of the equation, a smaller D(overall) intra represents better clustering performance.

Simpson's Diversity Index (D sim ) is a measure of diversity that considers the number of entities as well as their abundance. The index measures the probability that two randomly selected individuals are the same. The formula to calculate the value of the index is:

where n is the number of individuals displaying one trait and N is the total number of all individuals. The value of D sim ranges between 0 and 1. With this index, 1 represents infinite diversity and 0 denotes no diversity.

To test which position was under selective pressure, we used a set of programs available in HyPhy to calculate nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis to infer pervasive selection. Fast Unconstrained Bayesian AppRoximation (FUBAR) was applied to detect overall sites under positive selection. The positively selected sites were identified using a probability larger than 0.95 using the FUBAR method.

Pairwise mutation dependency scores can measure the order in which genetic mutations are acquired within a cluster. For two selected mutations X and Y, the score S(X|Y) represents the proportion of strains that accumulated X among the strains that accumulated mutation Y. S(X|Y) and S(Y|X) can be calculated using the following functions:

where S X = 1 denotes that the sequence has a mutation X. The pairwise mutation dependency score displays the correlation and the timescale relationship of the two mutations. The value of S(X|Y) and S(Y|X) ranges between 0 and 1. With this index, S(X|Y) = 1 with S(Y|X) < 1 represents that mutation Y occurs after mutation X. In contrast, S(X|Y) = 1 with S(Y|X) = 1 represents that the two mutations occur simultaneously and are genetically linked. Statistical analyses and data presentations were generated using Python 3.7.6.

For each country with SARS-CoV-2 data available, clustering proportions were calculated and plotted on the world map using the tool Tableau Desktop 2020.2. The Tdistributed Stochastic Neighbor Embedding (t-SNE) plot was generated using the aggregated substitution profiles for the 16,873 strains. We used the sklearn.manifold.TSNE package in Python for t-SNE training and the seaborn package in Python for visualization. The other figures were plotted using the ggplot2 in R 4.0.1, the seaborn package in Python 3.7.6 and GraphPad Prism 8.0.2. Statistical analyses, and the clustering models were implemented in Python 3.7.6.

We downloaded the whole genome sequencing of SARS-CoV-2 from GISAID, aligned the sequences, and called mutations. Across the 16,873 strains (Africa: 98, Asia: 1324, Europe: 9527, North America: 4765, Oceania: 1040 and South America: 119), we identified 7970 substitutions (4908 non-synonymous substitutions, 2748 synonymous substitutions, and 314 intronic substitutions), with 6.99 mutation counts per strain (Supplementary Figure S2 ). In terms of the frequency spectrum of the mutations, we found that more than 99% of mutations had frequencies smaller than 1% (Supplementary Figure S3 ). Of these mutations, 54.05% were singletons (can be observed in one strain) and 15.35% were doubletons (can be observed in two strains). The high proportions of these low-frequency mutations suggest a rapid proliferating pattern of SARS-CoV-2 [29] . Of the downloaded 16,873 strains, we identified a total of 8706 unique strains, 7078 of which were singletons (Supplementary Figure S4 ). The Simpson's diversity index was 0.8222, indicating that two random strains would have a high probability of being genetically different. In summary, the distributions of these mutations and strains demonstrated the high genetic diversity of SARS-CoV-2.

There were 8706 unique strains within the entire SARS-CoV-2 population; thus, grouping these strains into multiple major subtypes was necessary to informatively formulate their population structure. Making use of the machine learning (ML) algorithms, we applied unsupervised ML models to cluster SARS-CoV-2 strains using their substitution profiles, as SARS-CoV-2 exhibited limited substitutions per strain as well as little ongoing horizontal gene exchange [4] .

Our initial work [30] used an unsupervised clustering model based on DEC [19] (see Section 2), to group the strains into six major clusters. To test the accuracy and biological interpretation of the clustering results, we mapped the six clusters against the phylogenetic tree. Each cluster of the SARS-CoV-2 strains were compact in the phylogenetic tree ( Figure 1A ). In addition, the pairwise genetic distances between intra-clustering and inter-cluster and t-SNE plot demonstrated that these strains were adequately isolated between clusters ( Figure 1B ,C and Supplementary Figure S5 ). Mean intra-cluster and inter-cluster pairwise genetic distances across the six clusters. The blue bars represent mean pairwise genetic distances between pairs of isolates within the corresponding clusters, and the red bars represent mean pairwise genetic distances between pairs of isolates outside the corresponding clusters. The error bars display the standard deviations. For all six clusters, the mean intra-cluster pairwise genetic distances are significantly lower than the corresponding mean inter-cluster pairwise genetic distances (p-value < 0.001, Wilcoxon rank-sum test). (C) The t-SNE plot displays the clustering of the strains.

We investigated the distributions of clustering results across continents and found that the clusters diverged in their geographical distributions ( Figure 2 , Table 1 ). For example, European strains were the dominant strains in cluster A (81.92%), cluster C (71.97%), and cluster F (85.73%); North American strains were the major strains in cluster D (66.39%) and cluster E (61.88%); and Asian and European strains occupied 76.47% in cluster B. We investigated the distributions of clustering results across continents and found that the clusters diverged in their geographical distributions ( Figure 2 , Table 1 ). For example, European strains were the dominant strains in cluster A (81.92%), cluster C (71.97%), and cluster F (85.73%); North American strains were the major strains in cluster D (66.39%) and cluster E (61.88%); and Asian and European strains occupied 76.47% in cluster B. It is noted that approximately 85% of the strains were collected from North America and Europe (Supplementary Figure S6) . Thus, directly comparing the proportion of continents in each cluster was not informative in terms of sampling bias. We next considered the distribution of clusters in each continent. Most continents were concentrated in only one or two clusters, including Africa (cluster C: 66%), Asia (cluster B: 49%), Europe (cluster C and cluster F: 64%), North America (cluster D plus cluster E: 74%), and South America (cluster C plus cluster F: 78%). In contrast, the distribution of the strains in Oceania was uniformly dispersed across the six clusters, suggesting that the diversity of SARS-CoV-2 was higher in Oceania than in the other continents. Note that, of all clusters, cluster C dispersed globally, and was the only cluster with frequencies exceeding 10% across all six continents, suggesting that cluster C might be more disseminated than the other clusters.

The different geographical distributions for the six clusters could be due to intrinsic genetic factors, extrinsic factors such as the migration of humans, or both. Hence, we next aimed to explore the genomic characteristics of these clusters, as well as the transmission and human migration of the virus across the globe.

If the different geographical distributions for the six clusters were due to intrinsic genetic factors, there would be high genetic variance between the clusters. The average mutation counts for the six clusters were 6.38, 3.49, 6.57, 7.09, 7.89, and 8.96 (Supplementary Figure S7 ), respectively. Considering the different collection dates ( Figure 3A ) of the strains, mutation rates as opposed to mutation counts were more effective for describing the genetic variations between clusters. The average mutation rates for the six clusters were 25.55, 15.91, 25.44, 31.64, 30.99, and 34.12 substitutions per year, respectively. Specifically, the average mutation rate in cluster B was significantly lower (p-value < 0.001, Wilcoxon rank-sum test) than those in other clusters. In contrast, the average mutation rate It is noted that approximately 85% of the strains were collected from North America and Europe (Supplementary Figure S6) . Thus, directly comparing the proportion of continents in each cluster was not informative in terms of sampling bias. We next considered the distribution of clusters in each continent. Most continents were concentrated in only one or two clusters, including Africa (cluster C: 66%), Asia (cluster B: 49%), Europe (cluster C and cluster F: 64%), North America (cluster D plus cluster E: 74%), and South America (cluster C plus cluster F: 78%). In contrast, the distribution of the strains in Oceania was uniformly dispersed across the six clusters, suggesting that the diversity of SARS-CoV-2 was higher in Oceania than in the other continents. Note that, of all clusters, cluster C dispersed globally, and was the only cluster with frequencies exceeding 10% across all six continents, suggesting that cluster C might be more disseminated than the other clusters.

The different geographical distributions for the six clusters could be due to intrinsic genetic factors, extrinsic factors such as the migration of humans, or both. Hence, we next aimed to explore the genomic characteristics of these clusters, as well as the transmission and human migration of the virus across the globe.

If the different geographical distributions for the six clusters were due to intrinsic genetic factors, there would be high genetic variance between the clusters. The average mutation counts for the six clusters were 6.38, 3.49, 6.57, 7.09, 7.89, and 8.96 (Supplementary Figure S7 ), respectively. Considering the different collection dates ( Figure 3A ) of the strains, mutation rates as opposed to mutation counts were more effective for describing the genetic variations between clusters. The average mutation rates for the six clusters were 25.55, 15.91, 25.44, 31.64, 30.99, and 34.12 substitutions per year, respectively. Specifically, the average mutation rate in cluster B was significantly lower (p-value < 0.001, Wilcoxon ranksum test) than those in other clusters. In contrast, the average mutation rate in cluster F was significantly higher (p-value < 0.001, Wilcoxon rank-sum test) than those in other clusters. The Simpson's diversity indexes for the six clusters were 0.7616, 0.7608, 0.8398, 0.8466, 0.8082 and 0.8502, respectively. Both the average mutation rate and Simpson's index were highest in cluster F, suggesting that the diversity of cluster F was higher than the other clusters. The nucleotide diversity per site for the six clusters was 0.0196%, 0.0222%, 0.0171%, 0.0256%, 0.0131%, and 0.0132%. The high mutation rates but low nucleotide diversity in cluster E and cluster F suggest that these two clusters may have more fixed mutations than the other clusters. The nucleotide diversity values for each gene across all clusters are displayed in Figure 3B -G. Except for some short genes that were unlikely to be informative, the diversity of most genes was close to the diversity of their genome-wide variants. Our analysis showed that intra-cluster genetic diversity differed between clusters, suggesting that selective pressures might be different between clusters. These different selective pressures will affect the geographical distribution of each cluster. in cluster F was significantly higher (p-value < 0.001, Wilcoxon rank-sum test) than those in other clusters. The Simpson's diversity indexes for the six clusters were 0.7616, 0.7608, 0.8398, 0.8466, 0.8082 and 0.8502, respectively. Both the average mutation rate and Simpson's index were highest in cluster F, suggesting that the diversity of cluster F was higher than the other clusters. The nucleotide diversity per site for the six clusters was 0.0196%, 0.0222%, 0.0171%, 0.0256%, 0.0131%, and 0.0132%. The high mutation rates but low nucleotide diversity in cluster E and cluster F suggest that these two clusters may have more fixed mutations than the other clusters. The nucleotide diversity values for each gene across all clusters are displayed in Figure 3B -G. Except for some short genes that were unlikely to be informative, the diversity of most genes was close to the diversity of their genome-wide variants. Our analysis showed that intra-cluster genetic diversity differed between clusters, suggesting that selective pressures might be different between clusters. These different selective pressures will affect the geographical distribution of each cluster. 

The high genetic diversity between clusters indicated that the frequencies of the mutations across clusters were very different. In order to explore whether there are mutations that affect the genetic structure within the clusters, we applied ANOVA to identify the statistically significant mutations that were strongly associated with clusters. Across the 7970 substitutions, 26.27% (2094 substitutions) of them achieved p-values < 0.05 (Supplementary Table S2 ). We found that some of these mutations were fixed or nearly fixed (frequency > 95%) in one or several clusters ( Figure 4A and Supplementary Table S2 ). Cluster C, cluster E and cluster F shared four common fixed substitutions: A23403G, C241T, C3037T and C14408T. Cluster E had two additional fixed substitutions: C1059T and G25563T. Cluster F had three additional substitutions from position 28,881 to position 28,883. For the remaining three clusters, there were two fixed substitutions (C8782T, T28144C) in cluster D and three fixed substitutions (G11083T, G14805T, and G26144T) in cluster A. It is noteworthy that the fixed mutation numbers in cluster E (six) and cluster F (seven) were higher than in any of the other clusters, which was consistent with our conclusion of the high mutation rates but low nucleotide diversity in cluster E and cluster F. 

The high genetic diversity between clusters indicated that the frequencies of the mutations across clusters were very different. In order to explore whether there are mutations that affect the genetic structure within the clusters, we applied ANOVA to identify the statistically significant mutations that were strongly associated with clusters. Across the 7970 substitutions, 26.27% (2094 substitutions) of them achieved p-values < 0.05 (Supplementary Table S2 ). We found that some of these mutations were fixed or nearly fixed (frequency > 95%) in one or several clusters ( Figure 4A and Supplementary Table S2 ). Cluster C, cluster E and cluster F shared four common fixed substitutions: A23403G, C241T, C3037T and C14408T. Cluster E had two additional fixed substitutions: C1059T and G25563T. Cluster F had three additional substitutions from position 28,881 to position 28,883. For the remaining three clusters, there were two fixed substitutions (C8782T, T28144C) in cluster D and three fixed substitutions (G11083T, G14805T, and G26144T) in cluster A. It is noteworthy that the fixed mutation numbers in cluster E (six) and cluster F (seven) were higher than in any of the other clusters, which was consistent with our conclusion of the high mutation rates but low nucleotide diversity in cluster E and cluster F. We selected the 2% (42/2094) substitutions that achieved the lowest p-values (Table  2 ) and analyzed their distributions in the clusters. Of the 42 substitutions, there were 26 nonsynonymous mutations (mutation G28882A was in a trinucleotide mutation from position 28,881 to 28,883, spanned two codons and resulted in an RG (arginine-glycine) to KR (lysine-arginine) amino acid change). We focused on these nonsynonymous mutations, as these mutations may comprise those that affect the population structure [31] . Indeed, We selected the 2% (42/2094) substitutions that achieved the lowest p-values (Table 2) and analyzed their distributions in the clusters. Of the 42 substitutions, there were 26 nonsynonymous mutations (mutation G28882A was in a trinucleotide mutation from position 28,881 to 28,883, spanned two codons and resulted in an RG (arginine-glycine) to KR (lysine-arginine) amino acid change). We focused on these nonsynonymous mutations, as these mutations may comprise those that affect the population structure [31] . Indeed, many of these substitutions have been reported to impact the evolution of SARS-CoV-2 [32, 33] . Previous studies have reported that recombination is common in coronavirus [3, 32, 33] . Given that recombinations in SARS-CoV-2 may perturb the clustering, we used Haploview [34] to analyze the linkage disequilibrium (LD) by calculating standardized disequilibrium coefficients (D ) and squared allele-frequency correlations (r 2 ) of the 42 substitutions. D is affected solely by recombination and not by differences in allele frequencies between sites, and r 2 is also affected by differences in allele frequencies at the two sites. In the heatmap of D' and r 2 (Supplementary Figure S8) , no obvious LD blocks were discovered, indicating that our clustering of SARS-CoV-2 strains using substitutions was not distorted by recombination.

Selection usually affects the distribution of the mutations in the population. Purifying selection tends to remove amino acid-altering mutations, while positive selection tends to increase the frequency of the mutations. Considering the rapidly proliferating pattern of SARS-CoV-2 that strengthened the power of drift relative to the power of purifying selection [35] [36] [37] , we mainly focused on the positive selective sites. We applied HyPhy [38] to infer the probabilities of the extracted 26 nonsynonymous mutations under positive selection. There were nine mutations (asterisks in Table 2 ) with a positive probability >0.95.

In particular, mutations G2891A, G11083T, C14408T, C17747T, and A23403G (D614G) were reported as recurrent mutations [32, 39] . The recurrence of these mutations agrees with the assumption that they may confer selective advantages in the population. These possible positively selected cluster-specific mutations may result in higher genetic distances between clusters of SARS-CoV-2 across geographical regions.

Regardless of genetic factors, the travel of humans can also lead to unique geographical distributions in today's highly globalized world. By analyzing the frequencies of the extracted 42 mutations in each cluster ( Figure 4A ) and their collected daily counts ( Figure 4B ), we can trace the dynamics of the substitutions in the SARS-CoV-2 genome. The four genetically linked mutations-A23403G (D614G), C241T, C3037T, and C14408T-that were fixed across three clusters (C, E and F) were the highest frequency mutations in the world, with a high frequency on all continents in our downloaded sequences, including South America (87%), Africa (86%), Europe (75%), North America (65%), Oceania (55%), and Asia (32%). The earliest time when sequences carrying these mutations was collected was in late January 2020. About a month later, these mutations were discovered worldwide. Though the mutation A23403G (D614G) has been reported and estimated to be a positive selective mutation, it is almost impossible to spread across the world without human migration in such a short time. Besides these high-frequency mutations, some lower-frequency mutations also provided some evidence of human migration. We explored the geographical distributions of mutations with global frequencies < 0.05 in Table 2 . Though most of these low frequency mutations were mainly collected within a single continent, we still found two mutations-T28688C and G1397A-that were discovered in Asia, Europe, and Oceania with high proportion. In addition, the spatial geographical distributions of some substitutions also provide evidence that human migration may have influenced the spread of the virus. For example, on the west coast of the USA, most strains accumulated the mutations C8782T and T28144C (cluster D), and these mutations were also found in high frequencies in East Asia. In contrast, on the east coast of the USA, most strains accumulated the mutations A23403G, C241T, C3037T, C14408T, C1059T, and G25563T (cluster E), and similar strains were mainly discovered in Europe (Supplementary Figure S9) . Note that, with the rapid growth of SARS-CoV-2, government policies were implemented, including travel bans, to restrict the spread of the virus. As a result, some mutations may have experienced weaker selective pressure by human intervention, which also affected the geographical distributions of SARS-CoV-2.

Deciphering the genetic structure and subtypes of SARS-CoV-2 is critical for reducing the risk of future dissemination and assessing efficacy of antibodies on different strains [40] . In this study, we employed a clustering model along with phylogenetic analysis to illuminate the subtypes of 16,873 SARS-CoV-2 strains. Using the substitution profiles of the strains as input features, we identified six major clusters of the strains. To our surprise, we found specific geographical distributions of the clusters, most of which, except Oceania, were mainly composed of one to two clusters. We used complementary approaches to evaluate whether the geographical distributions for the clusters were due to genetic factors or human travel. The varied intra-cluster genetic diversity across the clusters suggested different selective pressures between clusters, which would affect the geographical distribution of the clusters. By analyzing the statistically significant mutations that were strongly associated with the clusters, we identified that some mutations might be under positive selection, indicating different geographical distributions between the clusters were partially affected by these mutations. In addition, the dynamics and the spatial geographical distributions of some substitutions suggested that human migration may also have affected the different geographical distributions. In general, our findings indicate that the geographical distributions of the clusters are the result of both genetic factors and the migration of humans.

One limitation of our study is the sampling bias of SARS-CoV-2 collections. Indeed, approximately 85% of the strains were from North America and Europe (Supplementary Figure S4) . In particular, the proportion of strains from the United Kingdom and the USA was over 60%. As a comparison, the summation of proportions from the two continents of Africa and South America was smaller than 2% (Supplementary Figure S4) . The sampling bias may lead to the underestimation of certain mutations that are frequent in continents with fewer collected strains. For example, the frequency of mutation C15324T reached 41.84% in Africa, but was only 2.21% outside Africa. The frequency of mutation T29148C reached 15.13% in South America, but was only 0.12% outside South America. Another mutation T27299C with a frequency of 10.92% in South America was only found with a frequency of 0.08% in other regions. In fact, all three mutations were mostly grouped in single clusters, indicating these mutations were highly concentrated. However, due to the small proportion of the strains from these two continents, these mutations were unable to affect the clustering of samples. To address this issue, more strains were needed to be collected from these continents. In addition, we found that in cluster B, there were no fixed mutations. We calculated the pairwise dependency scores (see Section 2) of all the mutations with frequencies > 0.05 in cluster B and discovered five main subclusters (Supplementary Figure S10) . Other than the mutation G11083T that was discovered in two subclusters, there were no common mutations between either of the five clusters. As shown in Figure 3A , strains in cluster B were grouped in one cluster mainly because these strains had smaller mutation counts than strains in other clusters. The genetic distance between two strains was still small, though they shared no common mutations. To address this issue, another clustering assessment can be used for further analyses. Moreover, considering that our clustering model assumed that every nucleotide had the same rate of changing into other nucleotides (JC69 [41] ), this model is more suitable for fast-growing species within a short period. For long-term population clustering, the input features need to include a more complicated substitution model such as TN93 [42] , GTR [43] , and more.

Despite these limitations, our discovery of high genetic diversity in early SARS-CoV-2 is consistent with other published studies [44] . In addition, mapping the clustering results to the phylogenetic tree suggested that the emergence and the spread of SARS-CoV-2 occurred in short intervals, similar to previous studies [2, [45] [46] [47] . Our identification of the main substitutions that drove the clustering results has also been reported by many studies concerned with the evolution of SARS-CoV-2. For example, mutation A23403G (D614G, Asparticacid to Glycine) in the spike protein domains was reported to show significant variation in cytopathic effects and viral load, and substantially change the pathogenicity of SARS-CoV-2 [48] . This mutation is accompanied by a mutation (T14408C) that results in an RNA-dependent RNA polymerase (RdRp) amino acid change [49] . In addition, Tang et al. [50] used mutation T28144C to define "L" types (defined as "L" type because T28,144 is in the codon of Leucine) and "S" types (defined as "S" type because C28,144 is in the codon of Serine) of SARS-CoV-2. They found that the "L" type was more transmissible and aggressive than the "S" type.

In summary, our current study, as well as previously published unsupervised clusteringbased SARS-CoV-2 analyses [51, 52] , have showcased the advancement and efficiency of using clustering models in supporting phylogenetic analysis, especially when investigating the population structures of these never-seen-before species.

The SARS-CoV-2 virus has been spreading rapidly and across the globe since first being reported in December 2019. To understand the evolutionary trajectory of the coronavirus, phylogenetic analysis is needed to study the population structure of SARS-CoV-2. As sequencing data worldwide is accruing rapidly, grouping them into clusters helps to organize the landscape of population structures. To effectively group these data, we utilized clustering models to partition the viral sequences into clusters. The partition of the viral sequences revealed six major clusters, and these clusters differed in their geographical distributions. Through multiple approaches, we found that the unique geographical distributions across the clusters were influenced by both intrinsic genetic factors and the migration of humans. Identifying the mutations that are strongly associated with specific clusters has potential implications for the diagnosis and pathogenesis of COVID-19. In addition, our application of clustering techniques has proven to be a valuable method for studying the structure of new and unknown populations.

Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13040648/s1, Figure S1 : The distribution of the number of collected SARS-CoV-2 strains per day. Figure S2 : The distribution of the mutation counts of the 16,873 SARS-CoV-2 strains. Figure S3 : Frequency spectra of SARS-CoV-2. Figure S4 : Normalized allele frequency of 16,873 SARS-CoV-2 strains. Figure S5 : The t-SNE plot by geographic groups. Figure S6 : The bar chart showing the proportions of collected genomes from the six continents in the entire SARS-CoV-2 population. Figure S7 : The distribution of the mutation counts of the strains for the six clusters. Figure S8 : The D and r 2 of the 42 mutations. Figure S9 : Geographic distribution of six clusters in the United States. Figure S10 : The pairwise dependency score (see Section 2) of the mutations with frequency > 0.05 within cluster B. Table S1 : Accession IDs with demographic information for the strains in this study. Funding: This study is supported in part by NIH grant 1R01LM013337 and 5UL1TR001422.

Informed Consent Statement: Not applicable.

The publicly available SARS-CoV-2 genomes in this study are available at GISAID (https://www.gisaid.org, accessed on 15 May 2020). The reference SARS-CoV-2 genome is available at the NCBI GenBank (GenBank Accession Number: NC_045512.2, https://www.ncbi. nlm.nih.gov/nuccore/NC_045512.2, accessed on 15 May 2020).

Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2

A Novel Coronavirus from Patients with Pneumonia in China

Evolutionary Trajectory for the Emergence of Novel Coronavirus SARS-CoV-2. Pathogens

Phylogenetic network analysis of SARS-CoV-2 genomes

Variant analysis of SARS-CoV-2 genomes

Phylogenetic Tree Construction for DNA Sequences using Clustering Methods. Procedia Eng

Review of Clustering Methods: Toward Phylogenetic Tree Constructions

Harnessing machine learning to guide phylogenetic-tree search algorithms

Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

Phylogenetic approaches to microbial community classification

Phylogenetic clustering increases with elevation for microbes

Phylogenetic convolutional neural networks in metagenomics

A novel approach to phylogenetic tree construction using stochastic optimization and clustering

A hidden Markov Model approach to variation among sites in rate of evolution

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

A systematic computational analysis of biosynthetic gene cluster evolution: Lessons for engineering biosynthesis

Clustering of 770,000 genomes reveals post-colonial population structure of North America

Clustering to minimize the maximum intercluster distance

Unsupervised deep embedding for clustering analysis

Nextstrain: Real-time tracking of pathogen evolution

Clustal W and Clustal X version 2.0

FastTree 2-Approximately maximum-likelihood trees for large alignments

Interactive Tree Of Life (iTOL): An online tool for phylogenetic tree display and annotation

Estimating the Dimension of a Model

Who Belongs in the Family? Psychometrika

Cancer classification and pathway discovery using non-negative matrix factorization

Integrating hypertension phenotype and genotype with hybrid non-negative matrix factorization

Recent Advances in Supervised Dimension Reduction: A Survey

Decoding the evolution and transmissions of the novel pneumonia coronavirus (SARS-CoV-2/HCoV-19) using whole genomic data

Unsupervised clustering analysis of SARS-CoV-2 population structure reveals six major subtypes at early stage across the world

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2

Genotyping coronavirus SARS-CoV-2: Methods and implications

Analysis and visualization of LD and haplotype maps

The linear process of somatic evolution

The Ecology and Evolution of Cancer: The Ultra-Microevolutionary Process

A New Formulation of Random Genetic Drift and Its Application to the Evolution of Cell Populations

HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

Circulating ACE2-expressing extracellular vesicles block broad strains of SARS-CoV-2

CHAPTER 24-Evolution of Protein Molecules

Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees

Some probabilistic and statistical problems in the analysis of DNA sequences

Transmission dynamics and evolutionary history of 2019-nCoV

A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: A study of a family cluster

COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Patient-derived mutations impact pathogenicity of SARS-CoV-2. medRxiv 2020

Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus

On the origin and continuing evolution of SARS-CoV-2

Mutation landscape of SARS-CoV-2 reveals three mutually exclusive clusters of leading and trailing single nucleotide substitutions

Tracking the COVID-19 pandemic in Australia using genomics

We thank Xin Wu for the comments and suggestions during the preparation of the manuscript.

The authors declare no conflict of interest.