key: cord-1056650-rqnho47m authors: Mullick, Baishali; Magar, Rishikesh; Jhunjhunwala, Aastha; Farimani, Amir Barati title: Understanding Mutation Hotspots for the SARS-CoV-2 Spike Protein Using Shannon Entropy and K-Means Clustering date: 2021-10-05 journal: Comput Biol Med DOI: 10.1016/j.compbiomed.2021.104915 sha: 5c134668575fcbe1f96c4cc260b3ab040814691b doc_id: 1056650 cord_uid: rqnho47m The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations The SARS-CoV-2 virus has rapidly evolved by continually mutating, affecting more than 180 million people across the globe. Ever since the genome sequence of SARS-CoV-2 became available, mutations at several sites in the genome have been identified raising concerns regarding enhanced transmissibility of the virus (Wu et al., 2020) . The mutating nature of the virus has inspired global efforts from research community to actively track and understand the emergence of variants of concern (Alam et al., 2021; Chen et al., 2021; Xing et al., 2020) . One of the first mutation that rapidly spread throughout the world, mutation D614G, was first reported in April 2020 (Korber et al., 2020) . This mutation has now been classified under several lineages and is found to be a factor in increased J o u r n a l P r e -p r o o f transmission of the virus (Laha et al., 2020; Tomaszewski et al., 2020; Volz et al., 2021a; . The discovery of this mutation was followed by identification of a series of mutations in the virus belonging to the B. 1.1.7 lineage, which was first found in the Southeast of England (Volz et al., 2021b) . The mutations, namely A222V, S477N, N501, H69, N439K, Y453F,11S98F, D80Y, A626S, V1122L, have been noted as variants of interest in many studies (Elbe and Buckland-Merrett, 2017; Hadfield et al., 2018; Hayashi et al., 2021; Volz et al., 2021b) and are the focus of this work as well. These variants were selected because they were marked as the Variant Under Investigation SARS-CoV-2 VUI 202012/01 (Variant Under Investigation, the year 2020, month 12, variant 01) by different studies done in the United Kingdom (Galloway SE, Paul P, MacCannell DR, et al.) . The mutation, A222V belongs to the B.1.177 lineage and has been noted to have a dominating presence in European countries (COVID-19 Genomics UK consortium, 2021; Hodcroft et al., 2020) . N439K and Y453F have been found to have a higher binding affinity to the hACE2 receptor and are noted to reduce the neutralizing potential of antibodies specific to SARS-CoV-2 (Bayarri-Olmos et al., 2021; Starr et al., 2020; Thomson et al., 2021) . N439K often co-occurs with 69-70 deletion in the spike protein, the effect of this combined double mutation is being investigated by researchers Genomics UK consortium, 2021; Meng et al., 2021) . N501Y is the causative factor in the increased infectiousness of the disease (Tegally et al., 2020) . The numerous effects of such mutations on the increased transmissibility and lethality of SARS-CoV-2, make it imperative to study these mutations and understand their effects (Callaway, 2020) . To tackle the COVID -19 pandemic, efforts from the researchers have involved exploring traditional paradigm of in-vitro experimentation and data analysis-based methodologies like machine learning. Data driven modelling techniques, with their ability to analyze large amounts of data, build a functional mapping between the input parameters and output. This paper explores the use of datadriven methodologies to understand the mutations in the SARS-CoV-2 spike proteins. To understand J o u r n a l P r e -p r o o f and identify the mutation hotspots we have examined the sequence entropy and its correlation with experimentally identified variants of concern. Tomaszewski et al., defined mutational entropy as a measure of molecular heterogeneity of the SARS-CoV-2 proteome which is estimated from the positional variance in these sequences (Tomaszewski et al., 2020) . In our work, we measure the positional variance in the sequence of the SARS-CoV-2 spike proteins by calculating Shannon Entropy. In case of proteins, Shannon entropy is shown to have a strong correlation with protein structural entropy (Koehl and Levitt, 2002) , and can provide insights into the compositional stability of the proteins. The Shannon entropy is also directly proportional to the inverse packing density of proteins (Liao et al., 2005) , and the packing density is further related to increased mutagenesis. Higher local flexibility regions have an increased value of entropy and are prone to mutations (Tegally et al., 2020) . Our study explores these relationships of Shannon entropy to estimate the mutational hotspots in the SARS-CoV-2 spike protein. Higher value of entropy at a position in the sequence is indicative of increased randomness at that site whereas low value of entropy at a certain site is indicative of an increased stability and decreased randomness at the said location. Apart from identifying the hotspots of interest, we also analyze the similarity of these mutations by employing a k-means clustering algorithm. To generate the embedding for the clustering algorithm we leverage the protein sequence data by using language modeling approaches. Through transfer learning, some of the highly successful models in the Natural Language Processing (NLP) domain have been applied to protein sequence to generate meaningful representations that can be used in tasks like structure prediction (Rao et al., 2019) . We used the Prot-BERT language modeling to represent these spike protein sequences in the form of semantic rich embeddings (Elnaggar et al., 2020) . The Prot-BERT model has been trained on 80 billion amino acids, representing wide variety of protein sequences. The embeddings generated via the Prot-BERT model can be used for different J o u r n a l P r e -p r o o f downstream tasks. In our work, we use embeddings to determine the similarities between mutations using unsupervised machine learning techniques. This analysis will help in understanding the relationships between the mutations and assist the research community to tackle the virus. Machine learning models have been used in many ways to study and understand the different aspects of COVID-19 pandemic. These models have been previously used for forecasting the COVID-19 cases (ArunKumar et al., 2021; Chimmula and Zhang, 2020; Sarkar et al., 2020) , propose the potential antibodies (Magar et al., 2021) , understand the possible evolutions of the virus , understand the economic and social effects of social distancing (Memon et al., 2021; Silva et al., 2020) , understand the efficiency of lockdowns (Sharov, 2020) , study the transmission and spread of the virus (Cooper et al., 2020; Ndaïrou et al., 2020) . Data driven models have also been used to analyze the SARS-CoV-2 mutations. In their paper (Wang et al., 2021) , use techniques topological like persistent homology to understand the SARS-CoV-2 mutations and uncover some underlying patterns. In another study , develop the Informative Subtype Markers (ISM) to visualize and analyze the spread of different mutated SARS-CoV-2 sequences. To understand the effect of the mutations we focus only on the spike protein of the virus sequence. We select the spike protein region because it is the major component of the SARS-CoV-2 virus that is responsible for eliciting host immune responses of neutralizing antibodies. It is the presence of this spike protein on the antigen that allows it to interact and penetrate the host cells. Therefore, more attention to spike protein has been given in the analysis of the mutations of the SARS-CoV-2 virus. To this end, we collect the spike protein data from the GISAID server to analyze the effect of the mutations on the spike protein on its transmissibility. We downloaded three hundred eleven thousand J o u r n a l P r e -p r o o f two hundred and fifty-six spike protein sequences from the GISAID server (https://www.gisaid.org/) on January 3, 2020 (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) . The comprehensive dataset had sequences related to the SARS-CoV-1 virus too, therefore the first stage of preprocessing involved the elimination of sequences that were not from 2020. This resulted in a dataset comprising three hundred ten thousand five hundred and ten sequences. Most of these sequences are comprised of 1273 amino acids, with maximum length being 1278 amino acids. To ensure uniformity in our calculation of the positional entropy, the ones with length less than 1278 were made up to length 1278 by appending the relevant number of 'X's to the end of the gene sequence for the entropy analysis. The original spike protein sequence found in Wuhan is referenced from Zhao et al. (Wu et al., 2020) and the mutations in all the collected sequences in the data are analyzed with respect to this sequence. There was a large presence of repeated spike protein sequences found in different countries, so we decided to curate the data further and create data with only the unique sequences as featurizing the same sequence twice using Prot-BERT would have been redundant. Of these unique sequences, fifty-three thousand eight hundred and ninety-eight belonged prime variants of interest in this study. Subsequently, this dataset was used to generate embedding via the ProtBERT Model. These embeddings were further used to carry out unsupervised machine learning analysis. To understand the spread of the data and visualize it, we generated the plot using t-SNE (Laurens van der Maaten and Geoffrey Hinton, 2008) shown in Figure 1 . J o u r n a l P r e -p r o o f Figure 1 : t-SNE plot capturing the distribution of the data collected from the GISAID server. Some of the variants of concern like N439K, N501Y are clustered near each other. From the t-SNE, we can easily infer that the SARS-CoV-2 mutations have unique characteristics Further, we also analyze the geographical locations and the general distribution of the countries that were a part of the dataset we found that United Kingdom and Denmark contributed to over 50 percent of the mutations in the dataset with 140458 mutations in United Kingdom and 20346 in Denmark. These two countries have proactively studied the different mutations and made the data available for public use via the GISAID server. To analyze the mutation sequence data from other countries, a distribution of the dataset comprising of countries with more than 200 but less than 5000 mutation sequences is shown in Figure 2 . J o u r n a l P r e -p r o o f and Denmark, the other countries actively tracking the variants of concern include USA, Australia, South Africa, and Switzerland The positional entropy is a measure of the randomness at the given position in the sequence (Crooks, 2004) . To calculate the positional entropy for our dataset we use Shannon Entropy formulation stated in Equation 1 (Shannon, 1948) : Where L is a list of all possible amino acids in all the sequences ( ) is the probability of finding the kth amino acid at that position. We use equation 1 to find the positional entropy for all the positions in the SARS-CoV-2 spike protein sequence. Using the dataset obtained from the GISAID server, we first pre-process the data using Biopython (Cock et al., 2009) to extract the sequences from the FASTA file. We found that the length of the spike protein sequence varied from 1270 -1278, the distribution of the sequence lengths is shown in Figure S1 . We also observed that the positions that contain ambiguous sites or unidentified amino acid in the spike protein sequence have been denoted with character "X" in the dataset. These positions with character "X" are handled by a masking operation that calculates the entropy without considering them . We proceed by calculating the positional entropy values using equation1 and all the values for the positional entropy are stored in an array. To identify the regions of high entropy that can possibly be associated with harmful mutations, we use a running mean (window length = 15, step size = 1), here the first positional index of the window gets assigned the value of the running mean. In the running mean calculation, we don't consider the first 60 and last 60 amino acids in the sequences because of the sequencing uncertainty. After (Table S1 ). The Prot-BERT trained on the UniRef100 dataset was used to generate sequence embeddings (Elnaggar et al., 2020) . The Prot-BERT model has 30 layers, 16 attention heads, and embedding hidden size 1024. The Prot-BERT model was chosen because the embeddings generated have been used for different downstream tasks successfully increasing our confidence in using the same. We generate the embedding for the spike proteins of the mutated sequences using the pre-trained model on the hugging face website (Wolf et al., 2020) . The Hugging face interface allows the users to easily use the pre-trained models on various Natural Language Processing (NLP) tasks. The curated data containing the unique sequences of spike protein were entered in the pre-trained Prot-BERT model and an embedding of size 1024 for every sequence. These embeddings are then used to study similarities and understand distributions between the mutations via K-Means clustering. Clustering is an unsupervised learning technique used to group a collection of unlabeled data sharing similarities. Each cluster comprises data sharing common traits which are distinct from members of other clusters, thereby resulting in clusters with high internal homogeneity and high external heterogeneity (Bustamam et al., 2017) . Clustering can be broadly classified into two categories, hierarchical and non-hierarchical clustering. The clustering technique deployed in this study is the k-means clustering which is a non-hierarchical clustering approach. This technique involves defining the number of clusters 'k'. Each cluster is Clustering is one of the most important data mining techniques to group unlabeled data based on common traits. In this work, we used K means clustering to group the different mutations based on similarities in properties. The embeddings generated using the ProtBert model were used as features for the clustering model. To perform k-means clustering we use the scikit-learn library, that builds k-means model under the hood after entering the model parameters (Buitinck et al., 2013; Pedregosa et al., 2011) . The number of clusters chosen for this task was 10, based on the number of different mutation types being 10 and also because we got the highest silhouette score of 0.7228 (Rousseeuw, 1987) when using 10 clusters. We also implemented the MST-kNN clustering technique but the algorithm did not perform very well, it had a very low silhouette score of -0.7638 and hence was not used for any further clustering analysis. We use the silhouette scores metric as it is a measure of how well an algorithm can differentiate between different clusters in the data. The score varies from -1 to +1 and high silhouette score indicates that the datapoints have been clustered appropriately, with similar datapoints clustered together and dissimilar datapoints clustered differently. Other parameters for k-means such as the maximum number of iterations was chosen to be 1000 and the total number of initializations was chosen as 50 after multiple trials with other values in order to stabilize the cluster formation. The advantage of analyzing the entropy lies in the fact that sequential entropy is correlated to molecular motility is an important factor for the mutation (Koehl and Levitt, 2002; Liao et al., 2005; Tomaszewski et al., 2020) . Furthermore, studies have found a significant relationship between these J o u r n a l P r e -p r o o f high entropy hotspot regions of the viral sequence and enhanced virulence in the mutations associated with these regions, which have had a crucial role in the evolution of this disease. Hence, these sites are regions of interest in vaccine development and medicine formulation . We calculated the positional entropy for all positions of the spike protein genomic sequence and have estimated the mutational hotspot regions in these viral sequences. and was later found to be dominant mutation in the months of April and May 2021. Similarly, another mutation E484K belonging to the B.1.25 family was recognized as variant of concern was recognized in South Africa in April 2021 (Wise, 2021) . This mutation lies in the region 473 -487 which includes another mutation of significance S477N (Hodcroft et al., 2020; Liu et al., 2021) . This emergence of variants of concern from hotspot regions J o u r n a l P r e -p r o o f identified by our methodology demonstrates the accurate prediction of Shannon entropy based analysis. To further illustrate the positional entropy hotspots, we have plotted the positional entropy for the entire sequence of the spike protein of SARS-CoV-2 in Figure 3 . Based on our analysis nine other hotspot regions that we found include the regions 329 -343, 386 -400, 425 -439, 530 -544, 700 -714, 763-777, 905 -919, 955 -968, 1172 -1186 . Based on validation analysis presented in Table 1 it is likely that the new mutation of concern may emerge in these hotspot regions. Table 2 . Receptor-Binding Domain N439K, L452R, Y453F, S477N, T478K, E484K, N501Y Heptapeptide repeat sequence V1122L Table 2 : Location of the mutations in the spike protein of the SARS-CoV-2, we have 3 regions of the spike protein where mutations can be located. We also validate the mutations in Table 1 by using-EV mutation (Hopf et al., 2017) methodology that determines the favorability of a mutation by calculating the prediction epistatic score . The data for mutation effect using EV mutation for SARS-CoV-2 is available on the server created by (Nathan Rollins*, Kelly Brock*, Joshua Rollins* et al., 2020) , we used the data from this server to analyze the epistatic mutation effect predict for mutations presented in Table 1 . The novel aspect of the EV mutation method is its ability to take into account epistasis by taking into consideration the interactions between all pairs of amino acids residues in the neighborhood to quantify the mutational effects. A higher value of the prediction score using EV mutation indicates a highly favorable mutation. The analysis using EV mutation has been presented in Table 3 . Table 3 : Analysis of the SARS-CoV-2 mutations using EV mutation, the prediction epistatic score is an indicator of whether a mutation is fit or not fit. The higher score indicates that the mutation indicates that the mutation is a better fit. The third column indicates the rank among all the possible J o u r n a l P r e -p r o o f mutations at the site. The possible values for rank range from 1 to 19 as there are 20 amino acids and a single amino acids can mutate into 19 other amino acids. The rank depends on the EV mutation score, highest score will get rank-1 that indicates the mutation is highly favorable and lowest score gets rank-19 indicates that mutation is not favorable according to EV mutation calculations. Among the ten different mutations in Table 1 , Table 3 presents the EV mutation score for seven different mutations. The data for S477N, E484K and N501Y is unavailable on the server (Nathan Rollins*, Kelly Brock*, Joshua Rollins* et al., 2020) , and hence is not presented in Table 3 . We observe that A222V and T478K are highly favorable mutations as they have the highest possible prediction epistatic score among all mutations for the wild-type residue (A for site 222 and T for site 478). The D614G mutations is also highly favorable, and mutations Y453F, V1122L and N439K may be considered as moderately favorable. On the other hand, the mutation L452R may not be as favorable based on prediction epistatic score. The EV mutation scores validate most mutations identified in the hotspots from our methodology in Table 1 , further indicating the calculating the positional entropy of the sequence can be a useful metric for identifying future mutation hotspots. The positional entropy formulation developed in this work used the data from the year 2020 and yet was able to identify some of the mutations that emerge later in April and May 2021 such as E484K and L452R validating our methodology further. We believe that our method may potentially be used to identify the dangerous mutations in advance and aid in the fight against the pandemic . The clustering analysis was done on the embeddings generated from the Prot BERT model. The embeddings for all the sequences are a 2D array of shape (sequence length, 1024) where 1024 is the hidden dimension of the model. Subsequently, we applied mean pooling to the sequence length J o u r n a l P r e -p r o o f dimension of the embeddings and generate a vector of dimension 1024 for each sequence. This 1024dimensional vector is used for k-means clustering analysis. The cluster centers resulting from k-means clustering correspond to the different mutation types, thereby verifying our assumption that the different cluster types get grouped separately. We find that 7 out of 10 different mutations are identified as cluster centers with a few repeats. On analyzing the spike protein sequences that form the clusters and the sequence representative of the cluster center, we find that in most cases most of the sequences are identified to be of the same type as the cluster center whereas in most other cases the mutation type of the cluster center is amongst the top 3 mutation types present in the cluster, the other two types of possibly similar characteristics (Table 4) . For example, from the plots (Figure 4) show the clusters of S477N and N439K have a majority of S477N and N439K components. Furthermore, A222V has the second highest count in the cluster representing S477N (Figure 4) indicating similarities between them. D80Y is one of the majorities in the N439K cluster, thereby implying similarity in characteristics. In a study done by (Jacob et al., 2020) , it was found that A222V and S477N are both stabilizing mutations thereby validating our findings that these two mutations may have some similar characteristics. This similarity analysis between the mutations is significant because when designing therapeutics that can counter new mutations understanding characteristics of mutations computationally can save a lot of experimental time and accelerate the therapeutic development process. J o u r n a l P r e -p r o o f In this study, we developed a methodology to determine the hotspots for mutations in spike protein sequences of SARS-CoV-2. This study can enable us to know variants of interests beforehand so that therapeutics can be developed for them. We found nineteen regions of interest in the sequence of the spike protein that may be the potential hotspots for novel mutations in SARS-CoV-2. Ten of these hotspots contain mutations which have already been flagged as possibly more transmissible by the previous research. Interestingly, some of the new emerging variants from India and South Africa which have been marked dangerous in April 2021 and May 2021 were identified by our methodology even though we use the sequence data on the GISAID server before December 2020. Identifying hotspots beforehand may have implications in the development of therapeutics and be aware of the potential threats posed by the mutations in the virus. We also use the unsupervised learning-based clustering technique k-means to find the similarities between the variants of interests that have previously been found to be dangerous. The encode the protein sequences we use the Prot-BERT model and use features generated by it, for the k-means analysis. Clustering the mutation variants based on similarity reduces redundancy of time and resources, similar treatment techniques can be implemented for mutations that fall into the same cluster. One of the results of our analysis was the similarity between the S477N and the A222V mutations, it implies that these mutations share common traits and occurrences and may be subjected to similar treatment strategies. CovMT: an interactive SARS-CoV-2 mutation tracker, with a focus on critical variants Forecasting of COVID-19 using deep layer Recurrent Neural Networks (RNNs) with Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) cells The SARS-CoV-2 Y453F mink variant displays a pronounced increase in ACE-2 affinity but does not challenge antibody neutralization API design for machine learning software: experiences from the scikit-learn project Application of k-means clustering algorithm in grouping the DNA sequences of hepatitis B virus (HBV) The coronavirus is mutating -does it matter? COVID-19 CG enables SARS-CoV-2 mutation and lineage tracking by locations and dates of interest Time series forecasting of COVID-19 transmission in Canada using LSTM networks Biopython: freely available Python tools for computational molecular biology and bioinformatics A SIR model assumption for the spread of COVID-19 in different communities COVID-19 Genomics UK consortium, 2021. COG-UK report on SARS-CoV-2 Spike mutations of interest in the UK 15th WebLogo: A Sequence Logo Generator Data, disease and diplomacy: GISAID's innovative contribution to global health: Data, Disease and Diplomacy ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Learning Emergence of SARS-CoV-2 B.1.1.7 Lineage -United States Nextstrain: real-time tracking of pathogen evolution Effect of RBD mutations in spike glycoprotein of SARS-CoV-2 on neutralizing IgG affinity (preprint) Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of Mutation effects predicted from sequence co-variation Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19 Evolutionary tracking of SARS-CoV-2 genetic variants highlights an intricate balance of stabilizing and destabilizing mutations (preprint) Sequence Variations within Protein Families are Linearly Related to Structural Variations Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Characterizations of SARS-CoV-2 mutational profile, spike protein stability and viral transmission Visualizing Data using t-SNE Protein sequence entropy is closely related to packing density and hydrophobicity Identification of SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization Potential neutralizing antibodies discovered for novel corona virus using machine learning Encyclopedia of Machine Learning Assessing the role of quarantine and isolation as control strategies for COVID-19 outbreak: A case study SARS-CoV-2 Proteins [WWW Document Mathematical modeling of COVID-19 transmission dynamics with a case study of Wuhan Scikit-learn: Machine Learning in Python Spike mutation D614G alters SARS-CoV-2 fitness Evaluating Protein Transfer Learning with TAPE Silhouettes: A graphical aid to the interpretation and validation of cluster analysis Modeling and forecasting the COVID-19 pandemic in India A mathematical theory of communication Creating and applying SIR modified compartmental model for calculation of COVID-19 lockdown efficiency GISAID: Global initiative on sharing all influenza data -from vision to reality COVID-ABS: An agent-based model of COVID-19 epidemic to simulate health and economic effects of social distancing interventions Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations Sensitivity of SARS-CoV-2 B.1.1.7 to mRNA vaccine-elicited antibodies Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity New Pathways of Mutational Change in SARS-CoV-2 Proteomes Involve Regions of Intrinsic Disorder Important for Virus Replication and Release The COVID-19 Genomics UK (COG-UK) consortium 1.1.7 in England: Insights from linking epidemiological and genetic data (preprint) Analysis of SARS-CoV-2 mutations in the United States suggests presence of four substrains and novel variants Bio-informed Protein Sequence Generation for Multi-class Virus Mutation Prediction Covid-19: The E484K mutation and the risks it poses HuggingFace's Transformers: State-of-the-art Natural Language Processing A new coronavirus associated with human respiratory disease in China MicroGMT: A Mutation Tracker for SARS-CoV-2 and Other Microbial Genome Sequences The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity (preprint) Emergence of a Novel SARS-CoV-2 Variant in Southern California Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization The authors would like to thank Prakarsh Yadav and Parisa Mollaei for their useful inputs and comments on the paper. This work is supported by the start-up fund provided by CMU Mechanical Engineering.J o u r n a l P r e -p r o o f