key: cord-0803154-eijwpdw3 authors: Stojanov, Done title: Phylogenicity of B.1.1.7 surface glycoprotein, novel distance function and first report of V90T missense mutation in SARS-CoV-2 surface glycoprotein date: 2021-08-27 journal: Meta Gene DOI: 10.1016/j.mgene.2021.100967 sha: 89fc9275cf73bc54a17488a5eb7e0063ef431f4a doc_id: 803154 cord_uid: eijwpdw3 Phylogenicity of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) B.1.1.7 surface glycoproteins reported from Europe to the National Center for Biotechnology Information (NCBI) virus database by the mid of April 2021 is analyzed. Novel function for computing phylogenetic distance is proposed for that purpose. Proposed distance function resulted in better-fitted clusters than Jaccard and Sorensen-Dice and accurate evolutionary links were predicted for B.1.1.7 spike variants. Most B.1.1.7 spike variants were linked to their likely direct predecessors at single amino acid change, that in many cases resulted in loss of the key mutations that are associated to the higher B.1.1.7 SARS-CoV-2 infectivity. There were also cases where second mutation was introduced to compensate for the missing mutation. Unreported V90T SARS-CoV-2 surface glycoprotein mutation was also identified, that contributes towards escaping 2–51 neutralizing antibody. In December 2020 Public Health England published a report named as "Variant of Concern 202012/01" that described preliminary findings for a newly detected variant of SARS-CoV-2 with alarming epidemiological properties (Chand et al., 2020). Emerging lineage, that was named B. . Of all mutations, mutations found in the surface glycoprotein (S gene) are of the highest concern because surface glycoprotein (spike protein) mediates SARS-CoV-2 binding and entrance to the host cell (Shang et al., 2020) . SARS-CoV-2 spike protein is synthesized as a 1273-amino acid trimer on the rough endoplasmic reticulum Duan et al., 2020) and is composed of two subunits, S1 and S2 (Huang et al., 2020) . Subunit S1 contains: N-terminal domain (NTD), receptor-binding domain (RBD) and two subdomains (SD1 and SD2) of total size of 672 amino acids (residues 14-685), while: N-terminal hydrophobic fusion peptide (FP), two heptad repeats (HR1/HR2), transmembrane domain (TM) and cytoplasmic tail (CT) are contained in subunit S2 of total size of 588 amino acids (residues 686-1273), Wrapp et al., 2020 . Current research showed that spike protein mutations affect cell biding mechanism that results in higher infectivity (Starr et al., 2020) . Some B.1.1.7 surface glycoprotein mutations, such as: N501Y, P681H, ∆HV[69-70], ∆Y144 attracted special attention, as they have been associated to specific properties. N501Y mutation in the receptorbinding domain has been identified as the key mutation that increases virus binding affinity to human angiotensin-converting enzyme 2 (hACE2), Gu et al., 2020 . P681H mutation, that is found next to the furin cleavage site, enhances virus transmission by facilitating spike protein conformational change (Lasek-Nesselquist et al., 2021). Hoffmann et al. (2020) found that cellular protease furin, cleavages surface glycoprotein at S1/S2 site, that is essential for cell fusion and entrance into the host cell. Multiple-Arginine S1/S2 cleavage site, which as such is not found in animal coronaviruses (Peacock et al., 2020) , was reported as the key factor for fast transmission of B.1.1.7 SARS-CoV-2 variant. As for ∆HV[69-70] recurrent deletion, links to immune escape in immunocompromised patients were made and it was proved that ∆HV[69-70] enhances viral infectivity in vitro (Volz et al., 2021) . ∆Y144 also contributes to immune escape, as Wang et al. (2021) found that ∆Y144 plays pivotal role in resistance to NTD-directed monoclonal antibodies (mAbs). Since the start of pandemics, SARS-CoV-2 has evolved in several known VOCs (Variant of Concern) that served as a main drive for tracking virus polymorphisms and automated data classification. Software tool named PANGOLIN-Phylogenetic Assignment of Named Global Outbreak Lineages (https://cov-lineages.org/) designates isolates to the most likely lineage applying machine learning. On the other hand, Global Initiative on Sharing Avian Influenza Data-GISAID's SARS-CoV-2 database (https://www.gisaid.org/hcov19-mutationdashboard/) tracks complete and up to date list of worldwide SARS CoV-2 spike protein mutations, organized by: most recent accession, isolate (also including geo-location data), frequency and date of collection. Even designated under the same lineage, variants still profile specific polymorphisms. Deep analysis of mutual similarities and dissimilarities provided insights on transmission paths and dynamics of SARS-CoV-2 surface glycoprotein alteration. On the other hand, having computed variants profiling mutations, helped me track and report rare and unreported SARS-CoV-2 surface glycoprotein mutations. Appendix A shows the complete list of computed SARS-CoV-2 surface glycoprotein variants. Missense mutations and deletions were found only among the European population. Each variant matched between 508 (max) and 1 (min) samples in the NCBI virus database. As expected, most European SARS-CoV-2 spike proteins include D614G mutation. This mutation was found as a single present mutation in 508 records (Europe top-ranked variant, Appendix A, Fig. 1 ). On the other hand, 103 variants included D614G in addition to another mutation and 20 variants did not include D614G mutation at all. In terms of deletion, 40 variants included deletion and 84 variants contained only missense mutations. Most of the variants containing deletion (25) were designated to B.1.1.7 lineage. From computed SARS-CoV-2 spike protein variants (Appendix A), I extracted 177 different mutations that circulate among the European population. Among them I found three extremely rare missense mutations: C379F, A93Y and V90T. GISAID's hCoV-19 spike mutations table (https://mendel.bii.astar.edu.sg/METHODS/corona/current/MUTATIONS/hCoV-19_Human_2019_WuhanWIV04/hcov19_Spike_mutations_table.html) lists one worldwide occurrence of A93Y mutation, while V90T has not been reported as SARS-CoV-2 spike protein mutation so far. I found second worldwide occurrence of A93Y among the European population and this is the first report of V90T as SARS-CoV-2 surface glycoprotein mutation. None of the occurrences I report here have been evidenced in GISAID's hCoV-19 spike mutations table. According to GISAID hCoV-19 Spike Glycoprotein mutation surveillance dashboard, missense mutation C379F occurred two times worldwide in spike protein from Switzerland: hCoV-19/Switzerland/SO-ETHZ-490065/2020 and it was recently detected in spike protein from Indonesia: hCoV-19/Indonesia/JK-NIHRD-WGS07001/2021. The same missense mutation I found also in NCBI reported SARS-CoV-2 spike protein from France, accession: QJT72386.1, which was collected in March 2020. QJT72386.1 spike protein contains only: C379F and D614G mutations (Appendix A). On the other hand, missense mutations: A93Y and V90T were found in a highly-polymorphic B. Crystallographic structure of neutralizing antibody 2-51 in complex with SARS-CoV-2 spike Nterminal domain (NTD) (DOI: 10.2210/pdb7L2C/pdb, https://www.rcsb.org/structure/7l2c) was used to predict the effect of V90T missense mutation upon protein-to-protein interaction, Fig. 2 . mCSM-PPI2 web server (Rodrigues et al., 2019: http://biosig.unimelb.edu.au/mcsm_ppi2/) was used to compute the change in binding affinity between SARS-CoV-2 spike N-terminal domain (NTD) and 2-51 neutralizing antibody upon Valine (non-polar, aliphatic) to Threonine (polar, non-charged) substitution at position 90 in N-terminal domain. The change in biding affinity is primarily computed upon residues' interactions which are affected upon missense mutation (different amino acids interact differently with surrounding residues, Fig. 2 ) that leads to protein conformation change inside the motif of interactions, that may increase or decrease protein biding affinity towards another protein, Fig. 2 . Negative change of binding affinity ∆ = − . kcal/mol for V90T missense mutation was computed what means that newly detected V90T missense mutation in the N-terminal domain of SARS-CoV-2 spike protein, even though minimally, contributes towards escaping 2-51 neutralizing antibody. Table 1 . The aim was to find clusters of closely related variants upon their mutations (Fig. 3) and thereby predict likely evolutionary links. Ward's method (Ward, 1963 ) applied on distance matrix was used to compute dendrograms. A novel distance function for computing phylogenetic distance was proposed and evaluated relative to Jaccard (Jaccard, 1912; Tanimoto, 1958 ) and Sorensen-Dice (Dice, 1945; Sørensen, 1948; Nei and Li, 1979 ) distance functions. The distance function is defined in general and it can be applied to any two arbitrary sets. When applied to the subset of B.1.1.7 spike variants on Fig. 3 , proposed distance function resulted in better-fitted clusters (for higher number of clusters ) than Jaccard and Sorensen-Dice and highly-polymorphic variants were accurately linked to corresponding clusters. Multidimensional scaling was used to visualize clusters' structure in 2D space. The complete analysis was performed in Python Software Foundation for Python 3.9 (Python 3.9). Austria (20), Spain (1) 2020; Hensley et al., 2021), previously analyzed B.1.1.7 spike protein QSQ87331.1 is supposed to be found in such patient. In terms of computation, QSQ87331.1 is considered as an obvious outlier. On the other hand, there is no abrupt change in the rate of mutations for the remaining 24 B.1.1.7 spike variants on Fig. 3 and an assumption that those variants were isolated from non-immunocompromised patients was made. Clues towards the dynamics and type of B.1.1.7 spike protein alteration are aimed to be identified, as SARS-CoV-2 was transmitted among non-immunocompromised patients in Europe, based on the available data ( Fig. 3 ). Comparing B.1.1.7 spike protein variants on Fig. 3 (Fig. 3 ). Variants V1(V2) experience no additional mutation, Fig. 3 . As Fig. 3 shows, non-synonymous single nucleotide polymorphisms (SNPs) at codons for defining amino acid substitutions in B.1.1.7 S genes did not result in random protein mutations, but on the contrary, for more than half variants on Fig. 3 , the amino acid in the referent Wuhan SARS-CoV-2 spike protein (YP_009724390.1) was reversedan event that is regarded as deletion of defining amino acid substitution. In 13 out of 24 variants: V5, V [7] [8] [9] [10] [11] [12] , V16, V [20] [21] [22] [23] [24] (Fig. 3) , deletion of defining amino acid substitution can be observed. On the other hand, non-defining amino acid substitutions also occurred, Fig. 3 . Non-defining amino acid substitution and no deletion of defining substitution can be observed in 8 variants: V4, V6, V [13] [14] [15] , V [17] [18] [19] (Fig. 3 ). Some variants: V11, V16, V [20] [21] [22] [23] [24] included both events (1 and 2) (Fig. 3 ). Uncommon alterations: ∆I68 and V70I were found in variant V3 (Fig. 3) and they are discussed later in the paper. Here I consider changes at protein level, given that B.1.1.7 spike proteins underwent most of the time edit events of type: (1) and (2), as SARS-CoV-2 was transmitted among non-immunocompromised patients. Following these suggestions, cluster analysis towards identifying edit history is performed. While "absolute edit history" connects variants to their common ancestor (defined by the list of B.1.1.7 spike protein defining mutations), "relative edit history" and especially minimal relative edit history between variants, boosted with date of collection and geo-location data may reveal likely changes in strains that occurred as the virus was transmitted form one host to another. The first step of computational analysis is to define function for computing phylogenetic distance between variants. Given that variants are expressed as sets of mutations at protein level (Fig. 3) , applicable distance functions compute phylogenetic distance upon dissimilarity between sets. The way dissimilarity is computed may impact classification output or different clusters may be obtained given that different distance functions are applied. However, regardless of the distance function being used, a real number in the range between 0 and 1 is always returned, such 1 is returned for completely different sets and 0 for identical sets. Even though Jaccard (Jaccard, 1912; Tanimoto, 1958 ) distance function and Sorensen-Dice (Dice, 1945; Sørensen, 1948; Nei and Li, 1979 ) distance function are the defaults when it comes to measure phylogenetic distance, here I propose a novel distance function that can be used to measure the dissimilarity between any two random sets in general and thereby it can be also applied to compute phylogenetic distance between variants on Fig. 3 . Proposed distance function can be formulated as: such as: ( , ) denotes distance between sets and , || denotes cardinality of set, \ is the relative complement of in and \ is the relative complement of in . Under set cardinality || we assume the number of elements contained in the set, \ denotes elements contained in set but not in set and \ denotes elements contained in set B but not in set A. The sum of cardinalities of sets \ and \ multiped by cardinality of sets A and B then divided by the sum of squared cardinalities of sets and , is how the distance is computed following proposed metric -Eq. (1). In Appendix B, I prove that proposed distance function according Eq. (1) satisfies properties: P1, P2, P3 and P4. Note that properties where equivalence: ⇔ or "if and only if" (abbreviated by "iff") is suggested, are proved in "both directions". Even though distance functions share some common properties, different distance may be computed for the same samples by different distance function. That will be demonstrated for variants V1 and V3 on Fig. 3 , given that: Jaccard Eq. (2), Sorensen-Dice Eq. (3) and proposed Eq. (1) are applied. Jaccard distance function is defined by Eq. (2) and the distance between sets is computed upon cardinality of intersection and union. Cardinality of intersection is also considered in Sorensen-Dice distance function Eq. To compute Jaccard distance between variants V1 and V3, intersection and union of mutations for V1 and V3 needs to be computed, such as common mutations are assigned to the intersection and common plus different mutations to the union: . Initially each sample is considered as a sperate cluster and clusters are merged together based on → criterion, until one cluster is obtained in the final phase of the process. In other words, given that there are candidate clusters: = { 1 , 2 , … , } that are to be merged with cluster at some phase of the clustering, Ward's algorithm will select cluster from the list of candidates in , such as: ( , ) = { ( , 1 ), ( , 2 ), … , ( , )} or minimum increase of the error sum-of-squares is attained. Joining clusters together upon intra-cluster disturbance minimization criteria, usually results in best-differentiated clusters, what was the main reason why Ward's method was selected over the other available linkage methods in Python. Given that clusters and Ward's algorithm was applied to distance matrix that was computed for B.1.1.7 spike variants on Given that parameter measures 's mean distortion inside its own cluster and measures 's mean distance to the nearest cluster , is considered to be well-fitted to its own cluster for low intra-cluster mean distance ( ) and high mean distance to the nearest cluster ( ). A real number in the range: -1 to +1 is returned for the silhouette coefficient, such as: ( ) ≈ +1 denotes that fits well its own cluster , ( ) ≈ 0 that is on the border between its own and neighboring cluster and ( ) ≈ −1 that is assigned to the wrong cluster or fits better neighboring cluster than its own cluster . Silhouette score for particular classification equals the mean silhouette coefficient for all samples, such as: positive silhouette score indicates differentiated clusters, zero that clusters touch each other and negative that clusters overlap. Silhouette score was taken as a measure for the quality of clusters obtained by Ward's algorithm on three different distance functions: Eq. Table 2 lists clusters' structure for each , for each dendrogram and given that Jaccard and Sorensen-Dice produced same structures of clusters for all 's, Eq. (1) did not, Fig. (4,5) . Multidimensional scaling was used to visualize clusters in ℝ 2 space, Fig. 6 . In general, multidimensional scaling uses pairwise distances from distance matrix to compute coordinates for each sample in lower ℝ space ( usually equals 2 or 3), such that distance between samples in ℝ space approximates pairwise distance from distance matrix. Given that transformation error is controlled by parameter in Table 2 , Fig. 5 . Clusters that were differently computed from Jaccard (Sorensen-Dice) are marked in red color in Table 2 . Silhouette score analysis - Table 3 , Fig. 7 shows that Eq. (1) improved the quality of clusters for higher number of clusters (Table 3 , Fig. 7 ) and more accurate links were predicted for variants that are on the edge of classification, compared to Jaccard and Sorensen-Dice. For = 8 clusters and Jaccard (Sorensen-Dice) distance matrix, Ward's algorithm clustered variants: 18 and 22 together, while for proposed Eq. (1) they were classified separately, Table 2 , Fig. (5,6) . Variant 18 was clustered with variants: 12, 19, 15, 1, 13, while variant 22 formed another cluster with: 9, 23, Table 2 , Fig. (5,6) . Linkage change for V18(V22) due to Eq. (1) resulted in better-fitted clusters compared to Jaccard and Sorensen-Dice, Table 3 , Fig. 7 . Silhouette score of 0.3348169603112492 was computed for = 8 clusters given that distance matrix was computed by Eq. (1), while silhouette scores given that Jaccard and Sorensen-Dice distance functions were used, also for = 8 clusters, were: 0.30333844570840973 and 0.3249632576792248 respectively, Table 3 , Fig. 7 . An independent assessmentminimum edit distance for V18(V22) inside predicted clusters; was used to decide which distance function linked variants V18 and V22 better. Having linked variants V18 and V22 together, as Ward's algorithm applied on Jaccard and Sorensen-Dice did, minimum edit distance equals three, as P812L non-defining substitution and deletion of defining substitutions: P681H, S982A mutually differ V18/V22 (Fig. 3) . On the other hand, Ward's algorithm applied on Eq. (1) linked variants V18(V22) in two separate clusters ( Table 2, Fig. (5,6) ), such as the minimum edit distance relative to V18(V22) inside clusters was improved. Two edit events: L18F non-defining substitution and deletion of defining S982A substitution are required to transform variant V9 into V22 (Fig. 3) and also two edit events (non-defining amino acid substitutions): L18F and P812L are required to transform variant V1 into V18 (Fig. 3) . Since Eq. (1) implied variants at the best edit distance to be classified together (V22 and V9 in cluster C6; V18 and V1 in cluster C8, Table 2 , Fig. (5,6) ), while by Jaccard and Sorensen-Dice they were not, Eq. (1) is suggested as an improved option for predicting more accurate structural and evolutionary links between samples for higher number of clusters , where usually other distance metrics fail. Based on computed silhouette scores in Table 3 for Ward's clustering output for: Eq. (1), Jaccard Eq. (2), Sorensen-Dice Eq. (3) distance matrix, for = 2, 3, 4, 8, 10 clusters, one can easily find that Eq. (1) generated better-fitted clusters than Jaccard in all cases, while compared to Sorensen-Dice, better-fitted clusters were computed for higher values of : = 8 and = 10, Table 3 , Fig. 7 . For = 10 clusters, silhouette score for Eq. (1) equals 0.32145288135845923, while for Jaccard and Sorensen-Dice: 0.29527962158658155 and 0.3146922924108971 respectively, Table 3 , Fig. 7 . The same discussion can be applied for = 8 clusters, where silhouette score of 0.3348169603112492 for Eq. (1) overcame silhouette scores: 0.30333844570840973 and 0.3249632576792248 for Jaccard and Sorensen-Dice, Table 3 , Fig. 7 . For lower number of clusters = 2,3,4, Eq. (1) resulted in a bit worse-fitted clusters than Sorensen-Dice, Table 3 , Fig. 7 . Since Eq. (1) Journal Pre-proof outperformed Jaccard in all cases and Sorensen-Dice for higher number of clusters (Table 3 , Fig. 7) , Eq. (1) is suggested as better option than Jaccard and Sorensen-Dice for accurate prediction of closely-tied phylogenetic relations. Given that B.1.1.7 spike protein underwent most of the time: deletion of defining amino acid substitution and non-defining amino acid substitution - Fig. 3 , in this section I try to find likely phylogenetic links for B.1.1.7 spike protein variants on Fig. 3 , based on Ward's classification output for Eq. (1) and = 8 clusters ( Table 2, Fig. (5,6) ). Variants' structure ( Fig. 3) and date(place) of collection are also considered when suggesting likely linkage. Highly-polymorphic variants V3 and V20 were accurately predicted as single-variant clusters, Table 2 , Fig. (5,6) . In particular, variant V3 replaced ∆ 70 with V70I and most likely the absence of ∆ 70 in the commonly found B.1.1.7 ∆ [69,70] spike deletion pair was compensated by deletion of Isoleucine at the nearby position 68: ∆ 68, Fig. 3 . I hypothesize this, because both Valine and Isoleucine are non-polar aliphatic amino acids with similar physical-chemical properties (most hydrophobic amino acids, no hydrogen donor or acceptor atom in side chains, very close: pK a , pK b , pI,…etc.) and next to each other, so the one could easily compensate for the missing deleterious effect of the other. In the case of variant V3, ∆ 68 probably compensates for the missing ∆ 70, Fig. 3 . On the other hand, variant V20 profiles most deletions of defining amino acid substitutions: A570D, D614G and T716I, Fig. 3 . As mentioned before, deletion of defining substitution means that amino acids: {D(570), G(614), I(716)} were reversed to amino acids: {A(570), D(614), T(716)}, the same that are contained in Wuhan SARS-CoV-2 referent spike protein (YP_009724390.1). As variant V20 lost D614G mutation which is the key mutation that is associated to enhanced SARS-CoV-2 replication in human lung epithelial cells and primary human airway tissues (Plante et al., 2020), reduced infectivity may be associated to V20. Reduced J o u r n a l P r e -p r o o f Journal Pre-proof infectivity of variant V20 is also affirmed by the fact that only one spike protein (QTC10682.1) matched variant V20, Table 1 . Cluster C2 consists of variants: V14, V2, V6, Fig. (5,6) , 18.12.2020 in Finland. Variants assigned to cluster C2 experience no deletion of defining amino acid substitution and they are profiled by ∆ 145 instead of ∆ 144 deletion (Fig. 3) . In this case, variant V2 is suggested as likely parental strain for variant V6, as variant V6 might have been transformed from variant V2 upon S155G substitution (Fig. 3) . Given that the earliest V2 spike protein (QTH25075.1) was collected in Germany and all others were later collected in Austria, variant V2 might have been transmitted from Germany in Austria. Due to ∆ 145 deletion in V14 (Fig. 3) , V14 was also linked to V2 and V6, Fig. (5,6) . Variants: V17, V4 and V11 were linked together to form cluster C4, Fig. (5,6) , (Fig. 3) . Variant V11 is predicted to emerged from V4 upon deletion of defining P681H substitution (Fig. 3) . The absence of P681H mutation in this case can be associated to decreased SARS-CoV-2 binding affinity towards host cell, what is expected to result in reduced infectivity and mild up to moderate clinical presentation. Four variants: V24, V21, V8 and V16 form cluster C5, Fig. ( Fig. 3 . Since A570D mutation eases close to open spike conformational update (Warwicker, 2021) , no amplification of infectivity than already introduced by D614G mutation can be attached to variants: V24, V21, V8 and V16 (Fig. 3) . In this case variant V8 is suggested as likely parental strain for variants: V16, V21 and V24, such as: V16 is predicted to emerged from V8 upon S943I substitution in V8, V21 upon E96D substitution in V8 and V24 upon A653V substitution in V8 (Fig. 3) . Ward's algorithm applied on Eq. (1) classified variants: V22, V9, V23 together, Fig. (5,6) , Table 2 . Variants V22, V9 and V23 are missing P681H defining substitution, Fig. 3 . Given that P681H mutation enhances SARS-CoV-2 binding affinity towards the cell that results in increased infectivity, the absence of P681H mutation for V22, V9 and V23 may result in reduced infectivity. Following variants' structure and collection dates, variant V9 is about to be suggested as likely predecessor for variants: V22 and V23 (Fig. 3) . V9 spike proteins: QTC11006.1 and QTC10454.1 were collected on 13.01.2021 in Spain, while V22 and V23 spike proteins: QTC10598 and QTC10538.1 were collected on 18.01 and 13.01.2021 at the same geo-location. Suggestion that variant V23 might have been transformed from V9 upon E96D substitution is made (Fig. 3) , while V22 can be also linked to V9, but two alterations must had happened: L18F non-defining substitution and deletion of S982A defining substitution (Fig. 3) . Although V5 was found nearest to V10 than V7, when it comes to cluster C7 (Fig. (5,6) , Table 2 ), it is equally likely that V10 emerged either from V5 or V7 (Fig. 3) . Given that V10 emerged from V5, then D1118H deletion must had occurred in V5 (Fig. 3) . On the other hand, given that V10 emerged from V7, then S982A deletion must had occurred in V7 (Fig. 3) . As reversal of defining amino acid substitution upon its deletion is considered unlikely to happen, variants V5 and V7 could not have emerged one from other, Fig. 3 Fig. 3 . Due to Eq. (1) a bit more distant linkage for variant V18 was also correctly predicted by assigning V18 to cluster C8, such as variant V18 might have evolved (directly or indirectly) from V1 upon substitutions: L18F and P812L, Fig. 3 . Variant V18 spike protein QTC10862 was collected on 18.01.2021 in Spain. Reduced infectivity may be associated to V12, as variant V12 lost D614G mutation, Fig. 3 . Given that reliable evolutionary paths were traced for the majority of B.1.1.7 spike variants that underwent single amino acid alteration, suggestions for direct predecessors could not have been identified for variants that underwent severe alteration. One of the possibilities is that their parental strains were not reported to the NCBI virus database and therefore they could not have been linked properly. In this category one can enumerate variants: V3, V18, V20 and V22. Severe polymorphism of variant V20 and specific alterations found in variant V3, suggested these variants as outliers since the very beginning of the study and no significant cluster association for these variants could have been identified. On the other hand, variants V18 and V22 were linked to more specific clusters, but likely direct predecessors for V18(V22) could not have been also identified. Other weakness of the study is that proposed phylogenetic formula Eq. (1) underperformed Sorensen-Dice in clusters' quality for low number of clusters . Ward's algorithm applied on Sorensen-Dice metric for B.1.1.7 spike variants generated more compact classification for = 2,3,4 clusters than Eq. (1) ( Table 3 , Fig. 7) , that suggests Eq. (1) as improved option only for high number of clusters . As B.1.1.7 SARS-CoV-2 evolved in many other forms, B.1.1.7 spike protein lost some of the key amino acid mutations that are associated to the higher B.1.1.7 infectivity or less contagious B.1.1.7 variants compared to their likely predecessors were identified. Most variants were linked to their likely direct predecessors at single amino acid change or only one codon in B.1.1.7 S gene was found to be affected. Even though this change in some variants resulted in loss of mutation that is associated to the higher B.1.1.7 infectivity, in some cases a novel mutation was introduced to compensate for the missing mutation. Such is the case of variant V3, where Isoleucine deletion at position 68 -∆ 68 is supposed to be compensation for the missing deleterious effect of Valine at position 70 -∆ 70 in the N-terminal domain. Unreported V90T missense mutation in the N-terminal domain of SARS-CoV-2 spike protein was identified. V90T mutation showed positive affirmation towards escaping 2-51 neutralizing antibody. Novel function for computing phylogenetic distance was also proposed, that did better structural and evolutionary links predictions than Jaccard and Sorensen-Dice. Silhouette score analysis demonstrated that proposed distance function resulted in better-fitted clusters for higher number of clusters, what makes it suitable for more accurate predictions of closely-tied relations. As for further research, proposed phylogenetic formula will be applied and evaluated to other biological sequences. In addition to popular clustering algorithms, more reliable evolutionary trajectories are expected to be automatically predicted than custom metrics, that is expected to bring better understanding of some evolutionary changes to the scientific community. Predicting automatically the accurate history of alterations is crucial for building reliable epidemiological models, that may provide better understanding of the phylogenicity of many pathogens, including highly alarming SARS-CoV-2, that is of the highest priority nowadays. Case study: prolonged infectious SARS-CoV-2 shedding from an asymptomatic immunocompromised individual with cancer Shedding of viable SARS-CoV-2 after immunosuppressive therapy for cancer Persistence and evolution of SARS-CoV-2 in an immunocompromised host Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England Increased mortality in community-tested cases of SARS-CoV-2 lineage B. 1.1. 7 Measures of the amount of ecologic association between species Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK The SARS-CoV-2 spike glycoprotein biosynthesis, structure, function, and antigenicity: Implications for the design of spike-based vaccine immunogens Emergence of SARS-CoV-2 b.1.1.7 lineage-united states Adaptation of SARS-CoV-2 in BALB/c mice for testing vaccine efficacy Intractable COVID-19 and Prolonged SARS-CoV-2 Replication in a CAR-T-cell Therapy Recipient: A Case Study A multibasic cleavage site in the spike protein of SARS-CoV-2 is essential for infection of human lung cells Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19 The distribution of the flora in the alpine zone A tale of three SARS-CoV-2 variants with independently acquired P681H mutations in New York State Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS-CoV-2 Neutralization of SARS-CoV-2 lineage B.1.1.7 pseudovirus by BNT162b2 vaccine-elicited human sera Mathematical model for studying genetic variation in terms of restriction endonucleases Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations COBALT: constraint-based alignment tool for multiple protein sequences The furin cleavage site in the SARS-CoV-2 spike protein is required for transmission in ferrets Spike mutation D614G alters SARS-CoV-2 fitness Investigation of novel SARS-COV-2 variant: Variant of Concern 202012/01 mCSM-PPI2: predicting the effects of mutations on proteinprotein interactions Cell entry mechanisms of SARS-CoV-2 A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding Elementary mathematical theory of classification and prediction Evaluating the effects of SARS-CoV-2 spike mutation D614G on transmissibility and pathogenicity Antibody resistance of SARS-CoV-2 variants B.1.351 and B.1.1.7 Hierarchical grouping to optimize an objective function A model for pH coupling of the SARS-CoV-2 spike protein open/closed equilibrium Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation SARS-CoV-2 spike-protein D614G mutation increases virion spike density and infectivity Condensed distance matrix for variants on Fig. 3 applying Eq