key: cord-0719541-w7036388
authors: Zhao, Lue Ping; Lybrand, Terry P.; Gilbert, Peter B.; Hawn, Thomas R.; Schiffer, Joshua T.; Stamatatos, Leonidas; Payne, Thomas H.; Carpp, Lindsay N.; Geraghty, Daniel E.; Jerome, Keith R.
title: Tracking SARS-CoV-2 Spike Protein Mutations in the United States (January 2020—March 2021) Using a Statistical Learning Strategy
date: 2021-12-21
journal: Viruses
DOI: 10.3390/v14010009
sha: a5d587c926da15125757ff6c008b6b2a40d29e54
doc_id: 719541
cord_uid: w7036388

The emergence and establishment of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of interest (VOIs) and variants of concern (VOCs) highlight the importance of genomic surveillance. We propose a statistical learning strategy (SLS) for identifying and spatiotemporally tracking potentially relevant Spike protein mutations. We analyzed 167,893 Spike protein sequences from coronavirus disease 2019 (COVID-19) cases in the United States (excluding 21,391 sequences from VOI/VOC strains) deposited at GISAID from 19 January 2020 to 15 March 2021. Alignment against the reference Spike protein sequence led to the identification of viral residue variants (VRVs), i.e., residues harboring a substitution compared to the reference strain. Next, generalized additive models were applied to model VRV temporal dynamics and to identify VRVs with significant and substantial dynamics (false discovery rate q-value < 0.01; maximum VRV proportion >10% on at least one day). Unsupervised learning was then applied to hierarchically organize VRVs by spatiotemporal patterns and identify VRV-haplotypes. Finally, homology modeling was performed to gain insight into the potential impact of VRVs on Spike protein structure. We identified 90 VRVs, 71 of which had not previously been observed in a VOI/VOC, and 35 of which have emerged recently and are durably present. Our analysis identified 17 VRVs ~91 days earlier than their first corresponding VOI/VOC publication. Unsupervised learning revealed eight VRV-haplotypes of four VRVs or more, suggesting two emerging strains (B1.1.222 and B.1.234). Structural modeling supported a potential functional impact of the D1118H and L452R mutations. The SLS approach equally monitors all Spike residues over time, independently of existing phylogenic classifications, and is complementary to existing genomic surveillance methods.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the pathogen responsible for the global coronavirus disease 2019 (COVID- 19) pandemic, is an RNA virus and thus prone to replication errors [1] . Replication errors that yield nonsynonymous amino acid (AA) substitutions, or nucleotide insertions or deletions that cause a frame shift and alter the subsequent coding sequence, can lead to a variety of outcomes. If the resulting mutations have detrimental effects on fitness, or if they have neutral effects on fitness and undergo stochastic extinction, variants harboring these mutations fail to become established in the population. However, mutations that confer a fitness advantage can rapidly become dominant in a population. For SARS-CoV-2, there are four classes of variant: variant being monitored (VBM), variant of interest (VOI), variant of concern (VOC), and variant of high consequence (VOHC). The CDC is closely monitoring ten VBMs (Alpha, Beta, Gamma, Epsilon, Eta, Iota, Kappa, 1.617.3, Mu, and Zeta) and two VOCs (Delta and Omicron) in the United States [2] . VOCs show specific attributes such as increased transmissibility [3] [4] [5] [6] , increased resistance to neutralization by antibodies elicited through natural infection [3, [7] [8] [9] , and/or increased resistance to neutralization by vaccine-elicited antibodies [8, 10, 11] , and have already influenced vaccine development, evidenced by the current planning of clinical trials to test variant-adapted vaccines [12] . While no VOHCs have yet been identified, it remains possible that such variants-i.e., variants that can effectively evade natural or vaccine-induced immunity-may yet emerge [13, 14] . The identification of VOHCs could necessitate the introduction of more stringent public health guidelines and/or spur further treatment and vaccine development.

Genomic surveillance is critical for tracking the emergence and spread of new variants. Such surveillance can be accomplished via a variety of approaches, such as phylogenic analysis [3, 15] . In this approach, new viral sequences are classified to existing lineages identified by PANGO [16] , subsets of samples with the same branches are identified, and variant frequencies are counted to identify new variants. The NextStrain methodology [17] can model dynamic changes in variant proportions, while an alternative approach aligns sequence data to a matrix of binary indicators for the presence of variants, and systematically evaluates each mutant as a potential variant [18] . Leveraging the analytic approach of single nucleotide polymorphisms (SNPs), variants have been identified by assessing linkage-disequilibrium [19] or similar SNP-based identification and analysis [20] . However, with the exception of the NextStrain methodology [3] , these methods do not directly take into account sequence collection time, nor explicitly incorporate highly granular geographic information. Moreover, these methods take a holistic view of the viral genome. Thus, there is a need for complementary approaches for detecting and characterizing Spike mutations of potential public health importance that may be missed, or detected later, by existing genomic surveillance methods.

To meet this need, we describe a statistical learning strategy (SLS) using generalized additive models, unsupervised learning techniques, and single nucleotide polymorphism (SNP) methodologies for identifying and spatiotemporally characterizing viral residue variants (VRVs), a term we use to describe AA positions in the Spike protein where a mutation is significantly present in a given geographic area. The SLS method generates pertinent statistics for reproducible scientific inference and facilitates visual representation of results for intuitive interpretation. Using publicly available SARS-CoV-2 sequences from US COVID-19 cases that were not assigned to a VOI or VOC lineage, we apply our method to identify and spatiotemporally characterize VRVs in the Spike protein, within individual US states/territories. We also apply standard homology modeling methods to highlight individual AA mutations with the potential to impact Spike protein structure and/or function.

We first applied the SLS method to identify VRVs separately in each state/territory ( Figure S1 ). The decision to compartmentalize VRV discovery by state/territory was partially based on the fact that domestic travel restrictions have varied over the course of the pandemic, with nearly half of all states having imposed some type of interstate travel restriction [21] , leading to the hypothesis that VRVs may follow state/territory-specific temporal dynamics. The identified VRVs showed a range of dynamic patterns across the five different trajectories taken  by the V382, L452, T478, P681 , and T732 VRVs in California ( Figure 1A ). The relative abundance of V382 started rising on day 250, exceeded 10% on day 259, and fell below 10% on day 275. L452 emerged on day 310, exceeded 10% on day 390, and exhibited a positive trajectory thereafter. Three other VRVs (T478, P681, T732) had similar trajectories to L452. Note that we denote the VRV of interest using the reference amino acid position alone (e.g., P681), given that some reference amino acids show point mutations to more than one amino acid (see examples below).

Viruses 2021, 13, x 3 of 17 pandemic, with nearly half of all states having imposed some type of interstate travel restriction [21] , leading to the hypothesis that VRVs may follow state/territory-specific temporal dynamics. The identified VRVs showed a range of dynamic patterns across the different states/territories (Figure S2), exemplified by the five different trajectories taken by  the V382, L452, T478, P681 , and T732 VRVs in California ( Figure 1A ). We refer to the combination of a VRV and a state/territory in which it was identified as a "geo-VRV". A total of 267 geo-VRVs, consisting of combinations of 90 VRVs identified among the 52 state/territory classifications, were identified (Table S3) . Fifty-eight VRVs were only observed in one state/territory, whereas 32 were observed in two or more (Table S4) .

Unsupervised learning was then applied to organize the 267 geo-VRVs into 10 clusters (TP1 through TP10) ( Figure 1B , Table S3 ). The cluster most strikingly different from the others was "TP2", which was composed of 47 geo-VRVs, each of which contained the D614 VRV at a maximum relative abundance of 100%, showing the early dominance of the D614 VRV in these states/territories. Clusters TP3 and TP5 included geo-VRVs of potential concern, as they included VRVs that appear to have emerged within the last few months in their specific states/territories. In contrast, most VRVs in the remaining clusters tended to expand and contract within relatively short times in a given state/territory, making such VRVs likely less important from a public health perspective. We termed these 35 VRVs that were uniquely identified in Clusters TP2, TP3, and TP5 as "pressing VRVs".

We next compared the 90 VRVs and the 35 pressing VRVs with the 12 and 13 AA positions that have been shown to harbor substitutions (AA-subs) within US-circulating VOIs and VOCs, respectively [2] . The 90 VRVs included nine and eight AA-subs in VOIs and VOCs, respectively; the 35 pressing VRVs included four and eight AA-subs in VOIs and VOCs, respectively ( Figure 1C ), even though all VOI/VOC sequences were excluded from the current analysis. Notably, 25 of the VRVs that have not been previously identified as an AA-sub in a VOI or VOC appear to have emerging trajectories, demonstrating the potential of the SLS method to identify novel Spike AA positions that may warrant further investigation/observation.

Five VOI/VOC AA-subs (Y144, F888, V1176, H69, K417) were not identified as a VRV. Figure S3 shows the state/territory-specific relative abundances over time for states/territories where substitutions were identified at these five positions (albeit without meeting the statistical significance criteria for identification as a VRV). Our data suggest that, individually, these AA positions may be of less interest in the US.

Timely detection of potentially fast-emerging VRVs, and conversely, identification of VRVs likely not of concern, are both important for informing public health guidelines and for influencing research priorities. Given the importance of timely detection, we use the first time when the maximum proportion (Pmax) of a VRV exceeded 10% as the first reportable time (see the formula for calculating Pmax in Section 3.3.1 of Materials and Methods). For each out of the set of AA-subs within VOIs/VOCs that were also identified as VRVs, Table 1 compares within each state/territory the time of detecting an emerging VRV as calculated by the SLS method vs. the first appearance of the AA-sub in the scientific literature. The SLS method identified emerging VRVs in an average of 207 days, while the average of reported values in the literature was 299 days. E484, an AA-sub in the B.1.1.7, P.1, and B.1.351 variants, was an exception as it was not detectable in the US until day 370 when it was first detected as a VRV in Rhode Island.

SARS-CoV-2 is a single-stranded ("haploid") RNA virus. The presence of multiple VRVs found in a patient form a VRV-haplotype. The accumulation of multiple VRVs on a single RNA strand could affect protein function more than a single VRV. To identify VRV-haplotypes, we performed unsupervised learning of selected VRVs and cases through a two-way hierarchical cluster analysis state/territory-by-state/territory. As shown in Figure S4 , some VRV-haplotypes are shared across states/territories, but most are not. Figure 2 , for example, shows the results of the unsupervised case and VRV clustering for Washington state. The heatmap shows that multiple VRVs tend to aggregate among subsets of cases, the inspection of which can reveal VRV-haplotypes. For example, the case cluster "PG7", which included 208 cases, had VRVs from the "RG4" and "RG3" clusters, which included the VRVs (G142-S155-F157-E180-K444-L452-T478-D614-Q677-P681-T732-T859-S940-D950) (see Table S5 ). -405  371-384  356  -40  358-398  379  -South Carolina  -----------46  405  368-389  -Tennessee  50-141  ----------50  318  --Texas  -----------23  360  378  -Utah  -159-173  ---98-190  -----44  -358  -Virginia  --329-331  -----397  --47  384  359  176-185  Washington  -406  --411  -403  --410  -50  373  --Wisconsin  ------409  --265-271  -12  299-407  411  -Wyoming  -381  --381  -381  ----51  -392  -Other States  --------393  404  404  34  -368  375-381 For each of 15 amino acid positions shown to harbor a substitution in a VOI or VOC, the top two rows show the first reported date in the literature of a VOI or VOC harboring a substitution at the designated site vs. the date of VRV detection at the same amino acid position by the SLS method (across all states/territories. * "Reporting Day" was set to the 15th day in each month in which the relevant publication appeared. "SLS Detection Day" was set to the day at which the locally averaged proportion of the specific VRV exceeded 10% based on temporality models fitted in each states/territory. If the locally averaged proportion of the VRV later declined below 10%, the second day is shown after a dash. All numbers in the Through algorithmic comparison and merging of different case clusters, six VRVhaplotypes (W1 through W6) were identified in Washington, while the "W6" cluster (7130 cases) carried only a single VRV, D614 (Table 2) . Comparison across VRV-haplotypes suggested that W6 evolved to W3, W4, and W5 via the acquisition of an additional mutation at T732, Q677, and D178, respectively. Similarly, both W3 and W4 could have evolved to W2 via the acquisition of an additional mutation at Q677 or T732, respectively.

VRV-haplotype blocks are identified from unsupervised learning. Within each block, there can be multiple VRV-haplotypes that consist of polymorphic residues; individual VRVs may take either the reference residue or a substitution. For example, VRV-haplotype W1 had 10 haplotypes (Table 2) , where the number after the hyphen indicates the number haplotypes, we performed unsupervised learning of selected VRVs and cases through a two-way hierarchical cluster analysis state/territory-by-state/territory. As shown in Figure  S4 , some VRV-haplotypes are shared across states/territories, but most are not. Figure 2 , for example, shows the results of the unsupervised case and VRV clustering for Washington state. The heatmap shows that multiple VRVs tend to aggregate among subsets of cases, the inspection of which can reveal VRV-haplotypes. For example, the case cluster "PG7", which included 208 cases, had VRVs from the "RG4" and "RG3" clusters, which included the VRVs (G142-S155-F157-E180-K444-L452-T478-D614-Q677-P681-T732-T859-S940-D950) (see Table S5 ). Unsupervised learning was used to organize the 10 VRVs into 5 residue groups (RG1 through RG5) and to organize the 9877 cases into 12 patient groups (PG1 through PG12).

Through algorithmic comparison and merging of different case clusters, six VRVhaplotypes (W1 through W6) were identified in Washington, while the "W6" cluster (7130 cases) carried only a single VRV, D614 (Table 2) . Comparison across VRV-haplotypes suggested that W6 evolved to W3, W4, and W5 via the acquisition of an additional mutation at T732, Q677, and D178, respectively. Similarly, both W3 and W4 could have evolved to W2 via the acquisition of an additional mutation at Q677 or T732, respectively. Unsupervised learning was used to organize the 10 VRVs into 5 residue groups (RG1 through RG5) and to organize the 9877 cases into 12 patient groups (PG1 through PG12). Table 2 . VRV-haplotypes identified in Washington and in New York: state-specific frequencies of cases, number of VRVs per VRV-haplotype, and haplotypic polymorphisms (state-specific frequencies). Unimputable residues are denoted with an "X". D614-Q677-T732 12 

As all sequences corresponding to VOI/VOC were excluded, the strains with detected VRVs are not currently undergoing special monitoring or characterization. We were thus interested in naming the identified VRV-haplotypes and the PANGO lineages assigned by GISAID. To this end, we selected VRV-haplotype blocks including four or more pressing VRV mutations, resulting in eight VRV-haplotype blocks. Table 3 cross-tabulates these VRVhaplotypes by their assigned lineages. Of particular interest, viruses with the haplotype "KGHA" of T478-D614-P681-T732 were observed 2132 times, and 2029 of them were assigned to the strain B.1.1.222. It is natural to name the haplotype T478K-D614G-P681H-T732A as a B.1.1.222. Another noteworthy strain is B.1.234, which corresponds to "SVGHF" and "SVGHS" of G142-E180-D614-Q677-S940 with exceptionally high frequencies (353 and 262). The remaining VRV-haplotypes mostly correspond to B.1. Fourteen other strains were found in more than 10 occurrences and may also be of potential interest. Table 3 . VRV-haplotypes. Cross-tabulation of individual VRV-haplotypes with GISAID-assigned lineages in all 167,893 sequences, excluding lineages with fewer than 10 occurrences. "Freq", corresponding haplotype frequencies; "Unknown", sequences not assigned to any lineage. 

The SLS method includes homology modeling of Spike mutations, to predict possible consequences on Spike structure/function and to guide laboratory research. Inspection of the temporal dynamics of the VRV-haplotypes may be useful for identifying VRVs of interest. We performed homology modeling on two potentially interesting VRV-haplotypes, one of which we refer to as "UK-VRV" (N501-A570-D614-P681-T716-S982-D1118, from the UK variant cluster B.1.1.7) and another VRV-haplotype, S13-W152-L452-D614 (from the US variant cluster (B. 1.94; B.1.427; B.1.429) ).

The D614G mutation observed in the UK-VRV haplotype has been associated with increased infectivity/transmissibility [22] [23] [24] . Cryo-electron microscopy structures have recently been reported [25, 26] that reveal the structural consequences of this mutation and provide a plausible mechanistic explanation for the increased infectivity of D614G-carrying variants. The D614 VRV has predominated in all US cases for which sequence information is available in the TP2 cluster ( Figure 1B , Table S2 ). The N501Y mutation (present in the B.1.1.7 variant) is located in the receptor-binding domain (RBD) and has been reported to enhance binding affinity to the angiotensin-converting enzyme-2 [8, 27] . N501Y has also been shown to reduce susceptibility to some neutralizing antibodies (nAbs), although the B.1.1.7 variant appears to remain susceptible to some extent to natural infection-acquired and vaccine-induced nAbs [8] .

Of the five remaining VRVs in the UK-VRV haplotype, A570, T716, and S982 seem relatively benign in that mutations at these positions are already decreasing in certain states/territories (this trend is also true to some extent for N501Y). While this observation may simply reflect inadequate sequencing efforts in recent months, it may also indicate that mutations at these positions do not confer any fitness advantage to the virus.

The two remaining VRVs in the UK-VRV haplotype, P681 and D1118, are more intriguing. Mutations at these two sites, particularly at P681, appear to persist in multiple states/territories. The P681H mutation occurs in the S1/S2 cleavage segment of the Spike protein, which is typically not resolved in cryo-electron microscopy or X-ray diffraction experiments. Thus, we cannot speculate on potential structural consequences of this mutation. However, the continued presence of this mutation in many states and its location in the Spike protein S1/S2 cleavage segment is most interesting. It is plausible that the P681H mutation increases the flexibility of the S1/S2 cleavage segment, which might lead to enhanced cleavage and infectivity. However, there are no experimental studies to date that demonstrate a significant increase in S1/S2 cleavage for this mutant relative to other variants. Likewise, we are not aware of any reports that D1118H impacts transmissibility or morbidity, but the location of this mutation in the Spike protein trimer assembly ( Figure 3A ,B) suggests it could impact trimer assembly structure/stability/dynamics.

The US variants also carry the D614G mutation. The VRV-haplotype S13I-W152C-L452R (ICR-3) appeared in Fall 2020 and is rapidly becoming dominant in states on the West Coast, as well as appearing in selected southwestern and southeastern states ( Figure 3C-F) . The S13I and W152C mutations, which are situated in the N-terminal domain (NTD) of the Spike protein, have been implicated in escape from NTD-targeting monoclonal antibodies [28] . The L452R mutation is situated in the RBD; homology modelling of the RBD-ACE2 complex shows that while R452 does not directly contact ACE2, the guanidinium side chain of R452 is surface-exposed and thus could potentially impact nAb binding ( Figure 3G ). The L452R mutation was recently shown to reduce binding affinity to some RBD-targeting monoclonal antibodies, as well as to reduce susceptibility to nAbs [28] . Thus, structural modeling of mutations in the S13-W152-L452 VRV-haplotype yields results that are consistent with the temporal dynamics of this VRV-haplotype. The US variants also carry the D614G mutation. The VRV-haplotype S13I-W152C-L452R (ICR-3) appeared in Fall 2020 and is rapidly becoming dominant in states on the West Coast, as well as appearing in selected southwestern and southeastern states ( Figure  3C-F) . The S13I and W152C mutations, which are situated in the N-terminal domain (NTD) of the Spike protein, have been implicated in escape from NTD-targeting 

Spike AA sequences (genome position: 21563-25384) from 189,727 COVID-19 cases in the US and selected US territories, along with their associated metadata, were retrieved from GISAID [29] (https://www.gisaid.org/ accessed on 23 March 2021). Geographic origin (one of the 50 US states, Washington DC, Puerto Rico, or the Virgin Islands) was available for 189,284 of the sequences. For 443 of the cases, no US state/territory origin information was available. To ensure adequate sample size, Spike sequences from North Dakota, South Dakota, and the Virgin Islands were combined with these 443 sequences, forming an "Other States" category (728 sequences). Among them, 21,391 sequences were classified as a VOI or VOC (Table S1 ). These sequences were excluded, leaving 167,893 sequences for analysis (see Table S2 for monthly case numbers by state/territory).

Spike protein sequences were aligned to the Wuhan reference sequence [30] using MAFFT [31] , yielding a complete "rectangular residue sequence matrix". Sequences with at least one AA mutation (compared to the reference) were identified, enabling transformation of the residue sequence matrix to a matrix of binary VRV (mutant) indicators. Monomorphic residues led to columns of zeros and were eliminated from further analysis. We use VRV in this work to refer to a single AA position that harbors a substitution. We reserve the term "variant" in this work for identified VOIs and VOCs.

To model non-linear temporal dynamics, a generalized additive model (GAM) was used to regress the VRV indicator over sample collection time via the following probability model for the jth binary VRV indicator,

where α is a constant coefficient and s j (t) is a non-linear function of time t, and both are estimated by the restricted maximum likelihood method [32] . Upon completing the estimation, the above function was used with the estimated coefficient and non-linear function to compute locally averaged proportions of each jth VRV, yielding a p-value that measures if the function s j (t) deviates from zero. Also produced is the maximum proportion as Pmax = max[Pr VRV j = 1 t . To correct for multiple comparisons, VRVs p-values were converted to q-values (false discovery rates) [33] . The function "gam" was used to fit the GAM and the function "qvalue" was used to compute q-values (R packages MGCV [34] and qvalue [33] , respectively). The smoothing parameter k = 7 was chosen, and VRVs present in fewer than 10 sequences (or >90% of all sequences, as for D614), or whose span between emergence and disappearance from circulation was less than seven days were assigned a p-value (and q-value) of one. Upon fitting the GAM, the fitted values were used as locally averaged VRV proportions daily from the first to the last reporting day. Computed proportions over time describe VRV temporal dynamics. A VRV is identified by both statistical significance (q-value < 0.01) and being present in a substantial proportion (Pmax > 10%).

Within-state/territory: Temporal dynamics of <8 VRVs within a given state/territory were visualized with a line plot. For visualizing temporal dynamics of ≥8 VRVs within a given state/territory, unsupervised learning was applied, grouping VRVs with similar temporal patterns. Results were visualized with a heatmap.

Spatially integrated: To visualize spatiotemporal VRV dynamics, all state-specific temporal dynamics were integrated and unsupervised learning (one-way hierarchical clustering with the Euclidean distance with weights in favor of recent temporal trajectories and the "ward.D2" agglomeration method) [35] was applied.

Given the physical proximity and constrained functionalities, residues are in high disequilibrium, forming haplotypes on a single-stranded RNA sequence ("VRV-haplotypes") [36] . Haplotype structures enable imputation similarly to SNP imputation. Ten residues were selected to form a "imputation set". For each selected VRV, its missing values were imputed based on their empirical haplotype frequencies. Remaining missing residues (if any) are denoted by a lower case "x". While technically, a missing residue sequence value cannot be distinguished from an AA deletion, we conjectured that AA deletions would be rare in Spike and treated all such cases as missing values.

A viral strain harboring multiple VRVs is referred to as a "VRV-haplotype". To identify VRV-haplotypes, unsupervised learning was used to organize both cases and VRVs through a two-way hierarchical analysis [35] .

The two-way hierarchical analysis used the parallel function "parDist" with the "tanimoto" distance measure and the function "hclust" with "ward.D2" algorithm in the R package. VRV-haplotypes were identified separately within each state/territory.

Hierarchically organized COVID-19 cases and VRVs are visually represented by a heatmap for every state/territory, which can be divided in grids by R geographic case groups (case1, case2, . . . , caseR) and by C VRV groups (vrv1, vrv2, . . . , vrvC).

As a matrix of VRV indicators was used to identify blocks of multiple VRVs among subsets of cases, these blocks are referred to as VRV-haplotype blocks, since individual VRVs within blocks may vary. Within each block, actual residues were used to construct VRV-haplotypes for estimating haplotype frequencies and linking haplotypes with viral strains. For evaluating the relationship of VRV-haplotypes with known variants, variant information for case sequences was retrieved from GISAID [29] .

After identifying specific Spike protein mutants of interest from VRVs and related VRV-haplotypes, standard homology modeling methods were applied to generate 3D models. Since all mutants examined in this study occurred in haplotypes harboring the D614G substitution, reference structures harboring this mutation (PDB codes 6ZWV; 7KRR; 7KRS) [25, 26] were used as templates. All model building and analysis was performed using the ChimeraX interactive molecular graphics package (Resource for Biocomputing, Visualization, and Informatics at UC San Francisco, CA, USA) [37] . Mutated side chain conformations were generated from the backbone-dependent rotamer library of Shapovalov and Dunbrack [38] and selected to minimize steric clash with neighboring residues. No additional structural refinement was performed. For homology modeling using the Spike receptor binding domain (RBD)-ACE2 complex, PDB code 6M0J [39] was used.

The continuous evolution of SARS-CoV-2 has already impacted public health guidelines and research priorities, with the potential of even more clinically consequential variants still to emerge. Here, we leveraged a public data resource and described a statistical learning strategy for analyzing large, complex SARS-CoV-2 sequence datasets while incorporating temporal and spatial information. We provided detailed information on the emergence and persistence (or disappearance) of specific mutations in US states/territories, helping identify mutations that may warrant further observation/investigation. Our approach can be applied to other pathogens for which sufficient genomic surveillance data are available, generating important, statistically rigorous, and visually interpretable information for the biomedical research community, clinicians and public health officials. Our approach can also provide insight on the evolution of mutants and linkages with known viral strains.

By applying the SLS method to 167,893 US sequences not classified as any VOI/VOC, we identified 77 novel individual VRVs, including 25 pressing VRVs that appear to have emerged in the US. Among these pressing VRVs, the haplotype (T478-D614-P681-T732) links with the strain B.1.1.222 and the haplotype (G142-E180-D614-Q677-S940) links with the strain B.1.234, both of which do not correspond to any current VOI/VOC. Additionally of note, if the SLS method is applied to all US sequences, all circulating VOIs/VOCs are identified (results not shown). Indeed, timely discovery of novel VRVs is an important feature of the SLS method, which directly identifies emerging VRVs without relying on established classification of viral strains, as reliable classification of new strains by phylogenic analyses requires a sufficient number of accumulated viral sequence data. We should emphasize that our analyses were based on samples collected from January 2020 through March 2021, so it is not surprising that many of the VRVs we identified involve mutations associated most closely with the Alpha variant (B.1.1.7), which was predominant in the U.S. and world-wide during this time period. However, we note that our analyses also identified a haplotype N6 ( Table 2 ) that exhibits the subsequently well-documented mutation P681R, characteristic of the Delta variant that has dominated infections world-wide during the latter half of 2021. During our analysis time period, this N6 haplotype was a minor, but statistically significant, fraction of the observed haplotypes in patient samples. Recent experimental studies [40, 41] indicate that this mutation enhances S1/S2 segment cleavage and makes a major contribution to the enhanced infectivity observed for the Delta variant. The fact that our analysis protocol identified this specific mutation long before the Delta variant became clinically relevant in the U.S. appears to support the contention that our method may be useful for the early recognition of potentially important mutations.

As part of the assessment of immune correlates of protection, many randomized, placebo-controlled COVID-19 vaccine efficacy trials measure Spike protein sequences from symptomatic COVID-19 endpoint cases, and sometimes also from SARS-CoV-2 asymptomatic infections. Sieve analysis of these viral sequences can be conducted to assess whether and how vaccine efficacy depends on Spike protein sequence features, including differential vaccine efficacy across the levels of VRVs and of VRV-haplotypes [42] . The graphical tools proposed here for spatiotemporal tracking of VRVs and VRV-haplotypes can be useful for sieve analysis. Firstly, they help define and communicate the set of VRVs and VRV-haplotypes of study endpoint cases that have sufficient variability to be able to assess whether vaccine efficacy depends on the feature. For example, given that most vaccines use the Wuhan strain as the vaccine-insert, VRVs that meet our Pmax > 0.10 criterion would readily have the level of variability required for sieve analysis, whereas VRVs with Pmax < 0.02 would likely not. Secondly, including the assignment to vaccine or placebo as a factor in the unsupervised clustering graphics applied to the vaccine efficacy trial sequence data sets may help communicate results of sieve analysis. Thirdly, many of the vaccine efficacy trials have been offering the vaccine to placebo recipients, such that the placebo arm is lost and long term follow-up occurs only in individuals originally vaccinated or newly (deferred) vaccinated [43] . The graphical tools may be applied to track study participant vaccine breakthrough virus VRVs and VRV-haplotypes over time, and to similarly track VRVs and VRV-haplotypes in GISAID data bases of unvaccinated persons matched by geography and time. A comparison of these two tracking results may aid sieve analysis during the long-term follow-up period of the vaccine efficacy trials.

Evidence is mounting that neutralizing antibodies acquired by natural infection [44, 45] or through vaccination [46] [47] [48] [49] are a correlate of protection against COVID-19. Therefore, it will be critical to assess whether and how VRVs and/or VRV-haplotypes in the infecting strains impact neutralizing antibody titers attained by natural infection [50] , as well as whether and how they impact neutralization sensitivity to vaccine-induced neutralizing antibodies [10] and/or monoclonal antibodies [51] . One possibility is that the graphical tools used here could annotate VRVs and VRV-haplotypes according to impact on neutralization.

Moreover, a subset of sieve analyses is designed to restrict to VRVs and VRV-haplotypes that are known to impact neutralization response to the given vaccine under study, to improve power and to contribute to understanding neutralizing antibody-based correlates of protection. Once VRVs or VRV-haplotypes that impact vaccine efficacy are identified and their impact is quantified, this information can be applied to inform models for predicting vaccine efficacy against circulating virus populations and to aid optimization of vaccine strain selection.

A limitation of our approach is that it is constrained by intrinsic sampling limitations because all sequences were collected and contributed by laboratories (Table S6) without consistent sampling protocols. Hence, despite the large size of our dataset, the analyzed sequences were not nationally representative. Further, it is important to interpret our results in terms of VRV proportions among reported sequence data, rather than incidences or prevalence of VRVs, in the absence of reliably estimated denominators. To overcome this limitation, public health agencies need to consider a uniformly developed surveillance protocol, to sequence COVID-19 cases from well-defined populations.

Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/v14010009/s1, Figure S1 : For all Spike residues with sufficient variation, scatterplots of the maximum proportion (Pmax) of sequences from a given state/territory harboring a mutation at a given amino acid position vs. q-value. Points in red represent residues that meet both criteria for classification as a VRV. Points in black represent residues that do not, Figure S2 : Temporal patterns of VRVs identified in each state/territory, Figure S3 : Locally averaged proportions over time for substitutions at 3 AA-subs in a VOI (Y144, F888, V1176) and at 3 AA-subs in a VOC (H69, Y144, K417) that were not detected by the SLS method in states/territories where at least three sequences had a substitution at the designated AA position. AA-sub, amino acid that has been shown to harbor a substitution in a US-circulating VOI or VOC. VOI, variant of interest; VOC, variant of concern. The "TP1" through "TP10" labels, along with the vertical color bar on the left side of the plot, refer to the 10 individual State-VRV clusters identified in the plot, based on their temporal patterns ("TP" = temporal pattern), Figure S4 : Presence of VRVs among all cases in each state/territory. A gray cell means the VRV was not identified in the given case; a black cell means that it was. Both cases and VRVs were clustered by two-way hierarchical cluster analysis. The "PG" ("patient group") and "RG" ("residue group") labels in the lower left corner of each plot correspond to the row (case) and column (VRV) clusters identified by two-way hierarchical cluster analysis, also identified by the vertical and horizontal color bars at the left and top, respectively, of each plot, Table S1 : Distribution of the 21,391 VOI/VOC sequences by specific variant and by state/territory, Table S2 : Distribution of the 167,893 SARS-CoV-2 sequences by state/territory and by GISAID submission month, along with state/territory-specific distribution of the 21,391 VOI/VOC sequences that were excluded from the analysis, Table S3 : The 10 identified geo-VRV clusters (TP1 through TP10), based on temporal profiles, Table S4 : Frequencies of the 90 viral residue variants (VRVs) by state/territory, from an unsupervised learning from bi-clustering of all states and VRVs, Table S5 : VRV-haplotypes identified within each state/territory, along with state/territory-specific frequencies. The "positivity" column indicates the proportion of mutations in each haplotype block, Table S6 : List of individuals from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based. Funding: This study was supported by the National Institutes of Health/National Institute of Allergy and Infectious Diseases (https://www.niaid.nih.gov/) through award UM1 AI068635 to PBG. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Institutional Review Board Statement: Not applicable.

Data Availability Statement: All sequence data analyzed here are publicly available at GSIAD (https://www.gisaid.org/, accessed on 23 March 2021). For this analysis, data were retrieved on 23 March 2021.

Why are RNA virus mutation rates so damn high?

SARS-CoV-2 Variant Classifications and Definitions

Transmission, infectivity, and neutralization of a spike L452R SARS-CoV-2 variant

Detection of a SARS-CoV-2 variant of concern in South Africa

Genomic characterization of a novel SARS-CoV-2 lineage from Rio de Janeiro, Brazil

Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England

Sensitivity of SARS-CoV-2 B.1.1.7 to mRNA vaccine-elicited antibodies

Reduced neutralization of SARS-CoV-2 B.1.1.7 variant by convalescent and vaccine sera

Multiple SARS-CoV-2 variants escape neutralization by vaccine-induced humoral immunity

Efficacy of the ChAdOx1 nCoV-19 Covid-19 Vaccine against the B.1.351 Variant

COVID-19 vaccines vs variants-determining how much immunity is enough

Immunity to SARS-CoV-2 variants of concern

Monitor for COVID-19 vaccine resistance evolution during clinical trials

Genetic characteristics and phylogeny of 969-bp S gene sequence of SARS-CoV-2 from Hawaii reveals the worldwide emerging P681H mutation

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Nextstrain: Real-time tracking of pathogen evolution

Variant analysis of SARS-CoV-2 genomes

Variant analysis of 1040 SARS-CoV-2 genomes

Variant analysis of SARS-CoV-2 genomes in the Middle East

Partitioning the curve-interstate travel restrictions during the Covid-19 pandemic

Mutation D614G increases SARS-CoV-2 transmission

Tracking changes in SARS-CoV-2 spike: Evidence that D614G increases infectivity of the COVID-19 virus

SARS-CoV-2 spike-protein D614G mutation increases virion spike density and infectivity

Structures and distributions of SARS-CoV-2 spike proteins on intact virions

Structural impact on SARS-CoV-2 spike protein by D614G substitution

The basis of a more contagious 501Y.V1 variant of SARS-CoV-2

SARS-CoV-2 immune evasion by the B.1.427/B.1.429 variant of concern

Global initiative on sharing all influenza data-from vision to reality

The establishment of reference sequence for SARS-CoV-2 and variation analysis

Parallelization of MAFFT for large-scale multiple sequence alignments

Generalized Additive Models

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach

Smoothing parameter and model selection for general smooth models

Ward's hierarchical agglomerative clustering method: Which algorithms implement Ward's criterion?

A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase

Meeting modern challenges in visualization and analysis

A smoothed backbone-dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions

Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor

SARS-CoV-2 Delta Variant Displays Moderate Resistance to Neutralizing Antibodies and Spike Protein Properties of Higher Soluble ACE2 Sensitivity, Enhanced Cleavage and Fusogenic Activity

Delta spike P681R mutation enhances SARS-CoV-2 fitness over Alpha variant

Sieve analysis to understand how SARS-CoV-2 diversity can impact vaccine protection

A deferred-vaccination design to assess durability of COVID-19 vaccine effect after the placebo group is vaccinated

Neutralizing antibodies correlate with protection from SARS-CoV-2 in humans during a fishery vessel outbreak with a high attack rate

SARS-CoV-2 seropositivity and subsequent infection risk in healthy young adults: A prospective cohort study

Evidence for antibody as a protective correlate for COVID-19 vaccines

Correlates of protection against symptomatic and asymptomatic SARS-CoV-2 infection

Immune correlates analysis of the mRNA-1273 COVID-19 vaccine efficacy clinical trial

Correlates of protection against SARS-CoV-2 in rhesus macaques

Evidence of escape of SARS-CoV-2 variant B.1.351 from natural and vaccine-induced sera

Resistance of SARS-CoV-2 variants to neutralization by monoclonal and serum-derived polyclonal antibodies

We gratefully acknowledge the individuals from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based. A full list of these individuals is provided in Table S6 .

The authors declare no conflict of interest.