key: cord-0519007-7bdfyutp authors: Natale, Joseph L.; Viswanath, Varun; Acevedo, Oscar Trujillo; Giottonini, Sophia P'erez; Hern'andez, Sandy Ihuiyan Romero; Mill'an, Diana G. Cruz; Palacios-Puga, A. Montserrat; Mandvi, Ammar; Khan, Brian M.; Lilik, Martin; Park, Jay; Smarr, Benjamin L. title: Dynamical clustering of U.S. states reveals four distinct infection patterns that predict SARS-CoV-2 pandemic behavior date: 2021-12-10 journal: nan DOI: nan sha: 8e45a84cf96367d0e9a0eba92a904d2361550e2a doc_id: 519007 cord_uid: 7bdfyutp The SARS-CoV-2 pandemic has so far unfolded diversely across the fifty United States of America, reflected both in different time progressions of infection"waves"and in magnitudes of local infection rates. Despite a marked diversity of presentations, most U.S. states experienced their single greatest surge in daily new cases during the transition from Fall 2020 to Winter 2021. Popular media also cite additional similarities between states -- often despite disparities in governmental policies, reported mask-wearing compliance rates, and vaccination percentages. Here, we identify a set of robust, low-dimensional clusters that 1) summarize the timings and relative heights of four historical COVID-19"wave opportunities"accessible to all 50 U.S. states, 2) correlate with geographical and intervention patterns associated with those groups of states they encompass, and 3) predict aspects of the"fifth wave"of new infections in the late Summer of 2021. In particular, we argue that clustering elucidates a negative relationship between vaccination rates and subsequent case-load variabilities within state groups. We advance the hypothesis that vaccination acts as a ``seat belt,"in effect constraining the likely range of new-case upticks, even in the context of the Summer 2021, variant-driven surge. The SARS-CoV-2 pandemic has claimed millions of lives globally -and directly affected hundreds of millions more -doing so in a highly unequal manner [1, 2] . Early foci for COVID-19, such as northern Italy, coincided with populations with high proportions of elderly, susceptible people [3] ; the availability of hospital beds, treatments, and clinical personnel are again to be a major problem in many geographical areas [4, 5] ; and new variants are emerging at different rates in different locales [6, 7] . Internal to the United States of America (at the time of writing, that nation with the highest recorded numbers of both "total" COVID-19 cases, and COVID-19-related deaths) disparities also permeate local infection rates [8, 9] . This has prompted a number of explicit comparisons, at the state level [10, 11] . In principle, such collations should be of great interest to public health decision-makers: to the extent that different states can be expected to behave in categorically different ways, information that supports early planning for interventional responses can be gleaned by mapping the states' diverse progressions over time. Publicly available data render unambiguous that all fifty U.S. states saw their greatest, sustained incidence of new cases during the country's last winter season (2020 -2021) [12] . Furthermore, media outlets have directed attention to other, ostensible similarities: in the case of Florida and California, news coverage emphasized an attainment of "similar COVID-19 case rates" [13] , or even allegedly "identical outcomes" [14] , despite the drastically different measures taken by the independently-operating, state-level governments at the time of reporting. Still, in complex systems it is always possible that multiple parts or subsystems exhibit temporarily superficial similarities without sharing their underlying dynamics [15] . Whether or not the U.S. state infection histories fall into universal similarity classes with relevance to pandemic planning -and, indeed, questions of how reliably government policy, and even vaccination rates, determine their ultimate case statistics -remains unanswered by these prior analyses. In this manuscript we consider one specific approach to demonstrate that the historical "infection trajectories" for all fifty U.S. states can be grouped according to some reasonable measure of similarity, such that resulting groups form a low-dimensional set (i.e., one much smaller than the number of individual states), and also prove useful for predicting the present, or even future, of this pandemic -not merely its past. As such groupings are established, any relevant information supporting early planning around responses to the states' diverse presentations can be contextualized and made available to public health decision-makers. The data referenced in the subsequent paragraphs were aggregated from sources developed by Johns Hopkins and Oxford Universities. The infection data, reported at daily resolution, were obtained from Johns Hopkins [16] . These 4 included reports of total (confirmed and probable) cases, expressed as cumulative counts for each U.S. state. In order to transform these cumulative records into "daily new" infection counts for each state, we subtracted each day's total cases record from the previous day's value, and then replaced their raw counts with the corresponding moving (14-rolling) averages attained by each day considered. In so doing, we therefore neglected the effects of deaths and recoveries on the "daily new" case numbers; we justified this approximation by noting that a) the latter were often as small as a full order of magnitude less than the total cases on a given day, and b) isolated "negative" days (those for which deaths and recoveries presumably exceeded the new, emerging case count) would be absorbed, without a strong effect, by the aforementioned temporal averaging. Training data were amassed between April 2020 to July 2021. Since the different U.S. states encompass populations with markedly different sizes and densities, but our temporal picture seeks to distill all the states' histories into a form for which their synchrony and structure can be compared on equal footing, we normalized the aforementioned, raw estimates of the states' daily new infection counts by their respective maximum values (all observed to occur in the same time frame, Winter 2020 -2021). This yielded a new set of case records for every state, with possible values between 0 and 1. Treating each as a time series, and applying 14-day moving (rolling) averages between April 13, 2020 and July 2, 2021 -a total of 446 calendar days -defined our U.S. "daily new" infection trajectories. An example trajectory, for New York state, is shown in Fig. 1 , in blue. Daily new infection records for dates prior to April 13, 2020 are depicted for temporal context in Section III, but were not used for developing or training any models: in these early stages of the pandemic, we could not be confident that the reporting standards across states were of uniform or sufficiently high quality [17] . For instance, a lack of PCR testing supplies might have led to a greater proportion of "probable," vs. "confirmed," cases so early on. Some numerical artifacts resulting from performing 14-day averages on data that begins midway through the initial surge of cases in Spring 2020 were removed from the figures in Section III, by adjusting the plotting limits across the respective x-axes. During the course of this work, data were re-scraped to extend between March 6, 2020 and August 3, 2021. The example infection trajectory for New York (in Fig. 1 ) includes periods of "low" relative case loads, punctuated by several dramatic -and, persistent -elevations. A central question in the present work is whether the temporal patterns associated with these periodic surges in the daily new cases fall into stereotypes across states. As a starting point for comparing the time progressions realized by different states, we modify a previously established convention, dividing the pandemic's history between the Spring of 2020 and Spring 2021 into consecutive "waves" [18] . Here, we reconceptualize these divisions in terms of annual, calendar seasons: 1. First wave: an initial case accumulation, roughly between March and May ("Spring") of 2020 2. Second wave: a subsequent resurgence, roughly between July and September ("Summer") of 2020 3. Third wave: the aforementioned "Winter peak," lasting from around November 2020 through March 2021, but reaching its apex between December 2020 -January 2021 4. Fourth wave: recalcitrant surge during the Spring of 2021 that was noted in, but had not necessarily resolved by the time of publication of, Ref. [18] In addition, we consider a "fifth" wave, encompassing more recent case load rises beginning in late July 2021 (Fig. 1 ). This upswing was reflected in the daily new case records for all fifty U.S. states, but the same was not universally true of the previous four waves. To emphasize explicitly that not all states participated equally in each of those four, we adopt the nomenclature wave opportunity to refer to the potentially-eventful spans of time with persistent elevations. As one example, New York expressed strongly the "first" wave opportunity, but the "second wave" seen in other parts of the country, as defined above, was so small relative to the former that New York effectively progressed without participation in it. To describe the climbing infection rates in the Winter months, we refer here not to "New York's second wave," but to "New York's (degree of) participation in the third U.S. wave opportunity" (see again Fig. 1 ). Since we are excluding data before April 13, 2020, and our rolling averages are fully-defined only after fourteen points have been observed (i.e., from April 26, 2020 on), the first wave opportunity is not considered in our analyses. In addition to infection data, we aggregated additional data on vaccine administration rates and government policy "stringencies" during the pandemic. We obtained vaccination data from repositories published by the Johns Hopkins University Coronavirus Resource Center [16] , and curated by Our World in Data [19] . We incorporated government policy data compiled by the Blavatnik School of Government, University of Oxford [18] . The percentage of "fully-vaccinated" individuals residing in each state -our chosen proxy for progress toward a local "herd immunity" status -was estimated by, first, extracting the number of people recorded as having received two doses of either two-series (Pfizer-BioNTech or Moderna), or one dose of the single-shot (J&J/Jannsen), vaccine against COVID-19 asvailable in the U.S., at each calendar day between February 1, 2021 and July 22, 2021, inclusive. This time range, distinct from (but still overlapping with) that of our infection data, consisted of 172 days. We aggregated tabulated counts by state, and then divided by the number of residents in each state. Although U.S. state population levels do change over time, we made the tacit approximation of slow change over the course of the pandemic: dividing by the "2021" population counts given in Ref. [20] allowed us to arrive at a cumulative percentage of fully-vaccinated residents in each state as a time series -a vaccination-rate analog of our infection trajectories. The vaccination-rate trajectory for New York is presented as a red curve, normalized between 0 and 1, in Fig. 1 . A "Stringency Index" score for comparing government policies across states was developed by previous authors, to summarize the "strictness of 'lockdown style' closure and containment policies that primarily restrict people's behavior" [18] . Stringency scores could range, in principle, from a minimum value of 0 to a maximum of 100 for a given state; rescaling the stringency time series values provided a third, government-policy trajectory, normalized to the same vertical scale as the infection and vaccine trajectories (between 0 and 1). Policy trajectories began much earlier than daily-infection and vaccination-rate trajectories, on January 1, 2020 (before the "start" of the pandemic in the U.S.) and ended on July 2, 2021 -a total of 549 days. A gray curve denotes the policy trajectory in Fig. 1 . Given the above considerations, we hypothesized that i) if placed on some even footing, the n = 50 U.S. "daily new infection" trajectories should collapse into a set C of archetypal forms with cardinality |C| ≪ n. That is, we set out to verify whether the records for all 50 states could be summarized by -and compressed into -a handful of discrete clusters, each associated with a particular timing and structure (relative peak heights) for all those distinct "wave opportunities" encompassed by the available data, between April 2020 and July 2021. We then investigated ii) whether one such partitioning might prove useful, not only for summarizing the states' dynamics in the past, but also for predicting aspects of the subsequent, ongoing surge, attributed largely to the B.1.617.2 ("Delta") variant [21] . "Daily new infection" trajectory values spanning the time range between April 12th, 2020 and July 2nd, 2021 were clustered in Python 3, using the SciPy Clustering package. Specifically, we performed hierarchical / agglomerative clustering via the named "linkage" function in the hierarchy module of SciPy (v. 1.4.1). A total of 446 data points per state, each representing a trajectory value for 1 calendar date, served as primary input to this function. We used a "complete" (i.e., Farthest Point, or Voor Hees, Algorithm) linkage method to compute cluster distances. In what follows, this distance was measured by the (Pearson) correlation (see [22] ) between respective pairs of time series. Using the named "fcluster " function of the same SciPy module, we obtained the set of "flat" clusters that would result from "slicing" the discovered hierarchy at a specific threshold value of our correlation-based distance metric. This same process could be repeated to cluster other types of data -e.g., percent-fully-vaccinated "trajectories". Using the named function "pca," we extracted the first 50 principal components. Since these principal components are essentially linear combinations of the input features, each represents a weighted sum of contributions from the rolling, "daily new infections" on different dates; therefore, we used their feature importance rankings to reveal the subsets of dates during the pandemic associated with the greatest degrees of variation across all 50 U.S. states. Feature importance, in the context of one given PCA-derived eigenvector (or, "principal component"), was scored according to absolute values: the vector components with highest (relative) absolute values were ranked as the most important. In order to arrive at a more all-encompassing set of dates that took into account the importances from the first several principal components k ∈ {1, . . . k 0 }, we also took a step back from the usual abstraction of principal components as (ideally, orthogonal) eigenvectors, and treated each instead as a set of general 'importance-amplitude" vectors, having components for each of the 446 dates. Summing these importance-amplitude vectors, component-wise, resulted in a new curve that we used to represent "overall," daily importances. In order to select an appropriate value for k 0 , we examined the cumulative percentage of "explained variance" associated with the first several principal components k ∈ {1, 2, . . . 10}; these values are reported in Table I, along with the "errors" corresponding to reconstructions of the 50 original observations -our trajectories -that were built using all principal components from 1 to k 0 , inclusive. The "error" at each time point was measured via the mean absolute deviation between the original and reconstructed trajectory value, averaged over states. The cutoff for choosing k 0 in terms of the cumulatively explained variance was set at a value of 90%. In the context of the "overall" scheme above, dates were considered "important" if the absolute value of a given cumulative importance curve exceeded the mean value of that curve. An upper limit of k = 50 (the minimum between the number of observations, 50, and the number of features, 466) principal components was permitted by the "pca" function. The results for the most important date ranges inferred via PCA are depicted in Section IVc, contextualized by their superposition over all n = 50 original state trajectories, with a common axis. There is no intrinsic guarantee that PCA should discover date ranges that are contiguous, for a generic time series decomposed in this manner. In the course of performing trajectory reconstructions, we verified that those reconstructions built using just the first, k = 1 principal component collapsed to a single curve. In other words, PCA lead to a description of the system that was identical and equivalent for all n states, at this "coarsest" level of resolution. Isolating just the most prominent peaks within this curve ("find peaks" function, signal processing toolbox, SciPy v.1.4.1, with the "prominence" parameter set to the minimum-observed value of that single, k = 1 reconstruction curve, 0.05) allowed us to compress our "k = 1 reconstruction" into a handful of values representing its most pronounced excursions above its minimum. The dates associated with these maxima were then cross-referenced with those of the greatest overall "importance," according to the first k 0 components cumulatively (see Sec. III C 3). Since the k = 1, reconstructed trajectory represented a single, unifying summary of all n original trajectories, we referred to this curve as the "master model." The master model served as our minimal model of the archetypal daily new infection trajectory, for an arbitrary U.S. state. We quantified the errors associated with assuming that all states followed a given master-model behavior exactly. These included i) the extent to which the full, n = 50 master model failed to capture nuances in the n individual trajectories from which it was inferred, and ii) how well the aforedescribed compression, consisting exclusively of the master model maxima, captured information conveyed by the master itself. To do so, we computed the mean absolute deviations between i) the 446-day, n = 50 master and all the individual state trajectories used in its creation, and ii) the identified maxima in the master and the corresponding trajectory heights at the associated dates for the individual states. For both cases, we report the averages of these deviations, across all included states, in Section III C 3. To characterize improvements afforded by invoking our dynamical clustering, we repeated this process -creation and compression of a master model, and an evaluation of the errors accrued when naïvely assuming this archetypal behavior for all the states it summarizes -separately for each cluster, for both the 446-day and maxima-only cases. Averages of government policy stringency trajectories and vaccination-rate trajectories (introduced in Section II A) were created for each cluster having at least two members by taking the means, across all cluster members, over time. That is, we separated states according to a learned cluster partitioning C and computed the mean, standard deviation and standard error (of the mean) over the constituent states for each cluster separately, by day. To develop a framework for discerning whether the ∼ |C| resulting policy (or vaccination-rate) "average trajectories" exhibited statistically significant differences from one another -as well as to reduce overfitting and more reasonably approximate sample independence -we chose not to compare from all daily stringency values, but rather made the approximation that the trajectory values from the first day of each month could themselves serve as independent observations on a given cluster-system. We then performed a Krusal-Wallis test, with the number of samples equal to the number ∼ |C| of clusters with at least two state members, drawing observations from the first day of each month between May 2020 and May 2021 (inclusive, so as to associate a total of exactly 13 observations with each group). As detailed below, vaccination trajectories can be shown to become (visually) distinguishable only after a certain calendar date in the history of the pandemic. We first performed a dynamical clustering above for all vaccinationrate trajectories in the manner described in Sec. II B for infection trajectories; we then created "average vaccination trajectories" for all the (previously identified) infection-trajectory clusters with at least two states as members, and identified where the vaccination-trajectory slopes were changing by studying their first and second derivatives (built-in "diff " function in Python 3's NumPy and Pandas). Finally, we performed a Kruskal-Wallis test to determine whether the vaccination-rate values on July 22, 2021 -a date by which the rates for all states were as well-separated as they would be by the end of our data -differed significantly, between the aforementioned infection-derived clusters. We report both test statistics H and p-values for 5% significance in Section III C 4, as well as the results of a post-hoc test (Python implementation of Conover's test as part of the scikit-posthocs package, using "holm" step-down method with Sidak adjustments to p-values) to infer which of the distinct pairs of groups are statistically distinguishable. All the methods discussed to this point have been aimed at establishing that all n = 50 U.S. daily new infection trajectories fall into a low-dimensional set, C; in Section III, we infer an appropriate value for |C| (i.e., a partitioning that sufficiently distinguishes coarse differences in "wave" histories). We also wish to assess the usefulness of a cluster set C for efficiently summarizing U.S. history -and its potential grounding in reality, beyond its practical use for data compression -according to how it recapitulates patterns in geography, government stringency, and vaccination rates. In Section III C 4, we see the vaccination-rate trajectories -percentages of fully-vaccinated individuals represented in each U.S. state's population -diverge from one another markedly (and then remain divergent) only following the late Spring of 2021. Furthermore, those vaccines against COVID-19 available in the period described by our study -April 2020 to July 2021 -require several weeks' time to accomplish even partial seroconversion [23, 24] . We therefore studied the degree to which the average vaccination rates within each cluster, approximately two weeks before the end of our data, predicted daily-new-case statistics in the corresponding clusters on the last full day of data. We performed tabulations of both the July 22, 2021 vaccination rates and the August 3, 2021 "daily new infection" records, by cluster, according to the set C learned and fixed in Section III B 1. We then investigated the relationship between within-cluster vaccination rates and i) the within-cluster mean of states' infection trajectory values and ii) the within-cluster standard error of the mean for these (relative) new-case loads, at an approximate two-week lag, by means of both Spearman rank-order correlations and two linear regressions. For both types of analyses, we considered only states that belonged to clusters encompassing at least two states, omitting "singleton" clusters that consisted of only a solitary state (see Section III B 1). We read off the output "r "of the "linregress" function in Scipy's Statistical functions module as the r 2 value, or linear (Pearson) correlation coefficient, to further quantify linear trends. The state of New York exhibited a strong participation in the first, third, and fourth wave opportunities (Fig. 1) , with essentially no reprieve between the latter two; New York was also among those states exhibiting a relative As the cluster threshold is varied from its minimum (0) to its maximum possible value (1), the initial hierarchical clustering of our infection trajectories converges rapidly to resolve just a handful of large clusters. At threshold 0.2, we resolve only 8 multi-state clusters (Fig. 2a) ; approaching threshold 0.43 and beyond, clusters tend to grow in size, while |C| decreases concomitantly. Once the threshold values reach 0.6, representing a temporal cross-correlation of at least 1 − 0.6 → 40% between any pair of constituent trajectories internal to a cluster, there remain only three big clusters and two outlying states -California and Hawaii both form their own, standalone, singleton "clusters." The 0.43 threshold represented the highest threshold for which non-singleton clusters were still of comparable size. In particular, whereas the three large clusters at the 0.6 threshold are composed of 12, 12, and 24 states, those at threshold 0.43 have sizes 12, 12, 8, and 16. Although it is possible for the distinct temporal patterns characteristic of the 50 states to fall into differently-sized clusters, we expect that power can be maximized by comparing the statistics of uniformly-sized clusters (the size distribution associated with highest entropy [25] ); by this heuristic, we chose to work predominantly with the clusters resolved at this intermediate threshold value. To visualize salient disparities in the states' temporal dynamics that separate the corresponding, |C| = 4 major clusters at this "locked" threshold, we plot mean infection trajectories for each cluster, color-colored by proximity in correlation-space (Fig. 2b) ; these color-coded data, taken from the training set that ran from April 2020 to July 2021, were superimposed upon gray curves representing the same mean trajectories, but updated Ref. [16] to run between March 2020 and August 2021. Commonalities among the mean, cluster-wise trajectories include the aforementioned Winter 2020 -2021 maxima, as well as the presence of smaller peaks in the Spring and Summer seasons. Although our correlation-based clustering was intended primarily to address the timing, or synchronicities, of realized wave opportunities, it also highlights some salient differences in the magnitudes, or relative heights, of those opportunities. The distinguishing properties follow. Cluster 1 alone saw a major, initial surge of new cases in Spring of 2020 -the first wave opportunity -while the states in Clusters 2 and 3 saw their first collective spike only weeks later, during the second wave opportunity in the Summer of 2020. Cluster 4 is distinguished from Clusters 1-3 by its effective lack of either a Spring or a Summer 2020 surge, with the third, Winter wave opportunity playing host to its first major elevation. At a finer resolution, this principal, Winter maximum is observed to concentrate early (vicinity of December) for Cluster 4, with only a modest resurgence of new infections shortly thereafter; other clusters exhibited a different, more multiply-peaked structure, as well as comparatively later peaks, in that season. The closest pair in correlation-space -Clusters 2 and 3 -share most of their coarse features, with aspects of this multiply-peaked Winter setting the latter apart from the former. All U.S. states were observed to have partial resurgences in Spring 2021, and every single state experienced an upswing in Summer 2021 (not explicitly depicted), with pronounced variability even within clusters by August (Fig. 2b) . Collectively, these patterns in the realized heights for each wave opportunity -that is, the (relatively) mediumsized first wave, essentially absent second wave, and very large third wave expressed by Cluster 1; then, the absent first wave, medium-sized second wave, and very large third wave for Clusters 2 and 3; and then, the absent first and second waves, but large third wave for Cluster 4 -provide some degree of intuitive explanation for why these particular clusters were resolved. We quantify this intuition by focusing on specific, "important dates," in Section III C 3. Even given the robust set of qualitative dynamical patterns teased out by means of ocular analysis in the previous section, the learned partitioning C still eludes practical explanation: are there other, non-dynamical patterns associated with the states that were grouped together that recapitulate the learned clustering? In order to visualize the spatial arrangement of those states whose trajectories clustered together at various thresholds, as one such pattern, we colored the states on a U.S. geopolitical map, according to cophenetic similarity (correlation-space distance) as in Fig. 2b . Adjacent, or neighboring, U.S. states were much more likely to be clustered together (Fig. 2c) . With a few exceptions (including southern states, at the 0.43 threshold, that are separated by land but still connected by water), the majority of the clusters at all three thresholds unite states that lie along relatively uninterrupted, existing geographical paths. As those clusters resolved at small thresholds merge to form larger clusters at larger thresholds, the latter begin to reveal clear correspondences to established U.S. geopolitical regions; these are visible at our "locked," 0.43 threshold. In particular, our Cluster 1 predominantly describes states that are traditionally subsumed under "the Northeast". Cluster 2 describes "the South," encompassing both the coastal southeastern border and several states along the southern rim of the country. Cluster 4 represents "the Midwest," only exchanging the states of that region's traditional, southernmost border for the three on its western border. Notably, the two outliers -California and Hawaii -are also geographical extremes. The remaining Southern, Western, and Central form one, comparably-sized cluster that (with the exception of Colorado) sweeps a clean arc from roughly the latitudinal center of the country to Washington State; this "Rest" of the states, comprising Cluster 3, is geospatially contiguous (and, at higher thresholds, merges) with its most proximal neighbor in correlation-space, Cluster 2. It might be reemphasized that all clustering operations were done dynamically -that is, according to correlative similarities between the respective time courses of "daily new case" loads associated with each U.S. state. No explicit spatial or geographical clustering was considered; rather, any geographical patterns emerged as consequences of the dynamical clustering described in Section IIIb. 2. Date ranges most explanatory for telling states apart coincide with peak times for whole-U.S. infection "waves" Analyzing the (cumulative) percentages of variability that can be "explained" by incorporating exclusively the first k principal components in reconstructions of the daily new infection trajectories (Table I) , it is apparent that k 0 = 6 13 principal components suffice to capture over 90% of the total variance considered by PCA. Reconstructions using the first k = 10 components produce trajectories almost indistinguishable from the originals (not depicted). Feature importance analysis suggests that the key dates for explaining trajectory variance encompassed long periods from November to December 2020, and January to mid-February 2021, with weaker contributions in the Spring of 2021 and even late Summer 2020 -that is, all the "important dates" comprised continuous ranges -for the k = 1 component (Fig. 3a) . The same is true for the successive components k ∈ {2, 3, 4, 5}. For instance, the high-prominence ranges for the k = 2 component coincide with the time-continuous Summer 2020 and Spring 2021 waves. Two lower-prominence peaks in the k = 2 importance plot describe a more subtle "fine-tuning" to temporally briefer, Winter fluctuations. The overall importances imputed for each date, based on the first k 0 = 6 components, also manifest as continuous stretches in time. Above-average values spanned only three, broad date ranges (Fig. 3b ) across the second, third, and fourth wave opportunities. These three ranges all corresponded closely to established date ranges for whole-U.S. peaks in daily new infection records (see discussion of the master curve in Sec. III C 3 below, or Ref. [18] ), implying that the time periods surrounding global, national peaks, were also those of the greatest local, state-level variation. Excluding the Spring of 2020 for the reasons discussed in Sections II A 1 -II A 2, all the "wave opportunities" are represented by maxima in the master curve (red vertical lines in Fig. 3c) . At the prominence level used to detect peaks here (Section II C 2), the multiply-peaked, third wave opportunity splits into two separate maxima for consideration. Separating the n states by cluster and inferring a master model for each cluster separately recovers the "average" trajectories for each cluster (see Fig. 2b ). Reconstructing individual state trajectories using their respective master models can reduce reconstruction error, from at least two perspectives: 1) the compressed, maxima-only representation and the full, 446-day trajectories should be reconstructed more accurately when a master-model construction is done separately for each cluster, as opposed to all n states together (mean absolute deviations between the states' original and reconstructed trajectories should tend to decrease); 2) the error between an original trajectory and the "tailored" master made for its containing cluster should diminish, even if no explicit reconstruction is done (the mean absolute deviations between any cluster-averaged trajectory and the original trajectory for given state in the corresponding cluster should be smaller than that between the original trajectory and the full, "n = 50" master model). Measurements for the latter (Table II) , based on either the full, 446-day curves and the 4 key, "important" dates discerned above confirm that errors are smaller within clusters, but also suggest that encoding the 446-day trajectories as a mere 4-vector of amplitudes at the important dates accomplishes a fair compression of states' infection histories: the latter are only slightly higher. Since all error values are of order ∼ 0.1, and cluster memberships are the only inputs needed to a create master model (beyond the data themselves), specifying the cluster membership -or, equivalently, its geographical region, according to Fig. 2c -amounts to a prediction of that state's normalized trajectory heights at each wave opportunity with collective, net error of 10% of the height of the corresponding Winter peak, on average. The geographical considerations above (Fig. 2c ) support the notion that there might be "real-world" correspondences to the four, abstracted clusters identified in Sec. III B 1. A real-world relevance for these clusters is further supported by their ability to find discriminatory patterns in two other metrics, neither of which was included in their definition: government policy stringency scores and vaccination rates (shown for temporal context in Fig. 3d ; see Section II A 3). For all Clusters 1 -4, the Spring of 2020 begins with a rapid increase in stringency that reaches an all-time high around April, despite the fact that not all states realized an equally strong "first wave" opportunity (see Fig. 2b ). It is clear that i) average stringency scores are highest for the Northeast, and lowest within the Midwest, through time; ii) the remaining clusters are less distinct from each other than they are from Cluster 1; iii) Clusters 2 and 3, the most overlapping, sustained stringencies intermediate to those of the Northeast and Midwest for the majority of time. Repeating this analysis for the vaccination-rate trajectories again highlights the Northeastern states as the highestscored cluster, and reemphasizes that high degree of similarity between the Southeastern rim and the "Rest" of the Southern, Central and Northwestern regions that we had initially established via infection-trajectory clustering. In contrast to the overall pattern observed for stringencies, these latter groups are the lowest-ranked for vaccination. Statistical hypothesis testing (four-sample Kruskal-Wallis, with 13 observations in each group) suggests that the monthly values for the mean stringency scores (H = 17.6, p = 5 · 10 −4 ) and vaccination rates (H = 31.9, p = 5 · 10 −7 ) are not identical across the four clusters. Adjusted p-values for the various, possible pairings suggest that Cluster 1 was distinct from both Clusters 2 and 3 (p = 2 · 10 −2 ) as well as from Cluster 4 (p = 5 · 10 −5 ), but that Clusters 2 and 3 were statistically indistinguishable from each other (p = 9 · 10 −1 ), and possibly even from Cluster 4 (p = 1 · 10 −1 ). The vaccination trajectories can be observed to diverge appreciably only around the start of April, and even then a hierarchical, correlation-based clustering (Section II B) unifies all n = 50 trajectories at a "distance" of ∼ 0.025, a full order of magnitude smaller than that of the most tightly-correlated merge (Cluster 2 in Fig. 2b ) among the infection trajectories. Also, numerical differentiation (not depicted) reveals that the only major, dramatic slope changes occur around April 1 and May 1, 2021, so that later divergences are largely explained via higher vaccine "adoption rates," on average, in the early stages of availability among states belonging to certain clusters. All this evidence for a "low Consider, as an analogy, the adoption of seat belts as a measure of protection against severe injury. Seat belts do not insure against the acquisition of all wounds, but are designed to be good at decreasing the probability of fatalities. One cannot guarantee survival in the worst of accidents by using a seat belt, nor is one ever bound to acquire severe injuries when failing to activate the device -yet the range of the appreciably likely injuries that one might receive, in the context of common accident configurations, can be constrained via wearing a seat belt. Similar is the pattern suggested by Fig. 4 : among those states belonging to clusters for which vaccination was more widespread in late July, the range of relative trajectory heights two weeks later, at the beginning of August, is observed to be smaller. In that short time period during which vaccination trajectory values differed significantly among state clusters, these values were also predictive of yet unseen, "future" infection rates. Here, we demonstrate that dynamical clustering allows separation of all 50 states into a small number of clusters (we settled on four, with two outliers). Meanwhile, Principal Components Analysis corroborated the existence of several distinct "waves" of COVID-19 infections nationally. These encompassed four key dates, which served in to build accurate compressions of cluster "average trajectories" with little other information, showing that a low-dimensional model can capture much of the variation in relative COVID-19 case loads over time. The fact that our set of clusters |C| happened to fall into established geographical regions, and showed correlations to state stringencies and vaccination rates further supports the potential for the clusters we discovered to be accurately reflecting reality. Models trained using states' data through June 2021 were useful in predicting standard deviations of cluster case loads in the subsequent, "fifth wave" from July through September 2021. One parsimonious explanation is that the vaccination rates do correlate with case loads, but that greater usefulness lies in their (positive) correlation with the predictability of case loads in our period of interest. Based on all these results, we developed the "seat belt" hypothesis, founded on the comparison that, while seat belts don't prevent accidents, they do restrict the likely range of adverse outcomes. In this vein, high vaccination rates may not prevent COVID-19 from affecting a state, but states with lower vaccination rates might expect a higher potential ceiling for their daily new case loads. Our results are consistent with previous analyses that identified four, historical waves [18] , and a role for geography in identifying states with analogous pasts [26] , although our findings recapitulated both without explicit geographic input to our models. Nevertheless, there are several important considerations for interpreting our own results in comparison with other analyses. First, we have normalized all case rates to one peak at the end of 2020, so that our results relate to relative change and not absolute case loads. That is, we clustered states not based on how many cases they have recorded, but the synchronous changes-of-shape in their infection patterns over time. Additionally, we developed models only of reported cases, and different results might have arisen had we focused on other metrics, as in deaths or hospitalizations. Similarly, many more data sources could conceivably be added to this model and might change the specific predictions. We chose not to pursue a more exhaustive data-scraping approach, instead choosing to focus on the capacity to make useful predictions with a simple model without either wide-spread exploration or explicit notions of causality, both of which could soak up arbitrary amounts of future time and energy. As yet more new data becomes available, it will be possible to "test" the hypothesis that higher vaccination rates will translate to a restricted "likely range" of new case loads generally. Here, too, future researchers must exercise caution. For example, whereas our vaccination-rate trajectories remain similar over long stretches of time, a more exhaustive analysis would have to rule out that infection dynamics do not bifurcate into qualitatively different behaviors if local vaccination rates reach a certain (even sub-herd immunity) value. On this note, we emphasize that our state-and state cluster-level results might be superseded by different insights at, for instance, the level of U.S. counties. In addition to these considerations, as well as the overwhelming likelihood that vaccination prevalence alone does not control new-infection dynamics, there remains the possibility that other complex influences -such as geographic spread [27] or climate patterns [28] could conceivably translate to different clusters "taking the lead" in emerging, new-case rates at different times. We stress that our "seat belt" measurements were evaluated while U.S. states were experiencing a clear upswing, ostensibly due to the spread of a novel SARS-CoV-2 variant. We emphasize again that, while our models do seem to provide useful predictions and to correspond to other realworld features, they are not causal models. The fact that state' dynamics can be identified and tied into clusters with few inputs is efficient, but does not resolve the issue of why compact patterns exist. Explorations along these lines do remain important, and we hope our model might be useful enabling others to add more variables in search of causality. Effectively infinite other data sources could be explored in this endeavor, but given the behavior of the clustering we observed, some possibilities suggest themselves, including mask-use policies and climate. While we do not assess political data [29] explicitly in our model, the regional clustering and relative differences in stringency suggest that they maybe worth exploring in these contexts as well. While our pipeline is not the only existing approach to compressing infection-rate information in such a way that is useful for prediction, it offers distinct advantages. Several analyses elaborated previously, as in Ref [18] , focused predominantly upon tracking and categorizing variations in state-level policies, and provided the basic terminology for describing the three completed "waves" of COVID-19 cases in the United States (as well as "concerns of a fourth wave" at the time of its publication in Spring 2021) [18] . This four-wave structure was discerned from observations on the overall patterns of COVID-19 cases across the U.S (the sum of contributions from all 50 states); instead, we impute a data-driven number of waves, and find that the encompassed dates agree well with those of Ref. [18] . This latter sentiment is also echoed within two earlier studies that had explored multiple clustering methods [30] and attempted to classify U.S. states as being within their first or second "surge" [31] . Unlike our work here, the "optimality" of a given partitioning C in Ref. [30] is not evaluated on whether clusters predict external, ancillary variables, but by analytic means. The analyses in Ref. [31] also focused on "daily new cases," which we chose over other metrics, such as cumulative cases or daily death rates, as a snapshot of the evolving transmission situation within the U.S. The authors of the present work were unaware of both works while conducting the research described here, but our contribution can nonetheless be considered an extension or generalization of the former: aside from the benefit of a somewhat more retrospective look at the progression of the pandemic, our emphasis on external validation provides for more direct application of clustering to future decisions and research. In conclusion, our model supports the idea that gross fluctuations the U.S. states' COVID-19 case rates are more regionally coordinated than is revealed by a focus on raw case numbers alone. As a result, while others have reported occasions on which California and Florida may have seen similar statistics [13, 32] , our findings suggest that a neighbor like Georgia would be more likely to provide actionable information for what is likely to happen to Florida (especially in the context of a new wave) than would California or New York. It remains to be shown, in future work, whether these same states would unite and cluster together on other grounds (e.g., demographic, political, or climatological), and if the specific groupings discovered here serve useful in future infectious disease containment or mitigation strategies. We have demonstrated that the fifty U.S. "wave" histories fall into just four distinct infection patterns, and that each pattern can be associated with a specific geographical region. Grouping states by infection pattern revealed that within-group vaccination rates predict key new-case statistics during the early stages of the Summer 2021 surge. COVID-19 exacerbating inequalities in the US A country level analysis measuring the impact of government actions, country preparedness and socioeconomic factors on OVID-19 mortality and related health outcomes COVID-19 in Italy: An Analysis of Death Registry Data Locally Informed Simulation to Predict Hospital Capacity Needs During the COVID-19 Pandemic A Closer Look Into Global Hospital Beds Capacity and Resource Shortages During the COVID-19 Pandemic A closer look at U.S. COVID-19 vaccination rates and the emergence of new SARS-CoV-2 variants: It's never late to do the right thing Facing the wrath of enigmatic mutations: a review on the emergence of severe acute respiratory syndrome coronavirus 2 variants amid coronavirus disease-19 pandemic Disparities in case frequency and mortality of coronavirus disease 2019 (COVID-19) among various states in the United States Disparities in COVID-19 mortality by county racial composition and the role of spring social distancing measures Racial and Ethnic Disparities in COVID-19 Infection and Mortality in the United States: A state-wise update Understanding the spatial patchwork of predictive modeling of first wave pandemic decisions by US governors Prevention, et al. Covid data tracker weekly review Why California, Florida Have Similar Number of COVID-19 Cases Lockdowns: Data Shows FL, CA With Nearly Identical Outcomes in COVID Cases Despite Opposite Approach Nonlinear dynamics and chaos with student solutions manual: With applications to physics, biology, chemistry, and engineering Repository of COVID-19 data underlying the analysis and visualizations created by the Johns Hopkins Centers for Civic Impact for the Coronavirus Resource Center (CRC). John Hopkins University Coronavirus Resource Center Substantial Underestimation of SARS-CoV-2 Infection in the United States Variation in US states' Responses to COVID-19 COVID-19 Dataset by Our World in Data World Population Review. Us States -Ranked by population 2021 Household Transmission of COVID-19 Cases Associated with SARS-CoV-2 Delta Variant (b. 1.617. 2): National case-control study. The Lancet Regional Health-Europe The SciPy community. scipy.spatial.distance.pdist Mini Review Immunological Consequences of Immunization With COVID-19 mRNA Vaccines: Preliminary Results Adaptive Immunity to SARS-CoV-2 and COVID-19 Elements of information theory The New York Times. Coronavirus in the U.S.: Latest Map and Case Count Insights from the COVID-19 pandemic Effects of Weather on Coronavirus Pandemic Oxford Covid-19 Government Response Tracker (OxCGRT) Cluster-based Dual Evolution for Multivariate Time Series: Analyzing COVID-19 COVID-19 in the United States: Trajectories and Second Surge Behavior One year in, how four states fared in combating the coronavirus The authors would like to extend a special thanks to the CaliBaja Center for Resilient Materials and Systems, and to those at the San Diego Supercomputer Center, for hosting the ENLACE 2021 program.