key: cord-0923807-6aeljw9f
authors: James, Nick; Menzies, Max
title: COVID-19 in the United States: Trajectories and second surge behavior
date: 2020-09-03
journal: Chaos
DOI: 10.1063/5.0024204
sha: c1f0f2044360ea5c01e1ddcf11eba7e9242e16f1
doc_id: 923807
cord_uid: 6aeljw9f

This paper introduces a mathematical framework for determining second surge behavior of COVID-19 cases in the United States. Within this framework, a flexible algorithmic approach selects a set of turning points for each state, computes distances between them, and determines whether each state is in (or over) a first or second surge. Then, appropriate distances between normalized time series are used to further analyze the relationships between case trajectories on a month-by-month basis. Our algorithm shows that 31 states are experiencing second surges, while four of the 10 largest states are still in their first surge, with case counts that have never decreased. This analysis can aid in highlighting the most and least successful state responses to COVID-19.

Understanding the trajectory of COVID-19 case counts assists governments in responding to the impact of the pandemic. Given the highly infectious nature of the disease, an increasing number of new daily cases may overwhelm the healthcare system and prompt further restrictions on businesses and social gatherings. Conversely, a decreasing number of new cases is a good sign but should be carefully monitored. Thus, the turning points in the new case counts are crucial to identify. Given the randomness of human behavior, however, turning points are difficult to predict and predictive modeling of COVID-19 requires frequent observation of real-time data. This paper focuses on a retrospective analysis of case trajectories in different U.S. states to determine which public policy responses were most and least effective. While we focus on the U.S., our methodology may be applied more broadly to understand the impact of public policies on countries with similar federative structures such as Brazil and India. These three countries have the highest COVID-19 case counts in the world, 1 with government responses differing between states and yielding differing results. 6, 7 For this goal, we use existing and new techniques from time series analysis. Time series analysis has been widely applied to epidemiology, 8, 9 including COVID-19. [10] [11] [12] Existing methods of time series analysis are diverse, including power-law models 13 and nonparametric methods such as distance analysis, 14 distance correlation, [15] [16] [17] and network models. 18 In this paper, we aim to develop a new mathematical framework of identification and comparison of turning points of time series to study the spread of COVID-19 in the U.S. Such turning points classify the behavior of states' trajectories throughout the pandemic as being in (or over) their first surge or second surge.

In addition to the aforementioned state-by-state determination of turning points, this paper uses a new application of Chaos ARTICLE scitation.org/journal/cha semi-metrics to measure distance between the states' behaviors and performs clustering based on this. The paper implements hierarchical clustering, 19, 20 which has previously been used in various epidemiological applications. These include inflammatory diseases, 21 airborne diseases, 22 Alzheimer's disease, 23 Ebola, 24 SARS, 25 and COVID-19. 11 The paper is structured as follows: in each of Secs. II and III, we introduce portions of our methodology and then present our results. Section II describes our framework for identifying turning points, determining which states are in a second surge, and clustering based on similar behavior. Section III analyzes the trajectories of COVID-19 counts in each state on a month-by-month basis. Section IV summarizes the results and new findings regarding the spread of COVID-19 in the United States. In Appendix A, we briefly apply our analysis to the Brazilian states to demonstrate the generality of our method.

In this section, we develop a mathematical framework and procedure to determine whether a state has experienced a second surge. Through the careful selection of turning points, we formulate a definition applicable to an individual time series and develop a method for comparing differing surge behaviors among a collection of time series.

A. Second surge methodology: Determination of turning points Let x i (t) ∈ R be a collection of real-valued time series over a common time interval t = 1, . . . , T, i = 1, . . . , n. In this paper, the analyzed time series are the daily counts of new cases in the 50 U.S. states and the District of Columbia (D.C.), ordered alphabetically, i = 1, . . . , 51. Our data span from 01/21/2020 to 07/31/2020, a period of T = 193 days across n = 51 regions.

There are several irregular features in the dataset, including lower case counts on the weekends, some negative daily counts due to adjustments of previous figures, and general noise. In addition, there are small disparities between different data sources. In order to isolate the signal in a dataset and between different datasets, we first apply a Savitzky-Golay filter 26 to the counts to produce a collection of smoothed time seriesx i (t), t = 1, . . . , T, i = 1, . . . , n. This combines a moving average calculation with polynomial smoothing. Through its moving average computations, it largely eliminates all negative counts, except when there are very few cases. In these instances, we replace any negative smoothed count with a zero. For the remainder of the paper, we analyze these smoothed caseŝ x i (t). Due to the smoothing process,x i (t) ∈ R ≥0 are not necessarily integers but are all non-negative.

Our identification of turning points and second surge behavior of a smoothed time seriesx(t) proceeds in two steps. First, we identify a sequence of potential local maxima or peaks, and local minima or troughs. Second, we appropriately refine this sequence according to chosen conditions. The final sets P and T of peaks and troughs, respectively, determine whether the time series is said to be in (or over) its first or second surge. It will be essential that P and T are non-empty.

For the first step, the basic idea is to designate a time t 0 a peak or trough, respectively, if (2) for a parameter l < T 2 , the length over which we look. In this paper, we select l = 17 to account for the 14-day incubation period of the virus 27 and reduced testing on weekends. These naïve definitions of (1) and (2) have two flaws: first, equal values ofx(t) may determine consecutive values of t as peaks or troughs when only one should be counted. More subtly, it is possible that two troughs may be detected at two points that are far apart, with no peak between them, when the time series has been largely monotonic between the two. For example, in Fig. 1(a) , troughs are naïvely detected at t 0 = 1 and 126, corresponding to the start of the time series and 05/26/2020, respectively. What follows is a method to exclude the latter.

We implement variants of (1) and (2) by sequentially examining the values ofx(t). The first peak or trough is assigned at the first value of t 0 such that (1) or (2) holds, respectively. For each U.S. state, this is an initial trough at t 0 = 1 corresponding to zero cases there after smoothing. Having determined a peak at t 0 , we search in the period t > t 0 for one of two elements: if we find a trough at t 1 > t 0 according to (2) , we add it to the set of troughs and proceed from t 1 as normal. If we find a peak at t 1 > t 0 according to (1) such that x(t 0 ) ≥x(t 1 ), we ignore this lesser peak as redundant; if we find a peak at t 1 > t 0 according to (1) such thatx(t 0 ) <x(t 1 ), we remove peak t 0 and replace it with t 1 and continue from there. An analogous process applies from a trough at t 0 . With this process, we generate an alternating sequence of troughs and peaks, starting with a trough at t 0 = 1. Every time series is assigned at least one peak and trough at its global maximum and minimum, respectively. If the global maximum is not unique, a peak is assigned at the first maximum. This concludes the first step.

So far, every time series in our collection is assigned an alternating sequence of peaks and troughs beginning with a trough at t 0 = 1. One could naïvely define a state as being in its second surge if its sequence so far is TPTP. However, some detected peaks and troughs are immaterial and should be excluded. We describe a flexible approach to excluding trivial peaks or troughs, in which we apply two conditions to do so.

Let t 1 < t 3 be two peaks, necessarily separated by a trough. We select a parameter δ and if the peak ratio, defined asˆx (t 3 )

x(t 1 ) < δ, we remove the peak t 3 . If two consecutive troughs t 2 , t 4 remain, we remove t 2 ifx(t 2 ) >x(t 4 ), otherwise remove t 4 . That is, if the second peak has size less than δ of the first peak, we remove it, deciding not to term it a second surge.

Finally, we define the log-gradient between times t 1 < t 2 as

The numerator coincides with log(ˆx (t 2 )

x(t 1 ) ) and is a more appropriate substitution for the "rate of increase" given byˆx (t 2 )

x(t 1 ) − 1. Indeed, a "rate of increase" is asymmetrically bounded between (−1, ∞), while the logarithmic rate is bounded between (−∞, ∞). 

The log − grad function measures the average rate of logarithmic increase over the period [t 1 , t 2 ]. Now, let t 1 , t 2 be an adjacent peak and trough. We select a parameter = 0.01, if

that is, the average logarithmic increase or decrease is well-defined and less than 1%, we remove both t 1 and t 2 from our sets of peaks and troughs. For example, this step removes a peak and trough from Fig. 1(b) , where the local maximum at 04/17/2020 is immaterial. This condition always preserves the trough at t 0 = 1, wherê x(t 0 ) = 0 and the peak at the global maximum. This concludes the selection of P and T.

To quantify distance between time series' turning points, we apply the semi-metrics of Ref. 28 (with p = 1). Given two nonempty finite sets S 1 , S 2 , this is defined as

where d(b, S 1 ) is the minimal distance from b ∈ S 2 to the set S 1 . The values d(S 1 , S 2 ) are symmetric, non-negative, and zero if and only if S 1 = S 2 . Then, we define the n × n surge behavior matrix between turning point sets by

Then, D TP ij = 0 if and only if time seriesx i (t) andx j (t) have equal sets of peaks and troughs, hence identical surge behavior. These procedures are presented in an algorithmic format in Appendix B.

Our methodology assigns one of four possible sequence types to each state. Thirteen states, including Georgia, California, Texas, and North Carolina [Figs. 1(b), 1(c), 1(d), and 1(e), respectively] are assigned the sequence TP (that is, one trough, then one peak) and deemed to be in their first surge. All 13 of these have their unique peak and global maximum on the final day and form a cluster of identical similarity in Fig. 2 , where we implement hierarchical clustering on D TP . Identical results are obtained with any value δ ∈ [0.1, 0.2], so we select δ = 0.2. That is, we exclude any second surge that is less than a fifth of the first. Then, 31 states plus D.C. are assigned TPTP-we deem these to be in their second surge. The three largest of these second surge states are Florida, Pennsylvania, and Ohio, displayed in Figs. 1(f), 1(g), and 1(h), respectively. Of these, all but Florida and South Carolina have a peak on the final day, with 19 exhibiting their global max on that day. These second surge states form the majority cluster in Fig. 2 . Four states are assigned the sequence TPT of which New York and New Jersey [1(i) and 1(j)] have a local max removed due to a peak ratio of less than 0.2. Their curves have been flattened, and the first surge is completely over. Arizona [1(k) ] and Utah are also assigned TPT, with their latter trough on the final day, indicating they are still coming down from a first surge. Finally, Maine [1(l)] and Vermont are assigned the sequence TPTPT. For Maine, this final trough is at the end of the period, indicating it is still coming down from its second surge, while Vermont's final trough is before the end, indicating it has flattened the curve on its second surge. Hierarchical clustering on D TP distinguishes all these behaviors into separate clusters.

scitation.org/journal/cha

In this section, we further analyze the new case counts in the 50 states and D.C., examining the trajectories of smoothed case counts on a month-by-month basis. Restricting these smoothed counts to a particular month gives a sequence f i = (f i (1), f i (2), . . . , f i (m))) ∈ R m , where m ∈ {29, 30, 31} is the number of days in that month and i = 1, . . . , 51.

Let ||f i || = m t=1 |f i (t)| be the L 1 norm of the vector f i . As all f i (t) are non-negative, this counts the total number of new cases (up to smoothing) observed in a month and is non-zero for every state and month after March. Thus, we may define g i = f i ||f i || . The vectors g i reflect relative changes of new case counts within a month. For example, a state whose new cases in a month differ between 1000 and 1100 will present a relatively flat normalized trajectory; whereas a state whose new cases in a month rise from 0 to 100 will present a more steeply increasing normalized trajectory, as a reflection of the relative change. We define trajectory distance matrices D ij = ||g i − g j || that measure distance between normalized trajectories. This distance differs from the frequently used distance correlation, 10, [15] [16] [17] which is a more suitable measure between cumulative cases. First, distance correlation is equal to 1 when two sequences have the greatest possible similarity, whereas the distance D ij between two identical sequences is 0. Secondly, whereas sequences (1, 2, 3, 4) and (4, 3, 2, 1) have distance correlation equal to 1, they have significantly different normalized trajectory distance, heralding the fact that one sequence is increasing while the other is decreasing. Specifically, ||g i − g j || = 0 if and only if f i = αf j for some α > 0, while two sequences (X k ), (Y k ) have distance correlation 1 if and only if X k = a + bY k for constants a, b. That is, distance correlation does not distinguish between positive and negative gradients.

In Figs. 3(a)-3(d), respectively, we implement hierarchical clustering on these matrices for the months of April, May, June, and July. There is consistent similarity in the dendrogram structure: every figure has three clusters, two small clusters, and one majority cluster that contains several subclusters of high internal similarity. The two small clusters generally consist of states that are experiencing steeply increasing or decreasing trajectories, while the larger cluster exhibits more heterogeneity. We describe the common features of the dendrograms in Table I . There, we also include the Frobenius norm of each distance matrix. For an n × n matrix A, this is defined as ||A|| = n i,j=1 |a ij | 2 1 2 and quantifies the total spread of all distances in a month.

In April, hierarchical clustering determines the existence of three clusters of U.S. states, displayed in Fig. 3(a) . The first contains Alaska, Hawaii, Idaho, Louisiana, Montana, and Vermont, all of which have declining new case trajectories. Idaho, seen in Fig. 4(a) , displays behavior typical of the cluster, experiencing a peak in early April and steadily decreasing for the remainder of the month. The second cluster consists of Iowa, Kansas, Minnesota, and Nebraska, all of which have steep increases in new case counts. Iowa and Minnesota are depicted in Figs. 4(b) and 4(c), respectively. The final cluster contains all 41 remaining states and two subclusters of high self-similarity. The first subcluster contains states whose trajectories are concave down with a local peak in April. Georgia, Pennsylvania, and Connecticut depicted in Figs. 1(b) , 1(g), and 4(d), respectively, Fig. 3(b) ], as their smoothed trajectories mostly consist of zeroes and very low counts. Again, three clusters are observed: the first and most anomalous cluster contains only New York and New Jersey, whose trajectories are significantly decreasing, as seen in Figs. 1(i) and 1(j), respectively. The second cluster contains Arkansas [ Fig. 4(e) ] and Vermont; their trajectories are relatively flat at the start of May, with an uptick in the second half. All other states are contained in the final cluster, with several observable subclusters. One notable subcluster contains northeastern states Connecticut, Delaware, Massachusetts, Pennsylvania, Rhode Island, and D.C. All these states' trajectories are steadily decreasing during May, as seen in Figs. 4(d) and 4(f) for Connecticut and Massachusetts, respectively. By contrast, another subcluster, containing North Carolina and Arizona [ Figs. 1(e) and 1(k) , respectively], is characterized by moderate and consistent increase in May.

In June, three clusters are again observed in Fig. 3(c) . The first consists of Florida, Idaho and Montana, which display significantly increasing new cases. Figures 1(f) and 4(a) show that Florida and Idaho, respectively, experienced high growth from the beginning of June, after a prior month of moderate decrease and flat cases, respectively. In the second cluster, we again observe almost all Northeastern states, including Connecticut , and 4(f), respectively, all of which experience a rapid increase in early July that begins to level off toward the end of July. Even New York and New Jersey experience slight increases in their cases in July, although with much lower absolute counts. Another subcluster contains Georgia, Pennsylvania and Ohio, [1(b), 1(g), 1(h), respectively], which all experience approximately linear growth in new cases. As seen in Table I , the reduced Frobenius norm for the month of July reflects less spread in the matrix as a whole, due to the large number of states with similarly increasing trajectories.

In this paper, we propose a new method for analyzing turning points and trajectories among a collection of time series. Our mathematical framework defines the characteristics of a state experiencing (or over) a second surge in COVID-19 cases. The use of semi-metrics between sets of turning points clusters states according to their differing surge behaviors and provides immediate and visible insight into the behavior of the time series collection as a whole.

This classification of behaviors is then accompanied by a close examination of the trajectories on a month-by-month basis. Here, we separate and cluster trajectories according to the relative rates of increase or decrease of new case counts. Our methodology is flexible: different smoothing techniques, metrics between data, (semi-)metrics between sets, parameters in the algorithmic framework, and clustering methods can be used to study collections of time series and identify differing surge behavior in greater generality than this application. We demonstrate this with a brief application to the Brazilian federative units in Appendix A.

Clustering the states' trajectories on a month-by-month basis reveals consistent similarity in the cluster structure: there are always three clusters, that is, one majority cluster and two smaller clusters. The stable cluster structure over time allows one to easily observe changes in the cluster membership of individual states and determine the time frame under which new case counts in different states changed direction. For example, in May, New York and New Jersey move into a separate cluster characterized by sharply falling numbers of new cases as they introduced mask mandates. 29 Our analysis provides insights into the evolution of COVID-19 in the United States. While previous papers have studied counts of different countries over shorter time windows, 10,11 this paper studies the U.S. on a state-by-state basis over seven months. Within our framework, we determine that 31 states plus D.C. are experiencing second surges, of which 21 are more severe than the first surge. Thirteen states, including four of the 10 largest, are still in their first surge, with new case counts that have never materially decreased. Only two states are completely over and two partially over their first surge with no second surge as of yet. Just two states are over their second. As of the end of July, all other state counts are increasing and 32 exhibited their greatest case counts (after smoothing) on the final day of analysis. All these features are visible in Fig. 2 , where five (sub)clusters correspond to these five possible surge behaviors.

The similarities in Fig. 2 can help identify common characteristics of the states that have most and least successfully managed COVID-19. New York and New Jersey, like many of the Northeastern states, experienced peaks in new COVID-19 cases in early April. Unlike other Northeastern states, these two reduced their new cases substantially and have avoided a second surge. Massachusetts and Delaware have experienced small second surges in July, 28.5% and 55.7% of their first peak, respectively.

By contrast, California, Texas, Florida, and Georgia are the four states that managed the growth of new COVID-19 cases poorly: their case counts are the highest in the nation. California and Texas limited restrictions despite cases that never stopped increasing and then reinstated them amid record counts. [30] [31] [32] With cases per capita greater than California and Texas, Georgia remains in its first surge, having overturned local mask mandates in July. 33 After an early first surge, Florida reduced restrictions and has since experienced a long and steep second surge, including the highest single day counts of any state. 34 Overall, this paper introduces a new method for analyzing second surge behavior in a collection of time series and provides new insights into the spread of COVID-19 in the U.S. Early in 2020, many states believed that COVID-19 would resolve quickly. Few predicted that Florida's second surge, for example, would be so much more severe than its first. Nonetheless, this is a highly infectious virus, and even countries that reported zero counts have since observed recurrences. 35 As further surges jeopardize both citizen safety and economic recovery, individual states must closely observe the trajectory of their cases and react swiftly to minimize the potential for increasing case counts. We predict that many state governments will learn their lesson from the second surge and be cautious in observing new case counts. Vigilance going forward is necessary, and we hope states learn this in their response.

The authors thank Kerry Chen and Orri Ganel for helpful comments and edits.

In this brief section, we demonstrate the generality of our method by applying it to the 27 federative units of Brazil. Our data span from 02/25/2020 to 07/31/2020, a period of 158 days. In Fig. 5 Indexing begins from 1 Initialize: CurrentPeakIndex = 2;

Begin the peak ratio refinement While CurrentPeakIndex ≤ Length(TPSet) -2 do i = CurrentPeakIndex, t 1 = TPSet(i), t 3 = TPSet(i + 2); ifˆx (t 3 )

x(t 1 ) ≥ δ then CurrentPeakIndex = i + 2; else ifˆx (t 3 )

x(t 1 ) < δ and i + 2 = Length(TPSet) then Remove t 3 from PeakSet; TPSet=Sort(PeakSet ∪ TroughSet); else ifˆx (t 3 )

x(t 1 ) < δ and i + 2 Length(TPSet) then t 2 = TPSet(i + 1), t 4 = TPSet(i + 3); ifx(t 2 ) ≤x(t 4 ) then Remove t 4 from TroughSet; else

Remove t 2 from TroughSet; Remove t 3 from PeakSet; TPSet = Sort(PeakSet ∪ TroughSet); Initialize: CurrentIndex = 1;

Begin the log-grad refinement while CurrentIndex < Length(TPSet) do i = CurrentIndex, t 0 = TPSet(i), t 1 = TPSet(i + 1); if |log − grad(t 0 , t 1 )| < then See Equation (3) Remove t 0 and t 1 from both TroughSet and PeakSet; TPSet = Sort(PeakSet ∪ TroughSet); else CurrentIndex = i + 1; Output PeakSet and TroughSet.

scitation.org/journal/cha that are decreasing at the end of the data period, six are in the north region: Acre, Amapá, Amazonas, Pará, Rondônia, and Roraima. On the final day of analysis, 15 states exhibited their greatest new case counts (after smoothing).

In this section, we provide an algorithmic presentation of the computational steps taken for the determination of second surge behavior, described in Sec. II. Algorithm 1 describes the first step of Sec. II A, where an alternating sequence of peaks and troughs is determined. Algorithm 2 describes the second step, where this list is refined. In our implementation, we choose three parameters l = 17, δ = 0.2, = 0.01. Equations (1) and (2) define necessary conditions in the first step, while (3) defines a necessary condition in the second step.

The data that support the findings of this study are openly available in Refs. 36 and 37.

Thinking globally, acting locally-The U.S. response to Covid-19

All 50 U.S. states have taken steps toward reopening in time for Memorial Day weekend

A devastating new stage of the pandemic

First and second waves of coronavirus

Scrutinizing the heterogeneous spreading of COVID-19 outbreak in Brazilian territory

India's policy response to COVID-19

The mathematics of infectious diseases

Mathematical models to characterize early epidemic growth: A review

Strong correlations between power-law growth of COVID-19 in four continents and the inefficiency of soft quarantine strategies

Rare and extreme events: The case of COVID-19 pandemic

Cluster-based dual evolution for multivariate time series: Analyzing COVID-19

Polynomial growth in branching processes with diverging reproductive number

Measuring the distance between time series

Measuring and testing dependence by correlation of distances

Distance correlation detecting Lyapunov instabilities, noise-induced escape times and mixing

Decay of the distance autocorrelation and Lyapunov exponents

Growing networks with communities: A distributive link model

Hierarchical grouping to optimize an objective function

Hierarchical clustering via joint between-within distances: Extending Ward's minimum variance method

Contribution of hierarchical clustering techniques to the modeling of the geographic distribution of genetic polymorphisms associated with chronic inflammatory diseases in the Québec population

Contact profiles in eight European countries and implications for modelling the spread of airborne infectious diseases

The application of unsupervised clustering methods to Alzheimer's disease

Application of hierarchical clustering ordered partitioning and collapsing hybrid in Ebola virus phylogenetic analysis

Hierarchical clustering using the arithmetic-harmonic cut: Complexity and experiments

Smoothing and differentiation of data by simplified least squares procedures

The incubation period of Coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: Estimation and application

Novel semi-metrics for multivariate change point analysis and anomaly detection

New York orders residents to wear masks in public

California tells six additional counties to close indoor businesses, all bars

Greg Abbott orders Texans in most counties to wear masks in public

COVID-19 deaths near 130,000; Florida and Texas report record case numbers

Georgia's governor issues order rescinding local mask mandates

Florida shatters singleday infection record with 15 300 new cases

New Zealand has 69 active Covid cases after 13 more diagnosed

Painel coronavírus