key: cord-0932317-nvbudtmd authors: James, Nick; Menzies, Max; Bondell, Howard title: Understanding spatial propagation using metric geometry with application to the spread of COVID-19 in the United States date: 2021-08-05 journal: Epl DOI: 10.1209/0295-5075/ac2752 sha: a6c78af5c0a9a22fbf6daeee1a951fa6b8ea365f doc_id: 932317 cord_uid: nvbudtmd This paper introduces a novel approach to spatio-temporal data analysis using metric geometry to study the propagation of COVID-19 across the United States. Using a geodesic Wasserstein metric, we analyse discrepancies between the density functions of new case counts on any given day, incorporating the geographic spread of cases. First, we apply this to identify the periods during which the changes in the geographic distribution of COVID-19 were most profound. The greatest shift occurred between May and June of 2020, when COVID-19 shifted from mostly dominating the Northeastern states to a wider distribution across the country. We support our findings with a new measure of the extent of geodesic variance of a distribution, demonstrating that the geographic imprint of COVID-19 was most concentrated in May 2020. Next, we investigate whether the epidemic exhibited meaningful patterns of spatial reversion, where similar geographic distributions return later. We identify broad similarity between the spread of COVID-19 across the US between the second and third waves, and to a lesser extent, the reemergence of the first wave's Northeastern dominance closer to the present day. This methodology could provide new insights for analysts to monitor the dynamical spread of epidemics and enable regional policymakers to protect their localities. More broadly, the framework we introduce could be applied to a variety of problems evolving over space and time. Introduction. -COVID-19 remains an ongoing threat to the public health of the United States (US). While an aggressive vaccination program has reduced cases and deaths considerably from their highs in January 2021 [1] , the vaccination rate is highly non-uniform across the country [2, 3] . Communities with low vaccination rates remain a risk, to themselves, as well as to the rest of the nation, both regarding the spread of the virus to vulnerable people elsewhere and the potential for further virus mutation. Throughout the pandemic, the imposition of restrictions on community activity and businesses, including mask mandates and lockdowns, has mostly fallen to state and local governments. Thus, it remains of considerable interest to local lawmakers to track the ongoing spatial propagation of COVID-19 across the US. This may reveal trends in the spread of the virus, allow for forewarning before local outbreaks, identify locations that performed unusually well or poorly in containment measures, and provide opportunities to learn from other localities. There are numerous existing analyses of the propagation of COVID-19 across the US on a state-by-state or finer basis. Previous research has used a variety of techniques, including state-by-state time series analysis [4, 5] , SEIR models [6] , regression models and feature selection [7, 8] , Markov chain Monte Carlo models [9] , and other Monte Carlo simulations [10] . Similar studies have been carried out in other countries, such as Brazil [11] [12] [13] , China [14] and the United Kingdom [15] , or comparing numerous countries simultaneously [16] [17] [18] [19] . Most of these studies are qualitatively descriptive in their summary of the spread of COVID-19 across the geography of a country. Prior physically-inspired research on COVID-19 has focused on fluids and aerosols [20] [21] [22] [23] [24] , characteristics of optimal antigens [25] , contagion and percolation models [26, 27] and network models [28] [29] [30] . Methods from statistical physics have been used widely to study both COVID-19 and prior epidemics [25, [30] [31] [32] . Geodesic Wasserstein distance. -This paper introduces a new mathematical method for studying the spatial propagation of an epidemic over time. Our approach applies metric geometry to perform an in-depth analysis of the time-varying spatial distribution of COVID-19 cases across US states and the District of Columbia (DC). Let (X, d) be any metric space, µ, ν two probability measures p-1 arXiv:2108.02516v2 [physics.soc-ph] 5 Nov 2021 on X, and q ≥ 1. The Wasserstein metric between µ, ν is defined as where the infimum is taken over all probability measures γ on X×X with marginal distributions µ and ν. Henceforth, let q = 1. By the Kantorovich-Rubinstein formula [33] , there is an alternative formulation when X is compact (for example, finite): where the supremum is taken over all 1-Lipschitz functions F : X → R. Throughout this paper, let X be the set of the 50 US states and DC, ordered alphabetically and indexed i = 1, ..., 51. As this is a finite set, measures µ, ν may be reinterpreted as probability vectors f, g ∈ R 51 such that f i ≥ 0 and 51 i=1 f i = 1. Such probability vectors are our central object of study, as they define the spatial distribution over the US. The simplest metric we can equip X with is the discrete metric, where d(x, y) = 1 if x = y, and 0 otherwise. In this case, the formulation (2) of Wasserstein metric is optimised with the choice of function and thus greatly simplifies to a relatively simple metric A proof of this statement is included in the appendix (Proposition 1). We refer to this as the discrete Wasserstein distance -it will not be our primary object of study but will be utilised later to demonstrate the robustness of our results. Our primary methodological contribution is obtained by setting the metric d on X to be the real-world geodesic distances between (the centroids of) the 50 US states and DC, measured in meters. We term the associated Wasserstein distance the geodesic Wasserstein distance and notate it as W G (f, g) for two probability vectors f and g. Wasserstein geodesics have been utilised in optimal transport problems [34, 35] , but are novel in epidemic research, which frequently neglects the spatial aspect of a virus' spread. Consider the illustrative case where f and g are Dirac delta functions supported on single elements x and y, respectively. By the formulation (2), it follows that W 1 (f, g) = d(x, y). For the discrete Wasserstein distance, this means that W disc (f, g) = 1 if x = y, whereas W G (f, g) is the physical distance between (the centroids of the) states indexed by x and y. Thus, the geodesic Wasserstein distance allows us to factor in the physical spread of COVID-19 across the US. That is, the discrete Wasserstein distance (or L 1 norm) between probability vectors does not take into account distance between states; the geodesic Wasserstein awards a greater distance between two probability vectors if the proportion of cases has shifted further away geographically. Intuitively, the geodesic Wasserstein metric W G (f, g) is the cost or work (in the sense of physics) to transform the distribution f into g, taking real-world distances into account. For example, a shift in distribution from the Northeast to the Appalachian states would be granted a smaller discrepancy difference than a shift from the Northeast to Pacific states. We use this new metric for two aims. First, we wish to determine the periods with the most rapid change in spatial propagation of COVID-19 across the US. Second, we investigate whether there is any reversion behaviour in this spatial propagation. While existing studies have typically analysed the behaviour of the different waves of the pandemic on a state-by-state basis, our approach will identify distinct times where a similar geographic distribution of COVID-19 returned across the US. In doing so, it can identify similarities in successive waves of the pandemic. Temporal changes in spatial propagation. -Our data spans March 12, 2020 to June 30, 2021 across n = 51 regions (50 states and DC), a period of T = 476 days. We begin here to avoid periods of sparse reporting early in the pandemic. Our analysis includes Alaska and Hawaii -due to their small contribution to total US case counts, their exclusion has a minimal impact on results. In order to reduce irregularities in daily counts, such as lower reporting of tests on weekends, we first apply a simple 7-day smoothing operator to the counts. This yields a multivariate time series of smoothed new cases x i (t), i = 1, ..., n, t = 1, .., T . For the first experiment, we consider grouped probability vectors of 30-day periods. Specifically, let f [a:b] be the probability vector of (smoothed) new cases in each state, observed across an interval a ≤ t ≤ b, divided by the total number of US cases across this period: When a = b, we simply notate this as f i (t). In fig. 1 , we display the distance function between adjacent 30-day periods. For example, t = 51 corresponds to May 1, and the associated function value is the geodesic Wasserstein distance between the probability vector of cases across April 1-30 and May 1-30. Within this figure, we also notate local maxima (over a 30-day window). The primary finding of this figure is a drastic peak on June 6, 2020. This reveals that the spatial distribution of cases throughout the month of May (broadly speaking) was significantly different to that throughout the month of June. Other more moderate peaks are observed on August (6) displays approximately periodic intervals of increase and decrease. To complement our primary finding, we take a deeper analysis of the 30-day periods prior to and following our peak date, June 6, 2020. The primary difference between the spatial distribution of COVID-19 cases across the US during the months of May and June 2020 is that May was dominated by high numbers of new cases in the Northeastern states, while cases spread much wider in June, impacting larger states such as California, Texas and Florida with significant case counts. We elucidate and quantify this with an individual state-by-state analysis of the distributional changes between May and June. In fig. 2 , we plot the time-varying probability densities f i (t) for select states over the period May 7, 2020 to July 5, 2020. These figures show the substantial change in the distribution of COVID-19 across the country during this 60-day period. In figs. 2a, 2b, 2c, 2d respectively, we see that four Northeastern states, New York, New Jersey, Massachusetts, and Pennsylvania exhibit a considerable decline in this period in their proportion of the total new COVID-19 cases across the US. While not a Northeastern state, the similarly urbanised and politically inclined state of Illinois (2e) also exhibits a drastic decline. On the other hand, California (2f), Texas (2g), Florida (2h) and Arizona (2i) exhibit a considerable increase. Altogether, fig. 2 shows the decline in the densities in the Northeastern states and the relative increase in all other areas of the country in the month of June. We provide more quantitative detail in Table 1 of the appendix, where we compute normalised integrals of the time-varying daily density function f i (t) for each state between May 7 to June 5 and June 6 to July 5. That is, we compute the following for each state, expressed as a percentage: The (4). This is a substantial difference (its maximal possible difference is 1), and validates the sizeable change in the distribution of cases between these two 30-day periods. We complete this first analysis with a novel mathematical approach to quantify the spread of a probability distribution across a metric space, while taking the spatial structure into account. Given a distribution f corresponding to a measure µ on a finite metric space (X, d), let where the second equality is valid when the metric space is discrete, as in this example. We term (8) the geodesic variance of the distribution f . We explain how this is a generalisation of the classic notion of variance in the appendix (Proposition 2). The geodesic variance is zero for a Dirac delta f = δ x , and greater when the distribution is more spread out across the space. In fig. 4 , we plot the time-varying variance of grouped 30-day distributions, t → Var(f [t:t+29] ). This figure reveals that the geographic variance of COVID-19 across the US is globally minimal when t corresponds to May 3, 2020, reflecting the 30-day period until June 1, 2020. From this point, it sharply increases, corresponding to the spread of the virus across the country, complementing the findings already observed. Understanding spatial propagation using metric geometry Reversion of spatial propagation over time. -For our second aim, we investigate the extent of spatial reversion over time. We wish to identify, for some values of t, if there is a non-trivial similarity with subsequent times s > t. For each t, we consider intervals {s : s ≥ t + 30} to avoid trivial similarity with neighbouring times and record both min{W G (f [s:s+29] , f [t:t+29] ) : s ≥ t + 30} and argmin{W G (f [s:s+29] , f [t:t+29] ) : s ≥ t + 30}. Ignoring the trivial region of being within 30 days of t, these determine the time s whose COVID-19 distribution is most similar to time t as well as the magnitude of that similarity. We plot both the argmin function, in days, and the min function, in meters, in fig. 5a . To provide more details, we also plot W G (f [s:s+29] , f [t:t+29] ) for all non-trivial values of s ≥ t + 30 in fig. 5b . In fig. 5a , non-trivial spatial reversion is characterised by the argmin function being greater than 30 (the nearest permissible day under consideration). With this in mind, two main findings are observed. First, April 2020 has both a high min and argmin, seen in fig. 5a . The value of the argmin is 350 days, revealing maximal proximity to a period about one year later; the value of the min is approximately 400 km. Thus, this period is not very similar to any other period, but it is weakly similar to April 2021. April 2020 is characterised by the early dominance of the Northeastern US states. This reflects that this period was quite distinct from any other time, but that April 2021 reflects a weak reemergence of the Northeast as significant contributor to the nation's cases. Secondly, June 2020 is characterised by a relatively low value in the minimum, revealing substantial non-trivial similarity to a subsequent period. The argmin here is approximately 200 days and then immediately after about 360 days. This reflects that June 2020 is highly similar to the period comprising December 2020 and January 2021, as well as the period of June 2021. Indeed, June 2020 is characterised by the second wave of COVID-19 in the US overall, and the first wave in large states California, Texas and Florida, while the beginning of 2021 is characterised by the third wave of COVID-19 in the US overall, and the second wave in the aforementioned large states. The sudden increase in the argmin function around this date reveals additional similarity with June 2021, in which a fourth wave of COVID-19 in the US is beginning. That is, the argmin values around June 2020 reveal two different subsequent reversions in the geographic distribution of COVID-19. This strong similarity is also visible in the darker regions of fig. 5b , to the right of the y-value of t as June 1, 2020. Conclusion. -This paper has introduced a new methodological framework to analyse the spatio-temporal propagation of a process, such as an epidemic, across a physical region. Applying this to the spread of COVID-19 across the US, we have identified periods of rapid change and spatial reversion regarding the country's distribution of COVID-19. In May 2020, new COVID-19 cases were p-5 mostly concentrated in the Northeastern states; in June 2020, new cases were spread across the country. During June 2020, the majority of the US was experiencing its second wave of COVID-19; however, California, Texas and Florida, the three largest US states and all outside the Northeast, were experiencing their individual first waves [4] . Our spatial reversion analysis can capture the broad similarity between June 2020, the second wave of the country as a whole and the first in the aforementioned three states, and January 2021, the third wave in the country as a whole and the second in the three states. A weak similarity is observed between April 2020 and April 2021, signifying a return of the prominence of the Northeastern states. Future work could modify our framework, replacing geographical distance with other notions of affinity. For example, one could consider two locations as "close", and assign them a high similarity, if there is a substantial volume of people travelling between them. Our methodology could be combined with existing methods of network analysis [36, 37] , which have also been applied to particular regions [38] , or applied on a more granular basis to individual US counties. Whether combined with other mathematical approaches or used in isolation, our methodology may provide analysts and policymakers alike with additional insights to protect their localities in advance. During the first wave of COVID-19 in the US, the Northeastern states were severely impacted; during this time, several Southern states did not take precautions, with governors drawing considerable differences between their state and New York [39] . However, our analysis reveals that the geographic distribution of COVID-19 was changing rapidly by the day, and that other locations could have prepared more thoroughly. Combating COVID-19 requires a collaborative effort between regions that share geographic borders, such as US states and European countries. Studying the time-varying distribution of new COVID-19 cases combined with existing network analyses could provide opportunities for such neighbouring states to work together combating the ongoing pandemic [40, 41] and new variants [42] , and incorporating the strategic use of vaccinations [43, 44] . Data availability statement. -COVID-19 data is sourced from the New York Times [45] and US location data is sourced from Google [46] . * * * The authors would like to acknowledge the reviewers for helpful comments and insights. Proposition 1. Let (X, d) be a finite discrete metric space, with d(x, y) = 1 for all x = y and zero otherwise. Let W 1 (µ, ν) be the L 1 -Wasserstein metric between two probability measures µ, ν on X, with corresponding distribution functions f, g, expressed as in (2) of the manuscript. That is, Then, this supremum is optimized by the choice of F as in (3) of the manuscript, namely As such, W 1 (f, g) reduces to the simple form of (4) of the manuscript, namely Proof. Let F be an arbitrary 1-Lipschitz function on X. That is, F : X → R and |F (x) − F (y)| ≤ d(x, y) = 1 for x = y. Let M = sup x∈X F (x), m = inf y∈X F (y). By taking the supremum over x and the infimum over y, the Lipschitz condition ensures that M − m ≤ 1. So ≤ using the fact that x f (x) = x g(x) = 1 to eliminate the second sum. Now, let Then Taking the supremum over F , we deduce W 1 (f, g) ≤ P . Finally, let F be as in (11) . Then X F dµ − X F dν immediately coincides with P , by definition. Thus, the supremal value coincides with P , and P = 1 2 f − g 1 , as required. Proposition 2. Let f be a probability distribution on a finite metric space (X, d), and consider its geodesic variance, defined by (8) of the manuscript. That is, When (X, d) is a finite subset of the real numbers R with its Euclidean metric, this quantity reduces to the classical notion of variance, up to a factor of 2. Proof. Let Y be a random variable on R. The classical notion of variance is the quantity var(Y ) = E Y 2 − (E Y ) 2 . If Y has a distribution f over a finite set X, this can be expressed On the other hand, (20) can be expanded Thus, up to a factor of 2, the geodesic variance reduces to the classical notion of variance on the real line. With more care, the above propositions hold if X is an arbitrary metric space, with µ, ν appropriately integrable measures on X. (7) of the manuscript, measuring the percentage change in their contribution to new US cases between the adjacent 30-day periods of May 7 to June 5 and June 6 to July 5, 2020. In Table 1 , we record all values of ∆(f ) i , as defined in eq. (7) and depicted in fig. 3 of the manuscript. Uneven vaccination rates across the US linked to Covid-19 case trends, worry experts CNN Alabama governor won't issue stay-at-home order because 'we are not California.' by population, it's worse. The Washington Post Covid-19) data in the United States