key: cord-0465095-po8itjt0
authors: Nielsen, Frank; Marti, Gautier; Ray, Sumanta; Pyne, Saumyadipta
title: Clustering patterns connecting COVID-19 dynamics and Human mobility using optimal transport
date: 2020-07-21
journal: nan
DOI: nan
sha: 17bad738e246bf423f2b098be3283a64b1211365
doc_id: 465095
cord_uid: po8itjt0

Social distancing and stay-at-home are among the few measures that are known to be effective in checking the spread of a pandemic such as COVID-19 in a given population. The patterns of dependency between such measures and their effects on disease incidence may vary dynamically and across different populations. We described a new computational framework to measure and compare the temporal relationships between human mobility and new cases of COVID-19 across more than 150 cities of the United States with relatively high incidence of the disease. We used a novel application of Optimal Transport for computing the distance between the normalized patterns induced by bivariate time series for each pair of cities. Thus, we identified 10 clusters of cities with similar temporal dependencies, and computed the Wasserstein barycenter to describe the overall dynamic pattern for each cluster. Finally, we used city-specific socioeconomic covariates to analyze the composition of each cluster.

The phenomenal spread of the COVID-19 pandemic will have unprecedented consequences for human life and livelihood. In the absence of a treatment or vaccine to develop immunity against the disease, governments around the world have used non-pharmaceutical, risk mitigation strategies such as lockdowns, shelter-inplace, school and business closures, travel bans or restrictions to limit movement and prevent contagion. The magnitude and effectiveness of such mitigation strategies in preventing contagion and reducing the number of deaths is shown in Europe where such mitigation strategies have reduced the reproduction number over time (R t ) below 1, which means that the virus will gradually stop spreading. Since the beginning of the epidemic, an estimated 3.1 million deaths were averted across 11 European countries attributable to these risk mitigation strategies [8] .

In the United States, the adoption, and enforcement of non-pharmaceutical, risk mitigation strategies have varied by state and across time. The first confirmed COVID-19 case was reported on January 21, 2020, in Washington State [11] . While transmissions were documented since, a national emergency was declared later on March 13 [13] . At that time, international travel restrictions were enforced. By March 16, six bay area counties declared shelter-in-place orders and on March 19, California was the first state to issue a state-wide order. Since then, several communities and states implemented stay-at-home orders and social distancing measures. As of March 30, there were 162,600 confirmed COVID-19 cases in the U.S. [13] and 30 states had announced shelter-in-place orders. On April 1, two additional states and the District of Columbia issued statewide shelter-in-place orders followed by 7 more states by April 6.

Historically, among the U.S. cities that were hit by the 1918 Spanish u, social distancing played a pivotal role in attening the pandemic curve. In fact, the cities which delayed enforcing social distancing saw the highest peaks in new cases of the disease. Policies aimed at reducing human transmission of COVID-19 included lockdown, travel restrictions, quarantine, curfew, cancellation and postponing events, and facility closures. Measuring the dynamic impact of these interventions is challenging [1, 6] and confounded by several factors such as differences in the specific modes and dates of the policy-driven measures adopted by or enforced across states, regions, and countries, and, of course, the actual diversity of human behaviors at these locations.

Given the current ubiquitous usage of mobile devices among the U.S. populations, social mobility as measured by aggregating the geospatial statistics of their daily movements could serve as a proxy measure to assess the impact of such policies as social distancing on human transmission. In the particular context of the current pandemic, human mobility data could be estimated using geolocation reports from user smartphones and other mobile devices that were made available by multiple providers including Google and Apple, among others. In this study, we obtained such data from Descartes Labs, which made anonymized location-specific time series data on mobility index freely available to researchers through their GitHub site: https://github.com/descarteslabs/DL-COVID-19. Thus, we obtained location-specific bivariate time series on daily mobility index and disease incidence, i.e., new cases of COVID-19 in the U.S.

In this study, we are interested to (a) measure and compare the temporal dependencies between mobility (M ) and new cases (N ) across 151 cities in the U.S. with relatively high incidence of COVID-19 by May 31, 2010. We believe that these dependency patterns vary not only over time but across locations and populations. For this purpose, we proposed a novel application of Optimal Transport to compute the distance between patterns of (N , mobility, time) and its variants for each pair of cities. This allowed us to (b) group the cities into different hierarchical clusterings, and (c) compute the barycenter to describe the overall dynamic pattern of each identified cluster. Finally, we also used city-specific socioeconomic covariates to analyze the composition of each cluster. A pipeline for our analytical framework is described in the following section.

Based on cumulative COVID-19 cases data from the Johns Hopkins Coronavirus Resource Center (https: //coronavirus.jhu.edu/), for this study, we compiled time series data on daily new cases of the disease for more than 300 U.S. counties from 32 states and the District of Columbia and matched by five-digit FIPS code or county name to dynamic and static variables from additional data sources. Since a single county may consist of multiple individual cities, we include the list of all city labels within each aggregate group to represent a greater metropolitan area. A total of 151 of such metropolitan areas that had at least 1,000 reported cases of COVID-19 by May 31, 2020, were selected for this study. Population covariates for these areas were collected from the online resources of the U.S. Census Bureau and the U.S. Centers for Disease Control and Prevention (CDC) (https://www.census.gov/quickfacts/, https://svi.cdc.gov/).

Anonymized geolocated mobile phone data from several providers including Google and Apple, timestamped with local time, were recently made available for analysis of human mobility patterns during the pandemic. Based on geolocation pings from a collection of mobile devices reporting consistently throughout the day, anonymous aggregated mobility indices were calculated for each county at Descartes Lab. The maximum distance moved by each node, after excluding outliers, from the first reported location was calculated. Using this value, the median across all devices in the sample is computed to generate a mobility metric for select locations at county level. Descartes Labs further defines a normalized mobility index as a proportion of the median of the maximum distance mobility to the "normal" median during an earlier time-period multiplied by a factor of 100. Thus, the mobility index provides a baseline comparison to evaluate relative changes in population behavior during COVID-19 pandemic. [22] .

Below we list the steps of the overall workflow of our framework, and briefly describe the same in the following paragraphs of this section. and HC3 were obtained based on Ward's linkage method and 3 variants of mobility: M , ∆M , and M respectively. 5: Apply HCMapper to compare the dendrograms of different clusterings (HC1, HC2 and HC3). Select the clustering (HC3) that yields the most spatially homogeneous clusters. 6: Compute Wasserstein barycenter for each cluster of the selected clustering (HC3). 7: Analyze the composition of the clusters by applying random forest classifier on 15 city-specific covariates as feature set. Identify the contributions of the covariates to discriminate among the clusters.

To better understand the temporal patterns of mobility, in addition to the given non-negative mobility index M , we also use two variants: delta mobility ∆M and M defined as follows:

and

Here ∆M is the first difference, and M approximately the local derivative [14] , of the time series M , and yet, unlike M , these are not restricted to be non-negative.

With the above definitions, the temporal relationship between mobility (and its variants) and new cases of each city in our data can be depicted as tuples (M/∆M/M , N , t). We represent the time series by performing a normalized ranking of the variables so as to represent each city by a discrete set of points in unit cube [0, 1] 3 . This normalized ranking is frequently used as a estimator for empirical copulas with good convergence properties [7] . The cities can have different representations by considering the three definitions of mobility metrics, and in each case, we can have different groupings of cities. A comparative analysis of all groupings can provide a correlation structure between groups of cities from different perspectives. We jittered the overlapping RT points for easy visualization.

To distinguish between the temporal dependence between mobility and new cases of two cities, we use Wasserstein distance from optimal transport theory. We compute Wasserstein distance between two discrete sets of points in unit cube, corresponding to two cities, as the minimum cost of transforming the discrete distribution of one set of points to the other set. It can be computed without the need of such steps as fitting kernel densities or arbitrary binning that can introduce noise to data. Wasserstein distance between two distributions on a given metric space M is conceptualized by the minimum "cost" to transport or morph one pile of dirt into another -the so-called 'earth mover's distance'. This "global" minimization over all possible ways to morph takes into consideration the "local" cost of morphing each grain of dirt across the piles [20] . Given a metric space M, the distance optimally transports the probability µ defined over M to turn it into ν:

where p ≥ 1, τ (µ, ν) denotes the collection of all measures on M × M with marginals µ and ν. The intuition and motivation of this metric came from optimal transport problem, a classical problem in mathematics, which was first introduced by the French mathematician Gaspard Monge in 1781 and later formalized in a relaxed form by L. Kantorovitch in 1942.

Upon computing optimal transport based distances for each pair of cities, hierarchical clustering of the cities is performed using Ward's minimum variance method [18] . For the 3 variants of mobility (M/∆M/M ), we obtained 3 different hierarchical clusterings: HC1, HC2 and HC3 respectively. Based on visual inspection of the distance matrix seriated by the hierarchical clustering, and looping over the number of clusters, we take a relevant flat cut in the dendrogram. For each case, we got 10 clusters, each consisting of cities that are similar with respect to their dependence between mobility and new cases.

The resulting clusters are compared using a visualization tool called HCMapper [17] . HCMapper can compare a pair of dendrograms of two different hierarchical clusterings computed on the same dataset. It aims to find clustering singularities between two models by displaying multiscale partition-based layered structures. The three different clustering results are compared with HCMapper to sought out the structural instabilities of clustering hierarchies. In particular, the display graph of HCMapper has n columns, where n represents the number of hierarchies we want to compare (here n = 3). Each column consists of the same number of flat clusters, which is depicted as rectangles within the column. Rectangle size is proportional to the number of cities within the clusters, while an edge between two clusters tells the number of shared cities between them. Thus, a one-to-one mapping between the clusters of two columns preferably depicts a similar perfect clustering while too many edges crossing between two columns describe a dissimilar structure. We also checked the spatial homogeneity of a clustering in terms of the average number of clusters in which the cities of each state were assigned to, over all states that are represented in our data. Moran's I to assess the spatial correlation among the cluster labels was also computed.

We summarize the overall pattern of each identified cluster by computing its barycenter in Wasserstein space. It efficiently describes the underlying temporal dependence between the measures of mobility (here we use M ) and incidence within each cluster. Wasserstein distances have several important theoretical and practical properties [21, 19] . Among these, a barycenter in Wasserstein space is an appealing concept which already shows a high potential in different applications such as, in artificial intelligence, machine learning and Statistics [4, 15, 3, 5] .

A Wasserstein barycenter [2, 5] of n measures ν 1 . . . ν n in P ∈ P (M) is defined as a minimizer of the function f over P, where

A fast algorithm [5] was proposed to minimize the sum of optimal transport distances from one measure (the variable) to a set of fixed measures using gradient descent. These gradients are computed using matrix scaling algorithms in a considerable lower computational cost. We have used the method proposed in [5] and implemented in the POT library (https://pythonot.github.io/) to compute the barycenter of each cluster.

To understand the composition of the identified clusters, i.e., what could explain the similarity in the temporal dependence between mobility and new cases of the cities that belong to a cluster, we used different city-specific population covariates, while checking their relative contributions to discriminating the clusters. These covariates include (a) date of Stay-at-home order, (b) population size, (c) persons per household, (d) senior percentage, (e) black percent, (f) hispanic percent, (g) poor percent, (h) population density 2010, (i) SVI ses (j) SVI minority, (k) SVI overall, and (l) Gini index. Here SVI stands for Social Vulnerability Index of CDC, and "ses" socioeconomic status. In addition, we also compute the 'reaction time' (RT) of each city as the number of days between the stay-at-home-order at a given city and a common reference stating point date (taken as 15 March, 2020).

This step also provides a form of external validation of the clustering results as none of the above covariates were used for clustering. To demonstrate, we conducted this step with the clustering HC3 obtained from the time series M .

Using the covariates as features of the cities, a random forest classifier is trained to learn the cluster labels. The aim is to see how the clustering could be explained by the covariates. To find which of the features contribute most to discriminate the clusters of cities we computed the mean Shapley values [16] . A Shapley value quantifies the magnitude of the impact of the features on the classification task. The ranking of the covariates/features based on the mean Shapley values determines the most relevant features in this regard.

In this study, we used bivariate time series on daily values of mobility index and COVID-19 incidence over a 3-month time-period (March 1 -May 31, 2020) for 151 U.S. cities that have reported at least 1,000 cases by the end of this period. By transforming the data for each city to a corresponding discrete set of ranked points on the unit cube, we computed the Optimal Transport distance as measure of temporal dependency between mobility and new cases for each pair of cities. Three definitions of mobility (M /∆M /M ) allowed us to generate 3 hierarchical clusterings: HC1, HC2 and HC3, as shown in Figure 1 and Table 1 . Each of the clusterings yielded 10 clusters of cities, which were compared for their sizes, singularities and divergences by the tool HCMapper, as shown in Figure 3 .

Among the clusterings, HC3 appeared to have clusters of consistent sizes, and also the fewest singularities and divergences. Further, when we mapped the counties representing the cities with cluster-specific colors, as shown in Figure 4 , we observed that the HC3 clusters showed high spatial correlation (Moran's I p-value of 0). They also showed the least disagreements among the cluster assignments of cities with each state, although some states like California and Florida contained cities from more than one cluster (see table 1). We looked into possible explanations of such cluster-specific differences using local covariates, as described below. Given the assumption of this study is that there are dynamic relationships between mobility and COVID-19 incidence that change not only over time but also across locations and populations, we computed Wasserstein barycenters of the 10 identified clusters, as shown in Figure 5 , to describe the overall dependency structure that is specific to each cluster. The temporal changes in the dependencies are shown in 3-dimensional plots, as the shading changes from light (early points) to dark green (later points) along the z-axis (time).

Finally, we sought to understand the factors that possibly underlie the dynamic patterns of each cluster as described above. Towards this, our results from Random Forest classification identified socioeconomic characteristics (or covariates) of the cities that could discriminate among the assigned cluster labels. The 8 most significantly discriminating covariates are shown in Figure 6 along with their cluster-specific contributions measured by the mean Shapley values. Notably, none of these covariates were used for clustering, and are yet able to discriminate among the clusters. Figure 2 shows the distinctive distributions of these covariates across the 10 identified clusters as boxplots. Reaction time is robustly the first and major contributor, which is indicative of the effects of stay-at-home on the different patterns of COVID-19 dynamics. 

The U.S. is alone among the countries in the industrialized world where the expected flattening of the curve did not take place yet. By May 31, 2020, there were 1.8 million confirmed COVID-19 cases and 99,800 deaths. 45 states were in various phases of re-opening and 5 states did not have shelter-in-place orders. By mid-June, cases had started to rise and as of June 26, there were 2.5 million confirmed cases and over 120,000 deaths. Some states that had begun to re-open parts of their economy have paused or delayed opening in the face of a surge of new cases. Estimating the impact of mitigation strategies on cases and deaths in the U.S. is challenging particularly due to the lack of uniformity in timing, implementation, enforcement, and adherence across states. Nevertheless, early observations point to the utility of such measures, particularly shelter-in-place orders in reducing infection spread and deaths (per data from California and Washington State) [9] . Counties implementing shelter-in-place orders were associated with a 30.2% reduction in weekly cases after 1 week, 40% reduction after 2 weeks, and 48.6% reduction after 3 weeks [10] Conversely, model projections estimate a steady rise in cases and over 181,000 deaths if such mitigation strategies were to be eased and not re-enforced before October 1 [12] .

Many researchers worldwide are currently investigating the changes in social and individual behaviors in response to the sudden yet prolonged outbreaks of COVID-19, e.g., [1, 6] . As the pandemic progresses, and until medical treatments or vaccination are available, new and diverse patterns of mobility, be they voluntary or via interventions, may emerge in each society. It is, therefore, of great importance to epidemiologists and policy-makers to understand the dynamic patterns of dependency between human mobility and COVID-19 incidence in order to precisely evaluate the impact of such measures. In this study, we have shown that such dependencies not only change over time but across locations and populations, and are possibly determined by underlying socioeconomic characteristics. Our analytical approach is particularly relevant considering the high socioeconomic costs of such measures.

We understand that our study has some limitations. We note that each step of our framework could be improved in isolation or as a pipeline, which we aim to do in our future work. We have also developed a prototype of an interactive tool to run online the steps of our analytical pipeline. It will be made publicly available shortly upon completion.

Here it is important to note the so-called ecological fallacy in inferring about individual health outcomes based on data or results that are obtained at either city or county levels. Such inference may suffer from incorrect assumptions and biases, which, however unintentional, must be avoided. Any views that might have reflected on the analysis or results of our study are those of the authors only, and not the organizations they are associated with.

Interplay of global multiscale human mobility, social distancing, government interventions, and COVID-19 dynamics. medRxiv

Barycenters in the Wasserstein space

Iterative Bregman projections for regularized transportation problems

Numerical methods for matching for teams and Wasserstein barycenters

Fast computation of Wasserstein barycenters

Critical community size for COVID-19: A model based approach for strategic lockdown policy

Non parametric tests of independence

Estimating the effects of non-pharmaceutical interventions on COVID-19 in europe

Social distancing for coronavirus is flattening the curve, California and Washington data show -the washington post

The effect of stay-at-home orders on COVID-19 cases and fatalities in the united states. medRxiv

First known person-to-person transmission of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in the USA

COVID-19 projections. institute for health metrics and evaluation. Website, Accessed

Proclamation on declaring a national emergency concerning the novel coronavirus disease (COVID-19) outbreak. White House

Derivative dynamic time warping

Existence and consistency of Wasserstein barycenters. Probability Theory and Related Fields

A unified approach to interpreting model predictions

HCMapper: An interactive visualization tool to compare partition-based flat clustering extracted from pairs of dendrograms

Hierarchical clustering

Fast and robust earth mover's distances

Computational optimal transport: With applications to data science. Foundations and Trends R in Machine Learning

Optimal transport: old and new

Mobility changes in response to COVID-19