key: cord-0155589-y3v3cnxp
authors: Self, Stella; Overby, Anna; Zgodic, Anja; White, David; McLain, Alexander; Dyckman, Caitlin
title: A Generalization of Ripley's K Function for the Detection of Spatial Clustering in Areal Data
date: 2022-04-22
journal: nan
DOI: nan
sha: 2641aa5a02f8d8591136fe2c3022c11e69270f05
doc_id: 155589
cord_uid: y3v3cnxp

Spatial clustering detection has a variety of applications in diverse fields, including identifying infectious disease outbreaks, assessing land use patterns, pinpointing crime hotspots, and identifying clusters of neurons in brain imaging applications. While performing spatial clustering analysis on point process data is common, applications to areal data are frequently of interest. For example, researchers might wish to know if census tracts with a case of a rare medical condition or an outbreak of an infectious disease tend to cluster together spatially. Since few spatial clustering methods are designed for areal data, researchers often reduce the areal data to point process data (e.g., using the centroid of each areal unit) and apply methods designed for point process data, such as Ripley's K function or the average nearest neighbor method. However, since these methods were not designed for areal data, a number of issues can arise. For example, we show that they can result in loss of power and/or a significantly inflated type I error rate. To address these issues, we propose a generalization of Ripley's K function designed specifically to detect spatial clustering in areal data. We compare its performance to that of the traditional Ripley's K function, the average nearest neighbor method, and the spatial scan statistic with an extensive simulation study. We then evaluate the real world performance of the method by using it to detect spatial clustering in land parcels containing conservation easements and US counties with high pediatric overweight/obesity rates.

The rapid growth in the use of geographic information system (GIS) software over the past thirty years has led to an explosion of spatial data and associated analytical methods. Spatial data generally falls into one of two categories. Point process data are associated with a specific latitude-longitude location, while areal data are associated with a spatial region (such as a county or census tract). The locations of trees in a forest, the addresses of cases of an infectious disease, and locations of violent crimes are all examples of point process data. Researchers frequently wish to determine if such data exhibit spatial clustering, loosely defined as an excess of events in one or more areas. Areal data can also exhibit clustering. Census blocks having a case of a rare cancer, land parcels with development restrictions, or counties which required individuals to wear a mask in public indoor settings during the COVID-19 pandemic are all examples of areal data which could be clustered. While a variety of methods have been developed to assess point process data for clustering, less attention has been paid to the areal case.

In this paper, we consider the problem of assessing binary areal data for spatial clustering and dispersion. We stress that we are here considering clustering in areal data based on location only, that is, we are attempting to answer the question, 'is there an excess of areal units having some binary characteristic of interest in certain part(s) of the study area?'. The terms 'clustered data' or 'clustering' are sometimes use to describe data which exhibits positive spatial autocorrelation in some numeric attribute. For example, census tracts with a high incidence of an infectious disease might tend to be closer to other tracts with high incidence. A number of methods exist for assessing areal data for this sort of clustering, including the Getis Ord Gi* statistic and the local Moran's I statistic, but 'clustering' in the sense of positive spatial autocorrelation among data attributes is not the focus of this work. For our purposes, the term 'clustered data' or 'clustering' will refer clustering based on location only, that is, we will say data is clustered if the locations of the observations are clustered, regardless of any spatial patterns present in their other attributes.

Most cluster detection methods are either designed for point process data or intended to assess spatial autocorrelation in numeric data attributes derived from aggregating data over an areal unit. One notable exception is the spatial scan statistic, initially developed in Kulldorff (1997) . Under the original formulation, the number of observations in the study area was assumed to follow either a binomial or a Poisson distribution. The method identifies the 'most likely cluster' by considering a large number of possible zones (usually circles) and performing a series of likelihood ratio tests to determine if probability of observing an event (under the binomial likelihood) or the event rate (under the Poisson likelihood) is larger inside the zone than outside. The spatial scan statis-tic may be used to assess areal data for the presence of clustering by assuming a binomial likelihood where each areal unit has a population of 1, with units having the characteristic of interest considered to have 1 'success' (Kulldorff et al., 2006) . When the spatial scan statistic is applied in this way, areal units are considered part of a zone if their centroid falls in the zone. However, representing an areal unit with its centroid is not without problems. If an areal unit is not convex, the centroid may lie outside the unit. Two large adjacent units might have centroids which are quite distant from each other, even though the units themselves share a border. The spatial scan statistic method has been extended beyond the original binomial and Poisson cases to handle normally distributed data (Huang et al., 2009; Shen and Jiang, 2014) , time-to-event data (Huang et al., 2007; Bhatt and Tiwari, 2016) , and ordinal data (Jung et al., 2007) ; a non-parametric version has also be developed (de Carvalho et al., 2021) . Kulldorff et al. (2006) develop a more flexible spatial scan statistic based on elliptical zones, while Tango and Takahashi (2005) use irregularly shaped windows built from the underlying areal units. Typically Monte Carlo simulations are used to simulate the null distribution of the spatial scan statistic, though exact or asymptomatic null distributions are available for a few related variants (Soltani and Aboukhamseen, 2015) .

One of the earliest methods for the detection of spatial clustering was the average nearest neighbor (ANN) method developed in Clark and Evans (1954) . This method was developed for point process data. The ANN method computes the average distance between each observed point and the point closest to it and compares this average to the expected distance under a null hypothesis of complete spatial randomness. Small ANN ratio values indicate spatial clustering, while large values indicate dispersion. Several extensions to the original ANN approach have been proposed (Clark and Evans, 1955; Clark, 1956; Clark and Evans, 1979) , and the method has been widely used in a variety of disciplines, including geography and ecology; for a nice overview, see Philo and Philo (2021) . While the ANN method was developed for point process data, it is commonly applied to areal data by using the centroids of areal units as the observed latitude and longitude locations. ArcGIS software does this by default when applying the method to areal (polygon) data (ESRI, 2021a) . In such scenarios, the centroid of an areal unit is considered an 'observed point' if the areal unit has some binary characteristic of interest.

Spatial clustering can occur at various geographical scales. For example, cases of an infectious disease may be clustered within a household while infected households themselves are clustered at the neighborhood level. Ripley's K function is a popular method for assessing spatial clustering because it allows researcher to assess clustering at a specific geographical scale (or multiple scales) (Ripley, 1976 (Ripley, , 1977 (Ripley, , 1981 . For a given distance r, Ripley's K function provides a means of comparing the number of pairs of observations located within a distance of r of each other and the number of such observations we would expect to find if the data were randomly scattered (i.e. no clustering). If the overall density of observations in the space is λ and there is no clustering, then we would expect a circle of radius r centered at any given observation to contain approximately πr 2 λ observations. Values much larger than this indicate spatial clustering, i.e. the number of nearby points is larger than expected, while values much smaller than this are indicative of spatial dispersion. Ripley's K function has been widely used in a variety of fields, including ecology (Haase, 1995) , microbiology (Yunta et al., 2014) , cancer detection (Martins et al., 2009) , image analysis (Amgad et al., 2015) , and archaeology (Sayer and Wienhold, 2013) . After the initial development of Ripley's K function in Ripley's seminal works (Ripley, 1976 (Ripley, , 1977 (Ripley, , 1981 , a variety of extensions and adjustments were developed. Many of these adjustments are designed to address the problem of edge effects caused by the underestimation of K(r) near the study area boundary if some of the points within distance r of a point in question fall outside of the study area. Edge corrections for circular (Diggle, 1983) , rectangular (Diggle, 1983) and irregular (Goreaud and Pélissier, 1999) study areas have been developed, as well as several more complex methods of edge correction (Sterner et al., 1986; Szwagrzyk and Czerwczak, 1993; Upton and Fingleton, 1985; Getis and Franklin, 1987; Andersen, 1992) .

Inference based on Ripley's K function rests heavily on the theory of spatial point processes. For example, the null hypothesis for statistical tests based on Ripley's K function is that the points arise from a two dimensional homogeneous Poisson process, which is inherently violated by areal data. However, the lack of cluster detection methods designed specifically for areal data have caused many researchers to use Ripley's K function on areal data, typically using the centroids and computing Ripley's K function using the resulting set of points. For example, a number of researchers have attempted to assess spatial patterns in land parcel data via Ripley's K function or Ripley's L function (a scaled version of Ripley's K function) (Lee and Lee, 2013; Siordia, 2013; Zipp et al., 2017; Qiao et al., 2019) . Other researchers have taken public health data associated with a geographical region such as a city or health division and computed Ripley's K function using the centroids of these larger geographical regions (Wade, 2014; Karunaweera et al., 2020; Skog et al., 2014 ). Ripley's K function has also been used to assess areal data for clustering in a variety of ecological and geological applications (Kretser et al., 2008; Davarpanah et al., 2018; Marj and Abellan, 2013) . In fact, ArcGIS software computes Ripley's K function for areal (polygon feature) data by mapping each areal unit to its centroid by default (ESRI, 2021b) . To our knowledge, the performance of Ripley's K function on areal data has never been evaluated. We show in our simulation studies that Ripley's K function often has a severely inflated type I error rate when applied to areal data.

In this paper, we propose an extension of Ripley's K function, which we refer to as Ripley's K function for areal data (RKAD). The interpretation of RKAD is similar to that of the traditional Ripley's K function, but it possesses improved performance for areal data. In Section 2, we define Ripley's K function and RKAD. Section 3 presents the results of an extensive simulation study to compare performance of RKAD, Ripley's K function, the ANN method, and the spatial scan statistic when assessing areal data for clustering or dispersion. In Section 4, we assess the performance of RKAD on two real datasets. First, we use it to determine if land parcels in Boulder County, Colorado which contain a conservation easement are spatially clustered. Next, we apply RKAD to determine if US counties with high childhood overweight rates are spatially clustered. Section 5 provides concluding remarks.

Suppose we are observing a spatial point process on a two dimensional region A , with density λ. For a distance r > 0, Ripley's K function is defined as K(r) = λ −1 E(number of points within a distance of r of any given point).

If we have observed a collection of n points, we can estimate Ripley's K function asK (r) =λ −1 n i=1 n j=1 w ij n whereλ = n/|A |, |A | denotes the area of A , and w ij is a weight associated with points i and j. In the traditional approach, w ij = 1 if the distance between points i and j is less than r and 0 otherwise (Ripley, 1976 (Ripley, , 1977 . However, many variants of Ripley's K function exists which modify these weights to account for edge effects (Ripley, 1976; Sterner et al., 1986; Diggle, 1983; Szwagrzyk and Czerwczak, 1993; Upton and Fingleton, 1985; Getis and Franklin, 1987; Andersen, 1992) . In practice, Ripley's K function is often re-scaled to Ripley's L function,L(r) = {K(r)/π} 1/2 .

Ripley's K function is often used to determine if an observed collection of points exhibits complete spatial randomness (CSR) (i.e. the points arise from a two-dimensional homogeneous Poisson process). For a homogeneous Poisson process on an infinite study area, K(r) = πr 2 . The distribution ofK(r) under CSR for a finite study area A can be approximated with Monte Carlo simulations, which are used to perform a hypothesis test with the null hypothesis being that the observed data arises from a homogeneous Poisson process with rate parameterλ. Large values ofK(r) indicate spatial clustering, that is, the number of points within a distance of r of any given point is larger than would be expected if the data exhibited CSR. Small value ofK(r) indicate dispersion, that is, the number of points within a distance of r of any given point is smaller than would be expected under CSR. For an overview of Ripley's K function, edge correction methods, and hypothesis testing via Monte Carlo simulations, see Dixon (2014).

While Ripley's K function is designed for point process data, in practice, it is often used to assess areal data for clustering. Areal units with some binary characteristic of interest are mapped to their centroids and Ripley's K function is applied to to the resulting set of points. The lack of suitable alternative methods to assess clustering in areal data as well as the fact that ArcGIS applies Ripley's K to polygon centroids by default both contribute to this misuse (ESRI, 2021b) . Applying Ripley's K to the centroids of areal units is particularly problematic when the units are vastly different sizes. For example, in one of our motivating data applications we wish to determine if land parcels having conservation easements (CEs) are clustered. Under the null hypothesis, all parcels are equally likely to have a CE. Under the null hypothesis, a portion of the study area with many small parcels (such as a metropolitan area), will have more parcels with CEs than portions of the study area with many large parcels, simply because there are more parcels per unit area. Put another way, centroids of smaller parcels will appear clustered relative to centroids of larger parcels simply because the size of the small parcels allows the centroids to be closer together.

To surmount these difficulties, we propose a modification to Ripley's K function. Suppose we have a study area A divided into N areal units a 1 , a 2 , ..., a N , and that each unit a i possesses some characteristic of interest with probability p i . For each unit, we define the Bernoulli(p i ) random variable Y i to be 1 if unit i possesses the trait of interest and 0 otherwise. For example, A might be a county, a 1 ,...a N be the parcels of land in the county, and Y i = 1 if parcel i contains a CE. We will refer to the subset of the units for which Y i = 1 as the 'observed units' or the 'observations', and denote them by

. For a given study area A , set of areal units a 1 , ..., a N , and observed data y, for r > 0 and each i for which y i = 1, define

where c i (r) denotes the circle of radius r centered at the centroid of a i . Thus m(r, i, y) is the proportion of the circle of radius r centered at the centroid of a i which falls into the observed units (see Figure 1 ). Note that m(r, i, y) is only defined if a i ∈ B(Y ). Additionally, define m(r, y) as the sample average (taken over all n y = N i=1 y i observed units) of the amount of area within a distance of r of an observed unit centroid which falls into observed units.

Finally, define Ripley's K function for areal data (RKAD) by

1}}. Therefore, m(r) = π −1 r −2 E(observed area within a distance of r of an observed unit centroid).

Also define While m(r) and m(r; n y ) may be calculated explicitly provided the vector of probabilities p = (p 1 , ..., p N ) is known, doing so involves computing (respectively) a 2 N and an N ny dimensional sum, and Monte Carlo approximations are likely to be more practical when N is large.

A test for clustering or dispersion in the observed units may be derived by comparing the observed m(r, i, y) values to m(r; n y ) calculated in the absence of spatial dependence. Consider the following test statistic

where m(r; n y ) is calculated under an assumed null distribution which lacks spatial dependence. Values of T (r, y) which are large or small relative to the null distribution indicate a departure from the null distribution. When units exhibit clustering, certain part(s) of the study area will contain more units than expected under the null distribution. As a consequence, other parts of the study area will have fewer units than expected (as the areas in between the clusters must be sparse in comparison). Therefore values of m(r, i, y) which are much larger or much smaller than m(r; n y ) are both indicative of clustering. Consequently, a large value of T (r, y) (i.e. a large average absolute difference between the observed m(r, i, y)'s and the estimated value m(r, n y ) under the null distribution) indicates spatial clustering. When the areal units are dispersed, that is, distributed at more regular intervals than expected in the absence of spatial dependence, then variability in the m(r, i, y)'s will be decreased and the average absolute difference between the m(r, i, y)s and m(r, n y ) under the null distribution will be small. Therefore, small values of T (r, y) indicate dispersion.

While the sampling distribution of T (r, ·) is not amenable to direct evaluation, it can be approximated via Monte Carlo simulations. The same Monte Carlo simulations can be used to approximate m(r; n y ). Such an approximation involves specification of the null distribution. The class of distributions which lack spatial dependence is too broad to be practical, and the class must be narrowed in some way. As is typical for hypothesis tests for spatial dependence, we take the case of independent and identically distributed observations as our null distribution (for exact specifications, see Section 3). An approximately α-level hypothesis test for clustering can be conducted by rejecting the null hypothesis if T (r, y) exceeds the 1 − αth quantile of the Monte Carlo sample of T (r, ·). Similarly, a test for dispersion can be conducted by rejecting the null hypothesis if T (r, y) is less than the αth quantile of the Monte Carlo sample.

In this section, we perform an extensive simulation study to compare the performance of RKAD to that of the ANN method, the spatial scan statistic, and the traditional Ripley's K function. We consider the performance of our proposed hypothesis testing procedure using two study areas which are shown in Figure  2 :

A 1 : A 20 by 20 regular grid of N 1 = 400 cells A 2 : The N 2 = 3, 108 counties (and county-equivalents) in the contiguous United States For each study area A j , j = 1, 2, we generate data under 19 different scenarios. First we consider the case of no spatial pattern in the locations of the observed units via the following three scenarios:

Here Y ∼SWoR(N, k, p) indicates that the random variable Y = (Y 1 , ..., Y N ) arises by selecting k elements from {1, 2, .., N } via sampling without replacement (SWoR) where p = (p 1 , ..., p N ) gives the probability of selecting each element, and Y i = 1 if i was selected and 0 otherwise. Areal unit a i is observed if and only if y i = 1. For j = 1, 2, we set p jI = (N −1 j , . . . , N −1 j ) . For each study area, we also consider 12 scenarios in which the locations of observed units are clustered. First, we assess the ability of our hypothesis test to detect large-scale spatial clustering which exists in only one part of the study area. We consider the following 6 scenarios:

The N j -dimensional vectors p jC l = (p jC l 1 , . . . , p jC l Nj ) , j = 1, 2, l = 1, ..., 6 are defined as follows: an entry of p jC l is equal to q/D if the unit is shown in blue in Figure 3 and equal to 1/D otherwise where D is such that i p jC l i = 1. For C 1 , C 3 and C 5 , we take q = 5 and for C 2 , C 4 and C 6 we take q = 10. Thus the blue units are 5 times more likely to be observed than the white units under data generation mechanisms C 1 , C 3 and C 5 and 10 times more likely to be observed under data generation mechanisms C 2 , C 4 and C 6 .

Next, we assess the ability of our hypothesis test to detect spatial clustering at a smaller scale when it occurs at multiple clusters spread through the study area by considering the following 6 scenarios:

Here, Y ∼ C(k, m, q) denotes that Y is generated as follows. First, m elements of {1, 2, ..., N } are randomly selected via SWoR(N j , m, p jI ). Next, k − m elements of {1, 2, ..., N } are selected via SWoR(N, k −m, p 2 ), where p 2i = 0 if i was one of the first m elements selected, p 2i = q/D for i such that a i shares a border with at least one of the initially selected k elements (but i itself was not initially selected) and p 2i = 1/D otherwise, where D is chosen so that i p 2i = 1; Y i = 1 if element i was selected in either the first or second step, and Y i = 0 otherwise, and areal unit a i is observed if and only Y i = 1. Note that data generation mechanisms C 7 , C 9 and C 11 correspond to 'weaker' clustering, in the sense that they tend to select fewer adjacent units than mechanisms C 8 , C 10 and C 12 Finally, we assess the ability of our hypothesis testing procedure to detect spatial dispersion under the following 4 scenarios. D 1 : Y ∼ C N j /10 , N j /100 , 1 10

Here, in D 1 and D 3 adjacent units are one tenth as likely to be observed as nonadjacent units, creating a mild dispersion effect. Under D 2 and D 4 adjacent units cannot be selected at all, creating a stronger dispersion effect. Finally, we consider only two samples sizes when assessing dispersion (N j /10 and N j /6) because it becomes increasingly difficult or impossible to select only non-adjacent units as number of selected units increases. Examples of data generated each scenario for A 1 is shown in Figure 4 . Supplementary Figure 1 provides similar examples for A 2 .

For each data generation scenario, we compare four methods: the average nearest neighbor method applied to the areal unit centriods, the spatial scan statistic under a binomial likelihood with each areal unit is assumed to have population 1 and zone membership is determined using areal unit centriods, traditional Ripley's K function with Ripley's isotropic edge correction (Ripley, 1988) , and our RKAD. For each study area, 10 radii r 1 , . . . , r 10 were selected for evaluation of the spatial scan statistic, Ripley's K, and RKAD. The set of radii used for each study area are shown in Figure 2 . The smallest radius was equal the smallest distance between any areal unit centriods, and the radii were increased incrementally, with the largest radii being approximately one fourth of the width of the study area.

The average nearest neighbor test statistic was computed using the nni function in the spatialEco R package (Evans, 2021) , and the spatial scan statistic was computed using the scan.test function in the spatstat R package (Baddeley et al., 2015). Ripley's K function was computed using the Kest function in the R spatstat package. The envelope function (also in the spatstat package) with 100 simulations was used to perform hypothesis testing for Ripley's K function.

To perform the hypothesis testing procedure based on Ripley's K function for areal data, Monte Carlo simulations were used. For each study area A j , j = 1, 2 and each sample size n ∈ { N j /10 , N j /6 , N j /4 , N j /2 }, 1,000 datasets were simulated under the null hypothesis of equal probability sampling without replacement. For each study area A j and n, y g was generated from SWoR(N j , n, p j ) for g = 1, ..., 1000, and m(r l ; n) = 1 1000 1000 g=1   1 n i:ygi=1 |c i (r l ) ∩ B(y g )| πr 2 l   was calculated. To approximate the null distribution of the test statistic T for each study area A j , observation size n, and radius r l T g (r l , y g ) = 1 n i:ygi=1 |m(r l , i, y g ) − m(r l ; n)| was calculated, and the quantiles of the set {T 1 (r l , y 1 ), ..., T 1000 (r l , y 1000 )} were used as approximate critical values for the hypothesis test. After approximating the each null distribution, 500 instances of y were generated under each of the 19 data generating mechanisms. For each radius r l and each y, T (r l , y) was computed and compared to critical values from the null distribution with the same observation size n and radius r l . For data generation mechanisms I 1 , I 2 and I 3 , an α = 0.05 two-tailed test was performed for each method. Note that the spatial scan statistic method, the Ripley's K method, and the Ripley's K for areal data method all require Monte Carlo approximation of the null distribution, rendering the level of these tests approximate. For data generation mechanisms C 1 −C 12 , a one-tailed test was performed with clustering as the alternative hypothesis. Finally, for data generation methods D 1 − D 4 , a one tailed test was performed with dispersion as the alternative hypothesis.

Tables 1 and 2 summarize the results from study area A 1 (the regular grid) and study area A 2 (the US counties), respectively. Each table reports the empirical rate of rejection for the null hypothesis. For scenarios I 1 , I 2 and I 3 , this quantity is the empirical type I error rate; for the other scenarios, this quantity is the empirical power. As the Ripley's K method and the RKAD were performed at 10 different radii, the rejection rate for each radius is reported separately.

Under the null scenarios (I 1 , I 2 and I 3 ), the empirical type I error rate of RKAD is within a Monte Carlo margin of error of its nominal level for all radii considered. The empirical type I error rate of the spatial scan statistic is at (or slightly below) its nominal level for these scenarios. The empirical type I error rate of the point process methods were markedly inflated, rejecting 100% of the null tests in some cases. The ANN method was generally worse for the regular grid than for the US counties, while Ripley's K was generally worse for the US counties and situations with more observations.

Under the large-scale clustering scenarios (C 1 − C 6 ), the RKAD method has high empirical power to detect clustering for all radii except the smallest radius, with performance improving as the strength of the clustering and sample size increase. The empirical power of the spatial scan statistic method is also quite high. Under the small-scale clustering scenarios (C 7 − C 12 ), the empirical power of the RKAD method is generally high (> 85%) for radii r 2 − r 3 , with power declining somewhat for larger radii. As the clustering in these scenarios occurs at a close geographic scale (i.e. first degree neighbors), we would expect the highest power at the smaller radii. The exception is scenario C 7 , which corresponds to the weakest clustering and the smallest sample size, for which empirical power is notably lower. The empirical power of RKAD at radii r 2 and r 3 is greater than or equal to that of the spatial scan statistic for almost all multiple clustering scenarios.

Under the dispersion scenarios (D 1 -D 4 ), the empirical power of RKAD varies considerably by scenario. For the regular grid, performance is rather poor for the smaller sample size, but improves as the strength of the dispersion and the number of observed units increases. Performance is much better for the US counties. The empirical power of the spatial scan statistic is quite low (< 5%) for all dispersion scenarios.

In summary, of the four methods considered, only RKAD and the spatial scan statistic had satisfactory type I error rates. As expected, the empirical power of RKAD varied with the choice of radius, with empirical power being higher for radii which correspond to the type of clustering present in the data (i.e. larger radii for the large-scale clustering in scenarios C 1 − C 6 and smaller radii for the small-scale clustering in scenarios C 7 −C 12 ). In almost all scenarios, the empirical power of RKAD with the ideal radius was higher than that of the spatial scan statistic.

In this Section, we consider the performance of our method on real world applications from two different fields. First, we use the method to determine if land parcels held as CEs are clustered in Boulder County, Colorado, using the 112,819 distinct land parcels in Boulder County as the areal structure. Next, we apply our method to determine if US counties with high childhood overweight rates are spatially clustered, using the 3,108 county and county-equivalents in the contiguous US as the areal structure.

CEs are a private and generally perpetual form of land conservation that legally severs aspects of private landownership (e.g., development rights, resource extraction, etc.) from a parcel of land (McLaughlin and Weeks, 2009 ). Although a landowner makes an individual decision to place a CE, there is evidence of spatial clustering of CEs over time, throughout the US (Lamichhane et al., 2021) . Cumulative and clustered CE use may impact regional ecosystem character by altering the degree of CE parcels' isolation or connectivity with other ecologically valuable parcels and may change ecologic quality on the CE parcel itself (Graves et al., 2019) . The greater the mass of clustering and ecological systems integrity, the more impact there may be on the land conversion rates at the county level, and on the decision to leave a parcel in open space (or not), potentially affecting placement of other socially valuable land uses as well. Furthermore, recognizing if and where CEs are clustered and linking the social, political, biological, and geographical characteristics to the clustered areas may help elucidate the factors driving CE placement (Baldwin and Leonard, 2015) .

The Boulder County data consists of 112,819 land parcels in place in 2008. Of these land parcels, 817 were held as CEs. A parcel was considered to be part of a CE if any part of the parcel was part of an easement. Figure 5 depicts the land parcels; parcels which are part of CE are shown in blue. The method was applied at 10 different radii, also depicted in Figure 5 .

In order to apply our method, the distribution of the RKAD test statistic under the null hypothesis was estimated using 1,000 Monte Carlo simulations. In each Monte Carlo simulation, 817 parcels were selected via simple random sampling without replacement. The observed RKAD test statistic was larger than the 95th quantile of the estimated null distribution for all radii, indicating that parcels which contain CEs are significantly clustered for all radii.

Human land conversion has caused widespread habitat loss and fragmentation (Haddad et al., 2015) . In the context of CEs with a purpose of biological conservation, clustering easements close to one another is one reserve design principle to improve landscape connectivity and combat the adverse effects of habitat fragmentation (Diamond, 1975) . Larger and higher quality habitats (easements) increase the size and stability of source populations and subsequently increase species dispersal capabilities (Hodgson et al., 2009) . Clustering and structural connectivity between conservation areas are not always positive, however, as clustering may also leave these areas vulnerable to spatially autocorrelated extinction pressures, such as diseases, invasive species, stochastic environmental events, or negative effects from localized urban growth (Donaldson et al., 2016) . Given that RKAD indicated spatial clustering of CEs in Boulder County, more detailed landscape connectivity studies focused on functional connectivity may be warranted (Balbi et al., 2019; Tischendorf and Fahrig, 2000) .

Next, we use the RKAD method to determine if US counties with high childhood overweight/obesity rates are spatially clustered. County-level childhood overweight rates were estimated from data collected in the 2016 National Survey of Children's Health using a multilevel small area estimation approach as described in Zgodic et al. (2021) . A county was considered to have a high overweight rate if its estimated rate exceeded the 75th percentile of all county overweight rates. There are 3,108 counties and county-equivalents in the contiguous US, and 786 of these counties were found to have a high rate of childhood overweight. These counties are shown in blue in Figure 5 , along with the radii at which RKAD was applied. The distribution of the RKAD test statistic under the null hypothesis was estimated using 1,000 Monte Carlo simulations. In each Monte Carlo simulation, 786 counties were selected via simple random sampling without replacement. The observed RKAD test statistic was larger than the 95th quantile of the estimated null distribution for all radii, indicating that counties with high rates of childhood overweight are significantly clustered. As Southeastern and Midwest states tend to have higher overweight and obesity rates than the rest of the country (Gartner et al., 2016; CDC, 2021) , these results are not surprising.

The problem of assessing areal data for spatial clustering has been the subject of relatively little attention. While the ANN method and the traditional Ripley's K method are often used to assess areal data for clustering by mapping each areal unit to its centroid, these methods were not designed for areal data. Our simulation study shows that applying these methods in this manner results in a highly inflated type I error rate. In fact, in many settings these methods rejected 100% of the null hypotheses. Since such an approach is the default method used by ArcGIS software, these results are concerning. Among the existing methods assessed in our simulation study, only the spatial scan statistic maintained its nominal type I error rate.

To address the relative lack of methods to assess areal data for clustering, we developed RKAD, an extension of Ripley's K function which is capable of assessing areal data for the presence of clustering or dispersion at specific geographic scales. The RKAD method quantifies the average amount of observed area within a specified distance of each areal unit centroid. RKAD can be used to perform a hypothesis test for the presence of spatial clustering or dispersion by comparing the observed RKAD test statistic to the distribution of the RKAD test statistic in the absence of spatial dependence. Simulation studies demonstrated that RKAD hypothesis testing procedure maintains its nominal type I error rate and has high power to detect a variety of spatial patterns, including small and large scale clustering and dispersion. RKAD generally displayed higher empirical power than the spatial scan statistic, especially when data were dispersed.

To facilitate the use of our method, R code which implements the RKAD method and performs the necessary Monte Carlo simulations has been made available online at https://github.com/scwatson812/RKAD. When the number of observed areal units is large, the Monte Carlo simulations may be run in parallel to reduce computation time. The development of faster methods for approximating the null distribution is an excellent area for future work. 

The Web Appendix contains contains Supplementary Figure 1 Figure 3: Illustration of spatial dependence structures C 1 − C 6 for study areas A 1 (left), and A 2 (right). Blue units are q times more likely to be selected than white units under spatial dependence configurations C 1 − C 6 . Figure 4 : Examples of observed units generated under each scenario for study area A 1 . The first row displays examples of data generated under the null hypothesis of equal probability sampling without replacement (left to right: I 1 , I 2 , I 3 ). The second row displays examples of data generated with a single cluster (left to right C 1 , C 2 , C 3 , C 4 , C 5 , C 6 ). The third row displays examples of data generated with multiple clusters (left to right C 7 , C 8 , C 9 , C 10 , C 11 , C 12 ). The fourth row displays examples of data generated with dispersion (left to right D 1 , D 2 , D 3 , D 4 ). Table 2 : Simulation study results for study area A 2 (the US counties). Results displayed include the empirical rejection rate of the average nearest neighbor method (ANN, column 2), the spatial scan statistic method (SST, column 3), the Ripley's K method at each of the 10 radii (RK, columns 5-14) and the Ripley's K for areal data method at each fo the 10 radii (RKAD, columns 5-14). For data generation mechanisms (DGMs) I 1 , I 2 , and I 3 , the reported rejection rates correspond to the type I error of an α ≈ 0.05 two-tailed test. For DGMs C 1 −C 12 , the rejection rates correspond to the power of an α ≈ 0.05 single-tailed test indicative of clustering. For DGMs D 1 − D 4 , the rejection rates correspond to the power of an α ≈ 0.05 single-tailed test indicative of dispersion. The radii at which the RKAD method was evaluated are shown in red. Note that areas with many small parcels appear black. The bottom pane displays the 3,108 counties in the contiguous US. Counties with a high rate of childhood overweight/obesity are shown in blue. The radii at which the RKAD method was evaluated are shown in red.

Extending ripley's k-function to quantify aggregation in 2-d grayscale images

Spatial analysis of two-species interaction

Spatial Point Patterns: Methodology and Applications with R

Title: Ecological relevance of least cost path analysis: An easy implementation method for landscape urban planning

Interacting social and environmental predictors for the spatial distribution of conservation lands

A spatial scan statistic for survival data based on generalized life distribution

Trends and maps

Grouping in spatial distributions

Distance to nearest neighbor as a measure of spatial relationships in populations

On some aspects of spatial pattern in biological populations

Generalization of a nearest neighbor measure of dispersion for use in k dimensions

Spatial autocorrelation of neogene-quaternary lava along the snake river plain, idaho, usa

Spatial scan statistics based on empirical likelihood

The island dilemma: lessons of modern biogeographic studies for the design of natural reserves

Statistical analysis of spatial point patterns

Ripley's K Function

Old concepts, new challenges: adapting landscape-scale conservation to the twenty-first century

Average nearest neighbor

Multi-distance spatial cluster analysis (ripley's k function) (spatial statistics)

The spatial distribution of gender differences in obesity prevalence differs from overall obesity prevalence among us adults

Second-order neighborhood analysis of mapped point patterns

On explicit formulas of edge effect correction for ripley's k-function

Quantifying the contribution of conservation easements to large-landscape conservation

Spatial pattern analysis in ecology based on ripley's k-function: Introduction and methods of edge correction

Habitat fragmentation and its lasting impact on earth's ecosystems

Climate change, connectivity and conservation decision making: back to basics

A spatial scan statistic for survival data

Weighted normal spatial scan statistic for heterogeneous population data

A spatial scan statistic for ordinal data

Spatial epidemiologic trends and hotspots of leishmaniasis, sri lanka

Housing density as an indicator of spatial patterns of reported human-wildlife interactions in northern new york

A spatial scan statistic

An elliptic spatial scan statistic

Spatial dependence and determinants of conservation easement adoptions in the united states

Assessing the appropriateness of the spatial distribution of standard lots using the l-index

Rockfall detection from terrestrial lidar point clouds: A clustering approach using r

Detection of breast masses in mammogram images using growing neural gas algorithm and ripley's k function

In defense of conservation easements: A response to the end of perpetuity

2.15 or not 2.15? an historical-analytical inquiry into the nearest-neighbor statistic

The identification and use efficiency evaluation of urban industrial land based on multi-source data

Spatial Statistics

Statistical Inference for Spatial Processes

The second-order analysis of stationary point processes

Modelling spatial patterns

A gis-investigation of four early anglo-saxon cemeteries: Ripley's k-function analysis of spatial groupings amongst graves

Multivariate normal spatial scan statistic for detecting the most severe cluster of a disease

Benefits of small area measurements: a spatial clustering analysis on medicare beneficiaries in the usa

Spatiotemporal characteristics of pandemic influenza

An alternative cluster detection test in spatial scan statistics

Testing for life historical changes in spatial patterns of four tropical tree species

Spatial patterns of trees in natural forests of east-central europe

A flexibly shaped spatial scan statistic for detecting clusters

On the usage and measurement of landscape connectivity

Point Pattern and Quantitative Data

Spatial analysis of global prevalence of multiple sclerosis suggests need for an updated prevalence scale. Multiple Sclerosis International

A statistical analysis of spatial clustering along cell filaments using ripley's k function