key: cord-0195597-oz8kjmff authors: Vidanapathirana, Nadeesha; Wang, Yuan; McLain, Alexander C.; Self, Stella title: Cluster Detection Capabilities of the Average Nearest Neighbor Ratio and Ripley's K Function on Areal Data: an Empirical Assessment date: 2022-04-22 journal: nan DOI: nan sha: 402d757b63db1dca6ba7233865a366e8ffec8473 doc_id: 195597 cord_uid: oz8kjmff Spatial clustering detection methods are widely used in many fields of research including epidemiology, ecology, biology, physics, and sociology. In these fields, areal data is often of interest; such data may result from spatial aggregation (e.g. the number disease cases in a county) or may be inherent attributes of the areal unit as a whole (e.g. the habitat suitability of conserved land parcel). This study aims to assess the performance of two spatial clustering detection methods on areal data: the average nearest neighbor (ANN) ratio and Ripley's K function. These methods are designed for point process data, but their ease of implementation in GIS software and the lack of analogous methods for areal data have contributed to their use for areal data. Despite the popularity of applying these methods to areal data, little research has explored their properties in the areal data context. In this paper we conduct a simulation study to evaluate the performance of each method for areal data under different types of spatial dependence and different areal structures. The results shows that the empirical type I error rates are inflated for the ANN ratio and Ripley's K function, rendering the methods unreliable for areal data. Researchers have been using spatial clustering analysis for many years to analyze data for spatial patterns. One of the earliest examples of spatial clustering analysis in public health occurred in 1894 when Dr. John Snow mapped the location of cholera cases to identify the source of an outbreak in London (Moore & Carpenter, 1999) . Over the past few decades, the popularization of geographical information systems (GIS) software has fueled the development and use of many new methods of spatial analysis. A Google Scholar search for 'spatial clustering' returns 15,600 results dated between 2001 and 2010, with an increase to 28,000 results between 2011 and 2020. Both point process (latitude-longitude) and areal data can exhibit spatial patterns, and a variety of methods have been developed to analyze spatial patterns in both types of data. One of the oldest spatial clustering techniques is the average nearest neighbor (ANN) ratio, which was developed as a statistical test for spatial clustering (Clark & Evans, 1954) . This method computes the distance between each observation location and its nearest neighbor, and the average 'nearest neighbor distance' (where the average is taken over all locations of observed data) is used to compute a test statistic. The ANN ratio has been widely used to detect clustering in various types of point process data, including clustering of disease cases (Aziz et al., 2012; Khademi et al., 2016; Melyantono, Susetya, Widayani, Tenaya, & Hartawan, 2021) , crime hotspots (Brookman-Amissah, Wemegah, & Okyere, 2014; Wing & Tynon, 2006; Z. Zhang et al., 2020) and clustering of archeological artifacts (Kıroğlu, 2003; Whallon, 1974) . The ANN ratio has also been used to detect clustering in areal data (after mapping to centroids) in disease case data (Mollalo, Alimohammadi, Shirzadi, & Malek, 2015) aggregated to the municipality level. Ripley's K function was first developed in 1976 and can describe spatial patterns at different scales simultaneously. For a given distance , Ripley's K function returns the expected number of observation locations within a distance of from a randomly selected observation location. Given a study area, a researcher can use Ripley's K function to determine if observation locations are clustered, dispersed, or randomly distributed throughout the study area. Ripley's K function has been applied to point process data for a variety of purposes including detecting disease clusters (Hohl, Delmelle, Tang, & Casas, 2016; Lentz, Blackburn, & Curtis, 2011; Ramis et al., 2015) , analyzing the spatial distribution patterns of plant communities (Haase, 1995; Moeur, 1993; Wolf, 2005) , and detecting crime hotspots (Lu & Chen, 2007; Vadlamani & Hashemi, 2020) . It has also been applied to areal data (after mapping the areal units to their centroids) in a variety of applications including disease case data aggregated at the municipality level (Karunaweera et al., 2020; Mollalo et al., 2015; Skog, Linde, Palmgren, Hauska, & Elgh, 2014) , locations of conserved land (Zipp, Lewis, & Provencher, 2017) , locations of land parcels in industrial use (Qiao, Huang, & Tian, 2019) , human-wildlife interactions (Kretser, Sullivan, & Knuth, 2008) , and rockfall events (Tonini & Abellan, 2014) . The ANN ratio and Ripley's K function were developed for point process data. However, Environmental Systems Research Institute (ESRI) ArcGIS software allows the user to implement both these methods on areal data using the average nearest neighbors tool (ANN ratio) and the multidistance spatial cluster analysis tool (Ripley's K function) ; both tools are in the spatial statistics toolbox. When the user applies either method to areal data (i.e. polygon features), ArcGIS automatically maps each polygon feature to its centroid and applies the method to the resulting set of points (Esri, 2021a; Esri, 2021b) . To our knowledge, the performance of the ANN ratio and Ripley's K function under such circumstances has never been evaluated. Both methods assume under the null hypothesis that the observation locations arise from a homogenous point process. However, for areal data the centroids of smaller units will inherently be closer to the centroids of their neighbors when compared to the centroids of larger units, and thus unless all the units are the same size it is impossible for their centroids to arise from a homogeneous Poisson process. As the homogeneous Poisson process assumption is violated for areal data, it is advisable to assess the performance (i.e. empirical type I error rate and empirical power) of the ANN ratio and Ripley's K function when applied to areal data. In this paper, we conduct a simulation study to evaluate the performance of each method for areal data under different types of spatial dependence and three different areal structures. Section 2 presents the detailed description of the two spatial clustering methods. Section 3 describes the simulation study and its results. Section 4 provides some theoretical results and Section 5 contains concluding remarks. The ANN ratio and Ripley's K function can be used to perform a hypothesis test for the presence of spatial clustering or dispersion. The null hypothesis for each of these tests is that the observed locations exhibit complete spatial randomness (CSR), that is, the observation locations arise from a two-dimensional homogeneous Poisson process. Formally, a stochastic process is said to be a homogeneous Poisson process with rate if the number of events in any bounded region , denoted ( ), is Poisson distributed with mean intensity | |, that is, Pr( ( ) = ) = − | | ( | |) / !, where | | denotes the area of . Given that there are events in , those events form an independent random sample from a uniform distribution on (Cressie, 1994) . The average nearest neighbor (ANN) ratio was initially developed to classify spatial patterns in plant populations (Clark & Evans, 1954) . In this context, the observed data consists of locations of observed plants, measured as coordinates in two-dimensional space. This method quantifies the randomness (or lack thereof) among the observed point locations by measuring the distance from each point to its nearest neighbor and using these distances to compute the average nearest neighbor (ANN) ratio given by where ̅ = ∑ =1 , denotes the distance from the ℎ individual to its nearest neighbor, denotes the total number of observations, = 1 2√ is the expected value of under CSR for an infinite study area, = | | is the density of the observed distribution, and | | is the size of the study area. Under CSR, ( ) = 1, and for a perfectly clustered distribution (i.e., all points fall at the same location), ( ) = 0. Values of greater than 1 indicate that the points are distributed more uniformly than expected under CSR. To use the ANN ratio for hypothesis testing, it is assumed that the distribution of under the null hypothesis of CSR is approximately normal. A z-score for the statistic is calculated by where ̅ is the standard error of the mean distance to the nearest neighbor under CSR. It can be shown that ̅ = 0.26136 √ . A significantly negative z-score indicates clustering, and a significantly positive score indicates dispersion. See Figure 1 for examples of clustering, dispersion, and complete spatial randomness. Clark and Evans (1954) note some limitations of their procedure, such as the sensitivity to the chosen study area and inability to distinguish between certain types of spatial dependence (e.g., tightly clustered points in one place vs pairs of points scattered in population). In this situation, they suggest an extension to this measure by constructing a circle for each observation with an infinite radius, dividing the circle into equal sectors, and measuring the distance from the individual to its nearest neighbor for each of the sectors. They also point out that problems may arise when the data consist of large areal units rather than points and the centroid of each unit is used to calculate the ANN ratio (Clark & Evans, 1954) . The ANN ratio discussed above is based on first-order statistics, i.e., the mean of the distances between observation locations. One of the limitations of the method is the inability to test point patterns at different scales simultaneously (Ripley, 1977) . For example, it is possible for data to be clustered at a small scale and clustered at a larger scale (i.e., the clusters are clustered) or clustered at a small scale but dispersed at a large scale (i.e., the clusters occur at somewhat regular intervals). Ripley's function is a second-order spatial analysis tool (i.e., uses variances of the distances between observations) that can address the issue of scale-dependent spatial patterns. Here we only consider Ripley's function for univariate spatial patterns in two dimensions, but it can also be extended for multivariate spatial patterns (ex: comparing spatial patterns of two species) (Dixon, 2001) . Let ( ) denote the number of points within radius t at point s. The function is given by where is the density (number of points per unit area) and can be estimated as ̂= | | , is the observed number of points and | | is the size of the study area. The expectation is taken over the point locations. If the points follow a homogenous Poisson process (i.e., exhibit CSR), then ( ) = 2 which is the area of a circle of radius . An approximately unbiased estimator for ( ) was proposed by Ripley (Dixon, 2001; Ripley, 1976) : where is the distance between the ℎ and ℎ points, ( < ) is the indicator function with the value of 1 if < and 0 otherwise, and −1 is a weighting factor associated with locations and that corrects for edge effects. Correction for edge effects is required if any distance is greater than the distance between point and the boundary. Because the points outside the boundary are not included in the calculation of ̂( ), edge effects can lead to a biased estimator of ( ). Various authors have proposed different edge corrections, such as buffer zones (Sterner, Ribic, & Schatz, 1986; Szwagrzyk, 1990) or toroidal edge corrections (Ripley, 1979; Upton & Fingleton, 1985) . One of the most commonly used edge corrections assigns a value of 1 if the circle centered at point which passes through point is entirely inside the study area and assigns equal to the proportion of the circumference of the circle that falls in the study area otherwise (Dixon, 2001) . To test for CSR, the estimator ̂( ) = [̂( )/ ] 1/2 is sometimes used in practice, and (̂( )) = under CSR (Ripley, 1979) . If the observed value of ( ) is larger than the expected value of ( ) for a given distance, the distribution is more clustered than CSR at that distance. If the observed value of ( ) is smaller than the expected value of ( ), the distribution is more dispersed than the random distribution at that distance. See Figure 1 Data is generated by selecting the observed units from among the total units present in areal structure . We consider two sample sizes for each areal structure: = ⌊ 10 ⌋ and = ⌊ 4 ⌋, and generate data under the null hypothesis of no spatial pattern (data generation mechanism (DGM) The ANN ratio is calculated by using the centroids of the observed units as the observation locations. As the ANN ratio is known to be sensitive to the choice of the study area (often referred to as the window), we consider two different methods for selecting the window: window 1 is the entire study area and window 2 is the smallest possible rectangle that encloses all the observed locations (note that window 2 is dependent on the observed data, while window 1 is not). For We assess the performance of the ANN ratio and Ripley's K function via empirical type I error rate and empirical power. For simulations under the null hypothesis ( 1 ), we report the global empirical type I error rate for the ANN ratio and the empirical type I error rate at each radius for Ripley's K function. For simulations under alternative hypotheses ( 2 and 3 ), we report the global empirical power for the ANN ratio and the empirical power at each radius for Ripley's K function. Table 1 summarizes the empirical type I error rate (i.e., empirical probability of rejecting the null hypothesis of CSR when units are generated under CSR) and empirical power (i.e., the empirical probability of rejecting the null hypothesis of CSR when units are clustered) of the ANN ratio. When using window 1, the empirical type I error rate for all three areal structures is greater than 0.05 and its highest (0.99) for the Canadian FSAs where the units are vastly different sizes. When the sample size increases, the empirical type I error of each areal structure also increases. Under the alternative hypothesis, the empirical power of detecting a single cluster is 1.00 for the FSAs which is greater than both the regular grid and the counties. When the sample size increases, the empirical power of detecting a single cluster decreases to 0 for the regular grid and the US counties. When using window 2, the empirical type I error rate is severely inflated for all scenarios except the US counties under the smaller sample size. Furthermore, while there are appreciable differences in the empirical type I error rate when using the different windows, there is no clear pattern of increase or decrease. The empirical power for window 2 is lower for all scenarios except the US counties under the larger sample size. Table 2 summarizes the empirical type I error and the empirical power of Ripley's K function. The empirical type I error rate is far above its nominal level in almost all cases. For the US counties, the empirical power is high (close to 1) for all radii except the smallest radius R1. For the FSAs, the empirical power is 1.0 for all radii except for R1. In general, the empirical power for detecting clusters is high in all three areal structures under all radii except for the minimum radius R1 under the US counties and CA FSAs. It is possible to obtain an analytical expression for ( ) when areal data is generated on an infinite regular grid with each unit being observed with probability . For simplicity and without loss of generality, we assume a distance of one unit between centroids of adjacent grid cells. We begin by counting the number of centroids within a distance of of a fixed centroid, denoted ( ). This problem is equivalent to the well-studied Gauss Circle Problem of counting the number of points on an integer lattice within a distance of of the origin. It can be shown that It can further be shown that ( ) is approximately equal to π 2 where the error term ( ) = ( ) − π 2 is such that | ( )| ≤ , where 1/2 < θ ≤ 131/208 (Hardy, 1915 (Hardy, , 1999 . Recalling that the units are observed with probability , then we expect ( ) observed units within a distance of . Note that our assumption that the distance between centroids is equal to one unit implies density of λ = , and thus ( ) = ( ). Under the null distribution of CSR, we have the usual ( ) = π 2 . Therefore, there are infinitely many values of such that | ( ) − ( ) | = | ( ) − π 2 | = | ( )| > 1/2 . Therefore even as becomes large, the distribution of ( ) is not well approximated by the distribution of ( ) . This study aimed to evaluate the performance of the ANN ratio and Ripley's K function on areal data using an extensive simulation study. Three areal structures, three types of spatial patterns and two different sample sizes were considered. As previously mentioned, the ANN ratio and Ripley's K functions are intended for point process data but in practice these methods are often used for areal data by mapping each areal unit to its centroid. Our results show that the empirical type I error rates of the ANN ratio and Ripley's K function are inflated for the simulated data regardless of the sample size. We also see that the ANN ratio gives different results depending on which window is used for the ANN ratio calculation. The highly inflated empirical type I error rate makes both these methods unreliable for detecting spatial clustering in areal data. In the case of Ripley's K function applied to a regular grid, the inflated empirical type I error rate observed in our simulation study is confirmed by the theoretical divergence of ( ) and ( ) . These findings have important implications for the use of these methods on areal data. An inflated type I error rate implies a high false discovery rate, meaning that researchers who have applied these methods to areal data may have wrongly concluded that their data exhibited a spatial pattern. In epidemiology, this may manifest as the discovery of non-existent disease clusters or outbreaks. In ecology, this the misapplication of these methods may lead researcher to conclude that parcels of conserved land are clustering together to create larger, higher quality habitats, when in fact they are not. In civil engineering and urban planning, the use of these methods could cause researcher to conclude that parcels of land designated for certain uses (parking, recreation, open space, etc.) are well dispersed throughout an urban area when such is not the case. As ESRI Arcgis software automatically applies these methods to areal data by mapping units to their centroids, these results are particularly concerning for ESRI users. The development of spatial clustering detection methods which are better suited for areal data is an excellent area for future work. In the absence of such methods, clustering analysis should be applied only to point process data unless the reliability of these methods for a particular areal data application can be demonstrated via simulation studies. 0.06 0.30 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 : Single cluster: D2 0.08 0.38 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 : Multiple clusters: D3 0.25 0.49 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Spatial pattern of 2009 dengue distribution in Kuala Lumpur using GIS application Spatstat: an R package for analyzing spatial point patterns Package 'spatstat Package 'splancs'. R package version Crime mapping and analysis in the Dansoman police subdivision Distance to nearest neighbor as a measure of spatial relationships in populations Models for spatial processes Ripley's K function Spatial pattern analysis in ecology based on Ripley's K-function: Introduction and methods of edge correction On the expression of a number as the sum of two squares Ramanujan: twelve lectures on subjects suggested by his life and work Accelerating the discovery of space-time patterns of infectious diseases using parallel computing Spatial epidemiologic trends and hotspots of Leishmaniasis Identifying HIV distribution pattern based on clustering test using GIS software A GIS based spatial data analysis in Knidian amphora workshops in Reşadiye Housing density as an indicator of spatial patterns of reported human-wildlife interactions in Northern New York Evaluating patterns of a white-band disease (WBD) outbreak in Acropora palmata using spatial analysis: a comparison of transect and colony clustering On the false alarm of planar K-function when analyzing urban crime distributed along streets The rabies distribution pattern on dogs using average nearest neighbor analysis approach in the Karangasem District Characterizing spatial patterns of trees using stem-mapped data. Forest science Geographic information systembased analysis of the spatial and spatio-temporal distribution of zoonotic cutaneous leishmaniasis in Golestan Province, north-east of Iran Spatial analytical methods and geographic information systems: use in health research and epidemiology The identification and use efficiency evaluation of urban industrial land based on multi-source data Spatial analysis of childhood cancer: a case/control study The second-order analysis of stationary point processes Modelling spatial patterns Tests of'randomness' for spatial point patterns Spatiotemporal characteristics of pandemic influenza Testing for life historical changes in spatial patterns of four tropical tree species Natural regeneration of forest related to the spatial structure of trees: a study of two forest communities in Western Carpathians Massively parallel spatial point pattern analysis: Ripley's K function accelerated using graphics processing units Rockfall detection from terrestrial LiDAR point clouds: A clustering approach using R Spatial data analysis by example Studying the impact of streetlights on street crime rate using geostatistics Optimizing and accelerating space-time Ripley's K function based on Apache Spark for distributed spatiotemporal point pattern analysis Spatial analysis of occupation floors II: the application of nearest neighbor analysis Crime mapping and spatial analysis in national forests Fifty year record of change in tree spatial patterns within a mixed deciduous forest. Forest Ecology and management Enabling point pattern analysis on spatial big data using cloud computing: optimizing and accelerating Ripley's K function Spatiotemporal patterns and driving factors on crime changing during black lives matter protests Does the conservation of land reduce development? An econometric-based landscape simulation with land market feedbacks SS and AM were partially supported by National Institutes of Health grant NIGMS P20GM130420.