key: cord-0302168-z0p8zwsk
authors: Xie, Yiqun; Shekhar, Shashi; Li, Yan
title: Statistically-Robust Clustering Techniques for Mapping Spatial Hotspots: A Survey
date: 2021-03-22
journal: nan
DOI: 10.1145/3487893
sha: c9c3f4400d091aba2a29c8b32d66291e3eb16779
doc_id: 302168
cord_uid: z0p8zwsk

Mapping of spatial hotspots, i.e., regions with significantly higher rates of generating cases of certain events (e.g., disease or crime cases), is an important task in diverse societal domains, including public health, public safety, transportation, agriculture, environmental science, etc. Clustering techniques required by these domains differ from traditional clustering methods due to the high economic and social costs of spurious results (e.g., false alarms of crime clusters). As a result, statistical rigor is needed explicitly to control the rate of spurious detections. To address this challenge, techniques for statistically-robust clustering (e.g., scan statistics) have been extensively studied by the data mining and statistics communities. In this survey we present an up-to-date and detailed review of the models and algorithms developed by this field. We first present a general taxonomy for statistically-robust clustering, covering key steps of data and statistical modeling, region enumeration and maximization, and significance testing. We further discuss different paradigms and methods within each of the key steps. Finally, we highlight research gaps and potential future directions, which may serve as a stepping stone in generating new ideas and thoughts in this growing field and beyond.

Identification of spatial hotspots, i.e., geographic regions with significantly higher rates of generating cases for certain types of events (e.g., disease, crime, pollution), has become an important task and standard research methodology in a diverse range of applications (e.g., public health, public safety, transportation, agriculture). In public health, for example, epidemiologists use spatial hotspots to find outbreaks of diseases and make timely interventions to reduce health risks to the public [78] . Similarly, in public safety, hotspots reveal areas with abnormally high rates of crimes (e.g., theft, assault), which can help narrow the search space for serial criminals (e.g., arsonists) or other underlying factors [131] . Broad applications of hotspot mapping will be detailed in Sec. 2.

While spatial hotspots are naturally clusters of events, mapping of hotspots poses many unique challenges. Here we summarize these challenges in four different aspects. First, spurious results often have very high social and economic costs in the application domains of hotspot mapping [16, 183] . For example, if a city falsely claims a neighborhood as a crime cluster, there can be many unnecessary negative impacts such as reducing property values in that region, hurting local small businesses, increasing mental pressure on residents and potentially pushing them to move. Unfortunately, spurious clusters are very common in real-world datasets and can easily happen as a consequence of natural randomness. Thus, decision makers in important application domains require statistical rigor as a necessary component to robustly control of the rate of spurious patterns. Second, mapping of spatial hotspots needs explicit consideration of underlying risk or known contributing factors, which are typically not considered in traditional clustering approaches [103, 134] . For example, having a hundred crime cases in a densely-populated urban area is different from having the same in a rural area of the same size. Third, hotspots are geographically contiguous regions and should not be fragmented pieces in space [192, 195] . As a result, spatial attributes cannot be simply mixed with other non-spatial attributes during the clustering process and need specific modeling. Finally, while a hotspot is often a geographically contiguous region, the observed events inside it may not necessarily form a continuous or smooth distribution of density, especially when the total number of observations is small (i.e., higher variance across locations). This undermines basic assumptions on contiguous density used in many traditional clustering approaches [29, 66] .

Clustering has long been a core topic in data mining and machine learning, providing a vast set of techniques in various families. Partitioning-based clustering methods (e.g., k-means and Gaussian mixture models, CLARANS [154] ) split input data into groups to minimize dissimilarities within clusters. Density-based approaches (e.g., DBSCAN [66] , OPTICS [14] , HDBSCAN [29] ) apply local or global density criteria to separate out clusters from noises. These approaches do not rely on pre-defined number of clusters and can be made flexible for different shapes and varying densities. Many techniques also utilize the hierarchical structure of clusters, e.g., by first splitting data into small units and then sequentially merging them into final clusters based on similarity and adjacency, or in the opposite direction (e.g., Chameleon [100] , BIRCH [225] , CURE [80] , HDB-SCAN [29] ). These families of approaches are not necessarily mutually exclusive (e.g., HDBSCAN leverages both density and hierarchy). Moving further, there have also been many other types such as grid-based (e.g., STING [206] , CLIQUE [9] ), graph-spectrum-based (e.g., spectral clustering [203] ), kernel-based (e.g., kernel k-means [178] , SVC [20] ), etc. Given that these techniques have been well summarized in other surveys on general clustering (e.g., [218, 219] ), here we skip the details and concentrate on the topic of statistically-robust clustering for hotspot mapping.

Traditional clustering techniques are not sufficient in responding to the unique challenges posed by hotspot mapping. These techniques, for example, often do not consider the high costs of spurious results and are prone to return many clusters formed by natural randomness (e.g., k-means always returns k partitions, DBSCAN and its variations have difficulty in avoiding spurious highdensity regions in random point distributions [29, 134, 212, 214, 215] ). In addition, these methods normally do not consider underlying risk factors (e.g., population at risk as control), geographiccontiguity of outputs, and non-contiguous within-cluster density.

To address these challenges, in the last decades there have been a blossom of research focusing on statistically-robust clustering techniques with explicit consideration and modeling of these various needs of hotspot mapping. As many of the methods are based on formulations from scan statistics, which was first developed by the statistics community, statistically-robust clustering is also known as scan statistics. Since both clustering and scan statistics have been widely used in literature, we use "clustering" in the rest of the survey to better connect to intended audience from data mining. First, statistical significance of cluster candidates was modeled and incorporated into the detection process to provide robust control of the rate of spurious results. New algorithms were also developed to reduce the computational cost of significance testing, which often requires Monte-Carlo-based estimation due to the complexity of test statistics. Second, underlying risk factors were explicitly modeled into the test statistics used for candidate evaluation, allowing the scores to reflect differences in statistical processes rather than directly observed densities of events. Third, statistically-robust clustering techniques can also guarantee that the output clusters are contiguous in the geographic space by modeling the problem as a region-maximization problem. In this paradigm, the enumeration space of region-candidates are clearly-defined (e.g., circles, rings or linear paths formed by subsets of data) and a wide variety of efficient region-maximization algorithms have been developed. This enumeration scheme also addresses the fourth challenge that within-cluster density may not be smooth or continuous due to the effect of natural randomness. By directly enumerating regions instead of forming them by local or hierarchical density criteria, regions can be evaluated regardless of their inner-density distribution.

There are several related surveys on traditional and statistically-robust clustering. For traditional clustering, multiple taxonomies are provided in [22, 125, 161, 218, 219] to summarize a broad spectrum of techniques, including partition-based, density-based, hierarchy-based, graph spectrum based, kernel based, spatial and non-spatial, etc. These surveys do not cover the models and techniques developed for statistically-robust clustering. A few surveys also exist on statistically robust clustering from different perspectives. Kulldorff (1999) [104] provides a summary of early methods and applications of spatial scan statistics, as well as basic enumeration algorithms; an overview of earlier domain applications was also discussed later in [46] . [148] describes a few extensions of spatial scan statistic, including efficient computational strategies for rectangular-shaped clusters and an expectation-based approach to improve space-time cluster detection. Later, two sequential handbooks on scan statistics were developed mainly by the statistics community [73, 75] , covering both new and existing theoretical results on distribution estimation (e.g., closed-form solutions, approximations), extreme values, sequences, etc. The handbooks also included a few chapters that are related to data mining, including summaries on Bayesian scan statistics [140] and irregular-shaped cluster detection [59] . The topic of statisticallyrobust clustering (a.k.a. hotspot detection) has also been briefly discussed (e.g., as a paragraph or section) in broad spatio-temporal data mining and data science surveys [16, 183, 210] . It is worthmentioning that hotspot is also related to another common pattern -spatial anomaly/outlierin literature [10, 16, 35, 183, 210] . One key difference between the two patterns is that anomaly or outlier is often used to describe rare observations, e.g., as defined in existing surveys such as "interesting but rare phenomena" [16] , "rare occurrences" [10] , "rare patterns" [183] , etc. In contrast, significant clusters or hotspots often represent a substantial/large portion or sometimes the majority of observations. Given the richness of application contexts and the large variety of techniques developed along different dimensions of statistically-robust clustering, there is a need for developing a taxonomy from the computer science or data mining perspective to decompose the complex modeling and detection process into a set of key steps, highlight their roles and mutual relationships, and discuss the advances in each key step by extracting corresponding contributions from the vast (and often versatile) literature. This can promote cross-pollination and synergistic integration of ideas across various research communities, especially considering that the developments of general clustering and statistically-robust clustering have largely run on siloed tracks. This can also help engage the broader data mining community in addressing the challenges and fostering future directions in statistically-robust clustering.

This survey aims to fulfill this need as follows. First, we develop a set of taxonomies to highlight the key components of statistically-robust clustering and its sub-areas (e.g., data models, spatial point processes, test statistics, enumeration space, computational algorithms, test paradigms). As research papers in the field often touch on multiple aspects of the overall taxonomy, we further decompose a broad set of the papers into contributions to each component of the taxonomy, which are otherwise difficult to distill and hinder cross-discipline dissemination of the research advances. This process also helps extract ideas that are applicable, and potentially valuable, to general studies in the field beyond their original scopes. Finally, based on the identified research landscape, we expose several open questions and future research directions to harness emerging sources of spatial big data and promote synergistic integration of ideas across multiple domains.

Scope and outline: This review article surveys the research progresses from a data mining perspective for computer science audience, and does not intend to provide a comprehensive survey on related statistics literature, which is another important line of research that often focuses more on statistical theories (e.g., distribution approximation) and has been well summarized in [73, 75] . The rest of the survey is outlined as follows: Sec. 2 introduces the application domains of hotspot mapping; Sec. 3 summarizes techniques for statistically-robust clustering, with Sec. 3.1 providing general key concepts, Sec. 3.2 providing an overview of building blocks, Sec. 3.3 providing a taxonomy of data models, Sec. 3.4 providing a taxonomy on various statistical processes and test statistics, Sec. 3.5 providing a taxonomy on different region definitions as well as their enumeration and maximization algorithms, Sec. 3.6 summarizing different paradigms and methods for significance testing, and Sec. 3.7 discussing evaluation methods and metrics. Finally, Sec. 4 and 5 discuss challenges and opportunities for future research, and conclude the paper with highlights.

Statistically significant clustering for hotspot mapping has been widely used with an ever-growing variety of spatiotemporal data sources from different domains. In this section, we highlight example applications from several domains where statistically interpretable outputs and robustness against spurious results are especially important for both research investigation and policy making.

Public health: The wide adoption of hotspot mapping started from public health with the popularity of the spatial scan statistic [103] . The most typical use is to identify the outbreaks of diseases, including cancer (e.g., childhood leukemia [84] , cervical cancer [41] ), infectious diseases (e.g., malaria [130] , dengue [78, 189] , COVID-19 [44, 85, 110] ), and many others [63, 78, 117] . With the awareness of both diseases cases and population at risk, detected hotspots of diseases reveal abnormal underlying factors that are otherwise hidden from public health researchers and policy makers. Many of these applications have been considered as standard practices in public health studies (e.g., National Cancer Institute [1] ). In addition to case data, other sources of related disease information, including symptoms (e.g., diarrhoea and fever [107] ) and increase in pharmacy visits [141] , have been used to improve the timeliness of outbreak alarms. With the growing variety of spatiotemporal datasets, recent applications have also explored the incorporation of proxy data from social networks (e.g., geotagged tweets about dengue fever [188] ) and trajectories.

Public Safety: Hotspot detection is popularly used in public safety applications to identify statistically alarming clusters of criminal activities, accidents, suicides, severe disaster impacts, etc. In crime analysis, such significant clusters are used to reveal patterns of homicide and drug traffic [223] , violence-related deaths [127] , police deaths in the line of duty [99] and more, with respect to many underlying factors including natural disasters, weather, new construction, etc. Geographic profiling of serial criminals, characterized as ring-shaped hotspots of serial crimes (e.g., arson), is also used to help narrow down potential residence locations of criminals [64] . Similarly, adoptions can be found in early warning of terrorism [71] , minefield detection [202] , and identification of significant concentration of accidents (e.g., falls, fire injuries, shark attacks) [13, 54, 207] .

Urban Planning and Transportation: Statistically significant clustering helps urban policy makers and designers better monitor, manage and plan infrastructure changes. Typical uses include locating breakage of distribution pipes [51] , mapping city areas with low accessibility to critical infrastructures or housing difficulties [164] , identifying disparities in air quality [81] , revealing regions with aging problems [220] , and many others [33, 83] . Generalized hotspots can also reflect levels of diversity (e.g., trees in urban forests, types of business services) and have been used to detect non-resilient designs or distributions of infrastructures within cities [208] . The methods have also been extended to cover network-based applications in transportation. In transportationsafety related use cases, road segments -including both intersections and linear paths -have been used to identify subspaces in road networks that have significantly higher rates of traffic accidents (e.g., pedestrian fatalities) [157, 197] . With emerging data sources such as on-board-diagnostic data (i.e., vehicle trajectories with hundreds of engine measurements), hotspot detection is also used to find sub-paths along routes that cause divergent behaviors in emission or energy consumption [116] .

Others: Forestry studies use hotspot mapping to analyze forest health at large scales [172] and detect clusters of forest fires [158] , tree regeneration [68] , etc. It is also used in both plant and animal agriculture. In plant agriculture, for example, spatial concentrations of high and low ratios of production-to-environment-cost are used to suggest alternative plans for agricultural conservation planning [179] . Hotspots of crop diseases were also used to map disease spread across plants [48] . On the agricultural operation safety side, the approach is used to find spatial clusters of field accidents (e.g., tractor overturns [177] ). In animal agriculture, hotspot mapping has been widely employed to identify outbreaks of diseases in stock herds [194] . Finally, a great variety of applications exist in many other fields, including astronomy [32] , medical imaging [221] , environment [102] , climate change [217] , fishery [193] , entomology [19] , and many more.

In this section, we first provide key concepts and a high-level taxonomy to illustrate the building blocks of statistically-robust clustering. Then we will dive deeper into each building block and discuss the detailed taxonomy and summary of developments.

In the following, we summarize the key concepts and general problem definition for statisticallyrobust clustering. Here we keep the definitions at a high-level given the variety of formulations when fine details are considered, which will be discussed in later sections.

Definition 3.1. Case and control. A case is an observed incident that is related to a real-world event or phenomenon of interest (e.g., the spread of a disease). The number of cases is the number of observed incidents (e.g., the number of people with positive test results for a disease). In contrast, control represents the base population of candidates for the same event, where each candidate is subject to being a case of the event (e.g., all people being tested for a disease or at risk of being infected). Data on case and control can be presented at either individual (e.g., points) or aggregated levels (e.g., polygons, each of which contains the number of case or control points inside).

Definition 3.2. Point distribution and point process. A point distribution is a spatial distribution of cases related to an event . A point process governs the generation of random point distributions given a control distribution. In hotspot detection, it -for example -can determine the probability of each control point being a case of based on its location. If the spatial distribution of control is not discrete but continuous, the point process can then determine probability density across space. Here a "point" refers to a general spatial data object, which can be a real point (location), a trajectory, etc.

Definition 3.3. Hotspot. A hotspot is a sub-region of a given input space, where the rate of generating case points inside (or the probability of each individual control point being a case point) is higher compared to its outside . The existence of a hotspot means the point process is not homogeneous in the input space but clustered at the hotspot region.

Given case and control distributions as inputs, statistically-robust clustering aims to identify clusters formed by true hotspots rather than by pure random chance. The output clusters can either be represented using sub-regions of the input space or subsets of case points. Different from traditional clustering, output clusters are not only regions with high-intensity of case points. In contrast, the intensity is statistically adjusted by an underlying control distribution and output clusters must pass significance testing defined by a point process. Fig. 1 shows a high-level taxonomy of the key steps in statistically-robust clustering for hotspot mapping. Since hotspots can be considered as one type of clusters, in the rest of the paper the two names are used in an interchangeable manner.

At the beginning of the process is the input spatial data model (Fig. 1) . One of the key characteristics of data in hotspot mapping -case vs. control -is highlighted under the step. A variety of commonly used data types will be discussed and categorized in details in Sec. 3.3.

The second key step is the statistical modeling needed for rigorous control of spurious results. The first essential element within the step is a spatial (or spatiotemporal) point process needed to model the generation of a point distribution. This is often determined by assumptions in domain applications (e.g., Poisson point processes are common choices for monitoring disease outbreaks). Then, a hypothesis is needed to provide a clear mathematical definition of a "hotspot" (e.g., inside probability density of generating events is higher than outside). This definition is important as it spells out the criterion of a true hotspot, providing a mathematical ground to generate groundtruth synthetic data during evaluation (Sec. 3.7). Finally, a test statistic is needed to compute a score of each candidate cluster, which is necessary to allow comparison and ranking among a large number of candidates.

The third key step focuses on finding "where" a hotspot is based on the statistical modeling. This is achieved by clearly defining an enumeration space (e.g., a parameter space containing all candidates) and searching over regions to maximize test statistic values. The definitions of enumeration spaces can be used to guarantee that all candidate regions are spatially contiguous and meaningful to decision makers. Additional constraints can be applied to the enumeration space to avoid undesired outputs.

The last step before a hotspot can be added to the output is significance testing. This is the final guard in this process to enforce robust control of the rate of spurious clusters. The testing can be performed using both the Frequentist and Bayesian views depending on the application need. The view should be consistent with the point process modeled in the second key step. Due to the typical need of Monte-Carlo-based estimation during the test, acceleration techniques are important to reduce the high cost leveraging unique characteristics of the simulation process.

Finally, in Fig. 1 there is an arrow connecting back to the beginning of this process, which is data update. In this step, any detected significant hotspot is removed from the data before continuing to identify the other hotspots. This is often necessary due to the shadowing effects between hotspots in the data, which will be illustrated in Sec. 3.6.1. 

Mapping of hotspots covers a wide variety of data types and there are many different criteria that can be used to classify these data types into different groups. Table 1 shows a detailed taxonomy of various ways of classification. [48, 68, 116, 158, 172] Type of data models Point • Crime cases, injuries, fires, geo-tagged tweets, sampling sites [13, 64, 102, 158, 188, 189, 207, 216, 223] • Traffic accidents and vehicle tows [92, 157, 197, 200, 215] Polygon • County or city district maps with statistics of events (e.g., disease), shorelines [13, 41, 85, 103, 117, 141] • Grid cells [91, 136, 146, 147, 151, 152, 217, 221] Trajectory

• Vehicle trajectories, smartphone or geo-tagged social media trajectories [15, 91, 116, 123, 188, 189] Time-series • Hospital visits over the past days/weeks [95, 133, 136, 137, 141, 152] Type of spaces Euclidean • A state, city or country [41, 85, 117] • A forest area [68, 158] Spatial network or graph • Road or river network [92, 116, 157, 197, 200] • Utility network or graph representation of space [28, 51] • Social network with geo-tags [36, 188, 189] Level of uncertainty Individual • Locations of crimes, traffic accidents or crop disease incidents (e.g., point-type examples above) • Perturbed or inaccurate locations and timestamps [91, 121, 156] Aggregated • County maps with counts of cases and population, where uncertainty comes from both the aggregation and inaccuracies (e.g., [25, 52, 118] )

A key characteristic of data used in statistically-robust clustering is the need of both case and control data, which is not the case in traditional clustering approaches (i.e., no control). In many critical societal applications (Sec. 2), the emergence of events are closely linked to other related distributions. In transportation, for example, a road segment with a higher traffic volume is also more likely to have a higher density of traffic accidents, but such "clusters" are not interesting to decision/policy makers as they are expected and often do not call for additional investigation and intervention [116, 197] .

To have a more holistic view of this process, we can consider two levels of an observed dataset: (1) incidents of one or more events (e.g., crime cases); and (2) underlying contributing factors to the events (i.e., control). When control is ignored, the resulting clusters are representations of an aggregation of all the underlying contributing factors -which together make the cluster regions have higher rates of generating the observed data samples. Having control data as an input allows users to take out the contributions from known factors (e.g., traffic volume) -either interesting or not -and let resulting clusters reveal unknown factors causing higher rates of the events. More examples are provided in Table 1 .

Statistically-robust clustering takes a wide variety of input data models, ranging from points, polygons, grids, trajectories to time-series data. Examples of data models are shown in Fig. 2 as well as Table 1 , and illustrated as follows:

• Point: Point is the most popular and widely used vector data model for mapping hotspots [63, 107, 127, 130, 223] . Each point has geographic coordinates, observation type (i.e., case, control), and event information (e.g., even type such as disease or crime, event subtype, event value such as air pollution indices if data is continuous). • Polygon and grid: Polygons are typically used when observations are given at aggregated levels [30, 41, 78, 81, 84, 89, 105, 106, 220] . For example, population and demographic information is normally only available through census tracks, which may be shared at higher aggregation levels (e.g., census blocks or census block groups) due to privacy concerns. Similarly, certain types of events (e.g., cancer data) may only be available at city-or county-levels. Such datasets are often given in the form of polygons representing boundaries of the corresponding aggregation units. Some techniques take inputs in the grid format mainly for computational purposes (e.g., easier to enumerate contiguous rectangular-shaped candidate clusters) [155, 172] . In each grid cell, a count is recorded for different types of observations and events. When original dataset provided is not in the grid format, grid-based methods will preprocess it into a grid format which may require additional assumptions for data given in polygon formats (e.g., homogeneous distribution within each polygon). • Trajectory: As trajectory data is becoming available in greater varieties and volumes, it is also an important source of inputs for hotspot mapping. Examples of trajectory data that have been used in this field include social-media (e.g., a spatiotemporal sequence of geo-tagged tweets [188] ) or smartphone trajectories of users; location traces of vehicles and properties (e.g., taxi trajectories [91, 123] ; on-board-diagnostic data with hundreds of engine measurements [116] ). Trajectory data can be used in two different ways. First, it can be used to identify hotspots that have a linear-path shape in a network space (e.g., a path with abnormally high energy consumption). Second, it can be used to identify hotspots as geographic regions [123] , in which case events are no longer spatially static points but moving objects. This is especially valuable in epidemiology, in which a trace of locations provides much richer information than a static point (e.g., where an infected individual resides or is hospitalized) [188, 189] . One limitation of this type of data is its availability and quality (e.g., representativeness) for different types of applications. • Time-series: In some application scenarios such as infectious disease monitoring, decision makers may be more interested in knowing about short-term dynamics reflected by the data rather than long-term phenomenons. Time-series data is very helpful for such purposes [107] . In timeseries data, a point, polygon or grid cell becomes a sequence of observations at different timestamps (e.g., numbers of visits at a hospital on a daily basis [141] ). This allows finding emerging hotspots, i.e., regions that were not but are becoming significant hotspots based on latest observations. Sometimes the input datasets are a mixture of different data models based on their availability. For example, case data may be points of crime cases while control data may be polygons (e.g., census blocks) with population information.

Based on application contexts and characteristics of the phenomenon, different types of spaces may be used with input data. Two most common spaces used in hotspot mapping are Euclidean and network spaces.

Euclidean space is best suited when the goal of an application is to find geographic regions (e.g., a zone in a city) as hotspots or when the phenomenon may propagate and evolve with high-degrees of freedom (e.g., diseases that spread by air) [103, 151] .

Network space should be used when the goal is to find a road segment, a path or a sub-network as a hotspot [51, 62, 92, 157, 200] . For example, traffic accidents mostly happen only on road networks, so a road segment or route often best represents the spatial footprint of the phenomenon. Similarly, in crime hotspot detection, if police officers are more interested in the travel distance of a serial criminal from a potential point (e.g., a residence location, which is unknown and needs to be enumerated) to a crime location [62] , spatial networks should be used to model input data. In addition to road networks, other examples of networks include river networks, utility networks (e.g., powerlines, pipes [51] ), communication networks, and social-spatial networks [36] .

Uncertainty may be introduced when data points are inaccurate either due to errors or privacy-related perturbation [91, 121, 156] . Inputs to statistically-robust clustering may also be provided at various aggregation levels, including no-aggregation (original point samples), aggregated with nearby cases within a certain distance, aggregated at census-block or county levels, etc. Such aggregations also introduce additional uncertainty into the clustering process. The effects of these uncertainties on hotspot mapping have been examined and visualized by many [25, 52, 91, 118, 121, 156] .

Statistically-robust clustering uses a variety of spatial and spatiotemporal point processes to model different types of hotspots and their corresponding hypotheses, and has developed various test statistics to evaluate hotspot candidates. In the following, we first show a taxonomy of statistical models and then discuss formulations of test statistics as well as design strategies. It is worthnoting that while many of the formulations (e.g., test statistics) are originally proposed as a part of a technique (e.g., together with maximization algorithms), they can be generally used for different problems; an analogy is the modeling of loss functions in deep learning.

Processes. Point processes have been developed based on a variety of statistical models to fit different types of input data models and application needs in hotspot detection. A point process mainly concerns with point distributions over space, and the following summary includes more generalized definitions that may also consider point attributes (e.g., related to marked point processes).

• Bernoulli and Poisson: In the Bernoulli model control data (e.g., population at risk for a disease) is a collection of discrete points in some space (e.g., Euclidean, network), and each point follows a Bernoulli distribution ∼ ( ), where is the probability of being an instance of an event (e.g., disease [103] , crime [64] , accident [197] ). Statistical hypotheses are used to define hotspots (i.e., true clusters). The null hypothesis 0 is that all points follow the same Bernoulli distribution (same ) in the study area, and the alternative hypothesis 1 states that there exists a hotspot, where points inside follow ( 1 ) and outside follow ( 2 ), and 1 > 2 . Instead of assuming on individual points, Poisson point processes assumes expectations at regional levels [103] .

• Multivariate: This model takes an integrated view of multiple types of events that are related to a common parent type. Examples include multiple symptoms that relate to a single disease [107, 145] , multiple groups in a population [184] , multiple actions (e.g., visiting a hospital, purchasing OTC medications) following an infection [141] , and different types of crimes with similar motives. There are mainly two approaches to model the alternative hypotheses ( 0 still assumes a homogeneous process). The first approach considers a single 1 that assumes there is a single parent event causing the increase of sub-event instances [107] . In contrast, the second considers a set of 1 , and each 1 relates to a different subset of sub-events [141] . For example, both hospital and pharmacy visits are related to disease outbreak. However, pharmacy visits can also be caused by promotional events that are not related to an outbreak. Multiple alternative hypotheses enable such fine-grained characterization. • Multinomial: This process also considers multiple types of events. However, these events are typically independent unlike the multivariate case [98, 208] . Examples of such types include races, species, unrelated diseases, etc. This modeling is used when the goal is to find sub-regions where proportions of different types differ significantly from the rest. Thus, 0 states that the proportion composition is the same across the study area, whereas 1 states otherwise. Here control data is typically not used. • Continuous: This differs from the other processes by having continuous attributes on geolocated data points (e.g., air quality, house prices) rather than binary values or aggregated counts. To model continuous values, methods typically use normal [89, 105] , exponential [86] or other continuous distributions [49, 97] . Using normal distribution as an example, 0 states that the mean is the same across the whole study area and 1 states that points in hotspot regions follow distributions with higher means. • Expectation-based: All the processes above are mainly designed for mapping spatial hotspots but can be extended to spatiotemporal use cases. In contrast, the expectation-based model is designed to focus on emerging (i.e., spatiotemporal) hotspots [136, 152] . It considers each location in a grid partition of space as a time-series of counts (e.g., number of visits to a hospital) following a Poisson process. Unlike the "inside vs. outside" view typically used in hypotheses of other processes, here it compares the most recent time period to the history of each spatial sub-region. 0 states that the Poisson mean is the same for both the recent time window and past history, whereas 1 says otherwise.

• Bayesian: This model also focuses on finding emerging hotspots on grid-aggregated data [95, 133, 137, 141, 142, 204] . As an example, [133] uses the Bayes' rule with the conjugate Gamma-Poisson distribution to model prior and posterior, where the mean of the Poisson is · where is the control population at a location and is the expected probability, and further follows a Gamma prior. 0 assumes that follows the same Gamma distribution across the entire study area where 1 states that there exists a sub-region where the prior is different from the rest.

In the context of statistically-robust clustering, test statistics are used to assign a score to each candidate cluster for the purpose of comparison and ranking.

Early non-probabilistic test statistics are often used to find clusters of a fixed size (e.g., × squares) without underlying control distribution. With these simplifying assumptions, the pure cardinality of data points in a candidate is often sufficient as a test statistic. For more general cases with variable candidate sizes and underlying control distribution, simple extensions such as density or density ratio are used for normalization [92, 157, 197] . However, both density and density ratio are well-known to be strongly biased towards small candidates due to their monotonicity (e.g., any partitioning of a sub-region will yield a partition with a density greater than or equal to that of the sub-region). As a result, techniques based on these statistics lose detection power quickly as the size of a true hotspot increases [147, 214] .

To mitigate the bias issues, most commonly-used test statistics in the frequentist framework adopted the likelihood ratio formulation (examples shown in Table 2 ) [63, 65, 103, 106, 147] . With data generation processes being defined by a parametric distribution, the general form of the likelihood ratio for a region is ( ) = (

and ( | 0 ) are the likelihoods of data being generated by the alternative hypothesis 1 and null hypothesis 0 , respectively. A larger likelihood ratio indicates that an observation is more likely to exist under the alternative hypothesis. Different parametric distributions are used for different application scenarios (e.g., examples from Sec. 3.4.1), resulting in various forms of likelihood ratios. For example, Poisson based likelihood ratios can be used to identify outbreaks of diseases or hotspots of crime activities [63] ; continuous versions can be used to find regions with high pollution; and expectationbased extensions are more sensitive to temporal changes [136] . Various studies have also explored integration of prior knowledge using regularizers in test statistics. For example, penalty terms are added to favor higher spatial resolution [70] , discourage irregularly shaped and disconnected clusters [30] , and promote dynamic clusters that change smoothly over time [192] . Methods were also developed to reduce the bias of test statistics across scales (cluster sizes) and improve detection power (e.g., average likelihood ratio [34, 69] , spatial mixture index [208] , nondeterministic normalization index [214] ). For computational efficiency, simpler formulations of test statistics are used, such as the elevated mean scan statistic [168] . All the above-mentioned test statistics assume the underlying probability distribution obeys a parametric model. In scenarios where prior knowledge on underlying distributions does not exist, nonparametric model based test statistics are introduced, such as Berk Jones [21] , nonparametric scan statistics [36, 143] , and Higher Criticism [56] . Without assumptions on statistical distributions, nonparametric test statistics often rely on empirical p-values for significance testing. There are also works that use a proxy to represent the underlying distributions. For example, for the continuous-value model, the Mann-Whitney statistic is used to measure the distribution of the rankings of point values instead of the actual values [49] . Finally, these test statistics can also be applied to identify significant sub-graphs (e.g., based on Steiner-connectivity) in network or graph data as summarized in [25] .

Many Bayesian approaches have also been developed to complement the frequentist view for significant clustering [140] . Advantages of Bayesian measures include the capacity of incorporating informative prior information, when available; the flexibility of considering multiple hypotheses; and direct calculation of posterior probabilities . In general, Bayesian based methods do not rely on test statistics; rather, they directly calculate posterior probabilities using the Bayes' rule (e.g., univariate [133, 149] , multivariate [141, 150] , PANDA-CDCA Bayesian network [43, 96] ). With that being said, the Bayesian posterior probability can also be considered as a test statistic, and empirical thresholds can be applied as an approximation of significance levels in the frequentist setup. There are also frequentist methods that use the Bayesian modeling to design test statistics, such as the Bayes factor [227] and anomalous group detection [50] . 

where and are counts of case and control in a candidate cluster, respectively; and are the total counts of case and control of the study area; and the indicator function ½(·) guarantees the candidate is dense rather than sparse.

A log likelihood ratio (LR) using the Poisson point process. The denominators represent the expected number of cases inside and outside the candidate under 0 (i.e., no hotspot) (e.g., [103, 147] ).

where and are the number of samples of a candidate cluster and the study area, respectively; is the value of a sample in the candidate; and are the mean and standard deviation of all samples; ′ is a standard deviation estimated assuming samples inside and outside a cluster follow different means (i.e., 1 ).

A log LR using the continuous normal model, where 0 states all samples follow the same mean-value and 1 states that there exists a candidate region following a process with a higher mean (e.g., [89, 105] ).

where is the number of types (e.g., disease types), and are number of type-samples of a candidate cluster and the study area, respectively; = and = − − are the probability of a point being of type-inside and outside the candidate; and ′ = ′ = / is the probability of a point being of type-in the whole data.

A log LR using the multinomial model, where 0 states = , ∀ = 1, ..., across the whole space whereas 1 states some regions have different complexions (e.g., [98, 208] ).

where is the number of cases observed in a candidate cluster at current timestamp , is the baseline learned from historical data, and the max( ·, ·) function is used to filter out scenarios where < .

Compared to the Poisson model shown above, this log LR compares the concentration of samples in a candidate cluster at to its historical expectation from ′ < rather than the concentration outside the candidate (e.g., [136, 152] ).

A well-designed test statistic should be unbiased and ideally have the optimal asymptotic detection power under a significance level. Many theoretical insights and results have been developed to reveal major issues to consider during the design and recommend strategies to address them both statistically and computationally. For example, most of existing test statistics used in research and applications are based on ratios between maximum likelihoods of hypotheses 1 and 0 . Many of these standard formulations, including Kulldorff's likelihood ratio, have also been shown to be scale-sensitive. Specifically, these test statistics can be strongly biased towards small candidates [34, 175, 205, 214, 216] (albeit much better than density [146, 214] ).

In significance testing, statistically-robust clustering techniques commonly estimate the distribution of test statistics using the best candidate (i.e., the maximum score) from the 0 -pointdistribution in each Monte Carlo trial, which is needed to avoid multiple testing (Sec. 3.6.1). Thus, an intuitive interpretation behind the bias is that the distribution of the maximum score -as a random variable -is scale-dependent [61, 173, 175, 205] . In other words, the mean of maximum score of smaller candidates tends to strongly dominate that of larger candidates. As a remedy, a penalty term was constructed to correct the bias and restore the optimal asymptotic detection power for hotspots with different sizes [61, 167, 175, 205] . A further improvement uses square-roots of log likelihood ratios to construct the penalty term, requiring less case-specific design [173] ; the extension also provides additional computational efficiency (near linear-time). Another approach is to correct the bias during significance testing, where a block criterion is developed to adjust critical values (i.e., thresholds on test statistic values to remove insignificant results) on candidates belonging to different sizes or size-groups [175, 205] , guaranteeing optimal asymptotic power of detection. Average likelihood ratio approaches have also been developed to reduce the bias, where each score is a weighted average of those of neighboring candidates [34, 69] . According to theoretical results in [34] , the original likelihood ratio in the spatial scan statistic has optimal detection power only for hotspots with smallest footprints whereas the average likelihood ratio has optimal power at all scales. The analyses were carried out for the discrete one-dimensional scenario. Another view of the average likelihood ratio approach is that it uses marginal likelihoods instead of maximum likelihoods [140] , which may better utilize secondary cluster information as in the Bayesian test statistics [133, 149, 150] . In addition to the pure-scale interpretation of the bias, a spatial interpretation has been developed [214, 216] . The analysis shows that the likelihoods or probabilities used to score a candidate in many approaches are based on a space bi-partitioning where one partition represents the candidate and the other for the outside. Likelihoods calculated using the bi-partition implicitly assume that the best candidate in a random dataset still locates in 's partition, which is rarely the case. This leads to a big underestimation especially for the likelihood of 0 since the location of the best candidate should be nondeterministic in a purely random point process. To correct the bias, spatial-nondeterminism-aware test statistics were proposed [208, 214] , which can be computed using dual-level Monte Carlo simulation [208] . The use of another simulation introduces additional computation overhead but can be generally applied for different point processes (e.g., Bernoulli, Poisson, multinomial), test statistics (e.g., density, likelihood ratio, spatial mixture index), candidate shapes (or enumeration spaces), maximization algorithms, etc.

Region enumeration is the core computational step in statistically robust clustering, which examines a mathematically well-defined candidate space of clusters to identify the best candidate that maximizes the test statistic for further significance testing. While this routine only returns one cluster due to maximization, techniques in the literature often repetitively apply it as a sub-routine to find arbitrary number of clusters, i.e., by removing the best cluster (if significant) from data after each round and then starting a new round for the next (e.g., [103, 146, 208] ). If the best cluster at the current step is not significant, the algorithm terminates and returns all significant clusters. This enumeration-and-maximization (or one-cluster-at-a-time) scheme is widely used for statisticallyrobust clustering because of the fact that it reduces the mutual statistical influence (a.k.a. shadow effects) [113, 140] between multiple clusters in the same dataset (Sec. 3.6.1).

In the following, we will first summarize definitions of enumeration spaces and constraints that are popularly used in the literature, followed by maximization algorithms as well as acceleration techniques of various types. A summary taxonomy is provided in Table 3 . [208] ; ring-shaped hotspots for finding serial crimes and the source of a disease [63, 64] ; and arbitrarily-shaped hotspots by Significant DBSCAN, which also shows its robustness against spurious clusters (top: random data without clusters; bo om: clustered data) [212, 215] . (b) Linear hotspots of traffic accidents (pedestrian fatalities) [157, 197] and iso-distance hotspots of crimes on road networks [65] .

Definitions of enumeration spaces often take considerations of both application needs and computational feasibility/efficiency. These definitions can be well classified using cluster shapes (i.e., predefined, flexible), with additional consideration of data types and data spaces (e.g., Euclidean, network). Predefined geometric shapes: Majority of statistically robust clustering techniques were developed for a specific type of geometric shapes (e.g., circle, ellipse, rectangle, ring, linear path) that can represent real-world phenomenons in target application domains. The corresponding enumeration space then covers all valid sub-regions of a desired shape. Circle, for example, is the most widely used shape in epidemiology and public health research [1, 42, 65, 78, 84, 103, 117, 123] due to natural diffusion mechanisms, which tends to make infectious disease spread to an approximate circular shape; and due to computational convenience. Circles were extended to ellipses to better model effects of spatial anisotropy and provide more flexibility beyond circles [106] . Rectangles were also used as proxies for circles/ellipses because of the computational convenience they offer on gridded inputs [146, 147] . Ring shapes were developed for applications in public safety and criminology research. According to the routine activity theory, serial criminals (e.g., arsonists) tend to commit crimes neither too close to nor too far away from their home to lower both the risk of being recognized by neighbors and the cost of travel [64] . Thus, ring-shaped zones [62] [63] [64] were used to depict the footprints of serial criminals. While these shapes are more natural to point data (Sec. 3.3), they have also been used with polygon data (e.g., county polygons with total number of disease cases and population). In such cases, polygon data are first preprocessed to center points to perform region enumeration, and then in the output phase, the polygons containing points from each significant cluster are merged to generate the final results.

In transportation or city infrastructure related applications, vast majority of techniques use network space because data (e.g., locations of traffic accidents or pedestrian fatalities) or phenomenons are based on road, water or utility networks. Correspondingly, linear paths or intersections along network edges are used to model shapes of candidate clusters [51, 92, 157, 185, 197, 200] . In addition, network-space has also been used to identify "areal" clusters (e.g., a circle becomes a sub-network where the network distances from all locations to a center are less than or equal to ), especially in urban areas where network-distance is a better measure of spatial connectivity than Euclidean distance [62, 199] . For example, two locations separated by a river can be very close in Euclidean space but far apart by network distance. Fig. 3 shows several common examples of shapes used to define the enumeration space.

From a computational perspective, further simplifying constraints (e.g., two-point circles, shortest paths) are often needed to guarantee tractable search spaces; otherwise, a simple circular shape in a continuous space can lead to an infinite number of candidates. This will be detailed in Sec. 3.5.2.

Flexible shapes: A variety of techniques have been developed to detect arbitrarily shaped spatial hotspots to better capture the complex boundaries of real-world patterns caused by underlying physical and social contexts. Due to the non-parametric nature of arbitrary shapes, candidate enumeration spaces used for this paradigm are typically not well-defined. Instead, the vast majority of techniques define specific local search heuristics to iteratively go over a set of arbitrarily shaped candidates until certain criteria are met. A simple and popular formulation of this type is called an Upper Level Set (ULS) scan statistic [129, 162, 163] . ULS requires a tessellation of space (e.g., by geographic unit boundary; grid-partitioning), and starts by calculating a test statistic for each cell. Then, a cell belongs to an upper level set of threshold if its test statistic value ≥ . All connected components in an upper level set form the set of candidate clusters, with potentially arbitrary shapes. A ULS tree is constructed by applying a series of thresholds and significant clusters are merged to generate final results. Similarly, a minimum-spanning-tree (MST) based approach enumerates connected sub-trees that are part of an MST of a tessellated input data, where the edge weights are inferred directly or indirectly by test statistic values [45] . There also exists a FlexScan approach that enumerates all possible connected subsets in a tessellation in a brute-force manner [195, 201] . However, due to the exponential nature of the search space, it can only identify small clusters (e.g., <30 cells [59, 139, 201] ). A trajectory based method was developed to relax the requirement of a tessellation [53] , where data points are sequentially linked to form a trajectory using test statistic values; clusters are then enumerated by going through connected segments.

Meta-heuristics have also been explored to generate irregularly shaped candidates, and most still require a tessellated input space. A simulated annealing formulation [57] was used to enumerate candidates by iteratively search over neighboring choices, each defined as another candidate that differs by one-cell from the current candidate. As a simulated annealing approach, the probability of moving into an inferior neighboring choice gradually decreases as the iteration number increases. Many genetic algorithm based formulations [58, 176] were also developed to generate arbitrarily shaped candidates, where off-springs are evaluated by a choice of test statistics. In addition, random-walk based method was developed to search free-form regions using a heuristic metric [93] .

To reduce over-complicated "tree shapes" of clusters that are generated as a result of the freedom of arbitrary shapes but are spatially not meaningful, regularization functions were developed to penalize such shapes based on geometric compactness defined by various criteria (e.g., ratio between a cluster's area to its convex hull [58, 60] , sparsity [141, 180] ). The regularization terms were also used in multi-objective formulations to generate pareto sets where clustering results are provided under different levels of shape-regularity constraints [30] .

While traditional clustering techniques have provided many efficient schemes to detect irregularly shaped clusters (e.g., DBSCAN [66] , OPTICS [14] , spectral clustering [203] , Chameleon [100] , HDBSCAN [29, 153] , etc.), these methods, as introduced earlier, do not incorporate statistical rigor required by domain applications of spatial hotspot detection. Recently, a statistically robust formulation of DBSCAN -Significant DBSCAN -was developed to explicitly control the rate of spurious clusters [212, 215] , though it currently does not model the distribution of control points described in Table 1 , Sec. 3.3.1.

There are two types of constraints widely used for spatial hotspot detection. The first type concerns mostly about computational feasibility, i.e., to mathematically constrain the enumeration space to a finite and tractable set of candidates (e.g., non-exponential); whereas the second type of constraints is used to improve solution quality.

Computation-driven constraints: Given that even an enumeration space of predefined shapes may have infinite number of candidates, techniques rely on constrained parameterization of shapes to make the enumeration feasible. In the Euclidean space, circles are often constrained by two points -one at the center and the other on the circumference, corresponding to a ( 2 ) search space [103] . A center point may come from the set of event incidents (e.g., disease cases), underlying control data (e.g., population points), or an additional input set of candidate centers (e.g., centers of cells in a grid [65] ). This two-point definition has a natural explanation in epidemiology or similar domains, where the spread of an event (e.g., an infectious disease) propagates from a center source. Similarly, a ring-shaped cluster is parameterized by two concentric circles [63, 64] . Since the inner circle may have a low density of points (e.g., serial criminals often do not commit crimes too close to their residence locations), its enumeration space uses three-point circles where all three points lie on the circumference. The outer circle is still defined by one-point on the circumference as the center location has been determined by the inner circle; this leads to a ( 4 ) search space. For rectangular shapes, candidates are constrained to rectangles whose boundaries on all sides touch at least one data point or grid cell, yielding the same ( 4 ) space [146, 147, 205] . Many other similar constraints are used for ellipses, non-orthogonal ellipses, etc. Since flexibly shaped patterns can hardly be parameterized, there is often no further computation-driven constraints beyond the implicit constraints implied by different heuristic search strategies (e.g., local constraints).

In the network space, shortest-path is a common constraint for linear path enumeration [51, 157, 197] . While this may be a reasonable approximation for people's travel behaviors, many realworld paths such as bus routes do not always follow shortest-paths. Instead, a relaxed simple-path constraint -any path without self-intersections -quickly increases the potential space to ( !) [200] . Another extension beyond shortest paths is to directly define the enumeration space using real-world trajectories and their connected sub-components [116] . For network-distance based circles [199] , rings [62] , etc., the same shortest-path constraint is used to define the space that is reachable from a center (e.g., by a threshold ).

Solution-quality-driven constraints: This type of constraints are developed to further reduce non-meaningful outputs beyond statistical modeling, and is irrespective of computational considerations. One most important constraint is the spatial contiguity constraint, requiring that any two locations inside a cluster are mutually reachable by its inner space. This typically increases the computational cost [128] ; in fact, a linear algorithm exists if this constraint is taken out [138] . A relaxed version of this contiguity constraint is the proximity constraint [138, [190] [191] [192] , which requires segments inside a candidate to be within a local neighborhood defined either by a spatial distance or -nearest neighbors; the segments are allowed to be spatially disjoint. Similarly, density-contiguity within each hotspot can also be enforced via local constraints [212, 215] . In network space, contiguity constraints are commonly defined by sub-graph connectivity or its relaxed formulations [11, 25] . Another widely used constraint is a maximum population threshold, which eliminates clusters that cover more than 50% of underlying population used in control data (e.g., [103] ); otherwise, a cluster is considered as a general phenomenon rather than a "hotspot". As discussed in Sec. 3.5.1, regularity constraints were also used in arbitrarily-shaped cluster detection to avoid over complicated tree-like patterns [30, 45, 58, 60] .

Statistically robust clustering techniques typically detect only the best cluster in each round, and remove it before searching for the next cluster to avoid shadowing effects between patterns (detailed in Sec. 3.6.1). Thus, after an enumeration space is defined, maximization algorithms are used to identify the cluster candidate with the highest test statistic value. For example, test statistics ( ) in Table 2 can be used as objective functions, and represents a cluster candidate where cluster-specific inputs (e.g., counts of case and control for the Poisson-based test statistic) are calculated. Many new computational techniques have been [51, 92, 157, 197, 200] , subnetworks with shape constraints [62, 199] Flexible shape

• Problem-specific heuristics: ULS [129, 162, 163] , MST [45] , FlexScan [195, 201] [51, 157, 197, 199] , simple paths [200] , or real-world trajectories [116] ; relaxed sub-graph connectivity [11] Quality-driven

• Spatial contiguity/connectivity [11, 25, 174, [190] [191] [192] • Density contiguity [212, 215] • Shape regularity [30, 45, 58, 60] • Maximum population [63, 103] Algorithms Exact • Bound based [63, 64, 146, 147, 151, 197, 198, 200, 215] • Incremental updates [134, 141, 208] • Linear-time subset scanning [25, 36, 138, 145, 191] • Shortest-path-tree [157, 197] ,

• Approximation-ratio based [7, 8, 123, 174, 229] • Asymptotic-bound based [173, 175, 205 ] Heuristic

• Data sampling [122] , space reduction [63, 208] , metaheuristics [57, 58, 93] developed by the data mining community to reduce the cost of maximization. In the following, we summarize these acceleration algorithms in three broad categories: exact, approximation and heuristic algorithms. Exact algorithms: The vast majority of maximization algorithms fall into this category, which means they can guarantee that the output cluster indeed achieves the maximum test statistic value within an explicitly defined search space (i.e., the cluster is the same as that of a brute-force algorithm). As a result, exact methods mainly focus on pre-defined shapes (Sec. 3.5.1 and 3.5.2).

The initial algorithmic improvement was introduced by SaTScan [4] . The algorithm is only applicable to circular-shaped candidate enumeration, which pre-sorts all the points by distance to each circle center. The pre-sorting frees a candidate evaluation step from performing range queries to identify which case or control points are inside the candidate (inferred by the order after sorting). This reduces the overall enumeration and evaluation cost from ( 3 ) to ( 2 · log ), assuming both cardinalities of center and circumference points are proportional to . Since then, many significant improvements have been developed to further reduce the cost.

The first major type of strategies uses pruning, i.e., to eliminate subsets of enumeration spaces where no candidate may have the maximum test statistic value. An overlap-kd-tree based approach was first developed for finding square-shaped clusters in a two-dimensional grid [146] . Overlaps are added to original kd-tree partitions so that candidates that span across partitions are not missed.

Then, a branch-and-bound pruning strategy is used to enumerate through the tree. Since only the optimal cluster is selected in each round, the existing maximum score is used as thresholds to prune branches with a smaller upper-bound score. Later works have extended the overlap-kd-tree and branch-and-bound strategies to cover rectangular shaped clusters [147] , and for multi-dimensional data [151] . Similarly, filter-and-refine strategies were designed for various pre-defined shapes (e.g., circles [65] , rings [63, 64] , ellipses [198] , linear paths [157, 197, 200] , iso-distance sub-networks [199] ) in non-tessellated input spaces. Leveraging that calculating the number of points in a subgrid (e.g., (1) using an integral image [215] ) is much faster than in a continuous space, these techniques typically use a discretized grid space to compute upper bounds of test statistic values of candidates in continuous spaces. A refining step is only carried out to check exact candidates if the upper bounds exceed a threshold (either user-specified or using the current maximum). These bounds are often designed specifically for each shape based on their characteristics. In addition, incremental update rules have also been developed to efficiently compute test statistic scores of neighboring candidates (e.g., (1) amortized cost [134, 141, 208] ); the update rules depend on specific test statistics.

Another widely adopted strategy is based on a linear-time subset scanning (LTSS) property first introduced in [138] for tessellated data. LTSS only applies to scenarios where components (e.g., grid cells) of a cluster are unconstrained by spatial contiguity, i.e., it can be any arbitrary subset of data elements. By relaxing the contiguity constraint, LTSS states that, for many test statistic functions , there exists an ordering function , such that the subset with a cardinality of always reaches the maximum score if its members have the top-individual scores in the data [138] . Assuming a sorted order of individual components by has been given, LTSS allows finding the optimal subset in linear time out of an exponential enumeration space. It has been shown that the LTSS property can be satisfied by a variety of test statistic functions [124, 145, 191] , including Kulldorff's scan statistic, expectation-based Poisson or Gaussian function, separable exponential families and many quasi-convex functions. The method has also been used in non-parametric scan statistics [36] , GraphScan [25] , etc. A few special-cases have been proven to make LTSS able to output contiguous subsets as clusters (e.g., non-existence of break-tires [36] ). As mentioned above, one key limitation of LTSS is the lack of spatial contiguity in each cluster. To mitigate this, a proximityconstrained extension was developed [190, 192] . Instead of applying LTSS on the entire data, it sequentially go through all data components (e.g., grid cells) and apply LTSS only on the -nearest neighbors of each component, or neighbors within its -radius neighborhood. Discontiguity is still allowed for subsets within each local neighborhood, but this avoids long-distance separation.

In the network space, shortest-path trees were used to efficiently enumerate candidate paths between all pairs of points [157, 197] . Since data points may add a large number of low-degree (two-degree) nodes to a network, dynamic segmentation was employed to construct the shortestpath tree with only original network nodes and then dynamically retrieve shortest paths between data points. For the all-simple-path enumeration space, the number of candidates can be factorial [200] . Thus, more aggressive pruning methods were developed. Instead of directly using network nodes to build a simple-path tree, fragmentation-multi-graph-trees are used, where each tree node represents a sub-network and paths within each node can be dynamically computed and reused. Pruning and filter-and-refine strategies described above for Euclidean space based methods were also used in network-based hotpsot detection [199] . To handle high computational cost (e.g., NPhardness [11, 25] , high-order combinatorials [181, 182] ) incurred by constraints on sub-graph connectivity, compactness or sparsity, methods often use relaxed reformulations such as semi-definite programming [11] or spectral graph theory based relaxations [181, 182] , where results (e.g., detection power) can be rigorously analyzed.

Approximation and heuristic algorithms: We discuss algorithms in these two categories together as it is hard to draw an exact boundary given the various definitions used by techniques. In computational theory, an approximation algorithm should guarantee that it always returns a solution whose objective function value ( ) (i.e., test statistic value for hotspot mapping) is within of the optimal solution ( ). For hotspot mapping, one challenge in clearly defining "approximation" is that -from an application perspective -it is as important to know "where" and "how large" a hotspot is in addition to a value; this can be further complicated considering the large number of shapes used. In addition, this problem has broad interests by researchers from statistics and computer science, making the use of "approximation" more diversified.

A major difficulty in approximating the solution of spatial hotspots is due to the combinatorial nature of candidate enumeration in a finite and discrete space of parameters. To bridge this gap, an approximation scheme was provided in [8] to search for the maximum-scoring region over a set of axis-parallel rectangles. This approach requires the scoring function to be a convex discrepancy function, and one key idea was to simplify the optimization using a set of linear functions to construct an approximation. Using Kulldorff's spatial scan statistic as an example of , the approach was able to reduce the search cost from ( 4 ) to ( 1 2 log 2 ), where is the gap to optimality. The linear-function based approximation was further improved in [7] leveraging the fact that only the maximum-scoring region is needed for each round. It also provides a new approximation algorithm for grid-based input data with an ( 3 · (log , 1 )) complexity, where is the number of columns and rows in a × grid. For rectangular shaped hotspot detection, an approximationversion of the exact brand-and-bound algorithm was also given in [147] . This extension uses a relaxed instead of an exact upper-bound and thus allows more aggressive pruning. Probabilistic results were provided to describe how likely each individual relaxed-bound will remain correct, though no distance to global optimality was given. A scalable sampling-based approach was also developed for trajectory data, which established approximation relationship among variable error bounds, cluster shape, sample size and execution time [123] ; it also considered various ways to spatially sample and approximate the input trajectories. In network space, approximation algorithms have also been developed by analyzing and using the submodularity of objective functions [174] . For sparsity constraints, efficient stochastic gradient descent algorithms have been developed, which have linear convergence rates for various objective functions [229] . Another type of approximation algorithms aims to maintain the optimality with respect to asymptotic detection power, guaranteeing that the approach will give the optimal solution with sufficiently large number of samples [173, 175, 205] . Related results were first achieved for one-dimension data, where regions are defined as intervals [175] . This approximation only needs a ( log ) search over the ( 2 ) enumeration space. The analysis has been later extended to two-dimensional space [205] , where the best candidate in a set of ( 4 ) axis-parallel rectangles can be approximated by an economical set of rectangles, with an ( log 2 ) cost. Additional assumptions typically used by these results include continuous surface of control data, independence between samples, data types and certain test statistic functions.

Several other methods have a clearer heuristic nature. In ring-shaped hotspot detection, for example, a Delaunay-triangulation was used to generate guesses of ring centers instead of exhaustively going over centers of all three-point circles, reducing the center space from ( 3 ) to ( ) [63] . For multinomial-model-based hotspot detection (e.g., hotspots of high-diversity), a heuristic was developed to gradually narrow search volume as candidate sizes grow, based on the observation that candidate scores tend to converge (i.e., stabilize) as their sizes get larger [208] . A empirical sampling based approach was also developed to use only a subset of all data points to perform [64, 103, 147, 212] or fixed location with random values [89, 90, 105, 208] • Scale-independent [103, 157, 195] , scale-dependent [175, 205, 214] , or unified [216] testing • Bayesian as sub-models [50, 124, 227] • Nonparametric empirical p-values [36, 143] Computation

• Monte-Carlo simulation accelerated by bounds [64, 147, 212] , shared computation [157, 200] , reduction [216] , etc. • Test statistic distribution approximation [39, 40, 75] Bayesian

Modeling • Direct priors [141, 150] or priors as parametric distributions [120, 140] • Direct posterior probabilities as results [133, 141, 149, 150] or with added tests (e.g., empirical) [137, 180] • Rank-based [169, 170] Computation

• Fast subset sums [138, 144, 180] • Conjugate priors [133] hotspot mapping [122] . Finally, the metaheuristics reviewed for flexible-shaped hotspot enumeration (e.g., [57, 58, 93] ) can also be considered as heuristic-based optimization strategies. While heuristic approaches in general will not guarantee solution quality, many have demonstrated high power and performance through empirical results.

Hypothesis testing is the final step in each round of hotspot detection, which determines if the optimal cluster is a true hotspot or a spurious pattern.

There are two major reasons for selecting and testing only the optimal region in a single round [103, 113, 140, 147, 216, 226] . First and most importantly, returning only the optimal region allows robust control over the false positive rate given a significance level ; otherwise, detecting multiple regions (e.g., candidates at different locations or with different sizes) at the same time may be susceptible to multi-testing, i.e., increasing the chance of having a type-I error. As suggested by many studies, this maximization can be viewed as an automatic adjustment of significance levels for different groups of candidates (e.g., candidates of the same size) [173, 175, 205] . The intuition can also be easily explained from a data perspective [208, 216] . Given datasets generated under a null hypothesis 0 , having a critical value -a score threshold derived to determine a candidate's significance at the level -on the maximum score makes sure that a spurious cluster can only be falsely reported as a hotspot in smaller than datasets. In other words, to get a spurious detection in a dataset under 0 , its maximum score has to pass the threshold on the maximum score. The second reason is that this helps reduce the mutual shadow effect between true hotspots [113, 226] ; e.g., eliminating a true hotspot from data will make the signal of a secondary true hotspot statistically stronger. In order to detect multiple clusters, a significant hotspot will be removed from data (both its case and control points), and the same detection and testing procedure will be repeated on the updated data. While this may increase the risk of reporting a spurious hotspot at the dataset-level, the effect is very slight as the execution of any following round is conditional on the detection of a significant cluster in the previous round (i.e., strict unidirectional dependence between rounds). There 

Normal Multinomial have also been many studies trying to increase the detection power for multi-hotspot scenarios by adjusting test statistic functions [113, 226] .

3.6.2 Two paradigms. Both frequentist and Bayesian based paradigms have been explored to test the significance of a cluster. The frequentist approach is used for both spatial and spatiotemporal hotspot detection. While the Bayesian variation was originally proposed and mainly used for the latter (e.g., emerging hotspot detection using time-series data over space), it can still be used if prior information is available. Table 4 shows a summary of the two paradigms. Frequentist Approach. Frequentist based testing is the classic in this field and still dominates the use in related research and applications, partially because of the tradition and the popularity of the software SaTScan. The frequentist view considers the data as an aggregation of the past, so null hypotheses based on it typically assume an entire dataset is generated by a complete random process (e.g., homogeneous Poisson process). The exact definitions of the hypotheses differ by statistical models discussed in Sec. 3.4.1. Fig. 4 shows several examples of random point distributions following different point processes in the Euclidean space. Fig. 4 (a1) and (a2) show example datasets generated by Bernoulli-based spatial point processes (SPP) under 1 and 0 , respectively; where the black and orange points are control and case points, respectively. Under the null hypothesis 0 , each control point has an identical probability of being a case point (e.g., a COVID-19 case). In contrast, 1 in (a1) has a higher rate inside hotspot regions and lower outside. Fig. 4 (b1) and (b2) are example point distributions from the continuous Poisson point processes. Here 0 assumes a homogeneous underlying probability density, which is also known as the complete spatial randomness. 1 contains a region with higher probability density of generating points (i.e., a hotspot). In Fig. 4 (c2) , 0 of a normalmodel-based point process assumes the values observed at all data points (e.g., PM-2.5 readings on air quality sensors) are generated by the same distribution. Instead, a hotspot in 1 has a higher mean (of the normal model) as shown in Fig. 4 (c1) . Finally, Fig. 4 (d1) and (d2) show example distributions from a multinomial-model-based point process, where 1 in (d1) contains a region with significantly higher diversification of point types (e.g., biodiversity) and (d2) contains a random redistribution of the point types over space. Fig. 5 further shows examples of Bernoulli-based and Poisson point processes in a network space with linear hotspots.

Based on the hypotheses, the test will produce a p-value for the optimal cluster * in the observed data : 0 ( * ,

where is the test statistic function and * 0 is the optimal cluster in data 0 following 0 . Given the combined complexity of point processes, test statistics, enumeration spaces and maximization algorithms, the distribution of ( * 0 , 0 ) in general is unknown in closed-form, except in over-simplified special cases that are often not useful in practice [147] . Thus, there have been two lines of strategies developed to estimate the p-value.

The most common strategy is to approximate the distribution via Monte Carlo simulation, i.e., generating a large number of randomized datasets (e.g., 999, 9999) following 0 and running the same maximization algorithm on all datasets to simulate the distribution of ( * 0 , 0 ) (e.g., [103, 175, 189, 205] ). Fig. 4 (bottom row) shows randomization examples generated under 0 , where each can be considered as the simulated dataset in a Monte-Carlo trial. Note that the spatial distribution of control points or base locations can either remain the same or be redistributed for Bernoulli, normal and multinomial point processes [103, 105, 147, 208] . In contrast, the control for continuous Poisson and many other processes is assumed to be a continuous surface, so the case point locations need to be redistributed in each trial. This Monte Carlo approach is broadly applicable as long as 0 is well defined, at the price of greatly increased computational cost. In addition to the computational enhancements discussed in Sec. 3.5.3, which can be used as efficient sub-routines in Monte Carlo trials, a variety of dedicated algorithms have been developed to reduce the simulation cost (e.g., [64, 147, 151, 197] ). One convenient characteristic that has been leveraged by many approaches is that we do not need to know the exact maximum value ( * 0 , 0 ) for each random dataset 0 ; instead, to calculate the p-value 0 ( * , ), all we need is whether ( * , ) ≤ ( * 0 , 0 ) in the trials. Using this characteristic, the maximum score from the observed data ( * , ) is directly used as the bound in both branch-and-bound [151] or filter-and-refine algorithms [64, 116, 157, 197, 200] to aggressively narrow search spaces. In addition, once a candidate is found to have a greater score than ( * , ), the rest of the trial can be skipped. This also allows the use of non-exact but low-cost data structures during the simulation. As an example, a discrete-DBSCAN was used to replace the original DBSCAN for Monte Carlo trials to pre-estimate upper and lower bounds of ( * 0 , 0 ) [212, 215] ; the original DBSCAN is only re-applied for a subset of trials where the bounds are insufficient to deduce the relationship. Another widely adopted trick is early-termination [4] . As its name suggests, this method terminates the whole Monte Carlo simulation once the results from completed trials are sufficient to conclude the significance. This is extremely efficient if a detection is insignificant. For example, for a significance level = 0.01, the minimum number of trials needed to accept 0 is 9 out of 999. It was also shown in [215] that early-termination is highly efficient for general hotspot detection formulations. Shared computation has also been used for tasks that remain stationary across Monte Carlo trials. For example, many expensive operations for candidate enumeration and maximization (Sec. 3.5.3) only need to be pre-computed once, such as the sorting of points [88, 89, 105, 208] , and shortest-path trees [157, 197] or fragmentation-multi-graph trees [200] constructed for underlying road networks. The possibility of using only a small set of samples from the entire dataset to construct upper bounds of p-values has been explored. Specifically, a reduction algorithm was developed in [216] . It shows that the original significance-enforced clustering problem can be mapped to a series of reduced problems (e.g., from to ⌈ ⌉ points) where the p-values of reduced clusters in these problems can be used to upper-bound the p-values of clusters in the original problem. Moreover, in scenarios where significance testing needs to be performed simultaneously for a collection of clusters, a dual-convergence algorithm was used to gradually narrow the gap between upper and lower bounds in pruning as the simulation progresses [212, 215] .

In addition to Monte Carlo simulation, another line of efforts has explored a variety of mathematical approximations of the distribution of ( * 0 , 0 ). Similar to many of the approximation algorithms summarized in Sec. 3.5.3, the approximation schemes here often aim to asymptotically approximate the distribution, i.e., narrowing the error bound as the number of samples gradually grows to infinity. Related work started on one-dimensional data [74, 132] or fixed-cluster-size (e.g., equal-size 1D or 2D windows [38, 39] ) due to the complexity of analysis, and was later extended to two-and multi-dimension spaces for several types of test statistics (e.g., maximum normalized scan score) [40, 76, 77] . Strategies have also been explored to select best-performing parameters in the approximation models [39] , and many approximations have demonstrated good performance through comparisons with simulated results [39, 76] . As discussed in Sec. 1, a detailed summary of these mathematical approximations has been provided in [75] . Compared to the Monte-Carlo based techniques, accurate approximations of test statistic distributions in closed-forms may greatly reduce the computational overhead. On the other hand, Monte Carlo simulation tends to be more stable (e.g., fewer assumptions) and more flexible (e.g., can be generally applied to hotspot detection problems). Some efforts have also tried to combine Monte Carlo simulation with mathematical approximation as a middle ground between the two types of approaches [4] . Finally, as discussed in Sec. 3.4.3, p-values are commonly calculated in a scale-independent manner (i.e., the same critical value for all hotspot sizes), but scale-aware methods tend to be more robust [175, 205, 216] .

Bayesian Approach: Bayesian based testing schemes were developed to better model uncertainty during hotspot detection by incorporating informative prior information (e.g., expected number of observations in a spatial region, shape regularity, and size) when available. As explained in [140] , prior information may also be added into frequentist approaches as constraints, weights or penalty functions in test statistics [30, 173, 191] , but the Bayesian approach offers more flexibility (e.g., priors as continuous distributions). Moreover, a fundamental difference between the two frameworks is that the frequentist approach tests the significance of clusters by checking if 0 can be rejected, whereas the Bayesian method (e.g., [133] ) compares the posterior probabilities of the null and a set of different alternative hypotheses. In the multivariate Bayesian scan statistic [150] , for example, each alternative hypothesis 1 is defined by a pair of a spatial footprint (i.e., a candidate region) and an event type (e.g., an increase in pharmacy visits may be caused by a disease outbreak, promotional sales of OTC medicines, etc.). A posterior probability is calculated for each hypothesis as

, where is the observed data, 0 is the null hypothesis and 1 is the alternative hypothesis for a region and event (there is typically a large number of ( , ) pairs).

Informative prior distributions and their parameters over space are the most critical components to calculate the large number of posteriors in Bayesian scan statistics [133, 141, 149, 150] . Uninformative priors may reduce the Bayesian approach to certain versions of frequentist approaches such as the spatial scan statistic where candidates are ranked by their likelihood ratios [140] (many other types of test statistic functions exist as summarized in Sec. 3.4.2).

Original forms of Bayesian scan statistics methods [141, 150] estimate the priors by directly learning their distributions (e.g., a prior conditional probability of an event occurring in each region) from past data within a relevant temporal window. Due to the large number of candidate regions (Sec. 3.5.1), this nonparametric learning requires a large amount of training data to work. To mitigate this issue, some of the prior distributions are further abstracted to parametric distributions that are governed by a small number of parameters [120] , which can be learned well with a smaller amount of training data. The downside is that the training is done through a computationally expensive expectation-maximization process [140] . In addition, the Bayesian approach normalizes the individual posteriors using the overall probability of data. This is again costly as

, which means all combinations of candidate regions and events need to be enumerated to calculate the summation. As a result, this method is constrained to simple regular shapes (e.g., circles, rectangles).

A remedy is the fast subset sums approach [144, 180] , which is an alternative to the above regionbased Bayesian scan statistics. This method focuses more on individual locations and subsets of locations rather than contiguous regions. It was found in [137] that the computationally-expensive terms in estimating the final posterior probability for each individual location , such as the sums of posteriors of all location-subsets that contain , can be exactly computed without enumerating through all the subsets. The result becomes a map of posterior probabilities on individual locations. To improve the spatial proximity of locations with high posterior probabilities, a generalized fast subset sums method was developed [180] , in which the proximity and regularity can be controlled by a new sparsity parameter. Another computational enhancement used in Bayesian approaches is the construction of a conjugate prior [133] ; otherwise expensive simulations such as Markov Chain Monte Carlo (MCMC) are needed to calculate the posterior.

The randomization test summarized in the frequentist approach is not used here for significance testing. Instead, the posterior probabilities (e.g., a regular-shaped cluster with highest posterior for 1 or a map of posterior probabilities) are directly given to users for decision making. To enforce a false positive rate more explicitly, historical and synthetic data were used in [137, 180] to select a proper threshold on the posterior probabilities or its sums, where the number of false positives under different thresholds can be empirically evaluated.

Many variations have also been developed. A Bayesian network model was developed to combine the spatial-based multivariate Bayesian scan statistic [141] and the population-based PANDA [43] to better capture details in individual data with spatial consideration [96] . A temporal extension was later proposed [95] , where the increase of density is modeled using a linear function. A rank-based spatial clustering algorithm uses the Bayesian scan statistic [169, 170] to compute posterior probabilities on individual locations, which were then used to search for clusters. A dedicated Bernoulli version of the Bayesian scan statistic was developed to improve performance on binary labeled points (e.g., a crime case or not, which is not aggregated) [171] .

Integrated Bayesian-frequentist based approaches have been explored. A grid-based Bayesian variable window approach first uses Bayesian to define the test statistic and then employs a frequentist based randomization test to perform significance testing [227] . Similarly, methods such as anomalous group detection [50] and fast generalized subset scan [124] use a Bayesian network to evaluate and select the best candidates, and randomization tests to compute critical values.

One additional benefit of this statistically-robust clustering formulation is that ground truth hotspots are mathematically well-defined using point-process-based hypotheses (Sec. 3.6). This allows rigorous synthetic data generation, where true hotpots with varying properties can be inserted into point processes to generate validation datasets (e.g., Fig. 4 and 5) . Such datasets have been widely used to evaluate results from different clustering approaches [6, 60, 87, 108, 130, 135, 173] under a variety of assumptions (e.g., data size, effect size, spatial scale). Typical metrics used in the literature include detection power (i.e., 1-where is the Type-II error) [60, 108, 135] ; rate of spurious results (i.e., Type-I error) or total number of spurious clusters detected [150, 216] ; precision, recall and F1scores [208, 214, 216] ; and the number of days in advance in detecting an emerging hotspot in the spatio-temporal case [96, 141, 142, 150, 180] . While counting the number of spurious detections in 0 -data is straightforward, measuring if a true hotspot is successfully detected can be tricky. Different strategies have been used to conduct this. For tessellated data (e.g., grid or polygon inputs), the success is often measured by the overlap ratio (i.e., intersection over union -IoU) between the true pattern and detection [96, 180, 192] . This can also be generalized to the IoU between point sets [212] . At the object level, IoU may be less robust. For example, the IoU between two circles decreases quickly even with small differences in center locations and radii [216] . Thus, separate criteria on center accuracy -which may be important in applications (e.g., source of infection or pollution [103] ; locations of serial criminals [64] ) -and size similarity have been used as replacements [208, 216] . To make the validation process more holistic, incompatible assumptions have been used to generate synthetic data and test the sensitivity of techniques (e.g., regular vs. irregular shapes [45, 60, 201] , normal vs. non-normal processes for continuous-value based clustering [88, 89, 105] ). In addition to quantitative evaluation, qualitative evaluation (e.g., real or simulated case studies, visual inspection) is also commonly employed [60, 63, 157, 163, 197, 212, 215] .

In general, many empirical results (e.g., [147, 175, 205, 214, 216] ) have shown that spatial hotspots are easier to detect with larger data sizes, effect sizes and spatial footprints; fewer number of true hotspots; less uncertainty (e.g., due to aggregation or positional inaccuracy); regular shapes, etc. Related theoretical results have also been developed (e.g., detectability of a clustering signal) [34, 167, 205] . Validation using real-world datasets (e.g., crime cases, traffic accidents, COVID-19 symptoms) often require deep involvement of domain experts. Many studies have used combinations of real-world base data (e.g., spatial domains such as city boundaries, county maps and road networks [60, 87, 103, 108, 157, 197, 200] ; control data such as population or traffic volume [116, 141, 147] ) and synthetic event data (e.g., disease cases) in evaluation. Moreover, to further improve synthetic data's representativeness of real-world phenomenons, studies have gone beyond simple statistical assumptions to generate more realistic distributions (e.g., using imitation learning), such as synthetic population and contact networks [23, 82, 228] , human mobility maps [18] , trajectories [160] , etc. These strategies for realistic spatial data generation can help better estimate the performance of techniques in real scenarios. Finally, we list examples of open datasets across various domains that can be leveraged for evaluation and comparison in Table 5 .

There are many opportunities lying ahead to: (1) further enhance existing hotspot detection techniques, (2) incorporate statistical rigor into general clustering methods, (3) harness emerging problems and new forms of spatial data, and (4) explore synergistic integration with learning for mutual benefits. [109, 201] ; arbitrarily-shaped (with customizable generation code) [212, 215] , etc.

Real data with known ground truth

Battle of the water sensor networks [25-27, 37, 159] ; New York Bronx Legionnaire's disease outbreak [63] ; Disease infection at airports (social network data) [188] , etc.

Real data without known ground truth * New York cancer incidence [5, 24, 79] , New York birth defects [5, 186] , Sudden Infant Death Syndrome (SIDS) occurrences in North Carolina [47, 128] ; Metro Transit bus trajectories with engine measurements [116] ; Los Angeles highway traffic [26, 27] ; Orlando pedestrians fatalities [197] , etc. * The score of the best significant cluster and computational time are often used to compare methods, and domain knowledge is used to identify explanations and surprises.

As discussed in Sec. 3.4.2, the design of a test statistic is a critical component that directly determines the solution quality. Moreover, the design is non-trivial as localized test statistics (e.g., likelihood ratios [64, 103, 146] , density ratios [157, 185, 197, 200] , entropy measures [208] ) can be easily biased when being used to compare and rank candidates in hotspot detection, especially across scales [34, 173, 175, 205] . While approaches have been developed to correct the bias and restore the optimal detection power (e.g., [34, 173, 175, 205] ), most of them require sophisticated analysis and have only been explored in special cases of hotspot mapping (e.g., Gaussian white noises, one-dimensional intervals, rectangular regions). This limits the methods' adoption in general scenarios summarized in Sec. 3, despite their importance. An interesting direction would be to explicitly explore the trade-off between flexibility and optimality, and develop more generally applicable frameworks that may not necessarily fully restore the optimal power. In addition, recent methods that try to correct the bias by incorporating spatial-nondeterminism-awareness are less dependent on the specification of problems [208, 214, 216] , but they do introduce additional computational complexity and require more efficient algorithms and data structures. Scalabilitywise, existing algorithmic building blocks (Sec. 3.5.3) have not yet considered applications in the context of spatial big data [67, 166, 222] . Enabling frameworks of large-scale applications need to be developed to model additional complexity (e.g., I/O, network communication, indexing) in distributed environments (e.g., Apache Sedona [222] ).

The trade-off between detection power and robustness against spurious patterns has not been sufficiently explored. While statistical rigor is certainly important to suppress false alarms -especially in real-world applications where resource is limited, missing one true hotspot in many cases may turn out to be more costly in certain scenarios (e.g., the spread of COVID-19). Thus, to better inform decision making, an integration of the two indicators may be needed [141, 216] . This integration can be non-trivial as the relative weights of detection power and false positive rate may be nonstationary during different stages of an event (e.g., disease spread) and can be affected by additional dynamic factors (e.g., cost of remedy, impact, policies, forecasts, geographical heterogeneity, local demographics). Consideration of these complex factors in hotspot recommendation may require measures beyond a single test statistic and simple criteria based on significance levels. These new modeling will also require new computational frameworks.

Moreover, the vast majority of techniques use the simplifying assumption that there is a single hotspot in the input data for 1 [113, 140] . Although this does not necessarily limit the use of the approaches in detecting arbitrary number of clusters [4, 64, 208, 216] (e.g., by performing rounds of detection until the best candidate is no longer significant, as discussed in Sec. 3.6.1), it may cause the methods to work sub-optimally due to shadow-effects between multiple hotspots [226] . For example, existence of other hotspots increases the baseline score outside one true hotspot, making it harder to confirm its significance. While several methods have been developed to mitigate this issue [113, 226] , they are again limited to certain special cases and have not been widely adopted (e.g., not explored for Bayesian modeling [140] ). There are opportunities to more broadly consider this issue in hotspot detection and develop general frameworks to address it.

The uncertainty in input datasets also has not been well modeled in general except a few efforts (e.g., [25, 91, 156] ). Given that existing datasets in many problems (e.g., disease surveillance) are often provided at an aggregated level (e.g., by county or zip code) due to privacy concerns, such positional uncertainty of individual points can substantially affect solution quality [52, 118, 121] . Explicit techniques (e.g., Dempster-Shafer based modeling) are needed in future research to improve the robustness of the techniques in face of uncertainty.

Expanding to broad problems and emerging spatial data A major advantage of the wide variety of techniques summarized in this paper is their statistical robustness, which allows explicit control of the rate of spurious detections in the output. This modeling has also been used recently to improve the statistical robustness and automate parameter selection of general clustering techniques. Significant DBSCAN [212, 215] , for example, can effectively filter out a large number of spurious clusters generated by DBSCAN in data containing heavy-noise and clusters of varying densities. The incorporation of statistical modeling can also greatly reduce efforts in parameter selection. Similarly, a significant spatial data partitioning approach based on non-parallel split lines has been developed [72] . However, in general, statistically robust formulations have not been explored for most general clustering and data partitioning approaches (e.g., HDBSCAN [29, 153] , spectral clustering [203] , deep clustering [126] , and many others [218] ). Many opportunities lie ahead to incorporate statistical rigor into these broad techniques.

In addition, there are many newly emerging spatial data types and problems where hotspot detection techniques may be valuable. As an example, location-enriched social network data (e.g., geo-tagged tweets) open up opportunities for monitoring disease spread and many other events (e.g., disasters) using novel formulations of hotspot detection [188, 189] . In the context of COVID-19, business points-of-interest data and their visits (e.g., SafeGraph [3]) can also be leveraged in hotspot detection to generate timely warning of high-risk regions and inform policy making. The use of these new types of datasets in hotspot detection is also challenging due to well-known issues including data quality (e.g., positional accuracy and uncertainty [52, 118] ), nonstationarity (e.g., algorithm dynamics, which are considered as major factors that cause erroneous predictions in Google Flu Trends in 2013 [111] ) and representativeness (e.g., selection bias [101] ).

Many new types of data are also available in the network space. First, traditional GPS trajectories are currently underutilized in hotspot detection despite their potential importance (e.g., transportation safety, environment, smart cities, public health) [15] . Second, attribute-enriched trajectories, including on-board-diagnostic data (i.e., vehicle trajectories with hundreds of real-time engine measurements [114, 115] ) and connected vehicle data [12] , also provide lots of new possibilities in novel hotspot detection formulations (e.g., routes with abnormally high emission or energy consumption [116] ). Also, many events in urban areas happen and propagate through road networks rather than the Euclidean space, so current research that are mostly focused on the Euclidean space may be expanded to the network space to take the full advantage of new high-resolution datasets. Advancements in these directions may require new statistical formulations (e.g., physics-aware point processes) and computational frameworks.

New opportunities also exist to go beyond existing hotspot formulations to address emerging societal problems. For example, the lack of diversity in many domains have raised major concerns in recent years on the resilience of our communities (e.g., low crop diversity in agriculture poses high risk to global food security [165] ). Research landscape in spatial hotspot detection can be broadened to help address these critical issues faced by our society [208] .

One fundamental property of spatial data is spatial heterogeneity, i.e., data distributions are nonstationary over space, violating the common identical distribution assumption under machine learning methods. Recently, incorporation of statistically-robust formulations (e.g., multivariate scan statistics) in deep learning models has shown promising results in tackling spatial data heterogeneity and substantially improving model performances [211, 213] .

On the other hand, statistically robust clustering techniques can also leverage advances from related communities to improve its capacity. For example, the growing maturity of deep learning [112] , especially in computer vision and natural language processing, may provide a useful way to further enrich the variety of datasets that can be used for hotspot detection. While it can be risky to directly use deep learning to find hotspots due to concerns such as non-interpretability and overfitting issues [31] , these models can help preprocess large volumes of visual and text information from remote sensing data (satellite or aerial imagery, LiDAR point clouds) [94, 209] , social platforms (e.g., Twitter, Flickr), traffic cameras, connected vehicles, etc. to generate valuable new data sources for hotspot detection, which would otherwise be tedious, time-consuming and not scalable. Moreover, learning approaches (e.g., generative adversarial networks) may be leveraged to generate more realistic simulation data under various scenarios to improve evaluation [18, 160, 228] .

From a computational perspective, data-driven approaches may also be used to reduce computation load and speed up the enumeration process [119] in scenarios where computing resource is limited on the application-side or real-time performance (e.g., interaction) is needed. Machine learning methods, for example, may be used to heuristically prune low-potential enumeration spaces or recommend a list of high-potential regions for the search to prioritize [17, 224] . Other settings such as reinforcement learning can also be helpful in this scenario.

To reduce the need of training data (e.g., in Bayesian hotspot detection techniques [141, 150] ), novel transfer learning methods [196] can be explored to reuse relevant parameters based on similarity between domain distributions across spatial locations, time periods, data or problems.

In this article, we provided an overview of statistically robust clustering techniques for spatial hotspot mapping. Given the wide application of hotspot mapping techniques in societal applications and the potential importance (e.g., infectious disease surveillance for COVID-19 scenarios), the goal is to present a taxonomy of the vast literature in different key sub-fields (e.g., statistical modeling, maximization algorithms, significance testing) to trigger more thoughts from the broad computing communities. 1 On the other hand, the statistically robust formulations may provide new paths for improving general clustering (e.g., removing spurious clusters) and learning techniques (e.g., heterogeneity); similarly, the validation strategies described (e.g., statistical definition based ground-truth generation) can be potentially used to expand existing evaluation methods for general clustering. To serve these purposes, we selected the terms clustering and hotspot mainly due to their common use and adoption in the computing community. 2 We hope that the taxonomies and summaries of techniques provided in this article can serve as a stepping stone for generating new ideas and approaches (e.g., future directions discussed in Sec. 4) in computing research, and help practitioners select appropriate paradigms based on their data and application needs.

There exists a variety of point processes ( Table 6 ) that can be used to model point distributions, and the choice often depends on specific application needs or assumptions. For data with time-series information or unknown statistical distribution, alternative formulations with Bayesian and nonparametric models can be leveraged. The selection of test statistics requires more consideration. Specifically, scale sensitivity or bias (e.g., in density or likelihood ratio based formulations) should be explicitly considered when clusters of different sizes are being compared. Many correction methods (e.g., penalized or average likelihood ratios [34, 175, 205] ) have been proposed for certain test statistics and formulations (e.g., rectangular shaped clusters, one-dimensional space). For more general formulations, flexible frameworks (e.g., dual-level Monte Carlo simulation [208] ) can be employed with more computational cost. [136, 137, 141, 152] A.2 Summary on region enumeration and maximization Both regular and irregular shape based formulations are common in literature. As regular-shaped clusters can be well-parameterized (e.g., two-point circles), their enumeration algorithms often use a top-down search strategy where candidates described by different parameters are directly enumerated. As a result, it is more robust if low-density gaps exist in true clusters, especially when the number of samples is limited. In contrast, irregular-shaped formulations often rely on a bottom-up approach using local search criteria; otherwise the enumeration space is often intractable. With that being said, fast linear algorithms have been proposed if spatial footprints of clusters are allowed to be dis-contiguous (e.g., with the LTSS property [138, 191] ). In terms of maximization algorithms, approximation approaches tend to achieve lower worst-case complexity compared to exact algorithms. On the other hand, these results so far only exist for a limited set of hotspot detection problems, and often require substantially more efforts to migrate to other variations (e.g., arbitrarily shaped hotspots; evolving hotspots; hotspots based on trajectory data or other point processes; etc.). In contrast, the general computational paradigms (e.g., filter-and-refine, pruning) used for exact accelerations are easier to extend to cover different scenarios. Finally, heuristic algorithmic designs can be combined with both to further reduce the computational cost.

The comparison between frequentist and Bayesian approaches for hypothesis testing in statistically robust clustering inherits their philosophical differences in general (e.g., conclusions vs. probabilities for hypotheses, no-prior vs. prior, values vs. distributions for parameters, etc.). In practice, the choice often depends on whether there exists a reasonable or agreeable prior distribution on parameters. If such a prior exists, then Bayesian may offer more flexibility such as probabilities on various hypotheses; otherwise, the inclusion of a prior may be considered subjective. In terms of computation, algorithms for the frequentist paradigm often require randomization tests with Monte-Carlo simulation, except in special cases (e.g., certain choices of test statistics, cluster shapes, etc.) where approximations are available. Similarly, general Bayesian formulations may require expensive simulation (e.g., Markov-Chain Monte Carlo) unless suitable conjugate priors can be identified to directly estimate the posterior probability.

National Cancer Institute

SaTScan datasets

A simulation study of three methods for detecting disease clusters

Spatial scan statistics: approximations and performance study

The hunting of the bump: on maximizing statistical discrepancy

Automatic subspace clustering of high dimensional data for data mining applications

Graph based anomaly detection and description: a survey

Connected subgraph detection with mirror descent on SDPs

Future connected vehicles: challenges and opportunities for spatio-temporal computing

A geospatial analysis of shark attack rates for the coast of California: 1994-2010

OPTICS: ordering points to identify the clustering structure

New Frontiers for Scan Statistics: Network, Trajectory, and Text Data

Spatio-temporal data mining: A survey of problems and methods

Learning to branch

Covid-gan: Estimating human mobility responses to covid-19 pandemic through spatio-temporal conditional generative adversarial networks

Spatiotemporal dynamics of the Southern California Asian citrus psyllid (Diaphorina citri) invasion

Support vector clustering

Goodness-of-fit test statistics that dominate the Kolmogorov statistics. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete

A survey of clustering data mining techniques

EpiFast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems

Public domain small-area cancer incidence data for New York State

Graph scan statistics with uncertainty

Near-optimal and practical algorithms for graph scan statistics

Near-optimal and practical algorithms for graph scan statistics with connectivity constraints

Discovery of under immunized spatial clusters using network scan statistics

Hierarchical density estimates for data clustering, visualization, and outlier detection

Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters

Can we open the black box of AI?

A new method for unveiling open clusters in Gaia-New nearby open clusters confirmed by DR2

Cannabis cultivation in Quebec: Between space-time hotspots and coldspots

Detection with the scan and the average likelihood ratio

Anomaly detection: A survey

Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs

A Generalized Matching Pursuit Approach for Graph-Structured Sparsity

Two-dimensional discrete scan statistics. Stats. & probability lett

Approximations for a conditional two-dimensional scan statistic. Statistics & probability letters

Approximations for two-dimensional variable window scan statistics

Geovisual analytics to enhance spatial scan statistic interpretation: an analysis of US cervical cancer mortality

Using the SaTScan method to detect local malaria clusters for guiding malaria control programmes

A Bayesian algorithm for detecting CDC Category A outbreak diseases from emergency department chief complaints

Spatial analysis of COVID-19 clusters and contextual factors in New York City

Constrained spanning tree algorithms for irregularly-shaped spatial clustering

Applications of spatial scan statistics: a review

Spatial modeling of regional variables

Vector transmission alone fails to explain the potato yellow vein virus epidemic among potato crops in Colombia

A Mann-Whitney scan statistic for continuous data

Detecting anomalous groups in categorical datasets

Detection of patterns in water distribution pipe breakage using spatial scan statistics for point events in a physical network

Visualizing the impact of space-time uncertainties on dengue fever patterns

Arbitrarily shaped multiple spatial cluster detection for case event data

Automated monitoring of clusters of falls associated with severe winter weather using the BioSense system

Evaluating the Evaluation Metrics for Spatial Disease Cluster Detection Algorithms

Special invited paper: Higher criticism for large-scale inference, especially for rare and weak effects

A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters

A genetic algorithm for irregularly shaped spatial scan statistics

Extensions of the Scan Statistic for the Detection and Inference of Spatial Clusters

Evaluation of spatial scan statistics for irregularly shaped clusters

Multiscale testing of qualitative hypotheses

Mining network hotspots with holes: A summary of results

Ring-shaped hotspot detection

Ring-shaped hotspot detection: a summary of results

Geographically robust hotspot detection: a summary of results

A density-based algorithm for discovering clusters in large spatial databases with noise

Enabling spatial big data via CyberGIS: Challenges and opportunities

Applying hotspot detection methods in forestry: a case study of chestnut oak regeneration

A weighted average likelihood ratio test for spatial clustering of disease

Likelihood-based tests for localized spatial clustering of disease

Early detection of terrorism outbreaks using prospective space-time scan statistics

Oblique decision trees for spatial pattern detection: optimal algorithm and application to malaria risk

Handbook of scan statistics

Tight bounds and approximations for scan statistic probabilities for discrete data

Scan statistics: Methods and applications

Multiple window discrete scan statistics

Maximum scan score-type statistics

Daily reportable disease spatiotemporal cluster detection

Local fuzzy geographically weighted clustering: a new method for geodemographic segmentation

CURE: An efficient clustering algorithm for large databases

Socioeconomic disparities and air pollution exposure: a global review

Creating realistic synthetic populations at varying spatial scales: A comparative critique of population synthesis techniques

Beyond postsuburbia? Multifunctional service agglomeration in Vienna's urban fringe. Tijdschrift voor economische en sociale geografie

Childhood leukaemia in Sweden: using GIS and a spatial scan statistic for cluster detection

Daily surveillance of COVID-19 using the prospective space-time scan statistic in the United States

A spatial scan statistic for survival data

Evaluating spatial methods for investigating global clustering and cluster detection of cancer cases

Covariate adjusted weighted normal spatial scan statistics with applications to study geographic clustering of obesity and lung cancer mortality in the United States

Weighted normal spatial scan statistic for heterogeneous population data

Weighted normal spatial scan statistic for heterogeneous population data

Detecting regions of disequilibrium in taxi services under uncertainty

Ls 3: A linear semantic scan statistic technique for detecting anomalous windows

Random walks to identify anomalous free-form spatial scan windows

Incremental dual-memory lstm in land cover prediction

A Bayesian spatio-temporal method for disease outbreak detection

A Bayesian network model for spatial event surveillance

A nonparametric spatial scan statistic for continuous data

A spatial scan statistic for multinomial data

A spatial analysis of American police killed in the line of duty

Chameleon: Hierarchical clustering using dynamic modeling

Measuring mobility to monitor travel and physical distancing interventions: a common framework for mobile phone data analysis. The Lancet Digital Health

Microbial source tracking and spatial analysis of E. coli contaminated private well waters in southeastern Ontario

A spatial scan statistic

Spatial scan statistics: models, calculations, and applications

A scan statistic for continuous data based on the normal probability model

An elliptic spatial scan statistic

Multivariate scan statistics for disease surveillance

Power comparisons for disease clustering tests

Benchmark data and power calculations for evaluating disease outbreak detection methods

Size and duration of COVID-19 clusters go along with a high SARS-CoV-2 viral load: A spatio-temporal investigation in Vaud state

The parable of Google Flu: traps in big data analysis

Deep learning

A spatial scan statistic for multiple clusters

Physics-guided energy-efficient path selection using on-board diagnostics data

Physics-guided energy-efficient path selection: a summary of results

Significant lagrangian linear hotspot discovery

Investigation of geographic disparities of pre-diabetes and diabetes in Florida

Impact of spatial aggregation error on the spatial scan analysis: a case study of colorectal cancer

Surrogate-assisted evolutionary framework for data-driven dynamic optimization

Learning outbreak regions in bayesian spatial scan statistics

Inaccuracy, uncertainty and the space-time permutation scan statistic

Scalable spatial scan statistics through sampling

Scalable spatial scan statistics for trajectories

Fast generalized subset scan for anomalous pattern detection

Geographic data mining and knowledge discovery

A survey of clustering with deep learning: From the perspective of network architecture

Spatial clusters of violent deaths in a newly urbanized region of Brazil: highlighting the social disparities

A Fast Algorithm for Combinatorial Hotspot Mining Based on Spatial Scan Statistic

Hotspot detection with bivariate data

Hot spot or not: a comparison of spatial statistical methods to predict prospective malaria infections

Visualising crime clusters in a space-time cube: An exploratory data-analysis approach using space-time kernel density estimation and scan statistics

Some probabilities, expectations and variances for the size of largest clusters and smallest intervals

A Bayesian spatial scan statistic

Detection of spatial and spatio-temporal clusters

An empirical comparison of spatial scan statistics for outbreak detection

Expectation-based scan statistics for monitoring spatial time series data

Fast Bayesian scan statistics for multivariate event detection and visualization

Fast subset scan for spatial pattern detection

Subset Scanning for Event and Pattern Detection

Bayesian Scan Statistics

A multivariate Bayesian scan statistic for early event detection and characterization

Bayesian network scan statistics for multivariate pattern detection

A nonparametric scan statistic for multivariate disease surveillance

Generalized fast subset sums for Bayesian detection and visualization

Fast subset scan for multivariate event detection

A fast multi-resolution method for detection of significant spatial disease clusters

Rapid detection of significant spatial clusters

Methods for detecting spatial and spatiotemporal clusters. Handbook of biosurveillance

A Bayesian scan statistic for spatial cluster detection

A multivariate Bayesian scan statistic

Detecting Significant Multidimensional Spatial Clusters

Detection of emerging spacetime clusters

Efficient computation of multiple density-based clustering hierarchies

CLARANS: A method for clustering objects for spatial data mining

Hotspots of vascular plant endemism in a global biodiversity hotspot in Southwest Asia suffer from significant conservation gaps

Assessing the outline uncertainty of spatial disease clusters

A kmain routes approach to spatial network activity summarization

Cluster recognition in spatialtemporal sequences: the case of forest fires

The battle of the water sensor networks (BWSN): A design challenge for engineers and algorithms

2020. xGAIL: Explainable Generative Adversarial Imitation Learning for Explainable Human Decision Analysis

A survey on density based clustering algorithms for mining large spatial databases

Spatially constrained clustering and upper level set scan hotspot detection in surveillance geoinformatics

Upper level set scan statistic for detecting arbitrarily shaped hotspots

Identification of "Hot Spots" of inner areas in Italy: Scan statistic for urban planning policies

Declining biodiversity for food and agriculture needs urgent global action

Parallel processing over spatial-temporal datasets from geo, bio, climate and social science communities: A research roadmap

Multiscale scanning in inverse problems

Connected sub-graph detection

A multi-level spatial clustering algorithm for detection of disease outbreaks

Rank-based spatial clustering: an algorithm for rapid outbreak detection

A Bayesian approach to the Bernoulli spatial scan statistic

Hot spots of perforated forest in the eastern United States

Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics

Event detection in activity networks

The block criterion for multiscale inference about a density, with applications to other multiscale problems

Applying niching genetic algorithms for multiple cluster discovery in spatial analysis

A spatial cluster analysis of tractor overturns in Kentucky from

Nonlinear component analysis as a kernel eigenvalue problem

Conservation planning in agricultural landscapes: hotspots of conflict between agriculture and nature

A generalized fast subset sums framework for bayesian event detection

Detecting anomalous activity on networks with the graph Fourier scan statistic

Changepoint detection over graphs with the spectral scan statistic

Spatiotemporal data mining: A computational perspective

Multivariate normal spatial scan statistic for detecting the most severe cluster of a disease

Anomalous window discovery through scan statistics for linear intersecting paths (SSLIP)

Confidence intervals for spatial scan statistic

Power evaluation of disease clustering tests

Detecting spatial clusters of disease infection risk using sparsely sampled social media mobility patterns

Where did I get dengue? Detecting spatial clusters of infection risk with social network data

Scalable detection of anomalous patterns with connectivity constraints

Penalized fast subset scanning

Dynamic pattern detection with temporal consistency and connectivity constraints

Spatial analysis of pallid sturgeon Scaphirhynchus albus distribution in the Missouri River, South Dakota

Spatio-temporal epidemiology of Tritrichomonas foetus infection in Texas bulls based on state-wide diagnostic laboratory data

A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring

A survey on deep transfer learning

Significant linear hotspot discovery

Elliptical hotspot detection: a summary of results

Detecting isodistance hotspots on spatial networks: A summary of results

Linear Hotspot Discovery on All Simple Paths: A Summary of Results

A flexibly shaped spatial scan statistic for detecting clusters

A patterned and unpatterned minefield detection in cluttered environments using Markov marked point process

A tutorial on spectral clustering

A Bayesian model for cluster detection

Optimal and fast detection of spatial clusters with scan statistics

STING: A statistical information grid approach to spatial data mining

Comparison of Poisson and Bernoulli spatial cluster analyses of pediatric injuries in a fire district

Discovering Spatial Mixture Patterns of Interest

A TIMBER framework for mining urban tree inventories using remote sensing datasets

Transdisciplinary foundations of geospatial data science

Rahul Ghosh, and Praveen Ravirathinam. 2021. A Statistically-Guided Deep Network Transformation and Moderation Framework for Data with Spatial Heterogeneity

Significant DBSCAN+: Statistically Robust Density-based Clustering

Rahul Ghosh, and Praveen Ravirathinam. 2021. A Self-Adaptive and Model-Agnostic Deep Learning Framework for Spatially Heterogeneous Datasets

A Nondeterministic Normalization based Scan Statistic (NN-scan) towards Robust Hotspot Detection: A Summary of Results

Significant DBSCAN towards Statistically Robust Clustering

A unified framework for robust and efficient hotspot detection in smart cities

Discovering Interesting Sub-Paths with Statistical Significance from Spatio-temporal Datasets

A comprehensive survey of clustering algorithms

Survey of clustering algorithms

Investigation of multi-scale spatio-temporal pattern of oldest-old clusters in China on the basis of spatial scan statistics

Anatomical organization of forward fiber projections from area TE to perirhinal neurons representing visual long-term memory in monkeys

Spatial data management in apache spark: The geospark perspective and beyond

Homicide as infectious disease: Using public health methods to investigate the diffusion of homicide

Data-Driven Spatial Branch-And-Bound Algorithms For Black-Box Optimization

BIRCH: an efficient data clustering method for very large databases

Spatial scan statistics adjusted for multiple clusters

Bayesian variable window scan statistics

Simnest: Social media nested epidemic simulation via online semi-supervised deep learning

Stochastic iterative hard thresholding for graph-structured sparsity optimization