key: cord-0516839-4jycd6c8 authors: Deepak, P; Sundaram, Sowmya S title: FiSH: Fair Spatial Hotspots date: 2021-06-01 journal: nan DOI: nan sha: 8d1957ecbf64cf44d71fcde7cb0a88d2e71806cd doc_id: 516839 cord_uid: 4jycd6c8 Pervasiveness of tracking devices and enhanced availability of spatially located data has deepened interest in using them for various policy interventions, through computational data analysis tasks such as spatial hot spot detection. In this paper, we consider, for the first time to our best knowledge, fairness in detecting spatial hot spots. We motivate the need for ensuring fairness through statistical parity over the collective population covered across chosen hot spots. We then characterize the task of identifying a diverse set of solutions in the noteworthiness-fairness trade-off spectrum, to empower the user to choose a trade-off justified by the policy domain. Being a novel task formulation, we also develop a suite of evaluation metrics for fair hot spots, motivated by the need to evaluate pertinent aspects of the task. We illustrate the computational infeasibility of identifying fair hot spots using naive and/or direct approaches and devise a method, codenamed {it FiSH}, for efficiently identifying high-quality, fair and diverse sets of spatial hot spots. FiSH traverses the tree-structured search space using heuristics that guide it towards identifying effective and fair sets of spatial hot spots. Through an extensive empirical analysis over a real-world dataset from the domain of human development, we illustrate that FiSH generates high-quality solutions at fast response times. With sensing and tracking devices such as mobile phones and IoT becoming pervasive in this web-driven era, there is an abundance of spatial data across real-world settings. Within such spatial datasets, it is often of interest to identify geographically localized groups of entities that are of sufficient size and express a distinctive character so strongly that it is unlikely to have occurred by chance. To illustrate an example from our times, COVID-19 contact tracing apps accumulate large amounts of spatial data of people, of which some are known to have a COVID-19 infection. It would be of interest to automatically identify localized regions of high COVID-19 incidence, referred to as hot spots in contemporary reporting 1 , so that the information could be channelized to health experts to identify causal reasons, or to public policy experts to develop a mitigation strategy for those regions. While COVID-19 hot spots are characterized by high disease incidence rates, other web and new age data scenarios may call for different formulations of hot spot character, viz., high crime rates (law enforcement), intense poverty (development studies), high mobile data usage (mobile network optimization) and so on. For example, Figure 1 illustrates hot spots of educational underachievement in India as identified from a human development dataset. In each case, identifying a set of hot spots would be of use so they could be subjected to an appropriate policy action. The unsupervised learning task of detecting spatial hot spots was pioneered by the spatial scan statistic (SSS) [21] . The spatial scan statistic and its variants within the SaTScan 2 toolkit have remained extremely Figure 1 : An illustration of hot spots of Low Educational Achievement in India popular for detecting spatial hot spots over the past two decades. While health and communicable diseases form the most popular application area of SSS (e.g., [30] ), they have been used within domains as diverse as archaeology [35] and urban planning [16] . In scenarios where spatial hot spots are to be used to inform government and public sector action, especially in sensitive policy domains (e.g., law enforcement [25] , development), it is often important to ensure that the collective population subject to the policy action is diverse in terms of protected attributes such as ethnicity, caste, religion, nationality or language, among others. Consider hot spot detection on a crime database to inform policy action that could include sanctioning higher levels of police patrols for those regions. This would likely lead to higher levels of stop and frisk checks in the identified hot spots, and would translate to heightened inconvenience to the population in the region. Against this backdrop, consider a sensitive attribute such as ethnicity. If the distribution of those who land up in crime hot spots is skewed towards particular ethnicities, say minorities as often happens [23] , it directly entails that they are subject to much more inconvenience than others. These, and analogous scenarios in various other sectors, provide a normatively compelling case to ensure that the inconvenience load stemming from crime hot spot detection (and other downstream processing) be proportionally distributed across ethnicities. The same kind of reasoning holds for groups defined over other sensitive attributes such as religion and nationality. It is also notable that ethnically skewed hot spot detection and patrolling would exacerbate the bias in future data. Minor crimes are recorded in the data only when committed as well as observed. Thus, majority and minority areas with similar real crime prevalance, alongside minority-oriented patrolling, would lead to data that records higher crime prevalance in the latter. Second, even in cases where the intended policy action is positive (e.g., setting up job support centres for unemployment hot spots), the policy being perceived as aligned to particular ethnicities could risk social solidarity and open avenues for populist backlash [15] , which could ultimately jeopardize the policy action itself. While considerations as above are most felt in policy domains such as policing and human development, these find expression in hot spot based prioritization in provisioning any common good. Ensuring fair distribution of the impact of any policy action, across sensitive attributes such as ethnicities, is aligned with the theory of luck egalitarianism [18] , one that suggests distributive shares (of inconvenience or benefits) be not influenced by arbitrary factors, especially those of 'brute luck' that manifest as membership in sensitive attribute groups such as ethnicity, religion and gender (since individuals do not choose those memberships are are often just born into one). Such notions have been interpreted as a need for orthogonality between groups in the output and groups defined on sensitive attributes, and has been embedded into machine learning algorithms through the formulation of statistical parity (e.g., [1] ). In summary, there is an compelling case, as in the case of other machine learning tasks, for hot spot detection to be tailored in a way that the population covered across the chosen hot spots be diverse along protected demographic groups such as ethnicity, gender religion, caste and similar. We now outline our contributions in this paper. First, we characterize the novel task of detection of fair spatial hot spots, for the first time. In particular, we outline a task formulation for enumerating a diverse sample of trade-off points in the noteworthiness-fairness spectrum, to suit diverse scenarios that require different trade-off points between noteworthiness and fairness. We note that straightforward solutions for the task would be computationally infeasible for even moderate dataset sizes. Second, we develop a method, FiSH, short for Fair Spatial Hot Spots, for efficiently enumerating sets of hot spots along the quality-fairness trade-off. FiSH works as a layer over any chosen fairness-agnostic spatial hot spot detection method, making it available across diverse scenarios and existing methodologies for those scenarios. Third, we outline a suite of evaluation measures to assess the quality and fairness of results. Lastly, we perform an extensive empirical evaluation over realworld datasets which illustrates the effectiveness and efficiency of FiSH in identifying diverse and fair hot spots. Given that fairness in spatial hot spots is a novel problem, we consider related work across two streams. We start by considering work on identifying outliers and spatial hot spots. These tasks are starkly different in terms of how the results are characterized. Outliers are determined based on neighborhood density, whereas hot spots are determined based on hotness on a chosen attribute (e.g., diseased, poor etc.). In particular, the notion of a hotness attribute is absent in the formulation for outlier detection making them fundamentally different tasks. Despite being non-identical tasks, there are similarities in the overall spirit of the tasks, which makes outlier identification relevant to the interested reader. We start with a discussion on methods for the tasks of outlier detection and spatial hot spots, and then move on to work on fairness in machine learning as applied to tasks allied to ours. There have been a variety of problem settings that seek to identify objects that are distinct from either their surroundings or the broader dataset. The more popular formulations use the former notion, that of measuring contrast from the surroundings of the data object, i.e., making use of neighborhood density. LOF [5] (and improvements [20] ) consider identifying individual objects, aka outliers, which are interesting due to their (relatively sparser) spatial neighborhoods. It is noteworthy that these make object-level decisions informed purely by spatial attributes (without reference to non-spatial attributes like diseased/non-diseased, as required for COVID-19 hot spot determination). SLOM [6] extends the objectlevel local neighborhood-based decision making framework to incorporate information from non-spatial attributes. Among outlier detection methods that assess the contrast of individual data objects with the dataset as a whole, the popular paradigm is to build a dataset level statistical model, followed by assessing the conformance of individual objects to the model; those that are less conformant would be regarded as outliers. Such statistical models could be a clustering [38] , dirichlet mixture [13] , or more recently, auto-encoders [7, 22] . Spatial Scan Statistics (SSS), pioneered by Kulldorff [21] , are methods that identify localized regions that encompass multiple objects (in contrast to making decisions on individual objects, as in LOF) which collectively differ from overall dataset on chosen non-spatial hotness attributes (e.g. diseased, poor etc.). The focus is on characterizing regions which may be interpreted as hot spots due to the divergence of their character from the overall dataset. This makes SSS a markedly different task from outlier identification in specification, input data requirements, internal workings and output format. SSS spatial hot spots are vetted using a statistical likelihood ratio test to ascertain significant divergence from global character. This makes SSS as well as its various variants, as implemented within SaTScan, a statistically principled family of methods to detect spatial hot spots. While Kulldorff's initial proposal is designed to identify circular hot spots, the framework has been generalized to identify arbitrary shapes in several ways; ULS [29] is a notable work along that direction. Other methods such as bump hunting [14] and LHA [33] address related problems and leverage data mining methods. Despite an array of diverse research in identifying spatial hot spots, SSS methods have remained extremely popular. Just since 2020, there have been 1000+ papers 3 that make use of SSS and other scan statistics within SaTScan. Our technique, FiSH, can work alongside any method that can provide an ordered output of hot spots, such as SaTScan methodologies. While most attention within the flourishing field of fairness in machine learning [9] has focused on supervised learning tasks, there has been some recent interest in fairness for unsupervised learning tasks [28] . Among the two streams of fairness explored in ML, viz., individual and group fairness (refer [3] for a critical comparative analysis), most work on fair unsupervised learning has focused on group fairness. Group fairness targets to ensure that the outcomes of the analytics task (e.g., clusters, top-results etc.) embody a fair distribution of groups defined on protected attributes such as gender, ethnicity, language, religion, nationality or others. As alluded to earlier, the most common instantiation of group fairness has been through the computational notion of statistical parity, initially introduced within the context of classification [12] . Group fair unsupervised learning work includes those on fair clustering (e.g., [8] ), retrieval (e.g., [39] ) and representation learning (e.g., [26] ). While there has been no work on fair spatial hot spots yet, there has been some recent work on fairness in outlier detection which we discuss below. Fair Outliers: There has been some recent work on fair outlier detection. We start by outlining the core differences between outlier detection to illustrate why fairness enhancements targeted at outlier detection would not be applicable for spatial hot spots. First, outlier detection involves object-level decision making, whereas hot spots are determined at the level of object groups. Second, they do not make use of any non-spatial hotness attribute (e.g., diseased, poor etc.) to determine outlierness, whereas a key nonspatial attribute is used to characterize hot spots. The second difference makes algorithms for outlier detection contrast highly from those for identifying spatial hot spots. Among recent fair outlier detection papers, [10] develops a human-in-the-loop method for fair outlier detection, whereas [11] focuses on automated group fair outlier detection, developing FairLOF, a technique that extends LOF (discussed above) for fairness. FairLOF adapts LOF to incorporate adjustments based on protected attribute memberships of the object in question and its neighbors, to ensure that protected groups are fairly represented among outliers. It may be noted that the protected attributes are used exclusively to embed fairness, and not to characterize outlierness. There is a third paper, [31] which makes an intriguing proposition of achieving group fairness (on protected attributes) while being unaware of protected attributes at decision time. To our best knowledge, there has been no prior work on fairness in detecting spatial hot spots or anomalous object groups of other kinds. Consider a dataset D = {. . . , , . . .}. Each object is associated with a set of spatial attributes such as ( , ) for a 2D space, or ( , ) for locations of people. Further, each is associated with a non-spatial hotness attribute ∈ {0, 1} such as diseased (for epidemiology) or poor (for human development), which is used to determine spatial hot spots. is also associated with protected attributes (e.g., ethnicity, religion) as we will see momentarily. Consider a method for detecting spatial hot spots, such as spatial scan statistics, that is capable of providing a ranked list of top spatial hotspots, as S = [ 1 , 2 , . . . , ]. Each is associated with a spatial region (circular/spherical in the case of the basic SSS) such that the data objects (from D) that fall within have a significantly different hotness profile than the overall dataset. For example, the population within may have a significant high (or low) incidence rate of poverty as compared to the whole population. Noteworthiness of spatial hot spots, analyzed statistically (as in SSS), is directly related to both the size of the population within the hot spot and the extent of divergence on the hotness attribute. S is the list of spatial hot spots ordered in decreasing statistical noteworthiness; thus is more noteworthy than +1 . When (typically, << ) noteworthy spatial hot spots are to be chosen to action upon without consideration to fairness, the most noteworthy hot spots, i.e., S = [ 1 , . . . , ], would be a natural choice. The task of fair spatial hot spots detection is to ensure that the hot spots chosen for policy action, in addition to noteworthiness considerations as above, together encompass a diverse population when profiled along protected attributes such as ethnicity, religion, nationality etc, as motivated earlier. In other words, each demographic group is to be accorded a fair share within the collective population across the chosen hot spots. As mentioned earlier, this notion of statistical parity has been widely used as the natural measure of fairness in unsupervised learning [2, 8, 11] . When the protected attributes are chosen as those that an individual has no opportunity to actively decide for herself (observe that this is the case with ethnicity, gender as well as religion and nationality to lesser extents), statistical parity aligns particularly well with the philosophy of luck egalitarianism [19] , as noted in Section 1.1. We will use S to denote a set of hot spots (from S) that are selected in a fairness-conscious way. It is desired that S fares well on both the following measures: The first, (.), relates with noteworthiness and is simply the sum of the ranks (ranks within S) of the chosen spatial hot spots. Lower values of (.) are desirable, and S scores best on (.), due to comprising the top-(so, ). The second, (.), is a fairness measure, which requires that the population covered across the hot spots within S be minimally divergent to the overall population, when measured on a pre-specified set of protected attributes P (e.g., ethnicity, gender); (., .) measures divergence on attribute ∈ P. The divergence may be computed using a suitable distance measure; we will use Wasserstein distance [34] . As in the case for (.), lower values of (.) are desirable too. Though lower, and not higher, values of (.) and (.) indicate deeper noteworthiness and fairness, we refer to these measures as noteworthiness and fairness to avoid introducing new terminology. The noteworthiness and fairness considerations are expected to be in tension (an instance of the fairness-accuracy tension [24] ), since fairness is not expected to come for free (as argued extensively in [17] ). One can envision a range of possibilities for S , each of which choose a different point in the trade-off between (.) and (.). At one end is the S (best (.), likely worst (.)), and the other end is a maximally fair configuration that may include extremely lowranked hot spots from S. These would form the pareto frontier 4 when all the ( sized) subsets of S are visualized as points in the 2D noteworthiness-fairness space, as illustrated in Figure 2 . Each point in the pareto frontier (often called skyline [4] ) is said to be pareto efficient or pareto optimal since there is no realizable point which is strictly better than it on both N and F measures. In other words, S candidates that are not part of the pareto frontier can be safely excluded from consideration, since there would be a pareto frontier candidate that is strictly better than it on both noteworthiness and fairness. Each policy domain may choose a different point in the trade-off offered across candidates in the pareto frontier, after due consideration of several available trade-off points. For example, policing may require a high-degree of fairness, whereas epidemiology interventions may be able to justify policy actions on less diverse populations based on the extent of supporting medical evidence. The pareto frontier may be large (could contain hundreds of candidates, theoretically bounded above only by O ( )) for a human user to fully peruse. Thus, an obvious recourse would be to identify diverse pareto efficient candidates (henceforth, -dpe), where is a pre-specified parameter, so the human user may be able to choose appropriately from a varied set of possibilities. A natural and simple but incredibly inefficient solution would be to (i) enumerate the entire pareto frontier, (ii) trace the sequence of pareto efficient points from the top-left to the bottom-right (i.e., the dotted line), (iii) split the sequence into − 1 equally sized segments, and (iv) take the segment end points as the result. To summarize, the diverse candidate selection task outlined as -dpe requires a diverse set of pareto efficient candidates in the N-F space, each candidate representing a sized subset of S. It may be observed that it is infeasible to enumerate the subsets (e.g., 40 10 = 8.5 +8) in the N-F space just due to the profusion of possibilities, making exact -dpe identification (as outlined in the four-step process in the previous section) infeasible for practical scenarios. This makes the task of identifying a close approximation of -dpe results efficiently a natural alternative for a policy expert to examine the trade-off points and arrive at a reasonable choice of S to subject to policy action. This brings us to the approximate -dpe task, which is that of efficiently identifying a close approximation of the exact -dpe result. The set of circled Notation What it stands for S the ordered list of spatial hotspots used as starting point for -dpe task S the subset of most noteworthy hotspots from S S -sized subset of S, a candidate for fair selection of hot spots (S ) sum of ranks of the spatial hot spots within S ; lower denotes better noteworthiness (S ) deviation of S 's population from dataset on protected attributes; lower denotes better fairness cardinality of S # hotspots from S desired in each output candidate Number of candidates desired in output beam width parameter used by FiSH (Sec 4) Figure 2 illustrates a possible solution to the approximate -dpe task. All pertinent notations are outlined in Table 1 for easy reference. Our method, FiSH, that addresses the approximate -dpe task, is detailed below. FiSH is an efficient heuristic-driven technique addressing the approximate -dpe task outlined above. We first describe a systematic organization of the search space, followed by a heuristic method that traverses the space prioritizing the search using three considerations: pareto efficiency, diversity and efficient search. Recall that we start with a noteworthiness-ordered list of spatial hot spots S = [ 1 , . . . , ]. Our full search space comprises the distinct -sized subsets of S. We use the lexical ordering in S to organize these candidates as leaves of a tree structure, as shown in Figure 3 . Each node in the tree is labelled with an element from S, and no node in the FiSH search tree has a child that is lexically prior to itself. Such a hierarchical organization is popular for string matching tasks, where they are called prefix trees [37] . In devising FiSH, we draw inspiration from using prefix structures for skyline search over databases [27] . Each internal node at level (root level = 0) represents a -sized subset of S comprising the nodes indicated in the path from root to itself. The lexical ordering ensures that each subset of S has a unique position in the tree, one arrived at by following branches corresponding to nodes in the subset according to the lexical ordering. The candidates would be the nodes at level . It is infeasible to enumerate them fully, as observed earlier. Thus, FiSH adopts a heuristic search strategy to traverse the tree selectively to follow paths leading to a good solution (i.e., set of nodes at level ) for the approximate -dpe task. The exact -dpe result set is characterized by pareto efficiency and diversity, when applied over the candidates. The FiSH search strategy uses precisely these criteria as heuristics to traverse the search tree efficiently from the root downward. The core idea behind this search strategy is our conjecture that pareto efficiency and diversity at a given level in the FiSH search tree would be predictive of pareto efficiency and diversity at the next level. We operationalize this heuristic strategy using beam search, a classical memory-optimized search meta-heuristic [32] that has received much recent attention [36] . FiSH starts its search from the root node, expanding to the firstlevel child nodes, each of which represent singleton-sets denoting the choice of a particular spatial hot spot from S. This forms the candidate set at level 1 of the FiSH tree, C 1 = {{ 1 }, { 2 }, . . .}. These 1-sized subsets of S are then arranged in an N-F space as in Fig 2. Note that the N-F space of 1-sized subsets is distinct and different from the N-F space of -sized subsets (Fig. 2) . The paretoefficient subset of C 1 is then identified as (C 1 ). The candidates in (C 1 ) are then arranged in a linear sequence tracing the pareto frontier from the top-left to the bottom-right point (similar to the illustration of pareto frontier in Fig 2) . This linear sequence is split into − 1 equally spaced segments, and the points at the segment end-points are chosen as ( (C 1 )), a -sized subset of C 1 . The (C ) = pareto frontier of C in the N-F space 7 R = equally spaced points from (C ); 8 Return R candidate set at the next level of the tree search process, i.e., C 2 , is simply the set of all children of nodes in ( (C 1 )) (actually, the subsets of S that they stand for). It may be noted that C 2 is a small subset of the set of all 2-sized subsets of S, since only children of the nodes selected from the previous level are selected for inclusion in C 2 . Next, C 2 is subject to the same processing as C 1 comprising: (1) identifying pareto efficient candidates (C 2 ), (2) identifying a diverse sized subset ( (C 2 )), and (3) following the children pointers, to arrive at the candidate set for the next level. This process continues up until C whereby the pareto frontier (C ) is identified followed by the choice of diverse candidates which will eventually form FiSH's result set for the approximate -dpe task. This search strategy is illustrated formally in Algorithm 1. The one-to-one correspondence between nodes in the search tree and subsets of S allows us to use them interchangeably in the pseudocode. FiSH's search strategy makes use of pareto efficiency and diversity directly towards identifying a small set of nodes to visit at each level of the tree. Restricting the search to only nodes at each level before moving to the next enables efficiency. Smaller values of enable more efficient traversal, but at the cost of risking missing out on nodes that could lead to more worthwhile members of the eventual result set. In other words, a high value of allows a closer approximation of the -dpe result, but at a slower response time. It may be suggested that be set to ≥ , since the algorithm can likely afford to visit more options than a human may be able to peruse eventually in the result set. The candidate set size at any point, and thus the memory requirement, is in O ( ). The computational complexity is in O ( 2 2 ), and is dominated by the pareto frontier identification (which is in O ( 2 2 )) at each level. While is a controllable hyperparameter (likely in the range of [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] , can be constrained by limiting FiSH to work with the top-result set (as S) from the upstream spatial hot spot technique. Given that (approximate) -dpe is a new task we proposed, we now describe novel evaluation metrics to assess the quality of FiSH's results. Recall that, given the N-F space comprising all -sized subsets of S, the choice of equally spaced skyline candidates forms the result set for the exact -dpe task that we propose in this paper. This result set, which we call Exact, is computationally infeasible for moderate datasets, but forms our natural baseline for measuring FiSH's effectiveness. Approximate -dpe results from FiSH may be evaluated either directly based on how well they approximate the expected results of the exact -dpe task, or based on how well they adhere to the spirit of the -dpe task of identifying a diverse group of pareto efficient subsets of S. We now devise evaluation measures along the lines above. In what follows, we use P to denote the -sized subsets of S. Let the result of the exact -dpe task be E = [ 1 , . . . , ], and FiSH's result be F = [ 1 , . . . , ]. We would like the average distance between corresponding elements to be as low as possible. where (., .) is the euclidean distance in the N-F space. Notice that when E = F , (., .) evaluates to 0.0. Given that (.) and (.) would be in different ranges, we will compute the distance after normalizing both of these to [0, 1] across the dataset. As may be obvious, smaller values, i.e., as close to 0.0 as possible, of (., .) are desirable. A diverse and pareto efficient set may be expected to collectively dominate most objects in the -space. Accordingly, we devise a measure, called coverage, that measures the fraction of candidates in P that are pareto dominated by at least one candidate in F . where ≻ is true when pareto dominates . A point pareto dominates another if the latter is no better than the former on both attributes, excluding the case where both are identical in terms of their N-F co-ordinates. A candidate being dominated by another indicates that the latter characterizes an absolutely better trade-off point than the former (on both (.) and (.) ). Thus, we would like the result set to be in a way that most, if not all, candidates are dominated by one or more candidates in the result set. (.) is measured as a fraction of the candidates dominated, hence it is in the range [0, 1]. Full coverage (i.e., (.) = 1.0) may not be attainable given that only candidates can be chosen in the result; instead, if we were to choose the entire skyline, we would get = 1.0 by design. Thus, the extent to which (F ) (FiSH's coverage) approaches (E) (coverage attained by the exact result) is a measure of FiSH's quality. Coverage, being modelled using pareto domination, may be seen as modelling pareto-ness of FiSH's result. Given that our formulation of the approximate -dpe task hinges on the idea that the candidates should be diverse (so that they may embody a variety of different trade-off points), diversity is a key aspect to measure the adherence of the solution to the spirit of the approximate -dpe task. We model diversity as the minimum among pairwise distances between candidates in F . Unlike the average of pairwise distances that allows nearby pairs to be compensated by the existence of far away ones, this is a stricter measure of diversity. On the other hand, this is quite brittle, in the sense just one pair of results being proximal would cause (.) to go down significantly; in such cases, the (.) would not be that representative of the overall diversity in F . Hence, all the evaluation measures must be seen in cognisance of the others. Coming to desirable values of (.), we would like (F ), which measures the lower bound of distances among elements in F , to be as high as possible, and approach the diversity of E, i.e., (E). As obvious from the construction, lower values of , and higher values on both and indicate the quality of FiSH's approach. It is also to be seen that and should be judged together, since it is easy to maximize coverage without being diverse and vice versa. and requires all subsets of S to be enumerated, whereas requires additionally that the exact -dpe results be computed. This makes these evaluations feasible only in cases where such enumeration can be done, i.e., reasonably low values of . In addition to the above quality measures, a key performance metric that FiSH seeks to optimize for, is the response time. We now describe our empirical study evaluating FiSH. In this section, we describe the dataset used, the experimental setup, our evaluation measures and our experimental results. 6.1.1 Dataset. We used the Indian Human Development Survey (IHDS) 5 dataset, a large-scale survey of India's population conducted in 2011-12. In particular, we used a random sample of 10000 individuals from the data with distinct locations. The location (lat, long) was determined through querying Google Maps based on the district and other location information available in the data. The binary hotness attribute was chosen as either (i) (annual) income < 100k 6 , or (ii) education < 2 yrs. For each setting, we use caste and religion as sensitive attributes and low income/education as hot spot criterion. In other words, we would like to identify a set of spatial hot spots such that the population across them fare poorly on income (education) but religion and caste groups are fairly represented. These choices of attributes for hotness and fairness are abundantly informed by social realities in contemporary India; for example, caste discrimination remains rampant across India, including in urban settlements 7 . We used SaTScan Bernoulli model to discover hot spots. We implemented FiSH as well as the Exact -dpe computation (i.e., enumerate all subsets, find pareto efficient frontier, and identify diverse subsets) on Python 3 on an Intel 64 bit i5-8265 at 1.6 GHz with 8 GB RAM. Unless otherwise mentioned, we use the following parameter settings: = 20, = 5 and = = 5. We performed extensive empirical analyses over varying settings. We present representative results and analyses herein. Table 2 illustrates a representative sample of the overall trends in the comparison between FiSH and Exact. The low values of indicate that FiSH's results are quite close to those of Exact, which is further illustrated by the trends on the Cov measure where FiSH follows Exact closely. For MD, we observe a 20% deterioration in the case of Income, and a 50% deterioration in the case of Education. We looked at the case of Education and found that the low value of MD for FiSH was due to one pair being quite similar (distance of 0.041), possibly a chance occurrence that coincided with this setting; the second least distance was more than three times higher, at 0.1349. On an average, the pairwise distances for FiSH was only 20% less than that for Exact. Across varying parameter settings, a 15-20% deterioration of MD was observed for FiSH vis-a-vis Exact. For the record, we note that the choice of first hot spots from S as the result yielded ≈ 0.8 and Cov 3 to 10 percentage points lower; this confirms that -dpe task formulation is significantly different from top-k not just analytically, but empirically too. Apart from being able to approximate the Exact results well, FiSH is also seen to be able to generate results exceptionally faster, a key point to note given that bringing the -dpe task into the realm of computational feasibility was our main motivation in devising FiSH. In particular, FiSH's sub-minute response times compare extremely favourably against those of Exact which is seen to take more than an hour; we will illustrate later that Exact scales poorly and rapidly becomes infeasible for usage within most practical real-life scenarios. The FiSH vs. Exact trends, reported in Table 2 is representative of results across variations in parameter settings. FiSH was consistently seen to record 0-10% deteriorations in Cov, around 15-25% deterioration in MD, and multiple orders of magnitude improvements in response time. The trends on the effectiveness measures as well as the response time underline the effectiveness of the design of the FiSH method. With FiSH being designed for efficient computation of a reasonable approximation of -dpe results, it is critical to ensure that FiSH scales with larger ; recall that = |S|, the size of the initial list of hotspots chosen to work upon. Table 3 illustrates the FiSH and Exact response times with varying . While Exact failed to complete in reasonable time (we set a timeout to 12 hours) for > 25, FiSH was seen to scale well with , producing results many orders of magnitude faster than Exact. In particular, it was seen to finish its computation in a few minutes even for ≈ 100, which is highly promising in terms of applicability for practical scenarios. Similar trends were obtained with scalability with higher values of and ; Exact quickly becomes infeasible, whereas FiSH's response time grows gradually. We now analyze the performance of FiSH in varying settings. This analysis helps us evaluate the sensitivity of FiSH to specific parameter values; for example, smooth movements along small variations in parameter values will help build confidence in the utility of FiSH in varying scenarios. With Exact being unable to complete running within reasonable amounts of time for higher search spaces (e.g., > 25, = 7, > 5 etc.), we restrict our attention to FiSH trends over Cov and MD; this is so since results from Exact are necessary to compute the DC measure. Among Cov and MD, our expectation is that the brittleness of the MD measure, as noted in Section 5.3, could lead to more fluctuations in MD when compared to Cov, even when FiSH results change only gradually. We now study the trends with varying parameter settings, changing parameters one at a time, keeping all parameters at their reference settings from Section 6.1.2, except the one being varied. 6.4.1 Varying . We now analyze the effectiveness of FiSH when operating over a larger set of SaTScan results, i.e., with larger values of (recall = |S|). With the number of points in the N-F space being , increases in lead rapidly to much denser N-F spaces, and correspondingly larger search spaces. We vary from 15 to 30 in steps of 5; the Cov and MD trends appear in Figure 4 and Figure 7 respectively. As expected, Cov consistently remains at values higher than 0.985, whereas there is higher volatility in the case of . The trends indicate that FiSH is not highly sensitive to and the quality of its results varies gradually with varying values of . The number of trade-off points that is provided to the user, or , is another important parameter in the -dpe task. The beam size in FiSH, as observed earlier in Section 5.4, is intimately related to , and may be expected to be set such that ≥ . Higher values of yield better results at the cost of slower responses; we consistently set = in our result quality analysis. Higher values of enable choosing more points from the N-F space in the output, and this provides an opportunity to improve on Cov. However, choosing more points obviously would lead to deterioration in the MD measure that measures the minimum of pairwise distances. We vary (and thus ) from 3 to 7, and plot the Cov and MD trends in Figures 5 and 8 respectively, which show gentle and consistent variations. As expected, Cov is seen to improve and saturate close to the upper bound of 1.0. MD on the other hand, is seen to deteriorate but stabilizes soon; the patterns are consistent except for the case of = 5 for Education, likely a chance occurrence as analyzed in Section 6.2. Varying . The third parameter of importance for the -dpe task is , which denotes the number of hotspots to be chosen within each trade-off point in the result. Increasing values of (up to /2) lead to larger number of points in the N-F space. With the number of trade-off points to be output pegged at , achieving the same coverage would become harder with increasing . This is in contrast with where there is no expectation of a consistent deterioration or improvement. From the Cov and MD plots in Figures 6 and 9 , the Cov is quite stable with a deterioration kicking in at = 7 (even there, remains at 0.90+), whereas MD remains consistent. 6.4.4 Setting . The beam width, in FiSH, offers a mechanism to trade-off effectiveness for efficiency. We experimented with varying values of and found that the gains on effectiveness measures (i.e., DC, Cov and MD) taper off beyond > 2 × . The response times were seen to increase with ; there are two ways in which affects the complexity, one is by providing more candidates at each level (which increases linearly with ), and another by increasing the cost of pareto frontier identification (which is in O ( 2 )). From the trends which indicated a linear trend between response time and , it may be reasonably suspected that the former factor dominates. Having analyzed FiSH quantitatively, we now consider a qualitative evaluation of FiSH vis-a-vis Exact. Fig 10 illustrates the N-F space for our reference setting (Section 6.1.2) for Income, with results from FiSH (green points) juxtaposed against Exact results (mustard yellow) and other points in red. This result is representative of FiSH's strengths and weaknesses. While three of five FiSH results are seen to be on the pareto frontier, the others are only slightly inward. As in the case of any heuristic-driven method, FiSH may miss some good results; here, FiSH's sampling misses out on the top-left region of the pareto frontier, which explains the slight deterioration in Cov for FiSH when compared with Exact. In this paper, for the first time to our best knowledge, we considered the task of fair detection of spatial hot spots. In this web era where spatially-anchored digital data is collected extensively, spatial hot spot detection is used extensively to inform substantive policy interventions across a variety of domains, making fairness an important consideration within them. We characterized fairness using the popular notion of statistical parity when computed collectively over chosen hot spots, and outlined the task of identifying a diverse set of solution candidates along the fairness-noteworthiness pareto frontier. Observing the computational infeasibility of identifying exact solutions, we developed a method, FiSH, that performs a highly efficient heuristic-driven search to identify good quality approximate solutions for the task. We then formulated a suite of evaluation metrics for the novel task of fair hot spots. We perform an extensive empirical evaluation over a real-world dataset from the human development domain where fairness may be considered indispensable, and illustrated that FiSH delivers high-quality results, and offers good scalability, consistently returning results orders of magnitude faster than what is required to compute exact results. This illustrates the effectiveness of FiSH in achieving fairness in detection of spatial hot spots, and that it offers fast response times, making it appropriate for real-world scenarios. While we have considered enhancing fairness by working upon a ranked list of spatial hot spots, FiSH extends easily to work over techniques that are capable of providing scores (in addition to ranks, which is basically an ordering over the scores) for each hot spot as well; we are considering evaluating FiSH's effectiveness in working over such scored lists. Our formulation of diverse candidates assumes that the user may be interested equally in all parts of the noteworthiness-fairness trade-off space. However, in several cases, users may have a preference to exclude some parts of the space. For example, the maximum relaxation of noteworthiness may be bounded above in some scenarios. We are considering how user's trade-off preferences can be factored into the FiSH search process to deliver diverse results within the sub-spectrum of interest. Fairness in Clustering with Multiple Sensitive Attributes Fair Algorithms for Clustering On the apparent conflict between individual and group fairness The skyline operator LOF: identifying density-based local outliers SLOM: a new measure for local spatial outliers Outlier detection with autoencoder ensembles Fair clustering through fairlets A Snapshot of the Frontiers of Fairness in Machine Learning A Framework for Determining the Fairness of Outlier Detection Fair Outlier Detection Fairness through Awareness Unsupervised anomaly intrusion detection via localized bayesian feature selection Bump hunting in highdimensional data The rise of right-wing populism in Europe and the United States. A Comparative Perspective, Friedrich Ebert Foundation Beyond postsuburbia? Multifunctional service agglomeration in Vienna's urban fringe The ethical algorithm: The science of socially aware algorithm design Luck egalitarianism: Equality, responsibility, and justice Luck egalitarianism LoOP: local outlier probabilities A spatial scan statistic Robust Subspace Recovery Layer for Unsupervised Anomaly Detection Race and place: The ecology of racial profiling African American motorists The cost of fairness in binary classification A penalized likelihood method for balancing accuracy and fairness in predictive policing Convex formulations for fair principal component analysis Efficient Skyline Retrieval with Arbitrary Similarity Measures Fairness in Unsupervised Learning Upper level set scan statistic for detecting arbitrarily shaped hotspots Spatial clustering of measles cases during endemic (1998-2002) and epidemic (2010) periods in Lusaka, Zambia FAIROD: Fairnessaware Outlier Detection Improvements in beam search Detecting localized homogeneous anomalies over spatio-temporal data Calculation of the Wasserstein distance between probability distributions on the line Unsupervised model-based clustering for typological classification of Middle Bronze Age flanged axes Sequence-to-sequence learning as beam-search optimization Prefix trees: new efficient data structures for matching strings of different lengths Findout: Finding outliers in very large datasets Mohamed Megahed, and Ricardo Baeza-Yates. 2017. Fa* ir: A fair top-k ranking algorithm