key: cord-0432340-lcyy1yi6 authors: Marc'ilio-Jr, Wilson E.; Eler, Danilo M.; Paulovich, Fernando V.; Martins, Rafael M. title: HUMAP: Hierarchical Uniform Manifold Approximation and Projection date: 2021-06-14 journal: nan DOI: nan sha: 69cffbf273fa42c0ea97d645ecc61006f7e02486 doc_id: 432340 cord_uid: lcyy1yi6 Dimensionality reduction (DR) techniques help analysts to understand patterns in high-dimensional spaces. These techniques, often represented by scatter plots, are employed in diverse science domains and facilitate similarity analysis among clusters and data samples. For datasets containing many granularities or when analysis follows the information visualization mantra, hierarchical DR techniques are the most suitable approach since they present major structures beforehand and details on demand. However, current hierarchical DR techniques are not fully capable of addressing literature problems because they do not preserve the projection mental map across hierarchical levels or are not suitable for most data types. This work presents HUMAP, a novel hierarchical dimensionality reduction technique designed to be flexible on preserving local and global structures and preserve the mental map throughout hierarchical exploration. We provide empirical evidence of our technique's superiority compared with current hierarchical approaches and show two case studies to demonstrate its strengths. High-dimensional data analysis has become an essential task in most knowledge domains. However, together with the increasing availability of high-dimensional data come the challenges for novel applications. Thus, tools for navigation and analysis are increasingly important. For example, dimensionality reduction (DR) techniques aim to reduce a dataset's dimensions while preserving the structure present in highdimensional space. Encoded as scatter plots, DR results (or embeddings) support analysis of document collections [5] , analysis of gene [11, 16, 37, 42, 44] , as well as the features learned by deep learning models [2, 34] . For these two latter applications, practitioners usually look for DR techniques that can uncover complex structures (e.g., manifolds), retain clusters, and provide meaningful representations of the global relationship among clusters. Traditionally, DR techniques operate in one level of detail, producing similarity-based layouts that highlight general patterns in the overall distribution of points in high-dimensional space. However, these layouts may hide important characteristics about the datasets, such as dissimilarities between samples inside a cluster that would provide an overview of the structures and benefit analysis when focusing on important information at first. This limitation can be circumvented with hierarchical DR (HDR) techniques, which support the exploration of DR results using the visual information-seeking mantra [40] : Overview first, zoom and filter, then details-on-demand. In HDR, the dominant structures seen in traditional representations are summarized. Users first visualize them with details that would not be visible for a projection of the whole dataset-for instance, see how the purple cluster highlighted in Fig. 1 is presented both in DR and HDR techniques . Then, users proceed to expand clusters and investigate lower (and more detailed) hierarchy levels. Although HDR has been an active area of research in the past few years [10, 17, 33] , current techniques present a few limitations. HiPP [31] uses nodesencoded as circles with varying areas-to create an overview of the dataset and guide users in the further analysis. It is based on Least Square Projection (LSP) [32] , which does not scale well with thousands of data points. HSNE [33] can successfully depict manifolds and deal with thousands of data points/dimensions using a GPU implementation. However, as the analyst drills down the hierarchy, HSNE does not preserve the projection mental map. That is, the embedding generated for each hierarchical level becomes more and more dissimilar from embeddings of the previous level- Fig. 1 (B, bottom) shows such a characteristic of the HSNE embeddings. Recently, a hierarchical technique called Multiscale PHATE [17] was proposed to work in the context of mass cytometry data. Although capable of handling massive datasets that present continuous phenomena, the technique does not seem to work well with much higher dimensions or when the dataset does not present such a continuous characteristic. Furthermore, Multiscale PHATE is very rigid in interaction since users can only choose among the clusters created during hierarchy definition when drilling down into the hierarchy. These HDR techniques cannot fully address problems related to two main issues: good trade-off between preservation of global and local structures in the low-dimensional representation of the dataset and preservation of the mental map across hierarchy levels. First, realworld datasets require HDR techniques to unfold complex structure-HiPP does not preserve the context of the neighborhood graph used for manifold approximation. The visual metaphor does not encode the existing data structures well. Second, HDR techniques must deal with higher-dimension datasets and be flexible in terms of hierarchy manipulation and dataset type (which is not supported by Multiscale PHATE). Finally, there is a current need for techniques to uncover complex structures while preserving global and local relations between data samples. The balance between the preservation of global and local relations is crucial for analyzing deep learning models [24, 34] and gene expression data [20] , for example. HSNE presents these requirements for higher hierarchy levels. However, the most critical issue with the technique is that it does not preserve the mental map throughout hierarchy levels, making it hard for users to move back and forth between overview and details. To address these challenges, we present Hierarchical Uniform Manifold Approximation and Projection (HUMAP). Based on UMAP [25] , our HDR technique induces a hierarchy on the dataset using a twostep similarity definition which encodes global and local similarity information between data points. The technique maintains the mental map as the user drills down the hierarchy by using projected data points from higher levels to influence the low-dimensional representation of lower hierarchy levels while successfully uncovering complex structures present in the many granularities of the dataset. We provide experimental evidence that HUMAP is superior to the techniques discussed above through visual and quantitative assessment. We also provide two qualitative case studies exploring the features learned in a deep learning model and the hierarchical explanation of DR results. In summary, the contribution of this paper is a novel HDR technique 1 that improves the representation of both global and local relations while preserving the mental map across hierarchy levels. The remainder of this paper is organized as follows: Section 2 presents the related works; we describe our technique in Section 3; we present two case studies in Section 4; we quantitatively evaluate the technique in Section 5; discussions about our work are presented in Section 6; we conclude our work in Section 7. DR techniques support the analysis of high-dimensional datasets by reducing the number of dimensions while maintaining structural and similarity relations in the low-dimensional representation [30] . When using a DR technique, users usually look for patterns in the data, such 1 https://github.com/wilsonjr/humap as clusters, shapes, and outliers. Understanding the data at hand and the available DR techniques is essential to ensure an effective data exploration process. For example, linear DR techniques [14, 15, 32] are fast approaches that are known to uncover variability and other "simple" structures very well. On the other hand, non-linear DR techniques [21, 25, 27, 33] can uncover more complex structures of the high-dimensional space into the low-dimensional space while being generally more complex and time-consuming. DR results are often represented by scatterplot metaphor and used in various knowledge domains. Despite its popularity for exploratory data analysis [26] , the overlap among markers makes scatter plots less effective [38, 39] . Together with such a visualization problem, traditional DR techniques might hide important details within and among clusters when exploring big datasets as a whole-some datasets, for example, present intrinsically multi-level structures [20] that require more flexibility from a DR technique. In this sense, strategies to provide easier exploration and guide users through the exploratory process facilitate knowledge discovery. Among various such strategies, HDR techniques provide exploration mechanisms based on the overview first & details-on-demand [40] mantra, focusing on important information according to user demand. There are numerous traditional DR approaches (refer to [6, 30] for useful surveys) but only a few hierarchical DR (HDR) techniques [10, 30] . These HDR techniques have the characteristic of first creating the hierarchical structure, then allowing for multilevel exploration. The challenge is how to convey different levels, preserve context and neighborhood structures [30] . Thus, current studies focus on defining the hierarchical structure while using the projection engine of previous traditional DR approaches. MDSteer [46] progressively computes a multidimensional scaling layout according to the user's demand, addressing the computational complexity problem of such a technique. Glimmer [12] also addresses MDS complexity using a multi-level scheme on GPU to perform projection. In simple terms, the authors create a hierarchy through interpolation of high hierarchy levels. Nevertheless, MDS-based approaches cannot represent well complex structures (e.g., manifolds) present in most practical and real-world datasets (e.g., deep learning features, image collections, or biological data). The absence of strategies to convey non-linear structures also impairs hierarchical variants of PCA [1, 13, 45] . On top of that, these techniques were not designed for exploratory data analysis. HiPP [31] uses hierarchical clustering and landmarks to organize data. The landmarks of coarser levels represent and influence the position of data points in finer levels. HiPP considers the landmarks of LSP [32] for context preservation and to convey hierarchy levels and uses a force algorithm to resolve overlaps when positioning points heuristically. HSNE [33] is one of the most robust approaches. Using random walks on a transition matrix creates a hierarchy to retain good global and local relationships on high hierarchy levels. However, the HSNE's embeddings lose the structural relationship presented at higher levels during hierarchical exploration, hindering the analysis when the mental map preservation is essential. Finally, HSNE requires a GPU to provide interactive analysis in a reasonable amount of time. A recent technique, called Multiscale PHATE [17] (M. PHATE), uses diffusion condensation [3] to compute a manifold-intrinsic diffusion space on the input data and condense data points towards centroids to generate groups of multiple granularities. M. PHATE handles massive datasets but for only a few dimensions. Moreover, it does not seem to work well when the input dataset does not present continuous phenomena, a problem related to its embedding engine (PHATE [28] ). These methods have one characteristic in common: the hierarchy levels consist of representative samples or landmarks. Interestingly, any one-singlelevel-of-detail dimensionality reduction technique that uses landmarks for projecting data points is enabled to have a hierarchical version [30] . Finally, during hierarchical exploration, users essentially deal with embeddings of different data subsets with increasing size. Thus, the mental map between consecutive projections must be preserved to guarantee that geometrical transformations do not deceive users during exploratory data analysis. The preservation of the mental map in DR layouts has been receiving attention when exploring deep learn- . To repeat the same process for high hierarchy levels, we measure the similarity among landmarks as the intersection of representation neighborhoods (E) created as the union of local and global neighborhoods (D). Except for the first hierarchy level (all data), we compute the k-nearest neighbors using a sorting algorithm (F). For projecting hierarchy levels (or subsets of them), we use a modified UMAP [25] optimization. While it needs a symmetrized version of the k-nearest neighbor graph for the projection (H), we initialize the low-dimensional representation by fixing the coordinates already computed for higher hierarchy levels (I). ing features [4, 35] , which is usually called as Projection Alignment. Dynamic t-SNE [35] produces projections of a time-varying dataset to minimize temporal variability that is irrelevant for the final layout. This is achieved by adding a term in the t-SNE's cost equation, controlling the trade-off between alignment and standard t-SNE optimization. VFF [9] also uses a trade-off parameter to control the alignment of projection in feature fusion tasks. However, VFF is not suitable for the preservation of neighborhoods or local structuress [4] . Finally, Cantareira and Paulovichs [4] proposed a generic model for projection alignment that takes into account the original cost function of the DR technique and a penalty term applied for alignment. In contrast with these approaches, our strategy to preserve the mental map consists of controlling the movement of the data samples already projected in high hierarchy levels, which influences the positioning of new data samples. We present HUMAP to address the problems of previous techniques. Our approach is very flexible for preserving both local and global relations present in the dataset and providing a tunable parameter to preserve the projection mental map among hierarchy levels and during drill-down operations. These characteristics are apparent in Fig. 1 . Unlike HSNE (the current state-of-the-art in the hierarchical exploration of DR layouts), HUMAP maintains the structures of selected subsets in drill-down operations, and the highest hierarchy level (top-level) resembles the lowest hierarchy level (B). Compared to single-level techniques (A), HDR approaches summarizing the information while providing extra details about the relationship within and among clusters. Our proposed HDR technique consists of two main components: Hierarchy construction and Projection (Fig. 2) . First, in the hierarchy construction component, we induce a tree-like structure on the highdimensional dataset using a similarity measure among landmarks. Then, in the projection component, we embed the hierarchy levels according to the user's demand for more detailed information. In the following, we provide an overview of our approach. Then, we detail each one of the components in subsequent sections. The first step for constructing a hierarchy from bottom to top consists of using a kernel function to find connection strengths (local affinities) of a k-nearest neighbor graph from the data points in the high-dimensional space R m (A). Then, we use Finite Markov Chain to find the most visited nodes (B), which consists of the landmarks for the superior level (C). Using local and global neighborhood information of each landmark (D), the similarity between each pair of landmark is defined as the intersection these two neighborhoods (E). Finally, a new neighborhood graph is created (F) using the computed similarity to define a new hierarchy level (G). Then, for projecting hierarchy levels, the neighborhood graph is first symmetrized (H) so the strength of each edge helps in the process of finding a suitable position in the low-dimensional representation. With exception to the top hierarchy level, the projection of low levels is influenced by the low-dimensional positions (I) of data points in higher levels for mental map preservation. To build our approach, we borrow a few strategies from UMAP [25] technique, such as the kernel function to compute the affinities/connections strengths among data samples and the embedding strategy. In the following section, we focus on the components for hierarchical dimensionality reduction with HUMAP. DR techniques usually rely on k-nearest neighbors structures to approximate manifolds in high-dimensional spaces. A k-nearest neighbor graph approximates global distances through the sum of various local distance relations [27] , that is, the distance between two arbitrary points in high-dimensional space consists of the path formed by the vertices in the k-nearest neighbor graph. Thus, finding a k-nearest neighbor graph is commonly the first step of many DR techniques that can uncover manifolds in data [21, 25, 27] . After finding the k-nearest neighbor graph, a distance computation on the graph using a kernel function defines the strength (or probability) of every edge of the graph. Here, we adopted UMAP's [25] kernel function for two data points x i , x j ∈ X (the high-dimensional dataset): where d(x i , x j ) is the euclidean distance between the data points x i and x j , ρ i is the euclidean distance of x i to its closest neighbor, and σ i is a kernel value that depends on the number of neighbors k, according to the following equation [25] , Notice that σ i is unknown; however, by fixing a value for k, we can use Equation 2 to compute it using a binary search [25] . That is, for each data point x i (matrix row), Equation 1 is plugged into Equation 2 and a binary search is used to find the best σ i value given the number of nearest neighbors (k). Thus, the value for σ i that produces a k closest to the actual k is selected. This procedure ensures different σ i for each data point and neighborhood, that is, for a fixed k, σ i receives greater (or lower) values for denser (or sparser) neighborhoods. In other words, σ i encodes how close together are the data points in the neighborhood of x i . Lastly, the parameter ρ i gives a locally adaptive kernel for each data point, and ensures good topological representation of high-dimensional data [25] . The computation of p i| j for every pair of data points of the dataset means that we have the connection strength between every neighbor according to the neighborhood density (please, see Equation 1). With these connections strengths, we select good representative data points (or landmarks) to summarize the dataset on upper hierarchy levels. For this task, we follow Pezzotti et al.'s [33] work and use random walks on a Finite Markov Chain (FMC). A random walk, specified by a length µ, consists of the path from an initial state to the final state after µ steps (or hops) on the neighborhood graph. Thus, we generate the transition probabilities for the FMC from p i| j by computing the following equation: where data points x j and x k are in the neighborhood of x i (NH(i)). Then, the landmarks are sampled from T i, j by a simple Markov Chain Monte Carlo technique [8] . That is, for each data point i, we start n random walks with length µ. We empirically evaluated that n = µ = 10 results in embeddings that trustfully represent the underlying data structure. Unlike previous approaches, our technique has the advantage that users control the number of data points in each hierarchy level. Users set the number of data points in a hierarchy level by specifying the number of landmarks on the level i (|H i |). During the Monte Carlo execution, we store the random walk endpoints and select as landmarks the |H i | most visited data points. During hierarchical creation, the hierarchical level H i encode information of the level below it (H i−1 ). This definition ensures that the information in the top-level accurately represents the information of the whole dataset, that is, the sum of local distances approximates the distance between landmarks. For the first hierarchical level, the similarity is computed using Equation 1 based on euclidean distance (d(x i , x j )). For superior levels, the similarities among landmarks in H i follow the data organization of H i−1 . Such similarities are computed based on the k-nearest neighbor graph of H i−1 . For two landmarks l i u and l i v on level i, we compute their similarity using two components: the intersection of the global and local neighborhoods. Fig. 2 D-F illustrate these two components for two landmarks in green. The local neighborhood (in pink) is computed for every hierarchy level (A and F) and the global neighborhood is found through random walks (in blue). Two landmarks that share a lot of local neighbors are very close in the high-dimensional space. Landmarks sharing global neighbors have some relationship since there is a path between these landmarks in the k-nearest neighbor graph. We use random walks to compute the global neighborhood in a similar fashion to the HSNE [33] approach. In this case, we start ω random walks of length υ from each non-landmark (x i−1 k ) on the knearest neighbor graph of H i−1 . Then, when a random walk reaches a landmark (l i u ), we add the non-landmark (x i−1 k ) to the "representation neighborhood" of landmark l i u . Notice that such a representation neighborhood (RNH) is created only for similarity computation. The local neighborhood component adapts our approach to preserve more of the local structures. To this end, we use the immediate neighborhood (NH) of each landmark l i u to augment their representation neighborhoods with the β × |NH(l i u )| first nearest neighbors of l i u , where β is a threshold between 0 and 1. Thus, the greater is β , the more important the immediate neighborhood, and the similarity measure will encode more local relations. The process of finding these two components results in the representation neighborhood (RNH) for each landmark l i u in level H i . This neighborhood consists of a sparse matrix RNH of dimensions or 0, otherwise. The similarity among landmarks is computed based on the intersection of representation neighborhoods, that is, the shaded area of Fig. 2 E-F. We compute the intersection with the RNH matrices as follows: where M is a normalization factor that ensures similarity between 0 and 1-notice that since RNH is a sparse matrix, the operation (RNH u ) T RNH v is performed really fast in practice. The resulting measure using the intersection of neighborhoods is a similarity measure, meaning that a value equals to 1 corresponds to landmarks sharing the same representation neighborhood. Thus, we transform it into a dissimilarity function by subtracting the similarity value from 1. Second, since we use the UMAP technique and it computes a probability graph using Equation 1, readers may ask why we did not define a probability matrix to cut corners just like HSNE technique [33] . However, if we create the probability matrix, we would lose the kernel function's properties-such as the ρ i parameter. Moreover, our algorithm for computing the similarity matrix is fast. It results in a very sparse matrix with dissimilarities, making the computation of k-nearest neighbor for hierarchy levels just a matter of a sorting procedure for each vertex. Having the dissimilarity values (1 − D RNH ), we are able to use them in Equation 1 to generate the matrix p i j for each hierarchical level. Then, UMAP [25] applies the following matrix symmetrization to produce an undirected weighted graph: The matrix p i j is used to produce an embedding that slowly converges to positions in a low-dimensional space. While we refer to McInnes et al. [25] for a detailed description of the UMAP algorithm, what is important to describe here is that an initial low-dimensional representation of the dataset is created using Spectral Embedding [29] and p i j is used to repositioning the data points using a force-directed graph layout whose convergence is performed by decreasing attractive and repulsive forces (given by p i j ) using Stochastic Gradient Descent [36] . During the definition of hierarchical levels, each landmark represents a set of data points in a level below it. To keep track of such representation is essential to define the functionality of interaction with hierarchical levels and request more detail (more data points) of a data subset. We assign each non-landmark data point on level H i−1 to a landmark on level H i in two steps. The first step consists of iterating through the landmarks and assigning their neighbors to be influenced by them. Eventually, a landmark will try to "represent" an already represented data point. Therefore, we assign the data point to be part of the neighborhood of the closest landmark. This procedure assigns most of the data points to a landmark; however, it is necessary to look for cases where data points are not in any landmark's neighborhood. For this particular case, we iterate over each data point on H i−1 to check if it is not associated with any landmark, in case we look for one in its neighborhood. Algorithm 1 details this process. For each non-landmark data point (Line 2 and 3), we check its neighborhood (Line 5). In case a neighbor v is a landmark (Line 6), we set the current data point to be represented by v (Lines 7 and 8). If v is not a landmark but is associated with one (Line 9), the landmark representing v will also represent the current data point (Lines 10 and 11). Finally, if neither of these cases is true of all the neighborhoods, we start a depth-first search in the k-nearest neighbor graph and associate the first landmark found to the current data point. where A[u] returns the landmark that represent the data point x u . Finally, when projecting lower hierarchy levels, we want them to be as similar as possible to higher hierarchy levels in order to preserve the mental map during hierarchical exploration. To achieve this, we use the coordinates of the already projected points in immediate higher hierarchy levels. That is, when initializing the low-dimensional representation to feed Stochastic Gradient Descent, we set the coordinates of known points as same of higher level projection. These coordinates only move a fraction during optimization algorithm. This fraction can be tuned according to one needs, although we empirically find θ = 0.01 to yield embeddings that best preserve the mental map throughout hierarchy. In this section, we provide two case studies. In the first one, we leverage a dimensionality reduction explanation approach to show how HUMAP benefit the analysis in terms of computational complexity and amount of information retrieved. Then, we testify the HUMAP performance regarding the mental map preservation when dealing with a big dataset. For both case studies, we set the number of random walks ω = 20 and the length of each random walk υ = 30. The explanation of dimensionality reduction algorithm has been receiving a lot of attention in the recent years [7, 22, 23, 41, 43] . The ability to explain DR results enhances exploratory analysis and can generate previously unknown facts [23] . In this use case, we demonstrate how HUMAP can be beneficial and offer more detailed explanations than traditional DR techniques when applyed together with a DR explanation approach called ClusterShapley [23] . ClusterShapley leverages a model-agnostic approach, called SHAP [19] , to explain DR results. Thus, it has the limitation of runtime execution when augmenting the number of clusters, data points, and dimensions of the input dataset. So that, since HUMAP provides a trustful representation of the dataset on the top-level representation of the hierarchy, it helps to reduce the computational burden of computing explanations for thousands of datapoints. Besides providing an way to compute explanation more rapidly, HUMAP will also lead users to focus on important information-dominant structures-that dictates how the embedding is organized. Finally, the drill-down operation provided by HUMAP also helps users to focus on the information that important at a given moment since users can cut corners and generate explanations to only a subset of samples. Fig. 3 illustrates the differences of computing explanations for the whole dataset (A) and the top-level representations for hierarchy with three levels (B). ClusterShapley explanations are related to clusters are show the contribution of each feature to cluster formation-the clusters shown in this use case were manually defined. In this case, negative SHAP values (x-axis) encode contributions to cluster cohesion while positive SHAP values encode contributions for spreading the cluster. In the visual representation, the authors encode feature values using a green-to-red color scale and the contributions on positions on the x-axis. When investigating the explanations for a cluster, we can highlight its data points on the visualization by drawing poly-lines connecting their feature contributions-as one can see for the red cluster in (A). The ClusterShapley technique shows the relationship among feature values and contributions to cluster formation. Lets analyze the explanations for the whole dataset. Focusing on the red cluster, we can see that the combination of lower MedRentPctHousInc (median gross rent as a percentage of household income) and higher values for the features PctSameCity85 (percentage of people living in the same city as in 1985), PctSameHouse (percentage of people living in the same house as in 1985), and PctBornSameState (percentage of people born in the same state as currently living) defines the formation of this cluster. Given that, the red cluster could be defined by communities that do not move to another places, which could be explained by their lower rent as part of their income. Then, the blue cluster is defined by communities with concentration of young people. Given the high values for the features agePct12t29, agePct16t24, agePct12t21, and MalePctNevMarr (percentage of males that have never married) that induces negative SHAP values. Moving to the orange cluster, we can easily see that it corresponds to communities of immigrants, characterized by the lower values of PctSpeakEnglOnly and higher values for racePctHisp-the two most important features. Finally, the green cluster is characterized by communities of older people. This is explained by the fact that negative SHAP values are related with lower values of PctHousOccup and higher values for the features agePct65up, pctWSocSec, and pctWRetire. While explaining the projection at once is beneficial for understanding the clusters, it only tells a single story about the data. Besides that, computing the explanations for the whole projection at once is too costly. Using HUMAP, we can understand various facets about the data and can also focus only on what is important for a given analysis. In addition, it reduces the computing time to generate explanations to help analysts on their exploratory tasks. For instance, based on the clustering result of HUMAP's toplevel in Fig. 3 , we can identify that the green cluster corresponds to communities with higher acquisitive power due to negative SHAP values for lower values of PctPopUnderPov, PctHousNoPhone, and higher values of PctEmploy and PctPersOwnOccuP; the orange and blue clusters represent communities with higher number of immigrants while the red cluster represents communities with higher number of older people given by negative SHAP values for higher values of pctWSocSec and lower values of PctImmigRecent, PctRecentImmig, and PctRecImmig5. To explore more of the dataset, Fig. 4 shows how HUMAP is beneficial to understand various characteristics hidden in data. From (A) to (B), we see the selection of a subset of data points together with the projection of the lower hierarchy level. It is important to emphasize that we manually defined the clusters for each hierarchy level, which explains the change of colors when looking at the projections of (A) and (B)after project, we manually define the clusters using clustering. After defining the cluster and computing explanations (C), there is a division between communities composed by higher number of black people (orange cluster) and while people (blue cluster)-check higher values of racepctblack for the orange cluster and racePctWhite for the blue cluster associated with negative SHAP values. Finally, the red cluster is associated with higher number of crimes (rapes and burglaries) and are also associated to the higher number of HousVacant, which consists of number of vacant households. We can further explore another data subset and project data onto the first level, as illustrated in the process from (D) to (E). Then, we select a subset containing communities with different characteristics and compute explanations for a new cluster assignment. The blue cluster consists of communities of people living in urban areas (higher pctUrban) in smaller apartments or houses (lower MedNumBR). Further that, such cluster is also characterized by lower percentage of households with farms or self employment income (lower pctWFarmSelf). Interestingly, the orange cluster presents opposite characteristics from the blue cluster, in which there are more households with farms or self employment income (higher pctWFarmSelf), lower percentage of people living in urban areas (lower pctUrban), and bigger houses (higher MedNumBR). Finally, there is an interesting aspect of the red cluster in which lower values for racepctblack contribute for the cluster cohesion (negative SHAP values) and higher values contribute to the confusion of this cluster to the two others (positive SHAP values). Then, higher percentage of houses with no phone (higher PctHousNoPhone), lower percentage of working mothers (lower PctWorkMom), and lower percentage of households with farms and self employments are three features that contribute to the cluster formation-besides being a common aspect for the white and black communities in this cluster. As demonstrated throughout this case study, there are more characteristics and insights that we can discover from the data using HUMAP combined with explanation approaches. In addition to this benefit, the running time execution for computing the explanations is reduced, as shown in Table 1 For this second case study, we explore the learned features of a convolutional neural network (CNN) after training on an image collection. Here, we aim to emphasize the mental map preservation of map across hierarchy levels, which is beneficial when exploring a lot of datapoints. We used the Fashion MNIST dataset [47] , composed of 70000 of 28 × 28 grayscale images of fashion items (clothing, footwear, and bags). Our architecture consists of 32 3 × 3 convolutional filters, 64 3 × 3 convolutional filters, a dense layer of 128 neurons, and a softmax layer of ten neurons. We used the dense layer as features for each image after training the dataset, resulting in a 70000 × 128 matrix. Fig. 5(A) shows the top level of HUMAP created using three hierarchy levels. Each data point is color-coded using its ground truth labels, and data points with a gray border correspond to the misclassification by the CNN. This top-level shows four major structures, with misclassification positioned in the boundaries of classes. We start by analyzing the separation of cluster in the projection. The subsequent level of the hierarchy (see Fig. 5(B) ) and the images representing a few data samples show that the features learned by the CNN can differentiate purses with and without a visible handles. The explanation computed using SHAP [19] shows how each part of the image contributes to augment (in blue) or to decrease (in red) the prediction probability for each class. Based on analysis using SHAP, having a visible shoulder strap is very determinant to assign a data sample in the bag group. If a bag does not have this item, its rectangular format contributes to the correct classification. Proceeding to the cluster with three classes sharing boundarieshighlighted in purple in Fig. 5(A) , we investigate the misclassification of data samples by projecting another hierarchy level (C). The selection of data samples from and in Fig. 5(C) show that the CNN confusion was due to image intensity. For the instance explained using SHAP, the CNN assigned too much importance for the middle part of the sneakers, augmenting the probability for class . That is, the sneaker has darker colors, which may be a common characteristic that CNN learned to recognize high-heels. Then, we project another hierarchy level (Fig. 5(D) ) to investigate the group of boots ( ). Inspecting the explanations for two data samples, the sample of class was apparently misclassified by the fact that it does not present protection (or darker color in the image) on the ankle area -see this aspect on SHAP values for classes and . The sample from class was misclassified as class by this very fact. Finally, we drill-down to the second level using the data samples from the cluster highlighted in green in Fig. 5(A) , resulting in Fig. 5(E) . To make analysis easier, we proceed to the lowest level for two major clusters highlighted in green in Fig. 5 (E) -notice that these clusters intersect by certain data samples. The cluster appears to correspond to dresses , t-shirts , and a smaller number of samples from coats . Focusing on CNN's misclassifications, we see several samples from class that do not have sleeves (tank tops and blouses) makes the CNN misclassify them with dresses. The SHAP explanations show that the highlighted tank top was misclassified as a dress (class ) since CNN focused on the absence of sleeves. In this section, we numerically evaluate HUMAP against existing hierarchical DR techniques. We compare our approach with Hierarchical Stochastic Neighbor Embedding (HSNE) [33] , Multiscale PHATE [17] , and UMAP [25] . We cannot control the number of data points in each hierarchy level in HSNE and Multiscale PHATE. Thus, we generate HSNE with three levels and generate a HUMAP hierarchy with three levels based on the number of data points in each HSNE level. Multiscale PHATE does not accept a parameter to specify the number of levels. After fitting the Multiscale PHATE hierarchy, we search for a level containing several data points similar to the top-level produced by HUMAP and HSNE techniques. Finally, the reason we evaluate UMAP is to demonstrate its differences to HUMAP regarding mental map and structures preservation. Thus, to simulate hierarchy levels using UMAP, we project the same data points generated in the HUMAP hierarchy. We use the following datasets for evaluation. MNIST [18] is a dataset of 70000 of 20×28 pixel grayscale images of handwritten digits classified among ten digits (from 0 to 9), where each flattened image results in a 784 dimensional vector. Fashion MNIST [47] (FMNIST) is a dataset of 70000 of 28 × 28 pixel grayscale images of fashion items (clothing, footwear, and bags) divided into ten classes; each flattened image results in a 784 dimensional vector. Mammals is a synthetic dataset designed to have four well-separated classes, and it consists of 20000 data points described by 72 dimensions. Embryoid Body (EB) is a single-cell RNA sequencing dataset for embryoid body data generated over 27 day time course [28] divided into five periods of time. For this dataset, we aim to visualize the development of the 31000 cells. The experiments were performed in a computed with the following configuration: Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz, 32 GB RAM, Ubuntu 64 bits, NVIDIA GeForce GTX 1660 Ti 22 GB. Fig. 6 shows the embeddings for the hierarchy levels. HSNE technique does not maintain the structure presented on the top level for all datasets, i.e., while it presents a good relationship among clusters on the top-level embedding, it embeds clusters too close to each other the projection of the whole datasets -for example, the clusters for mammals dataset on level 0. For the EB dataset, it does not present the continuous nature of the biological data. Finally, one major issue is that it cannot preserve the mental map throughout hierarchy levels. Multiscale PHATE could not generate embeddings for FMNIST and MNIST datasets due to their high dimensionality. Although it uncovered the EB dataset's continuous nature, it did not produce embeddings for the mammals dataset that difficult analysis due to closeness. On the other hand, HUMAP preserve the mental map across hierarchy levels, encoding the information present in lower levels on top hierarchy levels. For example, HUMAP shows the continuous nature of the EB dataset even for top-level embedding. Finally, UMAP shows on level 2 the correct overview only for the easier datasets (e.g., mammals and MNIST). For EB and FMNIST datasets, only HUMAP uncovers the structures shown in the lowest level. HUMAP projects higher levels based on the relationship on lower levels, which results in the ability to successfully encode the datasets with less data. Running-time. Since these techniques have stochastic components, we run the experiments twenty times for each evaluation metric following the procedure discussed at the beginning of this section. Fig. 7 shows boxen-plots in log scale for run-time execution (seconds) to fit the hierarchy and to embed levels 2 and 0 (whole dataset). Except for level 0, HUMAP presents similar run-time execution to HSNE in GPU, making it a promising approach for users with low resources when reasonable run-time executions are needed. Although HUMAP presents slightly greater run-time execution than HSNE on CPU for fitting the hierarchy, it is way superior when embedding the hierarchy levels. Multiscale PHATE seems unreasonable for interactive applications when embedding many data points (for example, level 0 of mammals and EB datasets). Finally, the difference between HUMAP and UMAP decreases as we approach to the lowest level. Neighborhood Preservation. Fig. 6 shows the neighborhood preservation (NP) [32] for varying number of neighbors k (k ∈ [1, 30]). Such a metric computes the ratio of neighbors preserved after DR for each data point, then, the mean ratio (from 0 to 1) consists of the NP for the projection. HUMAP techniques shows lower neighborhood preservation for datasets that are meant for other analyses, i.e., the EB dataset is meant for continuity analysis and the mammals was designed to have well-defined classes. Thus, although HUMAP was not superior in NP for these datasets, it could project the structures as needed. HUMAP balances the preservation of small and greater neighborhoods in higher levels, which explains the slightly lower results for MNIST and FMNIST datasets, if compared to HSNE. Finally, the greater result of UMAP for the highest level is explained by the fact it only considers the data point from level 2 to build the neighborhood graph, that is, it was not an iterative process from level 0 to level 2. HUMAP supports parameter tuning to perform better on neighborhood preservation. The similarity between two landmarks consists of the intersection between their global and local neighborhoods. The local neighborhood for a landmark is in the set of its nearest neighbors so that one can increase it by adjusting the parameter β ∈ [0, 1] that adds β |NH(i)| neighbors to the local neighborhood-by default, β is set to 0. We evaluated HUMAP with different values of β and found that there is a clear relationship between local neighborhood and neighborhood preservation. Besides that, adding only 3% of the current neighborhood (100 neighbors) makes HUMAP surpass HSNE by a great margin for MNIST dataset on top hierarchy levels. Visually, the β parameter makes HUMAP cluster together similar data pointsinterested readers can visualize the analyses and projections for this scenario in Supplementary File (Figs. 3 and 4 ). DEMaP. To analyze how well the techniques convey manifolds, clusters, and other high-dimensional space structures, we use the DEMaP metric [27] -it computes the Spearman correlation between geodesic distances on high-dimensional and euclidean distances on low-dimensions. Fig. 9 shows that HUMAP was superior over HSNE and UMAP on level 2 and specifically over HSNE on GPU on level 0 for the MNIST and FMNIST datasets. From the images, the pair of distributions of (HUMAP, UMAP) for the EB, FMNIST datasets on level 2 and (HUMAP, UMAP) for the EB dataset on level 0 are not statistically different after t-test-the other are different with p-value at least < 0.00001 (see Supplementary File - Table 2 for the details). For the mammals dataset, HUMAP shows higher values when embedding the whole dataset, providing evidence of our technique's stability across hierarchical levels. While HSNE and Multiscale PHATE show better results for the first level, they lose important information when drilling-down the hierarchy. On the HSNE side, the relationship between clusters is lost. On the Multiscale PHATE side, the technique does not encode the intracluster relationship well. Finally, HUMAP convey the continuous structures present in the EB dataset even for the top-level embedding, reflected by its DEMaP values. At the level 0, however, HSNE on CPU and Multiscale PHATE present higher DEMaP values. Another important characteristic is that, unlike HSNE and UMAP, HUMAP is able to convey the overview of data relationship in higher hierarchy levels. Although HUMAP presents lower DEMaP values for mammals on top-level and EB on the first level, Fig. 6 shows its stability method across hierarchy levels (mental map preservation). The preservation of the mental map is important to not mislead users and add cognitive overload during exploration-this characteristic also holds for subsets of data points, which we address in the following section. Finally, the fixing term for preserving the mental map reduces the quality of the whole projection (level 0) according to metric DEMaP. When setting the embedding of lower hierarchy levels to follow the pattern of higher hierarchy levels, the freedom of the optimization algorithm has to positioning the points is limited. In Supplementary File (Section 5), we provide the same analysis by setting free the optimization algorithm (θ = 1), resulting in higher DEMaP values. Mental map preservation. To quantify the mental map preservation, we use Procrustes Analysis on subsequent hierarchy levels. That is, for the hierarchy levels i and i + 1, we retrieved the points present in both i and i + 1 and used them to compute disparity between the Fig. 6 : Visual analysis of the embeddings generated for top hierarchical level and lowest hierarchical level using a hierarchy of three levels. For each dataset, top-level embedding appears on the left, and the lowest level (whole dataset) appears on the right. sets of points. Fig. 10 shows that HUMAP is way superior over the other techniques at preserving the mental map (the lower disparity the better). This result confirms the visual representations of Fig. 6 , where HUMAP clearly reduces the amount of rotations and variations between hierarchy levels. Continuity. Fig. 11 shows the continuity values for varying number of neighbors, which measures the extent to which original clusters are preserved. Except for Multiscale PHATE on mammals for level 2, all of the techniques performed really well with values greater than 0.9. Here, we can see the expected effect that HUMAP preserves the global structure more than focusing only on cluster formation for the EB dataset-the Fig. 6 illustrates this effect. Trustworthiness. The results from neighborhood preservation are comparable with Trustworthiness since it measures to what extent the local structure is retained, shown in Fig. 12 . As well as for continuity, all of the techniques performed well with values greater than 0.9, Fig. 9 : Evaluation of techniques' ability to represent complex structures such as clusters, manifolds, and other relationships. with exception for Multiscale PHATE which shines when dealing for continuous datasets and conveying global information. Finally, as commented earlier, HUMAP balances the local and global structures and provides comparable results with techniques that do not worry about this trade-off. This section focuses on comparing the techniques on their ability to project subsets of data points. We do not analyze Multiscale PHATE since there is no way to drill-down a hierarchy based on classes of data points-besides that Multiscale PHATE does not handle MNIST and FMNIST's dimensionality. Fig. 13a shows the same pattern of Fig. 6 , in which HUMAP conveys the structures on top-level and preserves the mental map across hierarchical levels. In this case, we project a hierarchy with four levels, adding more complexity to the analysis. In Fig. 13a , the subset on first level was lasso-selected and the subsequent subsets were selected based on specific classes. DEMaP. Fig. 13b shows that HUMAP is superior over HSNE for all of the scenarios and it only loses for UMAP on fourth level for the MNIST dataset-notice that UMAP alone does not consider the relationship among higher levels for projection. Mental map preservation. As computed for whole hierarchy levels, Fig. 14 shows the disparity between subsequent hierarchy levels. Unlike UMAP and HSNE, HUMAP successfully preserves the mental map even for smaller portions of data, which confirms the visual representation of Fig. 13a The results presented here show that HUMAP is visually superior to HSNE and UMAP at maintaining the relationship among clusters and data points across hierarchy levels, supported by the quantitative metrics. This characteristic is essential to maintain a mental map and not mislead analysts during interaction. For this analysis, except to UMAP, the techniques take nearly the same time to embed data points on the four levels, and HSNE has advantage on embedding the subset for level 0 due to the GPU implementation (please, see the results for all metrics in Supplementary File, Section 4). The experiments show that HUMAP outperforms current approaches on the ability to represent structures present in high-dimensional spaces while being competitive with techniques implemented on GPU regarding running time execution. Another important consideration of our approach is that it preserves the relationship among data clusters and other structures through different hierarchy levels. This critical characteristic makes HUMAP suitable for progressive analysis since it preserves the overall structures of higher hierarchy levels in lower hierarchy levels. In the following, we discuss some aspects of the analysis and our technique: Embedding initialization. For the past two years, there was a consensus that UMAP was superior over t-SNE on the preservation of global information (i.e., the relationship among clusters). Recent analysis shows that the initialization of the low-dimensional representation plays a major role in the quality of these two algorithms, where PCA [15] is preferable to random initialization for t-SNE. HSNE uses t-SNE with random initialization. This fact may explain lower results for HSNE for the lowest level (whole dataset) based on the DEMaP metric. However, for higher hierarchy levels, HSNE (and HUMAP) encodes well the global neighborhood, which we believe that such an initialization step would not compromise results. Currently, HSNE API does not offer an easy way to start the embedding with PCA initialization. Finally, PCA initialization does not solve the non-preservation of projection mental map across hierarchy levels. Landmark selection. To define different hierarchy levels, we must select representative data points (or landmarks) that will be a subset of the dataset for a given hierarchy level. We empirically evaluated several strategies for this task. Selecting the landmarks from denser neighborhoods (e.g., random sampling) only works for evenly distributed datasets; otherwise, hierarchy levels only encode part of the data (the dense regions). Using the density information to create a vector of probabilities (e.g., soft-max of σ i ) returns too many data points on non-dense regions. Finally, clustering approaches do not work well for our problem due to the computational complexity and their difficulty in capturing the data organization given outliers and other data structures, such as clusters of different shapes. DR techniques are great tools for high-dimensional data analysis. However, traditional techniques cannot uncover substructures while providing an overview about the datasets. Thus, hierarchical DR approaches offer analysis following the visualization mantra, in which analysts focus on important information and retain details according to demand. Nevertheless, the hierarchical approach in literature do not preserve the mental map throughout hierarchy levels or care not suitable the majority of the datasets. In this paper, we present HUMAP, a novel hierarchical DR technique. Our approach addresses problems that are not well-covered by current techniques, making it a good alternative for high-dimensional data analysis. Besides preserving the mental map, global, and local relationship, HUMAP offers tunable parameters that make it easy to focus on global or local neighborhood preservation. We plan to extend our technique to investigate different distance measures among data points and landmarks in future works. We also plan to implement GPU versions to help users to explore even more the HUMAP capabilities. Finally, we also plan to investigate novel strategies to assess hierarchical DR techniques and perform user experiments on the importance of the mental map preservation for hierarchical approaches. Efficient hierarchical-pca dimension reduction for hyperspectral imagery Semi-supervised learning with interactive label propagation guided by feature space projections Coarse graining of data via inhomogeneous diffusion condensation A Generic Model for Projection Alignment Applied to Neural Network Visualization Topicbased coordination for visual analysis of evolving document collections Towards a quantitative survey of dimension reduction techniques Supporting analysis of dimensionality reduction results with contrastive learning Introduction to markov chain monte carlo Visual feature fusion and its application to support unsupervised clustering tasks Fo-cus+context exploration of hierarchical embeddings Cyteguide: Visual guidance for hierarchical single-cell analysis Glimmer: Multilevel mds on the gpu Hierarchical principal component analysis (pca) and projection to latent structure (pls) technique on spectroscopic data as a data pretreatment for calibration Local affine multidimensional projection Principal Component Analysis Facetto: Combining unsupervised and supervised learning for hierarchical phenotype analysis in multi-channel image data Multiscale phate exploration of sars-cov-2 data reveals multimodal signatures of disease. bioRxiv MNIST handwritten digit database A unified approach to interpreting model predictions Eleven grand challenges in single-cell data science Visualizing data using t-sne Contrastive analysis for scatter plot-based representations of dimensionality reduction. arXiv e-prints Explaining dimensionality reduction results using shapley values A hybrid visualization approach to perform analysis of feature spaces Umap: Uniform manifold approximation and projection for dimension reduction Towards perceptual optimization of the visual design of scatterplots Visualizing structure and transitions in high-dimensional biological data Embryoid body data for phate On spectral clustering: Analysis and an algorithm Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment Hipp: A novel hierarchical point placement strategy and its application to the exploration of document collections Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping Hierarchical stochastic neighbor embedding Visualizing the hidden activity of artificial neural networks Visualizing time-dependent data using dynamic t-sne. EuroVis '16 An overview of gradient descent optimization algorithms Network visualization and analysis of spatially aware gene expression data with insitunet Scatterplots: Tasks, data, and designs Empirical guidance on scatterplot and dimension reduction technique choices The eyes have it: a task by data type taxonomy for information visualizations Attribute-based explanations of non-linear embeddings of high-dimensional data Imacyte: Visual exploration of cellular microenvironments for imaging mass cytometry data Using multiple attribute-based explanations of multidimensional projections to explore high-dimensional data Visual analysis of mass cytometry data by hierarchical stochastic neighbor embedding reveals rare cell types Analysis of multiblock and hierarchical pca and pls models Steerable, progressive multidimensional scaling Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms