key: cord-0582135-jhoo0iw1
authors: Sivaraman, Venkatesh; Wu, Yiwei; Perer, Adam
title: Emblaze: Illuminating Machine Learning Representations through Interactive Comparison of Embedding Spaces
date: 2022-02-05
journal: nan
DOI: nan
sha: 372f46308d131674c3f140b49db1fbe3dd7dfe79
doc_id: 582135
cord_uid: jhoo0iw1

Modern machine learning techniques commonly rely on complex, high-dimensional embedding representations to capture underlying structure in the data and improve performance. In order to characterize model flaws and choose a desirable representation, model builders often need to compare across multiple embedding spaces, a challenging analytical task supported by few existing tools. We first interviewed nine embedding experts in a variety of fields to characterize the diverse challenges they face and techniques they use when analyzing embedding spaces. Informed by these perspectives, we developed a novel system called Emblaze that integrates embedding space comparison within a computational notebook environment. Emblaze uses an animated, interactive scatter plot with a novel Star Trail augmentation to enable visual comparison. It also employs novel neighborhood analysis and clustering procedures to dynamically suggest groups of points with interesting changes between spaces. Through a series of case studies with ML experts, we demonstrate how interactive comparison with Emblaze can help gain new insights into embedding space structure.

Most state-of-the-art machine learning (ML) techniques work by learning an expressive representation of their input data in a high-dimensional vector space. This representation, also known as an embedding, reflects both the structure of the dataset and the task used to train the model. Embedding representations have increasingly been used to improve performance on a plethora of tasks thanks to deep neural networks such as transformers, which leverage vast quantities of often-unlabeled data to learn highly nuanced representations. However, embedding spaces can acquire unpredictable and undesirable structural features during training, learning shortcuts or biases in the data to perform better on the learning task [5, 42] . Model builders and data scientists need effective tools to probe the structure of their embeddings and to help them choose the best representation for their task.

Although a variety of tools have been developed to visualize and probe embeddings, the problem of extending these techniques to compare across multiple embedding spaces remains an open challenge. For instance, many popular embedding analysis tools use dimensionality reduction (DR) techniques such as tSNE and UMAP to generate 2D scatter plots of the embedding space [22, 33, 40] . These visualizations are well-suited to give a high-level overview of an individual embedding space, particularly when combined with augmentations that highlight distortions due to DR [32] . However, visualizations that juxtapose two DR visualizations side-by-side (e.g. [1, 4, 10] ) may become visually taxing and confusing when there are thousands of points displayed and different degrees of distortion in each plot. This juxtaposition approach is also typically limited to two embedding spaces to avoid overwhelming the user, but comparing more than two spaces may be necessary for some model comparison tasks (such as diachronic word embeddings [15] ).

To mitigate the complexity of comparing embedding spaces at a global scale using DR, comparison tools such as the Embedding Comparator [4] and embComp [18] have incorporated more focused visualizations that enable comparisons on small neighborhoods of points. These features do address the need for users to analyze local differences between embedding spaces, but the task of finding relevant neighborhoods to compare remains challenging. In the absence of pre-formed hypotheses about where in the embedding space to look, a tool that can adaptively guide analysts to interesting comparisons may be necessary.

In this work, we first present findings from interviews with nine embedding space experts from domains such as language modeling, computational biology, and dimensionality reduction. These interviews showed that model builders tend to rely on ad hoc workflows to analyze embedding spaces, partially because many are skeptical of DR's ability to support their analytical needs. In particular, experts desire tools to build a much richer understanding of their embedding spaces, moving beyond the fixed distance metrics often assumed in DR towards much more complex and nuanced notions of similarity and uncertainty. These observations, combined with the areas of opportunity we identified from prior embedding comparison systems, led us to develop a visual embedding analysis tool that unifies common needs for comparison throughout the processes of model building and dimensionality reduction (see Table 1 ).

The resulting system, which we call Emblaze, is a Python framework and web-based visual interface that can be run within a computational notebook environment, making it easy to import and visualize heterogeneous data in tandem with ad hoc workflows. Emblaze is centered around an animated, interactive DR scatter plot, which can be filtered and Which value of the tSNE perplexity parameter creates the best visual clustering for my data? Random effects from DR initialization How consistent are the neighborhoods and clusters created by this DR algorithm? Table 1 . Use cases for embedding space comparison that arise during ML model development and visualization, generated from formative interviews, case studies, and practical experience using Emblaze. In each of these comparison tasks, model builders seek to understand the effect of the comparison target on the overall structure and relationships encoded in their embedding spaces. Emblaze supports comparison for all of these tasks through a unified set of visualization techniques.

aligned in place to support navigation of different regions of the space. It incorporates novel visual augmentations that summarize changes for both pairs and larger sets of embedding spaces, and it dynamically suggests clusters of points that exhibit interesting changes. Through case studies with ML experts, we demonstrate the utility of these capabilities not just for comparing models, but also for understanding individual model behavior and reasoning about the effects of 2D projection.

Concretely, the contributions of this work are the following:

(1) A series of semi-structured interviews with nine embedding experts in a variety of fields, including natural language processing, computational social science, computer vision, and computational biology. The resulting qualitative analysis expands upon prior need-finding studies by probing experts' viewpoints on specific embedding analysis and comparison techniques (such as clustering, dimensionality reduction, and embedding space alignment), and by identifying practitioners' perspectives and challenges as pertaining to their unique fields.

(2) A comprehensive system, called Emblaze, to compare embedding spaces within a computational notebook environment. The tool improves on previous embedding comparison systems by supporting comparison across several spaces and at many stages of the model building pipeline. It also introduces new techniques to surface points and clusters with interesting changes, and facilitates rapid iteration through its lightweight notebook-based interface. Emblaze is open source and publicly available.

In this paper, we specifically focus on the problem of embedding space comparison, which can be defined as the comparison of multiple high-dimensional representations of the same set of objects. When designing for comparison, it is important to note that comparative tasks often demand fundamentally different approaches than more general exploratory or analytical ones [14] . However, we hypothesize that given the complexity and opaqueness of embedding spaces, comparison may be beneficial or even necessary to help users understand individual embedding spaces. Therefore, in the following sections we briefly review strategies for visualizing dimensionally-reduced data and embeddings alongside the relevant efforts to extend those methods to comparative tasks.

All dimensionality reduction techniques attempt to project an × matrix of observations in dimensions to a lower-dimensional × representation, while attempting to preserve relationships between observations [44] [2, 39] , lines whose lengths indicate the degree of projection error [41] , and animations between projection axes [12] . In addition, a few systems have incorporated multiple projections, either for interactive parameter selection [13, 34] or to compare proximity relationships in different variants [10] . In this work, we attempt to extend some of these techniques to the problems specific to embeddings in ML, including dataset scale and the need to compare more than two spaces.

Visual analytics work on learned embedding spaces (using DR or otherwise) has most frequently focused on textual embeddings [17, 24] , particularly in light of the hidden biases often found in word embeddings [5] . Some comparison tools have been developed to analyze and improve word embedding methods [7, 37] , but the more common use case for word embedding comparison is in holding the modeling procedure constant and analyzing semantic differences from different corpora [15, 21, 43] . For example, an early prototype of Emblaze was used as part of a visual interface to compare semantic changes in clinical concepts related to COVID-19 [30] . Note that prior qualitative comparison techniques have typically only been able to visualize small, curated subsets of the embedding spaces, and do not well support comparison of groups of points.

In the life-sciences domain, embeddings are frequently used to represent gene expression levels in cells [3] and human genetic profiles [23] , among other applications. Visual comparison tools have been developed for the common task of clustering and interpreting feature values in these embeddings [20, 27] . However, the widespread use of static DR-based visualizations in computational biology has prompted criticism, including new proposed DR techniques 1 In this paper, we use embeddings to refer to high-dimensional vector representations of objects, and projections for dimensionally-reduced representations of those embeddings. [11, 29] and calls to avoid DR-based analysis entirely [6] . It is worth noting that with the notable exception of Sleepwalk, an interactive R-based tool for global-scale embedding exploration [33] , the potential for interactive visualizations that enable comparison of multiple DR projections appears to be under-explored in computational biology.

Large-scale embedding spaces for images, such as those learned by convolutional neural networks, are an important application area for which few systems have specifically been developed. The difficulty of auditing such large spaces as well as the presence of class labels may explain why embedding visualization is typically eschewed in favor of other model inspection strategies [19] ; tools that do incorporate DR have primarily focused on probing image models during training [8, 35] . However, image models can still be analyzed using general-purpose embedding visualization systems, such as the Embedding Projector [40] and Latent Space Cartography [25] .

This work builds upon a small number of research systems that have been developed to support general-purpose embedding comparison. Systems by Li et al. [22] and Heimerl et al. [18] , for example, utilize summary visualizations of embedding metrics, neighborhood views, and DR plots to facilitate comparison. Meanwhile, Parallel Embeddings [1] utilizes a novel clustering-based visualization to highlight correspondences between embeddings, and the Embedding Comparator [4] features PCA plots of the neighborhoods around selected points that change the most. The latter two approaches offer greater simplicity, but limit analysis to two embedding spaces at the cluster or individual-point level,

respectively.

Notably, all of these systems require the user to spend time browsing the visualizations in order to find meaningful comparisons, which may be difficult for large datasets. Moreover, when these systems do surface candidate points for comparison, the techniques used are limited to individual points and cannot easily be generalized to clusters [4] . As we discuss in our expert interviews, direct paths to interesting comparisons at varying granularities may be a key factor in gaining better insight into large unlabeled datasets.

To characterize how embedding spaces are currently analyzed and compared, we conducted semi-structured interviews with 9 embedding experts across a variety of domains. As listed in Table 2 , the majority of participants worked with language-based models, while others were experts in computational biology, computer vision, multimodal machine learning, and signal processing. All participants except one (P7, a DR expert) were situated within machine learning practice rather than DR, consistent with our goal of examining how experts understand embeddings beyond visually projecting them. Interviews were conducted on Zoom and lasted 53 minutes on average. Transcripts were coded and analyzed to answer three primary research questions:

( 

Most of the embedding experts we interviewed expressed a largely unmet need for tools to help them understand their models: "the tools are very crude, and you're kind of just getting small clues from those little [ad hoc] techniques" (P8). However, consistent with the diversity in their backgrounds, participants varied considerably in the purposes and associated levels of granularity that understanding embedding spaces entailed.

In some cases, embedding analysis plays a distant secondary role to task-specific performance metrics; understanding the embedding space is often only useful to debug an underperforming model (P4 Other participants expressed that embedding analysis was an important, even essential aspect of their work (P1, P2, P6, P7). These experts advocated for a fairly rigorous approach to embedding analysis, in which they would define hypotheses about the structure of the embedding space and test them using a combination of existing tools and ad hoc algorithms. However, these participants are also concerned that the methods they are currently using are too "hand-wavy" (P1) and that their observations may not reflect real patterns in the embedding space structure (P1, P2).

Furthermore, their analytical techniques depend on the presence of previously-known points of interest, which they may not have when exploring new datasets (P7).

We spent considerable time probing participants' general approaches to embedding analysis because as discussed in the next section, their experience with embedding comparison was much more limited. The techniques that these experts employ for embedding analysis therefore present useful starting points for our proposed comparison techniques. Other participants echoed a similar sentiment relating to datasets with large numbers of unlabeled points, where the lack of known structure leads to a "blob of points" in the visualization (P2, P7).

Participants' doubts about DR may be surprising, particularly given that several techniques have been developed to visualize distortions and errors in DR [32, 39, 41] . This consensus may have arisen because participants were not experts in DR, although a few participants had occasionally used visual DR tools such as Embedding Projector [40] and Sleepwalker [33] (P6, P7, P8). Overall, participants perceived limitations in the fixed, error-prone distance transformations induced by DR, which led them to rely on their own handmade code snippets to probe embeddings.

One of the most common methods that participants used to analyze embeddings, especially in the natural-language domain, was to identify points of interest and examine their nearest neighbors in the highdimensional space (P1, P2, P3, P6, P8). When asked about the importance they placed in understanding embeddings at an individual-point level, participants gave a wide variety of responses depending on their roles and intended use cases.

Those who were explicitly performing qualitative analyses on embeddings (P1, P2) used nearest neighbors extensively.

For some participants who were building and validating models for downstream use, nearest neighbors were seen as an essential tool for exploration and debugging, even "the most important tool we have" (P5, P6). However, others who work with more heavyweight models and larger datasets (especially in industry) saw it as too time-consuming except to produce concrete demonstrations of results (P4, P9). In other words, while nearest neighbors serve as a very useful indicator of quality, it may currently be too difficult without a priori points of interest to find samples that can be subjectively incorporated into a larger analysis.

Clustering and feature analysis. In addition to looking at individual points, some participants also considered groups of points as units of analysis, often naming standard techniques such as -means and hierarchical clustering (P5, P8). For example, P9 described characterizing an image embedding model's weaknesses by probing the defining features of its clusters:

"We were seeing that things were clustering based on surface-level features, and we wanted it to be...

clustering based on more semantic features. So we were looking at clusters as a way to understand what notion of similarity the feature representation is capturing, and whether that's the notion of similarity that we actually want to capture. " (P9)

In addition to understanding why points were grouped into a single cluster, participants were also interested in explaining why multiple clusters were close to each other; for example, in computational biology, cell types that cluster close together in an embedding space could reveal an underlying biological similarity (P5, P7). Overall, participants were interested in using clustering techniques, but tended to rely on simple programmatic tools to do so.

Two participants mentioned using axis-based analysis to understand embedding model behavior,

i.e. studying the characteristics of points embedded along an axis between two words in the space, such as "man" versus "woman" (P2, P6). Similarly, P6 described probing generative image models by interpolating along an axis in the feature space between two images. Participants also described attempting or wanting to computationally assess the embedding topology, for example by characterizing the smoothness, continuity, or density of the space (P1, P6, P9). For simplicity and to adopt well-established techniques, we focus on the three main techniques described above as our design focus.

Unlike the techniques that participants use to understand individual embedding spaces, approaches for comparing more than one embedding space at a time were "very rudimentary" (P8) and few in number. Below, we discuss some common needs that experts expressed for embedding comparison, and the often-makeshift strategies they used to address them. 

Because different embedding spaces usually have very different feature axes even if trained on the same data, a few participants mentioned that it would be helpful to be able to align embedding spaces, particularly for multimodal embeddings (e.g. joint embeddings of images and text, or texts from different languages). Embedding space alignment is an active area of research; nevertheless, participants did use some simple alignment techniques, such as aligning with respect to a single center point or performing a Procrustes alignment (minimizing root-mean-square distance between points) (P3, P5).

desire to understand how model architectures and parameters were affecting the results at a more granular level than macroscopic accuracy metrics. For example, P7 was studying a novel dimensionality reduction technique, and wondered whether there was a better way to help users choose parameters than just looking at several visualizations side-by-side.

P6 noted that when choosing a model architecture among many disparate options, they would generally compare the results manually before running a grid search over hyperparameters for the best architecture. Most participants' approaches to comparing model variants were limited to top-level accuracy numbers, which became insufficient in light of more subjective or instance-level requirements on the learned representations.

Based on the techniques and needs that participants described to us, we generated four overarching themes and associated design goals to guide the development of our system: Goal 2. Support exploration of large datasets by guiding the user to points and clusters that change meaningfully between embedding spaces. Some participants enter the embedding analysis process equipped with specific points to analyze, but many do not extensively inspect the embeddings despite believing that it would be beneficial to do so. This discrepancy may arise because for many large datasets, the lack of labels to differentiate clusters makes it difficult to visually or programmatically pinpoint areas of interest. For example, after developing a new dimensionality reduction technique, P7 described the challenge of interpreting the results on a new dataset:

"We run this embedding, [and] we get this cloud of all blue points because we don't know how to color them.

What do we do next?" We propose to mitigate this complexity by developing recommendation features that guide the user to meaningful changes, which may also provide a form of clustering.

Goal 3. Support exploration of high-dimensional neighborhoods so users can avoid being misled by distortions due to DR projections. As discussed above, participants were largely skeptical of the ability of DR to accurately capture the structure of an embedding space. In fact, many tended to avoid DR-based tools entirely in order to avoid drawing misinformed conclusions. However, participants did find DR helpful to get an initial impression of the embedding space, and to communicate their results. We hypothesize that when complemented by appropriate comparison tools, DR plots can serve as an effective "map" of the data that enables intuitive navigation and exploration.

Goal 4. Support integration into custom embedding analysis workflows. The diversity of techniques we observed in the interviews indicates that embedding analysis and comparison often require custom, task-specific routines.

For example, experts tend to have predefined hypotheses about what characteristics of points to search for, and they utilize specific downstream analyses to assess the embeddings of those points. An effective analysis tool will need to not only provide prebuilt methods for exploration and analysis, but allow them to move between the system's and their own analysis routines.

We now introduce Emblaze, a system we developed based on the above design goals that seeks to help model builders compare notions of similarity and reliability in embedding spaces. Although Emblaze can be run as a standalone application, it is primarily designed as a widget that can be displayed in an interactive notebook environment. The tool integrates most of the major techniques described in the interviews, including nearest-neighbor analysis, clustering, and embedding space alignment. In fact, nearest neighbors form the backbone of most of the algorithms embedded in Emblaze, echoing previous embedding comparison efforts [18] and reflecting the importance of nearest neighbors to interviewees across domains. Below, we provide an overview of Emblaze's interface followed by the features we developed to support each of the four design goals.

As depicted in Fig. 1 , Emblaze centers a dimensionality-reduction plot of the dataset that facilitates navigation and selection of points of interest. DR maps can be generated using common techniques (PCA, tSNE, and UMAP) along with a variety of distance metrics (cosine distance is the most common). To control for visual differences due to DR, Emblaze allows users to generate projections using AlignedUMAP, a variant of UMAP that adds a similarity constraint to the objective function [28] . Additionally, the projections are optimally scaled and rotated using Procrustes alignment to minimize coordinate differences [31] . To the left of the main scatterplot is a panel listing the embedding spaces being compared, which we term "frames"; clicking once on a thumbnail opens the comparison interface, and clicking again animates the points in the scatterplot smoothly to their locations in the new embedding space. Meanwhile, the right-hand sidebar contains a variety of tools to manage and analyze selections in the interface, including the nearest neighbors of the current selection, a browser for saved and recent selections, and Suggested Selections.

Neighbor-based metrics are the primary way to compare embedding vectors in Emblaze, because they can be computed in the high-dimensional space and are compatible with any quantitative distance metric. For a point , we define ( ) as the set of nearest neighbors to in the embedding space ( is a constant that can be configured by the user, but is set to 100 by default). The rank of a neighbor in the neighbor set of is denoted rank ( ; ) and ranges from 0 (first neighbor) to − 1. We frequently employ the Jaccard distance, denoted (·, ·), to compare neighbor sets. 

When designing the animation between frames, we drew inspiration from the wellknown Gapminder tool and other animated visualizations [16] , which tap into the perceptual system's ability to track objects and identify motion outliers. However, some studies have found that animation introduces perceptual inaccuracy compared to small multiples and overlays [36] , and that motion outliers may be difficult to reliably perceive [46] . These concerns may be exacerbated in DR plots, which often contain tens of thousands of points (orders of magnitude more than the scatter plots tested in the aforementioned studies). Therefore, we introduce an augmentation that we call a . This highlights the addition of "weed" and "acid" as colloquial drug-related words in Twitter, and the removal of drug category names more characteristic of journalistic writing ("narcotic", "amphetamines").

Changes. The sidebar, which normally displays neighborhood information for the current selection, also shows comparison-specific information while the user is comparing two frames and .

First, a simple neighbor differences view shows the nearest neighbors of the selection in each frame (Fig. 2b) ; points are highlighted in magenta if they are present in but not , and green if present in but not . When multiple points are selected, an additional table of Common Changes lists the neighbors that are most commonly added to or removed from the nearest-neighbor sets of the selection (Fig. 2c) . To compute Common Changes, we define a function to measure the inverse rank of the neighbors of that are present in one embedding but not another: 

While the features described above help manage complexity and facilitate comparison given a previously-defined selection, we also sought to help the user find selections that yield interesting comparisons. For example, the Star Trail visualization preferentially highlights points with large nearest-neighbor differences that are also in the vicinity of the current selection, leading them to potentially interesting similar selections. We also developed a visual summary of change across frames, and an adaptive technique for suggesting selections, which we discuss below.

We hypothesized that having a fast visual way to assess the amount of change in a group of points across all frames would accelerate the discovery of selections worth focusing on. Therefore, we developed a Color Stripe visualization that uses perceptual color similarity to encode similarity between frames for a given selection . Color Stripes are determined by clustering the frames using a distance metric that captures both how much changes as a group with respect to its external neighborhood, and how much the neighborhoods within change.

More formally, the distance between frames and is computed as

where Δ inner and Δ outer reflect the change in the "inner" and "outer" neighbor sets of with respect to :

To visualize the clustering of frames generated by this distance metric, we assign each frame to a color along a ring in the CIELAB color space (a system in which Euclidean distance approximates perceptual distance). While the relative distances between frames around the ring correspond to differences in hue, the saturation of the colors is determined by the maximum distance between any of the frames. This results in highly consistent selections being represented as indistinguishable grayish hues, while highly varying selections feature bright colors. Examples of the Color Stripes can be seen next to the frame thumbnails to the left of the scatter plot (see Fig. 1 ), as well as in the Suggested Selections pane (Fig. 3, right) .

Finding groups of points that exhibit meaningful, consistent neighborhood changes in large embedding spaces is challenging, particularly for groups of points that are not tightly clustered or labeled in the DR projection. Tightly clustered groups pose an additional problem: if a group shifts drastically between two spaces while remaining closely interconnected, the nearest neighbors of each point individually may be largely similar even though the neighborhood around the cluster has changed. To help users identify and navigate to such groups quickly, we created the Suggested Selections feature, which can be accessed through one of the sidebar tabs.

As shown in Fig. 3 , the suggestion algorithm proceeds in two steps: one to precompute clusters, and one to rank and filter those clusters according to an interest function. In the precomputation step, a clustering of points is generated for each pair of frames and using a distance metric that measures the changes in neighbors gained and lost in Fig. 1 as well as a comparison of face recognition models on the CelebA celebrity faces dataset [26] .

The effect of this formulation is that pairs of points which gain (or lose) a similar set of neighbors from frame to will have a small distance. The points are clustered using hierarchical clustering, with a variety of distance cutoffs to produce suggestion results of varying sizes. It is important to note that this clustering is quite different from a clustering performed within an embedding space, as is typically used [1, 27] . Instead, the distance metric in Eqn. 5 explicitly clusters points based on how they change from one frame to another, thereby directly codifying the types of change that we consider most "interesting. "

In the suggestion phase, the clusters are ranked both by measures of a priori interest and relevance to the current visualization state, resembling a classic degree-of-interest function [45] . The a priori interest function is the sum of three metrics: consistency of neighbor gains and losses, changes in the cluster's inner neighbor structure (see Eqn. 4), and amount of neighbor overlap within the cluster. The clusters are then further filtered and ranked based on which frame(s) are being viewed, the currently-selected points and their neighbors, as well as the current bounds of the viewport. This enables the user to pan and zoom around the scatterplot and see Suggested Selections for each area they visit.

Note that we have now proposed two distinct distance functions for seemingly similar purposes, namely frames (Eqn.

2) and points (Eqn. 5). While the distance metric for Suggested Selections, points , helps to group together points within a fixed pair of frames, the metric for Color Stripes, frames , helps to group together frames with respect to a fixed set of points. This relationship is mirrored in how each clustering is used in the interface: users can select clusters surfaced by the Suggested Selections for a particular pair of frames, then use the Color Stripes to get a sense of their variations across all frames. By splitting the task of finding interesting comparisons into two complementary interactions, Emblaze extends the notions of interest established in prior work [4, 18] to support both groups of points and more than two embedding spaces.

Since Emblaze is subject to the caveats of dimensionality reduction expressed in our formative interviews, we took care to distinguish the projection from the original high-dimensional space through visual augmentations and selection operations. These affordances are intended to encourage the use of the DR projection as a navigation tool by which users can find subspaces of interest, as described in Goal 3.

When a point is hovered upon or selected, lines radiate outward from the point to its nearest neighbors in the high-dimensional space (by the distance metric pre-configured by the user). By assessing how far the lines extend while panning over the plot, the user can quickly see the fidelity of the 2D projection and find points that are far from their nearest neighbors.

The sidebar's neighbor list view, which has been established in prior embedding analysis systems [4, 33, 40] , also supports showing neighborhoods for multiple-point selections. A selection 's nearest neighbors in frame are the top 10 points in the full set of neighbors ∈ ( ( )) that attain the best total inverse neighbor ranks with respect to the points in . Note that simply listing the neighbors can be misleading if points have very disparate neighbor sets. Therefore, we also display a bar next to each neighbor indicating how many points in have that point as a neighbor;

high values indicate a consistent neighborhood.

Emblaze provides a lasso selection tool to select points that are close together in 2D, which works well when a neighborhood is tightly clustered in the DR projection. For neighborhoods that are not well preserved by the projection, however, the user can use the Radius Select feature to find points within a configurable distance of a center point in the high-dimensional space.

Towards the fourth goal, integrating embedding comparison into custom data science workflows, we implemented Emblaze as a widget that can be run in a Jupyter environment (as shown in Fig. 1 ). This leads to unique benefits for users as compared to standalone applications. First, importing embedding data into Emblaze is made extremely simple, requiring only a set of coordinate arrays and images and/or text to describe each point. Second, once the viewer widget is instantiated within a Jupyter cell, the user can interact with the state of the system by manipulating either the interface or the underlying Python objects. For example, the emblaze.Viewer object exposes bidirectional properties for the current visible frame, comparison frame, selection, filter, and display settings. This enables interactions in which the user can visually identify a selection of interest, computationally analyze that selection using custom functions, identify new points of interest, then instantly navigate to the new selection in the interface.

Emblaze is open-source and available on PyPI and GitHub 3 . The system consists of a Python backend, which performs most of the computationally-intensive analyses, and a Svelte frontend 4 , which enables reactivity. The scatter plot is implemented using PIXI.js 5 , a popular WebGL-based graphics framework. By implementing most plot rendering in custom shaders, Emblaze is able to display and animate tens of thousands of points smoothly on typical hardware. 3 https://github.com/cmudig/emblaze 4 https://svelte.dev 5 https://pixijs.com

As embedding comparison is a relatively nascent task in the literature on ML model analysis, we conducted case studies with ML experts to gain a preliminary understanding of how they might use Emblaze on real-world datasets. We recruited three ML expert researchers who were experienced in, and currently working with, embedding models or dimensionality reduction. The three users (whom we denote U1-3) prepared datasets from their work, installed Emblaze in their own programming environment, then engaged in a loosely scaffolded think-aloud analysis of their dataset.

Participants spent 2-2.5 hours working with the investigators, and were compensated 20 USD per hour. The resulting audio transcripts and usage logs were used to build sequences of actions, which revealed how participants were using each Emblaze feature as part of their analyses.

Note that because Emblaze simply requires a set of embedding matrices and object descriptions, it is not limited to visualizing the final outputs of an embedding model. As depicted in Table 1 , Emblaze also supports other tasks such as comparing across different DR techniques, layers of a neural network, training data subsets, or corpora (e.g. for distributional semantic analysis [30] ). The three experts' use cases and workflows discussed below represent just a few examples of how Emblaze can be utilized in practice.

U1's research centers around developing improved dimensionality reduction techniques, which necessitates comparisons of new techniques against existing ones on well-studied sample datasets. Here, they analyzed projections with different DR settings on the UCI Wine dataset, which contains physicochemical information and quality ratings for around 4,900 wines [9] . U1 chose to compare four projections of this dataset by manipulating 2 variables: projection technique (standard versus a custom implementation of UMAP) and num_neighbors parameter (15 and 50).

Using the Star Trail visualization to get an overview of changes between two projections of the custom UMAP implementation (shown in Fig. 4a ), U1 quickly observed that a large cluster of wines moved from one side of the plot to the other; they described this as an effect of UMAP's random initialization that they would like to overcome in their improved technique. They then used the manual animation slider to slowly interpolate back and forth between the two frames, and observed that the higher num_neighbors parameter had resulted in a more compact, but less distinctly-clustered projection. To support this hypothesis, U1 performed a lasso selection and Isolated to a group of points while in the comparison view, causing the Star Trails to change to highlight changes relevant to that selection.

Using the updated Star Trails and animation, they were able to watch the group move from being well-separated to integrated with the central mass of points.

Because the Wine dataset was mostly projected as one large mass of points with few well-separated clusters, U1 wanted to use the Suggested Selections feature to find differences within the central cluster. By scanning the Color Stripes in each suggestion, they identified a selection that differed considerably between the two num_neighbors values. Looking at the sidebar, they perused the Common Changes for the selected group of points and noted that the most commonly added points were high-quality wines. Finally, they manually animated between the two frames while Aligned and Isolated to the selection. This revealed that the group of points was actually two clusters that were positioned next to each other when num_neighbors = 15, but moved away from each other and acquired new neighbors when num_neighbors = 50. This unexpected finding, which required looking at both Common Changes and geometric differences, highlighted tradeoffs between the two parameter choices that were not immediately visible before. After performing this analysis as well as a similar analysis on an unlabeled dataset of tweets, U1 reflected that comparing 

U2 is working on building machine learning models to distinguish breast cancer lesions from normal tissue in mammograms. At the time of the study, they were interested in analyzing the embedding space their model had learned in order to identify subtypes among the patches labeled as lesions. Therefore, they prepared a dataset consisting of 1,500 mammogram patches, each represented by a 2048-dimensional embedding vector. They then used UMAP to project the embeddings into 2D five times with random initializations; this would allow them to account for the effects of DR variation while exploring the embedding space.

U2's first step was to take an overview of the space by hovering over points to show their associated images, enabling them to characterize which regions corresponded to lesions and normal tissues. Then, using the Star Trails in the comparison view for two of the five DR variants, their attention was quickly drawn to a very long trail between the two main clusters, corresponding to a point labeled as a lesion that was projected with the normal patches in all but one frame. By looking at the points closest to the outlying point in each variant frame, they concluded that the point was likely a mislabeled normal patch that was in fact correctly embedded by the model. Using any one of the projections in isolation, this error case would likely have been missed.

U2 identified clusters of interest by selecting parts of the projection using the lasso-select tool, since they had recently performed a -means clustering of the embedding space and found several contiguous regions that appeared to be meaningful. In one case, they selected a group of points that assumed two different geometries across the five frames: three frames were colored blue-green in the Color Stripes visualization, and the other two colored orange (shown in Fig.   4b ). U2 then opened the comparison view between one of the blue frames and one of the orange frames, and animated between the two to examine how their vicinities changed. By visually inspecting the points that the cluster moved towards, they hypothesized that the orange variants were less accurately isolating the selected neighborhood (although they noted that a medical expert would be needed to confirm which variants were more accurate).

Looking at the variations between DR projections helped U2 gauge the reliability of projections, as well as the possibility of labeling errors: "If some points move a lot, I would want to check them out, see if there's a problem with my data. " Conversely, U2 was also excited that Emblaze allowed them to identify groups of points that were consistently projected across different variants, indicating that those relationships were likely stable in the high-dimensional space.

For example, they found a Suggested Selection whose patches all depicted marginal areas of the breast, and for which the Color Stripes were all gray (minimal variation between frames). Despite the fact that these points were not all mutual nearest neighbors in the projection, the constancy in their arrangement across multiple initializations provided a strong signal that the model considered them similar. Supporting Design Goal 3, U2 expressed that this assessment of consistency was "definitely, definitely helpful, because there's no way for me to tell" which parts of a projection are reliable otherwise.

U3 is a natural language processing expert working on building embedding representations of knowledge graphs (networks in which nodes represent entities and edges encode facts relating those entities). Starting from a pre-trained BERT model that simply encoded the text of each node, they had developed two versions that were fine-tuned to the facts in the knowledge graph (supervised), as well as a version that transformed the embedding space using a normalizing flow. They were aware that both models performed better than the base BERT model on a downstream task, but they lacked specific examples of how the embedding space structure had changed to yield the improved metrics. Therefore, they loaded a dataset of 5,000 sampled entities from a common-sense knowledge graph, embedded according to the four models. (Since the time of the session, Emblaze has been optimized to visualize many more points, mitigating the need for downsampling.) The four models were jointly projected into 2D using AlignedUMAP, which computes a UMAP with an additional loss term penalizing deviations between frames.

Initially, U3 focused on the comparison between the base and the two supervised models. They first selected a fairly well-separated cluster in the base model which consisted of color words. Then, opening the comparison view to look at the cluster's Common Changes between the default and supervised spaces, they found that the default model was including several phrases that matched the words in the cluster but not their semantic roles (e.g. "blue umbrella, " "yellow ribbon"). Similarly, U3 lasso-selected and Aligned to a cluster of phrases in the base model that contained the word "friend, " then animated to the supervised model to see those points migrate apart from each other in the space. These examples confirmed U3's prior hypothesis that the base BERT model was overly reliant on lexical similarity compared to the fine-tuned version.

U3 was eager to use the Suggested Selections feature to find clusters of interest, in particular because the dataset had no labels that could serve as a color encoding for the scatter plot. For example, one suggestion they loaded was a set of 17 points comprising musical instruments (shown in Fig. 5a ). In the comparison view between the base and supervised representations, they noticed that the Star Trail visualization was highlighting a few points moving into the cluster.

They froze the transition between the two frames and navigated to the origin of each trail, revealing that the cluster was being augmented by less common instruments such as "bagpipe" and "piccolo" (Fig. 5b) . They then selected these points individually to scan their neighbor differences views, and concluded that "in the default space, it's just kind of garbage. But for the new space, it's a bunch of instruments. So that's actually very straightforward. " Towards Design Goal 4, U3 voiced the importance of verifying these facts in the knowledge graph and did so directly in the notebook by extracting and looking up the selected point IDs.

Comparing the base model against the normalizing flow representation posed a more challenging task, because they expected the flow model to redistribute the space in a "very noisy, not interpretable" way. First, they animated between the two frames several times, noting that many of the clusters in the base model were less obvious in the flow variant.

To make sure that these patterns were not just due to the projection, they browsed the Suggested Selections to find a cluster of interpretable points and eventually arrived at a cluster of concepts related to outer space. As above, they browsed the neighbor differences and Common Changes views to find that other space-related terms were commonly being removed in the flow model, while the neighbors most commonly added in the flow model were less sensible.

Contemplating the differences between the supervised and flow models with respect to the base, U3 noted that "the benefits [of flow] are not coming from improved alignment [of clusters], it's actually coming from the structure of the space... whereas [the supervised models] do seem to be helping for different reasons. " The finding that these two variants both improved quantitative performance but in very different ways prompted U3 to think about new model architectures that could leverage the complementarity between the two methods.

All three users thought the tool would be helpful in their work as (1) an interactive interface for DR projections, (2) a way to sanity-check their observations, and (3) a source of concrete examples to complement quantitative model performance metrics. Echoing the deep skepticism of DR expressed in the formative interviews, U3 noted that "there are some inherent caveats with some of the reduction techniques, " but that Emblaze, "if anything, highlights and brings more attention to those" through the high-dimensional neighbor comparison views. In support of Design Goal 2, all three users agreed that the Suggested Selections feature was very useful -particularly when there were no supervised labels to visually separate clusters. Overall, participants found that Emblaze made possible exploratory analyses they had never been able to do: "I think there is a lot of functionality that I would never interface with if the tool didn't exist" (U3).

Participants also provided useful feedback on the novel visual augmentations used in Emblaze. All three users were initially confused by the Color Stripe visualization, although they agreed that getting a sense of variation across all frames was important: "Yeah, we definitely need that information, it's very helpful. I wish it was more straightforward" (U2). Participants also wanted the Star Trail visualization to communicate more information about points' relationships to a cluster, such as by highlighting trails differently depending on whether they were entering or leaving a neighborhood (U1). Finally, participants thought notebook integration was very helpful for studying models without leaving their work environment, and suggested that the tool could integrate directly with models to dynamically compute and visualize embeddings for new groups of instances.

By building upon designs from prior work as well as experts' current approaches to analysis and comparison, Emblaze enables a series of comparative workflows on embedding spaces that would have been highly challenging with existing tools. Our think-aloud sessions with ML experts suggest that the tool makes substantial progress towards the four goals Goal 2. Support exploration of large datasets by guiding the user to points and clusters that change meaningfully between embedding spaces. All three users made extensive use of the Suggested Selections feature, particularly when clusters were not well separated by the projection, and found that it worked very well for their datasets. Participants had difficulty reading the Color Stripes visualization at first, a challenge that could be mitigated by simplifying the color encoding and giving it a dedicated space in the UI. However, they all agreed that Emblaze's ability to guide them to interesting and meaningful regions was a powerful addition to their workflow.

Goal 3. Support exploration of high-dimensional neighborhoods so users can avoid being misled by distortions due to DR projections. Users agreed that animating between different DR projections and looking at the neighbor lists was a useful way to disambiguate between artifacts of the projection and true high-dimensional neighborhoods. It may be possible to assist the user's interpretation of these features to make them even more accessible to non-experts. For example, the interface could prompt the user to check the accuracy of a cluster when it is more disparate in the high-dimensional space than it appears in the projection.

Goal 4. Support integration into custom embedding analysis workflows. Participants strongly favored Emblaze's notebook implementation over a standalone application, primarily because of ease of installation and compatibility with data that participants had previously stored. They also suggested new visualization possibilities if Emblaze were even more tightly integrated with ML frameworks in the future.

The case studies presented here cannot be interpreted as a comprehensive evaluation of Emblaze's features, particularly since our users had not used similar tools before and had no baseline for comparison. Rather, our observations point to novel workflows that model builders can utilize through Emblaze and that can be built upon in future work.

Echoing the needs expressed by our interview participants, many of these workflows led to a greater understanding of the notions of similarity that embedding spaces were capturing. For instance, U3's use of Suggested Selections enabled them to quickly find several clusters that diverged from one model to another in similar ways. By identifying common patterns of change across these clusters, they were able to construct a narrative for how the architecture choices underlying each model had resulted in the differences they observed. This process was made much more efficient by U3's back-and-forth interaction between the visualization and code, not only to corroborate findings for large groups of points, but also to quickly load up multiple subsets of the data for a more robust analysis.

Emblaze also afforded our expert users new workflows that helped them reason about the reliability of their embedding space analyses, and what conclusions they could sensibly draw from them. For example, even though U2 was only examining one model space, they were more confident in identifying reliable clusters because they could assess their consistency across DR projections. Comparison also allowed users to easily assess the quality of embedding neighborhoods, which would ordinarily require an intuition built up over many past experiences analyzing embedding spaces. After finding a cluster with substantial variation between two models, for instance, U3 could easily conclude that the cluster was poorly embedded in one model because its neighbors made little sense relative to the more reasonable neighbors in the other model. With enough experience, it would likely be possible to draw similar conclusions based on a single embedding space; Emblaze has the potential to help users build these intuitions more quickly.

Some features that would be important for particular use cases were omitted from this first version of Emblaze for simplicity. Most notably, a few interview participants described wanting to know what feature axes drive the separation of a cluster, e.g. which genes are highly expressed in a particular cluster of cells from a computational biology experiment. Although Emblaze's support for text and image data types covers many typical ML representations, incorporating visualizations to help users interpret points and clusters in tabular data (such as those proposed in prior embedding analysis tools [41] ) could expand the tool's applicability even further. In addition, Emblaze has only one scatter plot view that animates between projections, a promising alternative to prior work that juxtaposes multiple projections next to each other [1, 4, 10] . In the future, though, the two approaches could be combined by allowing users to toggle between a single animated scatter plot for large-scale browsing and side-by-side visualizations to compare smaller subsets of the data.

In this work, we have synthesized experts' viewpoints across different domains to construct a tool that enables visualization and exploration across several embedding spaces-previously an extremely difficult task. Our limited evaluation suggests that the system considerably lowers the barrier to embedding analysis and comparison. However, further engagement with model builders as well as non-expert users (such as ML students) is needed to determine how visualization tools can support these tasks even more effectively. Given the increasing societal impact of ML models trained on vast unlabeled datasets, qualitative comparison may help track our progress towards more valid, unbiased, and ethical representations. By making Emblaze open source and publicly available, we hope to spark experimentation and discussion in the ML and visualization communities on how embedding space comparison can help produce more accurate and responsible models.

This work was supported by the Center for Machine Learning and Health at Carnegie Mellon University. We thank Carolyn Rosé, Jill Lehman, Denis Newman-Griffis, and Dominik Moritz for their valuable feedback throughout the development of Emblaze. Thanks also to Ángel (Alex) Cabrera for laying the groundwork for reactive Jupyter widgets, which made the notebook implementation of Emblaze possible. Finally, we are grateful to all our study participants for sharing their time and insights.

Parallel embeddings: A visualization technique for contrasting learned representations. International Conference on Intelligent User Interfaces, Proceedings IUI

Visualizing distortions and recovering topology in continuous projection techniques

Dimensionality reduction for visualizing single-cell data using UMAP

Embedding comparator: Visualizing differences in global structure and local neighborhoods via small multiples. arXiv

Man is to Computer Programmer as Woman is to Homemaker?

Visual exploration and comparison of word embeddings

ReVACNN: Steering Convolutional Neural Network via Real-Time Visual Analytics. KDD Workshop on Interactive Data Exploration and Analytics Nips

Modeling wine preferences by data mining from physicochemical properties

Comparing and Exploring High-Dimensional Data with Dimensionality Reduction Algorithms and Matrix Visualizations

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation

Interactive Dimensionality Reduction for Comparative Analysis

Visual comparison for information visualization

Diachronic word embeddings reveal statistical laws of semantic change

Animated transitions in statistical data graphics

Interactive Analysis of Word Vector Embeddings

2020. embComp: Visual Interactive Comparison of Vector Embeddings

Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers

Challenges in unsupervised clustering of single-cell RNA-seq data

Statistically significant detection of linguistic change

EmbeddingVis: A Visual Analytics Approach to Comparative Network Embedding Inspection

Application of t-SNE to human genetic data

Visual Exploration of Semantic Relationships in Neural Word Embeddings

Latent space cartography: Visual analysis of vector space embeddings

Deep Learning Face Attributes in the Wild

XCluSim: A visual analytics tool for interactively comparing multiple clustering results of bioinformatics data

How to use AlignedUMAP

Assessing single-cell transcriptomic variability through density-preserving data visualization

TextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora

Ten quick tips for effective dimensionality reduction

Multidimensional Projection for Visual Analytics: Linking Techniques with Distortions, Tasks, and Layout Enrichment

Exploring dimension-reduced embeddings with Sleepwalk

Projection inspector: Assessment and synthesis of multidimensional projections

DeepEyes: Progressive Visual Analytics for Designing Deep Neural Networks

Effectiveness of animation in trend visualization

LAMVI-2: A Visual Tool for Comparing and Tuning Word Embedding Models

Visual Interaction with Dimensionality Reduction: A Structured Literature Analysis

Stress Maps: Analysing Local Phenomena in Dimensionality Reduction Based Visualizations

Embedding Projector: Interactive Visualization and Interpretation of Embeddings. Nips

Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions

Image representations learned with unsupervised pre-training contain human-like biases

Comparative Analysis of Word Embeddings for Capturing Word Similarities. International Conference on Natural Language Processing

Dimensionality Reduction: A Comparative Review

Search, show context, expand on demand": Supporting large graph exploration with degree-of-interest

Saliency Deficit and Motion Outlier Detection in Animated Scatterplots