Submitted 4 December 2017
Accepted 10 January 2018
Published 29 January 2018

Corresponding author
Daniel Alcaide,
daniel.alcaide@kuleuven.be,
daniel.alcaide@esat.kuleuven.be

Academic editor
Klara Kedem

Additional Information and
Declarations can be found on
page 16

DOI 10.7717/peerj-cs.145

Copyright
2018 Alcaide and Aerts

Distributed under
Creative Commons CC-BY 4.0

OPEN ACCESS

MCLEAN: Multilevel Clustering
Exploration As Network
Daniel Alcaide and Jan Aerts
Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and
Data Analytics, KU Leuven, Leuven, Belgium
imec, KU Leuven, Leuven, Belgium

ABSTRACT
Finding useful patterns in datasets has attracted considerable interest in the field of
visual analytics. One of the most common tasks is the identification and representation
of clusters. However, this is non-trivial in heterogeneous datasets since the data needs
to be analyzed from different perspectives. Indeed, highly variable patterns may mask
underlying trends in the dataset. Dendrograms are graphical representations resulting
from agglomerative hierarchical clustering and provide a framework for viewing the
clustering at different levels of detail. However, dendrograms become cluttered when
the dataset gets large, and the single cut of the dendrogram to demarcate different
clusters can be insufficient in heterogeneous datasets. In this work, we propose a visual
analytics methodology called MCLEAN that offers a general approach for guiding
the user through the exploration and detection of clusters. Powered by a graph-
based transformation of the relational data, it supports a scalable environment for
representation of heterogeneous datasets by changing the spatialization. We thereby
combine multilevel representations of the clustered dataset with community finding
algorithms. Our approach entails displaying the results of the heuristics to users,
providing a setting from which to start the exploration and data analysis. To evaluate our
proposed approach, we conduct a qualitative user study, where participants are asked to
explore a heterogeneous dataset, comparing the results obtained by MCLEAN with the
dendrogram. These qualitative results reveal that MCLEAN is an effective way of aiding
users in the detection of clusters in heterogeneous datasets. The proposed methodology
is implemented in an R package available at https://bitbucket.org/vda-lab/mclean.

Subjects Data Science, Visual Analytics
Keywords Exploratory data analysis, Graph and network visualization, Hierarchical clustering,
Visual analytics

INTRODUCTION
Determining the number of clusters in a dataset is a frequent problem in data clustering,
and is a distinct matter from the algorithm of actually solving the clustering problem.
The correct choice of the number of groups is often ambiguous depending on the shape
and scale of the points in a dataset and the desired clustering resolution by the user. The
optimal choice of clusters depends on the intended use, but in general, it strikes a balance
between the maximum compression using a single cluster and the highest resolution of the
data by assigning each data point to its own cluster.

How to cite this article Alcaide and Aerts (2018), MCLEAN: Multilevel Clustering Exploration As Network. PeerJ Comput. Sci. 4:e145;
DOI 10.7717/peerj-cs.145

https://peerj.com
mailto:daniel.alcaide@kuleuven.be
mailto:daniel.alcaide@esat.kuleuven.be
https://peerj.com/academic-boards/editors/
https://peerj.com/academic-boards/editors/
http://dx.doi.org/10.7717/peerj-cs.145
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/licenses/by/4.0/
https://bitbucket.org/vda-lab/mclean
http://dx.doi.org/10.7717/peerj-cs.145


Several clustering algorithms have been proposed for partitioning datasets (Jain, Murty
& Flynn, 1999). Most of these rely on parameter settings, such as the number of clusters
in k-means, the reference value (ε) in DBSCAN or the cutoff distance in a hierarchical
clustering. These parameters differ from the algorithm, but either directly or indirectly
specify the number of clusters. Setting these parameters demands either detailed pre-
existing knowledge of the data or time-consuming trial and error. Moreover, a singular
cutoff can hide interesting underlying structures. In the real world, there might not be an
single sensible cutoff, and it is common that automatic clustering methodologies ignore
particular characteristics of clusters, as some of these might be for example particularly
dense or sparse.

As Boudjeloud-Assala et al. (2016) state, ‘‘the clustering process is not complete until it
is evaluated, validated, and accepted by the user. As such, visual validation and exploration
can improve understanding of clustering structure, and can be very effective in revealing
trends, highlighting outliers, and showing clusters’’. Visualizing clustering results can help
to quickly assimilate the information and provide insights that support and complement
textual outputs or statistical summaries. Typical questions to be answered regarding
clustering results include how well defined the clusters are, how far away they are from
each other, what their size is, and if the observations belong strongly to the cluster or
only marginally. Therefore the exploration of the different cluster scenarios and the
identification of similar record groups (i.e., patterns) in the dataset is a challenge for the
user (Vogogias et al., 2016).

Hierarchical clustering is a widely used and effective algorithm to answer these questions,
as it provides a framework for viewing the clustering at different levels of detail by imposing
a hierarchy on it using a tree (Friedman, Hastie & Tibshirani, 2001). During the cutoff
selection process of the tree, the analyst can instantly obtain insights from the graphical
representation that suggest the adequacy of the solution but hierarchical clustering does
have some drawbacks: (1) the dendrogram representation becomes cluttered when datasets
get large; (2) a single cut of the dendrogram is sufficient when the dataset is homogeneous.
However, when the dataset is heterogeneous, multiple cuts at different levels might be
required. (3) If patterns are present at different levels, choosing a cutoff will hide all but
one of these.

Clustering methods often are a fixed process: loading a dataset, setting parameters,
running the algorithm, and plotting the results. In other words: clustering is used generally
to analyze the data, not to explore it (Boudjeloud-Assala et al., 2016). The integration of
visualization and algorithm into the same model is a possible solution to make the clustering
process dynamic. The framework to perform interactive visual clustering (IVC) presented
by Boudjeloud-Assala et al. (2016) demonstrated a significant advantage in data mining
since it allows users to participate in the clustering process by leveraging their visual
perception and domain knowledge. As recommended by Keim, Mansmann & Thomas
(2010), we believe that if we adapt the visualization environment and combine it with the
clustering approach, this combined approach can be used to provide a very natural way for
users to explore datasets.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 2/19

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.145


We suggest a novel and generic clustering and exploration approach called MCLEAN
(Multilevel Clustering Exploration As Network) for grouping and visualizing multiple
granularities of the data that enables: (1) exploration of the dataset using a overview-
plus-detail representation, (2) simplification of the dataset using aggregation based on
the similarity of data elements, (3) detection of substructures by means of community
detection algorithms, and (4) inclusion of the human in the process of selection the
number of clusters. Our methodology follows a synergistic approach that combines the
strengths of connectivity-based algorithms, community detections techniques and the
ability of humans to visually detect patterns, to explore moderately large datasets. It is a
visual exploratory and clustering method that permits the user to interact with the algorithm
results. The method combines hierarchical clustering algorithms with interactive tools to
find optimal clusters and visualize them in a simplified network representation. Network
visualizations are an effective means to understand the patterns of interaction between
entities, to discover entities with interesting roles, and to identify inherent groups or clusters
of entities (Liu, Navathe & Stasko, 2014). The MCLEAN methodology is implemented in
an R package available at https://bitbucket.org/vda-lab/mclean.

The remaining part of this paper is organized as follows. In the section ‘Background’ we
give an overview of related work in multilevel clustering and graph visualization techniques
as an exploratory tool. The section ‘Methods’ describes the proposed visualization technique
for clustering exploration in detail, followed by the section ‘Evaluation’, in which we present
an evaluation of our approach. Finally, the section ‘Conclusions and Future Work’ presents
conclusions and possible directions for future work.

BACKGROUND
The proposed framework allows the user to employ tacit knowledge in the clustering
process in order to detect substructures. This process provides a multilevel environment
through overview-plus-detail offering both a general outlook of the data grouping and the
precise union of a subset of elements using graphs. To set our work in context, we present
a set of examples of visual multilevel clustering and the network transformation of data to
identify patterns.

Visual multilevel clustering
There are several methods to perform clustering analysis, but only a few of them support
visual analysis. Even fewer provide interactive exploration capabilities of the clusters in
different levels of detail. However, the importance of visual interaction for performing
clustering analysis is increasingly recognized (Nielsen et al., 2012), as the expert users are
capable of steering the analysis to produce more meaningful results. The tacit knowledge
often motivates the decisions of the users that algorithms are not able to process or
incorporate by themselves. Therefore, including a human in the loop for taking decisions
and for guiding the analysis is essential (Vogogias et al., 2016).

Hierarchical clustering has been long used in many different fields including biology,
social sciences, and computer vision due to the ease of interpreting the output by the user.
The selection of the clusters is based on a single similarity threshold, where the tree is cut at

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 3/19

https://peerj.com
https://bitbucket.org/vda-lab/mclean
http://dx.doi.org/10.7717/peerj-cs.145


a uniform height. Unfortunately, large and heterogeneous datasets usually require a more
flexible approach allowing the user to explore different clustering scenarios. Some methods
have been proposed to cut the tree at different levels. Langfelder, Zhang & Horvath (2007)
suggested an automatic approach that cuts the branches of the dendrogram in different
levels based on their shape. Obulkasim, Meijer & Van de Wiel (2015) proposed a procedure
to detect clusters from the dendrogram, called guided piecewise snipping. The method
overcomes the drawbacks of the fixed height cut approach by allowing the piecewise rather
than the fixed-height cut and incorporating external data to decide upon the optimal cut.
In the same line of research, MLCut (Vogogias et al., 2016) is a tool that provides visual
support for exploring dendrograms of heterogeneous data sets at different levels of detail.

Partition-based clustering techniques such as k-means and CLARANS (Ng & Han,
2002) attempt to break a data set into k clusters optimizing a given criterion. Boudjeloud-
Assala et al. (2016) presented a semi-interactive system for visual data exploration of
multidimensional datasets using iterative clustering. Their framework connects the user
and the data mining process, which allows the user to play an active role in the clustering
tasks. Looney’s approach (Looney, 2002) implements a process of removing small clusters
in an iterative way, reassigning them into more dense regions. In doing this, consistency
in the clustering results is improved. Similary, Bruneau & Otjacques (2013) proposed an
approach to integrate user preferences into the clustering algorithm in an interactively
way through 2D projection of the dataset. Rinzivillo et al. (2008) proposed an exploratory
methodology for exploring a large number of trajectories using clustering techniques. The
grouping of the trajectories is progressively applied by the users refining the parameters of
the clustering algorithm.

Graph representation
The dendrogram visual representation is not scalable to larger datasets. A technique
presented by Chen, MacEachren & Peuquet (2009) uses a uniform threshold to provide
improved visibility by simplifying the dendrogram representation. This is a useful technique
for summarising the dendrogram in a selected level of detail and making it fit in smaller
displays. However, it does not provide support for multilevel cuts or data exploration at
multiple levels.

Given a matrix whose entries represent the similarity between data items, many methods
can be used to find a graph representation. In fact, modeling data items as a graph is a
common conceptualization used in hierarchical clustering algorithms. In a more general
approach, Ploceus (Liu, Navathe & Stasko, 2014) offers an approach for performing
multidimensional and multilevel network-based visual analysis on tabular data. Users
can flexibly create and transform networks from data tables through a direct manipulation
interface. Ploceus integrates dynamic network manipulation with visual exploration for a
seamless analytic experience.

The WhatsOnWeb system (Di Giacomo et al., 2007) takes advantage of graph-based
visualization created by the results of a Web search engine. In their system, a search query
produces a graph that represents sets of Web pages as nodes, which are connected if
documents are sufficiently semantically related. The strength of the relationship is encoded

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 4/19

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.145


with an edge weight and a topological clustering algorithm is recursively applied to the
graph, forming a graph hierarchy and showing different levels of information.

Systems presented for clustering and exploration in Duman, Healing & Ghanea-Hercock
(2009), Desjardins, MacGlashan & Ferraioli (2007), Beale (2007) and Lee et al. (2012)
transform the data into a spring-embedded graph layout, encoding the distance between
the elements as forces in the force-directed layout. The objective in these systems is the
projection of the distances in a reduced dimension allowing clustering assignment using
partitioning-based methods. Links are usually omitted in the representation facilitating
the readability of the spatialization of the nodes. They present an alternative to standard
dimension reduction methods such as projection pursuit or multi-dimensional scaling.

The network exploration of MCLEAN can be considered close to the solutions proposed
for the navigation of the clustering results for large-scale graph visualization systems, such
as Eades & Feng (1996) and Eades & Huang (2000). They allow the user to navigate a graph
by iteratively expanding or collapsing the aggregated nodes (meta-nodes). However, users
often lose context when navigating clustered graphs with deeper hierarchies (Abello, Van
Ham & Krishnan, 2006).

METHODS
The MCLEAN method takes a similarity matrix of all data records as input, and produces
a simplified graph representation showing a higher abstraction of the clustering process.
MCLEAN combines two visual representations. First, an overview plot (barcode-tree),
related to a dendrogram and topological barcode plot, shows how the general cluster
structure changes for different values of a parameter ε, indicating how close points need
to be in the multi-dimensional space to be considered belonging to the same cluster.
Second, a node-link plot represents the clustering results at a given ε. For this ε, clustering
information in the node-link diagram is dual-layered. First, graph connected components
correspond to data clusters at this threshold ε. Second, different colours within a connected
component indicate that this subnetwork would be split when using a more stringent ε; in
other words, it indicates substructures in this cluster.

A connected component is a subgraph in which all the vertices are directly or indirectly
connected. We use connected components to define the clusters in the dataset. In addition,
MCLEAN employs community detection algorithm to find subclusters inside connected
components. As a result, user knowledge (tacit or other) can inform on whether a cluster
is distinct or is a part of a larger cluster. This ambiguity is common in heterogeneous
data sets.

As most of the clustering techniques, the agglomerative algorithm that we use depends on
one single parameter. This parameter is a threshold (�) that defines the distance of union
between two data elements. We find similarities between the MCLEAN approach and
topological data analysis (TDA). Analyzing the multidimensional spaces from a topological
structure perspective, interpreting the persistent homology by calculating the number
of connected components (b0 from betti numbers) and using the persistence concept to
define the optimal threshold of network representation prove that although the aims are
distinct they share a same philosophy of analysis (Topaz, Ziegelmeier & Halverson, 2015).

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 5/19

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.145


Figure 1 Workflow diagram of MCLEAN algorithm, consisting of four steps: (1) graph transforma-
tion, (2) node aggregation, (3) community detection and, (4) barcode-tree creation.

Full-size DOI: 10.7717/peerjcs.145/fig-1

The MCLEAN method consists of four parts as illustrated in Fig. 1: (1) transformation
of the distance matrix into a node-link representation based on the threshold defined; (2)
simplification of the network creating aggregated nodes; (3) detection of substructures
employing community detection algorithms; and (4) exploration of the resulting networks
for different threshold values.

The methodology in this section is illustrated using a dataset taken from the UCI
repository website (see Fig. 2). This dataset contains 600 examples of control charts
synthetically generated as described by Alcock & Manolopoulos (1999). We used Dynamic
Time Warping (DTW) for measuring similarity between the temporal sequences. Figure 2
illustrates both representations of the raw data (Figs. 2A and 2B) and classical visualizations
of the distance matrix such as the dendrogram (Fig. 2C) and a scatterplot of the two first
dimensions of multidimensional scaling (Fig. 2D).

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 6/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-1
http://dx.doi.org/10.7717/peerj-cs.145


Figure 2 Representation of a synthetic dataset that contains 600 examples of control charts syntheti-
cally generated by the process in Alcock & Manolopoulos (1999). (A) Time series are treated as a unique
group. (B) Underlying (hidden) time series are split by their label. (C) Dendrogram using single linkage.
(D) Two first dimensions of classical multidimensional scaling. Color represents the label in the dataset.

Full-size DOI: 10.7717/peerjcs.145/fig-2

Graph transformation
Multidimensional Scaling (MDS) projects the data elements in reduced dimension ordina-
tion space. Two or three dimensions are often used, which is based on ease of visualization
rather than on the dimensionality of substructures in the data. Unfortunately, in some
cases, these projections blur patterns due to the heterogeneity of the distances and the
limitations of the space visualized. Therefore, a change to the spatialization (such as network
visualization) can help to overcome the limitations of complex datasets. An example of
these weaknesses can be seen in the MDS applied to the Synthetic dataset in Fig. 2D.

Although the distance matrix does not contain explicit network semantics, MCLEAN
uses this approach to transform the encoding of distances by the use of links in the network.
Moreover, the algorithm employed in the final drawing of the network (i.e., force-directed
graph) is optimized to avoid overlapping between the nodes.

The graph transformation step of MCLEAN is similar to the DBSCAN method (Ester et
al., 1996), in that it relies on a parameter � which defines the radius that designates points to
be lying in each other’s neighbourhood. In DBSCAN, a second parameter numPts is used to
define the minimal number of points that can constitute a cluster. In MCLEAN, however,
all datapoints are considered network nodes, and datapoints that are within a distance �
from each other are linked. The result of this step produces a graph where there exists a path
between two nodes if and only if they belong to the same connected component. At this stage
of the methodology clusters are represented as connected components in topological space.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 7/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-2
http://dx.doi.org/10.7717/peerj-cs.145


Figure 3 Node-link network transformation using force-directed layout from the distance matrix us-
ing a distance threshold of 150 in part (A), 190 in part (B), 220 in part (C) and, 290 in part (D). All data
elements are represented as a node in the network. Edges are defined based on the threshold.

Full-size DOI: 10.7717/peerjcs.145/fig-3

Figure 4 Network representation of the clustered dataset using a parameter ε of 150 in part (A), 190 in
part (B), 220 in part (C) and, 290 in part (D). This representation preserves the same structure shown in
Fig. 3.

Full-size DOI: 10.7717/peerjcs.145/fig-4

Figure 3 shows the graph transformation process for four snapshots of different parameters
� applied to the same dataset. As � increases, the number of links grows between the nodes.

Node aggregation
In case of large datasets, the node-link representation can become visually overwhelming
for the user without a proper level of aggregation. The challenge is to extract understandable
information buried in the structure of multiple nodes and links. In addition, a layout of
the entire graph is costly to compute. MCLEAN simplifies and highlights the structure of
the raw network. This process of simplification is founded on the use of aggregating nodes
(meta-nodes) that represent a subgraph at a higher level of abstraction.

Node aggregation is based on degree centrality, where the degree of a node is defined
as the number of connections that the node has within a network. This value is computed
for all nodes, and the highest one is the first candidate to be the center of an aggregated
node (meta-node). All nodes connected directly with the candidate are converted into an
aggregated node. All connections with other data elements are inherited in the meta-node
keeping the structure of the connected component. The result of node aggregation for the
graphs created in Fig. 3 is shown in Fig. 4.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 8/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-3
https://doi.org/10.7717/peerjcs.145/fig-4
http://dx.doi.org/10.7717/peerj-cs.145


Figure 5 Illustration of node aggregation process for a set of fifteen elements. (A) Network without
aggregation. Node 8 in the network is the best candidate to build the first meta-node. All nodes directly
connected to this node become part of the meta-node. (B) Network after the creation of the first meta-
node. The aggregation continues until all nodes are aggregated. The best candidates are node 4 and node
14 at this step. (C) Network at last stage of the aggregation process. Node 1 and 6 become individual meta-
nodes. (D) Resulting aggregated network.

Full-size DOI: 10.7717/peerjcs.145/fig-5

Our simplification graph approach was designed to preserve the structure of the input
graph. According to Archambault, Munzner & Auber (2008), a topologically preserving
graph must respect the following two properties:
1. Edge Conservation: an edge exists between two meta-nodes m1 and m2 if and only if

there exists an edge between two leaves in the input graph l1 and l2 such that l1 is a
descendant of m1 and l2 is a descendant of m2.

2. Connectivity Conservation: any subgraph contained inside a meta-node must be
connected.
By respecting these two properties, we ensure that the resulting graph preserves the

topological features of the initial graph: edge conservation guarantees that any edge in the
simplified graph is present in the initial graph, while connectivity conservation ensures that
any path can continue through any meta-node (Archambault, Munzner & Auber, 2009).

In MCLEAN, meta-nodes are created through the densest nodes (highest degree) in a
connected component. The node with the highest degree is the best candidate to be the
center of the meta-node. Figure 5A shows an illustration of a connected component where
node eight is the best candidate. All nodes connected directly to the best candidate become
part of the meta-node as shown in Fig. 5B. A meta-node inherits the edges with the external
nodes or meta-nodes that do not belong to it. The aggregation is an iterative process until
all nodes become part of a meta-node. Aggregated nodes are excluded from the process
preventing them to be included into another meta-node. For example node ten is part of
the meta-node of node eight. Therefore, it cannot be included in or be a candidate for
a new meta-node although it has the same degree as four and fourteen in Fig. 5B. The
number of connected components or clusters does not change after the simplification
process. Figure 5B show the result of the node aggregation process.

Community detection
The simplified network representation (Fig. 4) preserves structural data in a compressed
way, which together with community detection allows revealing substructures inside
connected components. A community refers to a group of nodes that are internally

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 9/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-5
http://dx.doi.org/10.7717/peerj-cs.145


Figure 6 Network representation of the clustered dataset using the distance threshold of 150 in part
(A), 190 in part (B), 220 in part (C) and 290 in part (D). Communities are detected through Infomap. The
coloring of nodes illustrates the communities detected by the algorithm.

Full-size DOI: 10.7717/peerjcs.145/fig-6

highly connected. Community detection in networks is not a trivial problem, and many
algorithms have been proposed. MCLEAN relies on the Infomap algorithm (Information-
theoretic method) (Rosvall & Bergstrom, 2008), which provides multilevel solutions for
analyzing undirected, directed, unweighted, and weighted networks. In MCLEAN, the
number of data elements in each simplified node is used as vertex weight in the Infomap
algorithm to reduce the effect of aggregation. Different communities in a single connected
component are shown in different colors. Figure 6 shows the networks created after graph
transformation (Fig. 3) and node aggregation (Fig. 4) applying the results of community
detection. Prevalence of communities increases with network size, as shown in Fig. 6 where
part A does not reveal any substructure but part D shows three in a single connected
component.

Barcode-tree
As indicated above, the generated connected components depend on the value of parameter
�, as can be seen in the four subplots in Fig. 3. In general, the network consists of isolated
vertices for small values of the threshold. At the largest value, the entire dataset is a single
connected component. The selection of a representative threshold without prior knowledge
of the underlying space is however difficult for any dataset. In addition, heterogeneous
datasets may need multiple levels of partitions and therefore will require the exploration
of multiple thresholds.

In order to provide guidance in the parameter choice and a contextual overview of the
relation between � and clustering results, MCLEAN generates these graphs across a range of
� values. These are subsequently combined in the tree representation called barcode-tree,
which is inspired by both a clustering dendrogram and barcode representation (Topaz,
Ziegelmeier & Halverson, 2015) as used in topological data analysis.

The barcode-tree (Fig. 7) is a visual representation of cluster arrangement. The horizontal
axis corresponds to threshold � and refers to the distance measure of union between
the data elements that define the network (see ‘Graph transformation’ section). The
individual components are arranged along the vertical axis of the plot. At any given
threshold, the number of connected components is the number of lines that intersect the
vertical line through the threshold. Meta-nodes are formed in the join points that are

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 10/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-6
http://dx.doi.org/10.7717/peerj-cs.145


Figure 7 Barcode-tree for a sequence of thresholds from 0 to 300 by steps of 5 using gradient color to
represent the number of communities for each connected component.

Full-size DOI: 10.7717/peerjcs.145/fig-7

aggregations of individual data elements or existing meta-nodes at a smaller threshold
(see ‘Node aggregation’ section). This tree overcomes the limitations of binary structure
of a dendrogram, allowing for a more clear representation of branches. Moreover, the
barcode-tree implements a leaf ordering method motivated by the MOLO algorithm
presented by Sakai et al. (2014). The branches are evaluated backwards recursively (from
the single cluster until the singleton) to be the center of the subtree at each threshold
avoiding the crossing of the branches.

Meta-nodes for a small threshold are aggregated into new ones created by the larger
threshold: if �1 ≤�2 ≤�3 ≤···≤�N−1 ≤�N then M1 ⊆M2 ⊆M3 ⊆···⊆MN−1 ⊆MN
with Mi being the meta-nodes in network i. If the user is interested in understanding the
structure of the input data, then topological hierarchies are useful tools to explain the
origin of all edges viewed in a cut. Both the objective for the barcode-tree view in MCLEAN
and the barcode in TDA is to find the persistent topological structures across a range of
thresholds. Those structures which persist over an extensive range are considered signals of
the underlying topology. As the threshold changes, the topological structures of network
change accordingly.

In Fig. 7, we see the representation of the connected component for the range of �
from 0 to 300. For �=100, we see 600 connected components because there are no
connections amongst the individual elements in the dataset. For �=220, we see a big
connected component and a significant subset of individual elements, reflecting the fact
that some vertices have joined into a larger connected component. For �=290, we see a
single connected component that indicates the joining of all data elements.

The resolution of the plot depends on the number of evaluations, showing a general
overview with only a few thresholds (Fig. 8E) or allowing detailed understanding of
connected component composition with a more dense covering of thresholds (Fig. 8A).

Although the exploration of the connected components path around a threshold of
interest can give an intuition of the resulting network, the analysis of the connected

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 11/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-7
http://dx.doi.org/10.7717/peerj-cs.145


Figure 8 Set of five curves of connected components vs. threshold distance according to different
granularities. (A) barcode-tree for the range of ε (0 to 300) by steps of 1, (B) by steps of 2, (C) by steps of
5, (D) by steps of 10 and (E) by steps of 15 and (F) by steps of 20.

Full-size DOI: 10.7717/peerjcs.145/fig-8

components using the network representation (Fig. 6) is a necessary step to identify hidden
substructures using only the tree representation. For example, at threshold 290 in Fig. 7, we
identify a single connected component. However, we identify more details in the structure
of the network at Fig. 6D.

EVALUATION
To understand the implications of the proposed methodology and the interaction between
the visuals, we performed a qualitative evaluation regarding learnability and usability
of MCLEAN. We recruited six participants, including four doctoral and two post-
doctoral researchers in the area of data science with knowledge of clustering techniques
(e.g., hierarchical or k-means clustering) and dimensionality reduction techniques (e.g.,
multidimensional scaling and PCA). None had seen or used MCLEAN before the evaluation
test. The goal of the evaluation was to identify qualitative insights about how well MCLEAN
supports the identification of patterns according to the simplification of the dataset.

Tasks and procedures
We gave a brief introduction to MCLEAN, explaining the fundamentals of the methodology
and demonstrating the main functionalities of the interface developed to interact with the
network and barcode-tree including the bidirectional selection of elements between the
two visuals (see Fig. 2). A training exercise was performed to familiarize the participants
with the MCLEAN workflow using the Fisher Iris dataset (Fisher & Marshall, 1936). We
asked the participants to explore the general patterns in the barcode-tree and specific
topological structures utilizing the network representation and community finding results.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 12/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-8
http://dx.doi.org/10.7717/peerj-cs.145


Figure 9 Detection of patterns between the barcode-tree and dendrogram. Part (A) highlights the four
patterns detect by five out of six participants in the barcode-tree. Part (B) shows the four patterns detected
in the dendrogram. Pattern B1* was identified as an additional pattern by two out of five participants.

Full-size DOI: 10.7717/peerjcs.145/fig-9

We repeated the exercise for the actual evaluation over the control charts dataset, explaining
only that it concerned time-series data. We asked the participants to think aloud, observed
their interaction with the interface, recorded their patterns selection as hand-written notes,
and sought their impressions and comments on the methodology after they completed the
tasks. To conclude we asked them to complete a questionnaire to evaluate the efficacy and
their satisfaction of MCLEAN compared to dendrogram. We also sought to know how
difficult the methodology was to learn and use, if there were any problematic design issues,
and how we might be able to address the difficulties experienced by the participants.

Results and analysis
The evaluation exercise was split into three parts: detection of patterns using the barcode-
tree, selection of thresholds comparing the dendrogram and barcode-tree, and detection
of patterns combining the network representation and barcode-tree. After the exercise, the
questionnaire was provided to obtain user satisfaction and additional feedback.

Detection of patterns using the barcode-tree
In a first evaluation, we sought to identify to what extent participants are able to identify
the different underlying patterns as shown in Fig. 2B. Five participants (participants A–E)
identified four different patterns in the temporal dataset using the barcode-tree exclusively
for the range of � 0 to 300 by steps of 5 (Fig. 7). Participant F identified two patterns using
the same representation, grouping pattern A1, A2 and A3 as a single pattern and pattern
A4 independently from the rest (see Fig. 9). Identical results were found in the dendrogram
exploration. In addition, pattern A4 was classified as a group of outliers by participants
D and E.

When identifying four patterns, users A-E aggregated Type 3 and Type 5 (see Fig. 2B)
signals into a single pattern (pattern A2; see Fig. 9A), and Type 4 and Type 6 in pattern
A3. In both cases, the pairs behave similarly, but in opposing directions: a continually
increasing or decreasing trend, or a shift in the middle of the time series. In each pair, the

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 13/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-9
http://dx.doi.org/10.7717/peerj-cs.145


global distance between the different types is small compared to the rest of patterns due to
the sequence alignment using DTW.

When using the dendrograms, (Fig. 9B), participants A–E identified between three and
five patterns. Two or three groups were detected in the middle, and the heterogeneous data
elements (pattern B1) were recognized as an additional pattern and misunderstood as two
independent clusters by users B and E due to the location of the branches. Participant
C identified a single cluster containing signals of Type 1, Type 3 and Type 5, and
another containing Type 4 and Type 6. This result shows a slight loss of perception in
the dendrogram compared to barcode-tree and a possible potential misinterpretation of
the dendrogram due to to the position of the branches.

Changes in tree resolution did not present a change in the interpretation of participants
when resolution was increased, i.e., steps of 1 and 2 (Figs. 8A and 8B) but it did when
the resolution decreased. Three participants (B, C and E) detected six patterns when we
evaluated the number of connected component in steps of 20 as shown in Fig. 8F. This fact
reveals that different resolutions lead to different possible interpretations of the data.

Selection of cutoffs in dendrogram and barcode-tree
Using dendrogram exploration, only participant F experienced difficulties in cutoff
selection, whereas participants A-E selected a single cutoff between thresholds 180 and
195, describing two or three notable clusters and ungrouped data-elements. Using the
barcode-tree, participants A-E selected a similar threshold. Participant D investigated an
additional threshold at 220. Participant F picked the threshold 285 with the intention of
exploring the network representation. The number of cutoffs was not limited in any of the
representations allowing the user to explore different partitioning perspectives. Overall,
users were more confident in choosing thresholds using the barcode-tree than when using
the dendrogram. A persistent segment starts at the � threshold of 185 until the join of three
clusters at threshold 202. Discussion with the participants indicated that this persistence in
the barcode-tree makes for better readibility and therefore higher confidence in threshold
selection. In contrast, the binary union of the branches and non-optimisation of the leave
ordering in the dendrogram can lead to misinterpretation of the cutoff selection leaving
some elements outside of a potential cluster.

Detection of patterns combining the network representation and
barcode-tree
In this part, we aimed to evaluate the detection of the structures through community
detection and interaction between the visualizations. We identified three relevant thresholds
for the network representation at the start of the three most persistent topological structures
(see Fig. 9A), more specifically at threshold 187, 202, and 283, and invited the participants
to describe the patterns seen using both the network and the tree representations (Fig. 10).
We encouraged them to use the interaction between these visual encoding to clarify their
relationship. Three participants (C, E and, F) identified six patterns at threshold 187,
while the others recognized four. The number of patterns recognized was three for all
participants at thresholds 202 and 283. Both the network representation and the color-
encoding to represent the communities detected by Infomap were clear for all participants.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 14/19

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.145


Figure 10 Network representations of the synthetic time-series dataset (Fig. 2) at the beginning of
the three most persistent structures detected in the barcode-tree. (A) Network at threshold 187. The
grey node corresponds to Type 1 signal. The two connected components correspond to signals of Type
3/5 (blue and red) and Type 4/6 (orange and green), respectively. In both cases, ascending—respectively
descending—signals are combined in a single connected component but still distinguish between gradual
or stepwise change based on colour. (B) Network at threshold 202. The most significant connected com-
ponent in the network integrates all signals in the dataset, excluding Type 2 represented as individual grey
elements. Two communities represented as different colors distinguish the ascending (blue) and descend-
ing (orange) patterns of the network. (C) Network at threshold 283. The single connected component net-
work still allows the detection of the ascending and descending patterns and the high variability of signal
Type 2 (green) due to the community detection algorithm.

Full-size DOI: 10.7717/peerjcs.145/fig-10

The difference in number of perceived patterns shows a critical sense of the community
detection results, demonstrating the added value of the human in the pattern selection.

User satisfaction and comments
All the participants indicated that they liked the MCLEAN methodology, especially the
obtained interpretation due to the change of the layout in the network creation. Although
some participants considered the selection of thresholds and interpretation of community
detections nontrivial, they still agreed that the methodology was consistent and the learning
curve was not too high. Participants strongly favored the use of MCLEAN over dendrogram
in terms exploration and clustering technique due to the better readability of the tree and
the power of combining the two visualizations interactively. This indicates the benefits
of this methodology as an interactive visual clustering facilitating the integration and
evaluation of the results by the user.

CONCLUSION AND FUTURE WORK
In this paper, we described a method for interactive multi-resolution exploration of
clustering results in complex datasets. Evaluation experiments indicated that combining
visualizations and analytical techniques can increase the understanding of the information
for the user by providing more transparency and confidence to the process. Although the
number of clusters and their quality is strongly related to user behavior, we believe that
this is actually a strength of the system (one that was specifically aimed for), and that
these approaches used in conjunction are crucial to allowing a user-centric approach to
information discovery, to exploit heterogeneous data sources better.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 15/19

https://peerj.com
https://doi.org/10.7717/peerjcs.145/fig-10
http://dx.doi.org/10.7717/peerj-cs.145


Although the presented network and barcode-tree representations help the user in
gaining insight in their data, there are some clear points for future work. For example,
the current approach relies on single linkage clustering whereas average and/or complete
linkage clustering might be more useful for particular datasets (especially where the distance
matrix does not exhibit gaps). In addition, the current visual encoding of the barcode-tree
shows visual artifacts (parallel lines merging with a cluster) depending on the granularity
level used. Finally, it will be useful to investigate further methods for directly comparing
how data elements are integrated across thresholds.

In conclusion, incorporating the domain user in the clustering process itself allows
for retaining the richness of multilevel patterns in cluster results. MCLEAN facilitates
integrating tacit or other user knowledge in clustering result interpretation and exploration,
while simplifying the representation of groups especially in the presence of noise or outliers.
We argue that the MCLEAN approach provides new opportunities beyond existing
techniques for cluster visualization and exploration.

ACKNOWLEDGEMENTS
The authors wish to thank Houda Lamqaddam for guidance in the design of the evaluation,
and the evaluation participants for valuable feedback.

ADDITIONAL INFORMATION AND DECLARATIONS

Funding
This research was supported by imec strategic funding 2017, IWT SBO Accumulate 150056,
and KU Leuven CoE PFV/10/016 SymBioSys. The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:
imec strategic funding 2017.
IWT SBO Accumulate: 150056.
KU Leuven CoE PFV/10/016 SymBioSys.

Competing Interests
The authors declare there are no competing interests.

Author Contributions
• Daniel Alcaide conceived and designed the experiments, performed the experiments,
analyzed the data, wrote the paper, prepared figures and/or tables, performed the
computation work, reviewed drafts of the paper.
• Jan Aerts conceived and designed the experiments, analyzed the data, wrote the paper,
reviewed drafts of the paper.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 16/19

https://peerj.com
http://dx.doi.org/10.7717/peerj-cs.145


Data Availability
The following information was supplied regarding data availability:

Bitbucket: https://bitbucket.org/vda-lab/mclean.

REFERENCES
Abello J, Van Ham F, Krishnan N. 2006. Ask-graphview: a large scale graph visualization

system. IEEE Transactions on Visualization and Computer Graphics 12(5):669–676
DOI 10.1109/TVCG.2006.120.

Alcock RJ, Manolopoulos Y. 1999. Time-series similarity queries employing a feature-
based approach. In: 7th Hellenic conference on informatics. River Edge: World
Scientific Publishing, 27–29.

Archambault D, Munzner T, Auber D. 2008. GrouseFlocks: steerable exploration of
graph hierarchy space. IEEE Transactions on Visualization and Computer Graphics
14(4):900–913 DOI 10.1109/TVCG.2008.34.

Archambault D, Munzner T, Auber D. 2009. TugGraph: path-preserving hierar-
chies for browsing proximity and paths in graphs. In: Visualization symposium,
2009. PacificVis’ 09. IEEE Pacific. Beijing, China: Piscataway: IEEE, 113–120
DOI 10.1109/PACIFICVIS.2009.4906845.

Beale R. 2007. Supporting serendipity: using ambient intelligence to augment user
exploration for data mining and web browsing. International Journal of Human-
Computer Studies 65(5):421–433 DOI 10.1016/j.ijhcs.2006.11.012.

Boudjeloud-Assala L, Pinheiro P, Blansché A, Tamisier T, Otjacques B. 2016. In-
teractive and iterative visual clustering. Information Visualization 15(3):181–197
DOI 10.1177/1473871615571951.

Bruneau P, Otjacques B. 2013. An interactive, example-based, visual clustering system.
In: Information visualisation (IV), 2013 17th international conference. London:
Piscataway: IEEE, 168–173 DOI 10.1109/IV.2013.21.

Chen J, MacEachren AM, Peuquet DJ. 2009. Constructing overview+ detail dendrogram-
matrix views. IEEE Transactions on Visualization and Computer Graphics
15(6):889–896 DOI 10.1109/TVCG.2009.130.

Desjardins M, MacGlashan J, Ferraioli J. 2007. Interactive visual clustering. In: Pro-
ceedings of the 12th international conference on intelligent user interfaces. Honolulu,
Hawaii: New York: ACM, 361–364 DOI 10.1145/1216295.1216367.

Di Giacomo E, Didimo W, Grilli L, Liotta G. 2007. Graph visualization techniques for
web clustering engines. IEEE Transactions on Visualization and Computer Graphics
13(2):294–304 DOI 10.1109/TVCG.2007.40.

Duman H, Healing A, Ghanea-Hercock R. 2009. An intelligent agent approach for visual
information structure generation. In: Intelligent agents, 2009. IA’09. IEEE symposium
on. Nashville: Pisacataway: IEEE, 55–62 DOI 10.1109/IA.2009.4927500.

Eades P, Feng Q-W. 1996. Multilevel visualization of clustered graphs. In: International
symposium on graph drawing. Springer, 101–112.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 17/19

https://peerj.com
https://bitbucket.org/vda-lab/mclean
http://dx.doi.org/10.1109/TVCG.2006.120
http://dx.doi.org/10.1109/TVCG.2008.34
http://dx.doi.org/10.1109/PACIFICVIS.2009.4906845
http://dx.doi.org/10.1016/j.ijhcs.2006.11.012
http://dx.doi.org/10.1177/1473871615571951
http://dx.doi.org/10.1109/IV.2013.21
http://dx.doi.org/10.1109/TVCG.2009.130
http://dx.doi.org/10.1145/1216295.1216367
http://dx.doi.org/10.1109/TVCG.2007.40
http://dx.doi.org/10.1109/IA.2009.4927500
http://dx.doi.org/10.7717/peerj-cs.145


Eades P, Huang ML. 2000. Navigating clustered graphs using force-directed methods.
Journal of Graph Algorithms and Applications 4(3):157–181 DOI 10.7155/jgaa.00029.

Ester M, Kriegel H-P, Sander J, Xu X. 1996. A density-based algorithm for discovering
clusters in large spatial databases with noise. In: KDD’96 proceedings of the second
international conference on knowledge discovery and data mining. Portland, OR: Palo
Alto: AAAI Press, 226–231.

Fisher R, Marshall M. 1936. Iris data set. UC Irvine Machine Learning Repository.
Available at http://archive.ics.uci.edu/ml/datasets/Iris.

Friedman J, Hastie T, Tibshirani R. 2001. The elements of statistical learning. Vol. 1. New
York: Springer Series in Statistics New York.

Jain AK, Murty MN, Flynn PJ. 1999. Data clustering: a review. ACM Computing Surveys
(CSUR) 31(3):264–323 DOI 10.1145/331499.331504.

Keim DA, Mansmann F, Thomas J. 2010. Visual analytics: how much visualization
and how much analytics? ACM SIGKDD Explorations Newsletter 11(2):5–8
DOI 10.1145/1809400.1809403.

Langfelder P, Zhang B, Horvath S. 2007. Defining clusters from a hierarchical clus-
ter tree: the Dynamic Tree Cut package for R. Bioinformatics 24(5):719–720
DOI 10.1093/bioinformatics/btm563.

Lee H, Kihm J, Choo J, Stasko J, Park H. 2012. Ivisclustering: an interactive visual
document clustering via topic modeling. In: Computer graphics forum. Vol. 31. New
Jersey: Wiley Online Library, 1155–1164.

Liu Z, Navathe SB, Stasko JT. 2014. Ploceus: modeling, visualizing, and analyzing tabular
data as networks. Information Visualization 13(1):59–89
DOI 10.1177/1473871613488591.

Looney CG. 2002. Interactive clustering and merging with a new fuzzy expected value.
Pattern Recognition 35(11):2413–2423 DOI 10.1016/S0031-3203(01)00213-8.

Ng RT, Han J. 2002. CLARANS: a method for clustering objects for spatial data min-
ing. IEEE Transactions on Knowledge and Data Engineering 14(5):1003–1016
DOI 10.1109/TKDE.2002.1033770.

Nielsen CB, Younesy H, O’Geen H, Xu X, Jackson AR, Milosavljevic A, Wang
T, Costello JF, Hirst M, Farnham PJ, Jones SJ. 2012. Spark: a navigational
paradigm for genomic data exploration. Genome Research 22(11):2262–2269
DOI 10.1101/gr.140665.112.

Obulkasim A, Meijer GA, Van de Wiel MA. 2015. Semi-supervised adaptive-height
snipping of the Hierarchical Clustering tree. BMC Bioinformatics 16(1):15
DOI 10.1186/s12859-014-0448-1.

Rinzivillo S, Pedreschi D, Nanni M, Giannotti F, Andrienko N, Andrienko G. 2008.
Visually driven analysis of movement data by progressive clustering. Information
Visualization 7(3–4):225–239 DOI 10.1057/PALGRAVE.IVS.9500183.

Rosvall M, Bergstrom CT. 2008. Maps of random walks on complex networks reveal
community structure. Proceedings of the National Academy of Sciences of the United
States of America 105(4):1118–1123 DOI 10.1073/pnas.0706851105.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 18/19

https://peerj.com
http://dx.doi.org/10.7155/jgaa.00029
http://archive.ics.uci.edu/ml/datasets/Iris
http://dx.doi.org/10.1145/331499.331504
http://dx.doi.org/10.1145/1809400.1809403
http://dx.doi.org/10.1093/bioinformatics/btm563
http://dx.doi.org/10.1177/1473871613488591
http://dx.doi.org/10.1016/S0031-3203(01)00213-8
http://dx.doi.org/10.1109/TKDE.2002.1033770
http://dx.doi.org/10.1101/gr.140665.112
http://dx.doi.org/10.1186/s12859-014-0448-1
http://dx.doi.org/10.1057/PALGRAVE.IVS.9500183
http://dx.doi.org/10.1073/pnas.0706851105
http://dx.doi.org/10.7717/peerj-cs.145


Sakai R, Winand R, Verbeiren T, Moere AV, Aerts J. 2014. dendsort: modular leaf
ordering methods for dendrogram representations in R. F1000Research 3:177.

Topaz CM, Ziegelmeier L, Halverson T. 2015. Topological data analysis of biological
aggregation models. PLOS ONE 10(5):e0126383 DOI 10.1371/journal.pone.0126383.

Vogogias A, Kennedy J, Archaumbault D, Smith VA, Currant H. 2016. MLCut:
exploring multi-level cuts in dendrograms for biological data. In: Cagatay T, Tao
RW, eds. Computer graphics and visual computing conference (CGVC) 2016. Geneva:
Eurographics Association DOI 10.2312/cgvc.20161288.

Alcaide and Aerts (2018), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.145 19/19

https://peerj.com
http://dx.doi.org/10.1371/journal.pone.0126383
http://dx.doi.org/10.2312/cgvc.20161288
http://dx.doi.org/10.7717/peerj-cs.145