Noname manuscript No.
(will be inserted by the editor)

Contextualization of topics

Browsing through the universe of bibliographic information

Rob Koopman · Shenghui Wang · Andrea
Scharnhorst

Received: date / Accepted: date

Abstract This paper describes how semantic indexing can help to generate a con-
textual overview of topics and visually compare clusters of articles. The method
was originally developed for an innovative information exploration tool, called Ari-
adne, which operates on bibliographic databases with tens of millions of records [18].
In this paper, the method behind Ariadne is further developed and applied to the
research question of the special issue “Same data, different results” – the better
understanding of topic (re-)construction by different bibliometric approaches. For
the case of the Astro dataset of 111,616 articles in astronomy and astrophysics,
a new instantiation of the interactive exploring tool, LittleAriadne, has been cre-
ated. This paper contributes to the overall challenge to delineate and define topics
in two different ways. First, we produce two clustering solutions based on vector
representations of articles in a lexical space. These vectors are built on semantic
indexing of entities associated with those articles. Second, we discuss how Lit-
tleAriadne can be used to browse through the network of topical terms, authors,
journals, citations and various cluster solutions of the Astro dataset. More specif-
ically, we treat the assignment of an article to the different clustering solutions as
an additional element of its bibliographic record. Keeping the principle of semantic
indexing on the level of such an extended list of entities of the bibliographic record,
LittleAriadne in turn provides a visualization of the context of a specific clustering
solution. It also conveys the similarity of article clusters produced by different al-

R. Koopman
OCLC Research, Schipholweg 99, Leiden, The Netherlands
Tel.: +31 71 524 6500
E-mail: rob.koopman@oclc.org

S. Wang
OCLC Research, Schipholweg 99, Leiden, The Netherlands
Tel.: +31 71 524 6500
E-mail: shenghui.wang@oclc.org

A. Scharnhorst
DANS-KNAW, Anna van Saksenlaan 51, The Hague, The Netherlands
Tel.: +31 70 349 4450
E-mail: andrea.scharnhorst@dans.knaw.nl

ar
X

iv
:1

70
2.

08
21

0v
1 

 [
cs

.D
L

] 
 2

7 
F

eb
 2

01
7


2 Rob Koopman et al.

gorithms, hence representing a complementary approach to other possible means
of comparison.

Keywords Random projection · clustering · visualization · topical modelling ·
interactive search interface · semantic map · knowledge map

1 Introduction

What is the essence, or the boundary of a scientific field? How can a topic be
defined? Those questions are at the heart of bibliometrics. They are equally rele-
vant for indexing, cataloguing and consequently information retrieval [24]. Rigour
and stability in bibliometrically defining boundaries of a field are important for
research evaluation and consequently the distribution of funding. But, for infor-
mation retrieval - next to accuracy - serendipity, broad coverage and associations
to other fields are of equal importance. If researchers seek information about a
certain topic outside of their areas of expertise, their information needs can be
quite different from those in a bibliometric context. Among the many possible hits
for a search query, they may want to know which are core works (articles, books)
and which are rather peripheral. They may want to use different rankings [25],
get some additional context information about authors or journals, or see other
closely related vocabulary or works associated with a search term. On the whole,
they would have less need to define a topic and a field in a bijective, univocal way.
Such a possibility to contextualize is not only important for term-based queries.
It also holds for groups of query terms, or for the exploration of sets of docu-
ments, produced by different clustering algorithms. Contextualisation is the main
motivation behind this paper.

If we talk of contextualisation we still stay in the realm of bibliographic infor-
mation. That is, we rely on information about authors, journals, words, references
as hidden in the entirety of the set of all bibliographic records. Decades of biblio-
metrics research have produced many different approaches to cluster documents,
or more specifically, articles. They often focus on one entity of the bibliographic
record. To give one example, articles and terms within those articles (in title,
abstract and/or full text) form a bipartite network. From this network we can
either build a network of related terms (co-word analysis) or a network of related
articles (based on shared words). The first method, sometimes called lexical [20],
has been applied in scientometrics to produce so-called topical or semantic maps.
The same exercise can be applied to authors and articles, authors and words [21],
and in effect to each element of the bibliographic record for an article [13]. If we
extend the bibliographic record of an article with the list of references contained
by this article, we enter the area of citation analysis. Here, the following methods
are widely used: direct citations, bibliographic coupling and co-citation maps. Hy-
brid methods combine citation and lexical analysis (e.g., [14, 37]). We would like
to note here that in an earlier comparison of citation- and word-based mapping
approaches Zitt et al. ([38]) underline the differences both signals carry in terms of
what aspect of scientific practice they represent. We come back to this in the next
paragraph. Formally spoken, the majority of studies apply one method and often
display unipartite networks. Sometimes analysis and visualization of multi-partite
networks can be found [32].


Contextualization of topics 3

Each network representation of articles captures some aspect of connectivity
and structure which can be found in published work. Co-authorship networks shed
light on the social dimension of knowledge production, the so-called Invisible Col-
lege [9, 23]. Citation relations are interpreted as traces of flows of knowledge [28,
30]. By using different bibliographic elements, we obtain different models for, or
representations of, a field or topic; i.e. as a conceptual, cognitive unit; as a com-
munity of practice; or as institutionalized in journals. One could also say that
choosing what to measure affects the representation of a field or topic. Another
source of variety beyond differences arising from choice of representations is how to
analyze those representations. Fortunately, network analysis provides several clas-
sical methods to choose from, including clustering and clique analysis. However,
clusters can be defined in different ways, and some clustering algorithms can be
computationally expensive when used on large or complex networks. Consequently,
we find different solutions for the same algorithm (if parameters in the algorithm
are changed) and different solutions for different algorithms. One could call this an
effect of the choice of instrument for the measurement or how to measure. Using
an ideal-typical workflow, these points of choice have been further detailed and
discussed in another paper of this special issue ( [33]). The variability in each of
the stages of the workflow results in ambiguity, and, if not articulated, makes it
even harder to reproduce results. Overall, moments of choice add an uncertainty
margin to the results [19, 27]. Last but not least, we can ask ourselves whether
clear delineations exist between topics in practice. Often in the sciences very dif-
ferent topics are still related to each other. There exist unsharp boundaries and
almost invisible long threads in the fabric of science [5], which might inhibit the
finding of a contradiction-free solution in form of a unique set of disjunct clusters.
There is a seeming paradox between the fact that experts often can rather clearly
identify what belongs to their field or a certain topic, and that it is so hard to
quantitatively represent this with bibliometric methods. However, a closer look
into science history and science and technology studies reveals that even among
experts opions regarding subject matter classification or topic identification might
vary. What belongs to a field and what not is as much an epistemic question as
also an object of social negotiations. Moreover, the boundaries of a field change
over time, and even a defined canon or body of knowledge determining the essence
of a field or a topic can still be controversial or subject to change [8].

Defining a topic requires a trade-off between accepting the natural ambiguity
of what a topic is and the necessity to define a topic for purposes of education,
knowledge acquisition, and evaluation. Since different perspectives serve different
purposes, there is also a need to preserve the diversity and ambiguity described
earlier. Having said this, for the sake of scientific reasoning it is equally necessary
to be able to further specify the validity and appropriateness of different methods
for defining topics and fields [11].

This paper contributes to this sorting-out-process in several ways. All are
driven by the motivation to provide a better understanding of the topic re-construction
results by providing context: context of the topics themselves by using a lexical
approach and all elements of the bibliographical record to delineate topics; and
context for different solutions in the (re-)construction of topics. We first introduce
the method of semantic indexing, by which each bibliographic record is decom-
posed and a vector representation for each of its entities in a lexical space is build,
resulting in a so-called semantic matrix. This approach is conceptually closer to


4 Rob Koopman et al.

classical information retrieval techniques based on Salton’s vector space model [29]
than to the usual bibliometrical mapping techniques. In particular, it is similar to
Latent Semantic Indexing or Latent Semantic Analysis. In the specific case of
the Astro dataset, we extend the bibliographic record with information on cluster
assignments provided by different clustering solutions. For the purpose of a delin-
eation of topics based on clustering of articles, we reconstruct a semantic matrix
for articles based on the semantic indexing of their individual entities. Secondly,
based on this second matrix, we produce our own clustering solutions (detailed in
[36]) by applying two different clustering algorithms. Third, we present an inter-
active visual interface called LittleAriadne that displays the context around those
extracted entities. The interface responds to a search query with a network visu-
alization of most related terms, authors, journals, citations and cluster IDs. The
query can consist of words or author names, but also clustering solutions. The
displayed nodes or entities around a query term represent, to a certain extent, the
context of the query in a lexical, semantic space.

In what follows, we address the following research questions:

Q1 How does the Ariadne algorithm, originally developed for a large corpora
which contains tens of millions of articles, work on a much smaller, field-specific
dataset? How can we relate the produced contexts to domain knowledge re-
trieved from other information services?

Q2 Can we use LittleAriadne to compare different cluster assignments of papers,
by treating those cluster assignments as additional entities? What can we learn
about the topical nature of these clusters when exploring them visually?

Concerning the last question, we restrict this paper to a description of the approach
LittleAriadne offers, and we provide some illustrations. A more detailed discussion
of the results of this comparison has been taken up as part of the comparison paper
of this special issue [33], which on the whole addresses different analytic methods
and visual means to compare different clustering solutions.

2 Data

The Astro dataset used in this paper contains documents published in the period
2003–2010 in 59 astrophysical journals.1 Originally, these documents had been
downloaded from the Web of Science in the context of a German-funded research
project called “Measuring Diversity of Research,” conducted at the Humboldt-
University Berlin from 2009 to 2012. Based on institutional access to the Web of
Science, we worked on the same dataset. Starting with 120,007 records in total,
111,616 records of the document types Article, Letter and Proceedings Paper have
been treated with different clustering methods (see the other contributions to this
special issue).

Different clustering solutions have been shared, and eventually a selection of
solutions for the comparison has been defined. In our paper we used clustering
solutions from CWTS-C5 (c) [31], UMSI0 (u) [34], HU-DC (hd) [12], STS-RG
(sr) [6], ECOOM-BC13 (eb), ECOOM-NLP11 (en) (both [10]) and two of our
own: OCLC-31 (ok) and OCLC-Louvain (ol) [36]. The CWTS-C5 and UMSI0

1 For details of the data collection and cleaning process leading to the common used Astro
dataset see [33].


Contextualization of topics 5

are the clustering solutions generated by two different methods, Infomap and the
Smart Local Moving Algorithm (SLMA) respectively, applied on the same direct
citation network of articles. The two ECOOM clustering solutions are generated
by applying the Louvain method to find communities among bibliographic coupled
articles where ECOOM-NLP11 also incorporates the keywords information. The
STS-RG clusters are generated by first projecting the relatively small Astro dataset
to the full Scopus database. After the full Scopus articles are clustered using
SLMA on the direct citation network, the cluster assignments of Astro articles
are collected. The HU-DC clusters are the only overlapping clusters generated
by a memetic type algorithm designed for the extraction of overlapping, poly-
hierarchical topics in the scientific literature. Each article is assigned to a HU-DC
cluster with a confidence value. We only took those assignments with a confidence
value higher than 0.5. More detailed accounts of these clustering solutions can
be found in [33]. Table 1 shows their labels later used in the interface, and how
many clusters each solution produced. All the clustering solutions are based on
the full dataset. However, each article is not necessarily guaranteed to have a
cluster assignment in every clustering solution (see the papers about the clustering
solutions for further details). The last column in Table 1 shows how many articles
of the original dataset are covered by different solutions.

Table 1 Statistics of clustering solutions generated by different methods

Cluster label Solution #Clusters Coverage
c CWTS-C5 22 91%
u UMSI0 22 91%
ok OCLC-31 31 100%
ol OCLC-Louvain 32 100%
sr STS-RG 556 96%
eb ECOOM-BC13 13 97%
en ECOOM-NLP11 11 98%
hd HU-DC 113 91%

3 Method

3.1 Building semantic representations for entities

The Ariadne algorithm was originally developed on top of the article database,
ArticleFirst of OCLC [18]. The interface, accessible at http://thoth.pica.nl/
relate, allows users to visually and interactively browse through 35 thousand
journals, 3 million authors, and 1 million topical terms associated with 65 mil-
lion articles. The Ariadne pipeline consists of two steps: an offline procedure for
semantic indexing and an online interactive visualization of the context of search
queries. We applied the same method to the Astro dataset and built an instanti-
ation, named LittleAriadne, accessible at http://thoth.pica.nl/astro/relate.

To describe our method we give an example of an article from the Astro dataset
in table 2. We list all the fields of this bibliographic record that we used for
LittleAriadne. We include the following types of entities for semantic indexing:

http://thoth.pica.nl/relate
http://thoth.pica.nl/relate
http://thoth.pica.nl/astro/relate


6 Rob Koopman et al.

Table 2 An article from the Astro dataset

Article ID ISI:000276828000006
Title On the Mass Transfer Rate in SS Cyg
Abstract The mass transfer rate in SS Cyg at quiescence, estimated from the ob-

served luminosity of the hot spot, is log M-tr = 16.8 +/- 0.3. This is safely
below the critical mass transfer rates of log M-crit = 18.1 (correspond-
ing to log T-crit(0) = 3.88) or log M-crit = 17.2 (corresponding to the
“revised” value of log T-crit(0) = 3.65). The mass transfer rate during
outbursts is strongly enhanced

Author [author:smak j]
ISSN [issn:0001-5237]
Subject [subject:accretion, accretion disks] [subject:cataclysmic variables] [sub-

ject:disc instability model] [subject:dwarf novae] [subject:novae, cata-
clysmic variables] [subject:outbursts] [subject:parameters] [subject:stars]
[subject:stars dwarf novae] [subject:stars individual ss cyg] [subject:state]
[subject: superoutbursts]

Citation [citation:bitner ma, 2007, astrophys j 1, v662, p564] [citation:bruch a, 1994,
astron astrophys sup, v104, p79] [citation:buatmenard v, 2001, astron as-
trophys, v369, p925] [citation:hameury jm, 1998, mon not r astron soc,
v298, p1048] [citation:harrison te, 1999, astrophys j 2, v515, l93] [cita-
tion:kjuikchieva d, 1998, a as, v262, p53] [citation:kraft rp, 1969, apj, v158,
p589] [citation:kurucz rl, 1993, cd rom] [citation:lasota jp, 2001, new astron
rev, v45, p449] [citation:paczynski b, 1980, acta astron, v30, p127] [cita-
tion:schreiber mr, 2002, astron astrophys, v382, p124] [citation:schreiber
mr, 2007, astron astrophys, v473, p897, doi 10.1051/0004-6361:20078146]
[citation:smak j, 1996, acta astronom, v46, p377] [citation:smak j, 2002,
acta astronom, v52, p429] [citation:smak j, 2004, acta astronom, v54, p221]
[citation:smak j, 2008, acta astronom, v58, p55] [citation:smak ji, 2001,
acta astronom, v51, p279] [citation:tutukov av, 1985, pisma astron zh,
v11, p123] [citation:tutukov av, 1985, sov astron lett+, v11, p52] [cita-
tion:voloshina ib, 2000, astron rep+, v44, p89] [citation:voloshina ib, 2000,
astron zh, v77, p109]

Topical terms mass transfer; transfer rate; ss; cyg; quiescence; estimated; observed; lumi-
nosity; hot spot; log; tr; safely; critical; crit; corresponding; revised; value;
outbursts; strongly; enhanced

UAT terms [uat:stellar phenomena]; [uat:mass transfer]; [uat:optical bursts]
Cluster ID [cluster:c 19] [cluster:u 16] [cluster:ok 18] [cluster:ol 23] [cluster:sr 17] [clus-

ter:eb 1] [cluster:en 1] [cluster:hd 1] [cluster:hd 18] [cluster:hd 48]

authors, journals (ISSN), subjects, citations, topical terms, MAI-UAT thesaurus
terms and cluster IDs (see Table 1). For the Astro dataset, we extended the origi-
nal Ariadne algorithm [17] by adding citations as additional entities. In the short
paper about the OCLC clustering solutions [36] we applied clustering to differ-
ent variants of the vector representation of articles, including variants with and
without citations. We reported there about the effect of adding citations to vector
representations of articles on clustering.

In Table 2 we display the author name (and other entities) in a syntax (indi-
cated by square brackets) that can immediately be used in the search field of the
interface. Each author name is treated as a separate entity. The next type of entity
is the journal identified by its ISSN number. One can search for a single journal
using its ISSN number. In the visual interface, the ISSN numbers are replaced
by the journal name, which is used as label for a journal node. The next type
of entities are so-called subjects. Those subjects originate from the fields “Author
Keywords” and “Keywords Plus” of the original Web of Science records. Citations,


Contextualization of topics 7

Table 3 Entities in LittleAriadne

Journals 59
Authors 55,607
Topical terms 60,501
Subjects 41,945
Citations 386,217
UAT terms 1534
Cluster IDs 610
Total 546,473

references in the article, are considered as a type of entity too. Here, we use the
standardized abbreviated citations in the Web of Science database. We remark
that we do not apply any form of disambiguation–neither for the author names
nor for the citations. Topical terms such as “mass transfer” and “quiescence” in
our example, are single words or two-word phrases extracted from titles and ab-
stracts of all documents in the dataset. A multi-lingual stop-word list was used
to remove unimportant words, and mutual information was used to generate two-
word phrases. Only words and phrases which occur more than a certain threshold
value were kept.

The next type of entity is a set of Unified Astronomy Thesaurus (UAT)2 terms
which were assigned by the Data Harmony’s Machine Aided Indexer (M.A.I.).3

Please refer to [7] for more details about the thesaurus and the indexing procedure.
The last type of entity we add to each of the articles (specific for LittleAriadne)
is the collection of cluster IDs corresponding to the clusters to which the article
was assigned by the various clustering algorithms. For example, the article in
Table 2 has been assigned to clusters “c 19” (produced by CWTS-C5) and “u 16”
(produced by UMSI0), and so on. In other words, we treat the cluster assignments
of articles as they would be classification numbers or additional subject headings.
Table 3 lists the total number of different types of entities found in the Astro
dataset.

To summarize, we deconstruct each bibliographic record, extract a number of
entities, and add some more (the cluster IDs and the topical terms). Next, we
construct for each of these entities a vector in a word space built from topical
terms and subject terms. We assume that the context of all entities is captured
by their vectors in this space. Figure 1 gives a schematic representation of these
vectors which form the matrix C. All types of entities – topical term, subject,
author, citation, cluster ID and journal – form the rows of the matrix, and their
components (all topical terms and subjects) the columns. The values of the vector
components are the frequencies of the co-occurrence of an entity and a specific
word in the whole dataset. That is, we count how many articles contain both an
entity and a certain topical term or subject.

Matrix C expresses the semantics of all entities in terms of their context. Such
context is then used in a computation of their similarity/relateness. Each vector
can be seen as the lexical profile of a particular entity. A high cosine similarity
value between two entities indicates a large overlap of the contexts of these two

2 http://astrothesaurus.org/
3 http://www.dataharmony.com/services-view/mai/

http://astrothesaurus.org/
http://www.dataharmony.com/services-view/mai/


8 Rob Koopman et al.

Fig. 1 Dimension reduction using Random Projection

entities – in other words, a high similarity between them. This is different from
measuring their direct co-occurrence.

For LittleAriadne, the matrix C has roughly 546K × 102K elements, and is
sparse and expensive for computation. To make the algorithm scale and to pro-
duce a responsive online visual interface, we applied the method of Random Pro-
jection [1, 15] to reduce the dimensionality of the matrix. As shown in Figure 1,
we multiply C with a 102K × 600 matrix of randomly distributed –1 and 1, with
half-half probabilities.4 This way, the original 546K × 102K matrix C is reduced
to a Semantic Matrix C′ of the size of 546K × 600. Still, each row vector repre-
sents the semantics of an entity. It has been discussed elsewhere [2] that with the
method of Random Projection, similar to other dimension reduction methods, es-
sential properties of the original vector space are preserved, and thus entities with
a similar profile in the high-dimensional space still have a similar profile in the
reduced space. A big advantage of Random Projection is that the computation is
significantly less expensive than other methods, e.g., Principal Component Anal-
ysis [2]. Actually, Random Projection is often suggested as a way of speeding up
Latent Semantic Indexing (LSI) [26], and Ariadne is similar to LSI in some ways.
LSI starts from a weighted term-document matrix, where each row represents the
lexical profile of a document in a word space. In Ariadne, however, the unit of
analysis is not the document. Instead, each entity of the bibliographic record is
subject to a lexical profile. We explain in the next section that, by aggregating
over all entities belonging to one article, one can construct a vector representation
for the article that represents its semantics and is suitable for further clustering
processes (for more details please consult [36]).

4 More efficient random projections are available. This version is more conservative and also
computationally easier.


Contextualization of topics 9

With the Matrix C′, the interactive visual interface dynamically computes the
most related entities (i.e., ranked by cosine similarity) to a search query. After
irrelevant entities have been filtered out by removing entities with a high Maha-
lanobis distance [22] to the query, the remaining entities and the query node are
positioned in 2D so that the distance between nodes preserves the corresponding
distance in the high dimensional space as much as possible. We use a spring-like
force-directed graph drawing algorithm for the positioning of the nodes. Designed
as experimental, explorative tool, no other optimisation of the network layout is
applied. In the on-line interface, it is possible to zoom into the visualization, to
change the size of the labels (font slider) as well as the number of entities displayed
(show slider). For the figures in the paper, we used snapshots, in which node labels
might overlap. Therefore, we provide links to the corresponding interactive display
for each of the figures. In the end, with its most related entities, the context of
a query term can be effectively presented [18]. For LittleAriadne we extended the
usual Ariadne interface with different lists of the most related entities, organized
by type. This information is given below the network visualization.

3.2 From a semantic matrix of entities to a semantic matrix for articles

The Ariadne interface provides context around entities, but does not produce
article clusters directly. In other words, articles contribute to the context of entities
associated with them but the semantics of themselves need to be reconstructed
before we can apply clustering methods to identify article clusters. We describe
the OCLC clustering workflow elsewhere [36], but here we would like to explain
the preparatory work for it.

The first step is to create a vector representation of each article. For each article,
we look up all entities associated with this article in the Semantic Matrix C′. We
purposefully leave out the cluster IDs, because we want to construct our own
clustering later independently, i.e., without already including information about
clustering solutions of other teams. For each article we obtain a set of vectors.
For our article example in Table 2 we have 55 entities. The set of vectors for this
article entails one vector representing the single author of this article, 12 vectors
for the subjects, one vector for the journal, 21 vectors for the citations and 20
vectors for topical terms. Each article is represented by a unique set of vectors.
The size of the set can vary, but each of the vectors inside of a set has the same
length, namely 600.

For each article we compute the weighted average of its constituent vectors
as its semantic representation. Each entity is weighted by its inverse document
frequency to the third power; therefore, frequent entities are heavily penalized to
have little contribution to the resulting representation of the article. In the end,
each article is represented by a vector of 600 dimensions which becomes a row in
a new matrix M with the size of 111, 616 × 600. Note that since articles are repre-
sented as a vector in the same space where other entities are also represented, it is
now possible to compute the relatedness between entities and articles! Therefore
in the online interface, we can present the articles most related to a query.

To group these 111,616 articles into meaningful clusters, we apply standard
clustering methods to M. A first choice, the K-Means clustering algorithm results
in 31 clusters. As detailed in [36], with k = 31, the resulting 31 clusters perform


10 Rob Koopman et al.

the best according to a pseudo-ground-truth built from the consensus of CWTS-
C5, UMSI0, STS-RG and ECOOM-BC13. With this clustering solution the whole
dataset is partitioned pretty evenly: the average size is 3600 ± 1371, and the largest
cluster contains 6292 articles and the smallest 1627 articles.

We also apply a network-based clustering method: the Louvain community
detection algorithm. To avoid high computational cost, we first calculate for each
article the top 40 most related articles, i.e., with the highest cosine similarity. This
results in a new adjacency matrix M′ between articles, representing an article
similarity network where the nodes are articles and the links indicate that the
connected articles are very similar. We set the threshold for the cosine similarity
at 0.6 to reduce links with low similarity values. A standard Louvain community
detection algorithm [3] is applied to this network, producing 32 partitions, i.e., 32
clusters. Compared to K-Means 31 clusters, these 32 Louvain clusters vary more
in terms of cluster size, with the largest cluster containing 9464 articles while
the smallest cluster 86 articles. The Normalized Mutual Information [35] between
these two solutions is 0.68, indicating that they are highly similar to each other
yet different enough to be studied further. More details can be found in [36].

4 Experiments and results

To answer the two research questions listed in the introduction, we conducted the
following experiments:

Experiment 1. We implemented LittleAriadne as an information retrieval tool.
We searched with query terms, inspected and navigated through the resulting
network visualization.

Experiment 2. We visually observed and compared different clustering solutions.

4.1 Experiment 1 – Navigate through networked information

We implemented LittleAriadne, which allows users to browse the context of the
546K entities associated with 111K articles in the datasets. If the search query
refers to an entity that exists in the semantic matrix, LittleAriadne will return,
by default, top 40 most related entities, which could be topical terms, authors,
subjects, citations or clusters. If there are multiple known entities in the search
query, a weighted average of the vectors of individual entities is used to calculate
similarities (the same way an article vector is constructed). If the search query does
not contain any known entities, a blank page is returned, as there is no information
about this query.

Figure 2 gives a contextual view of “gamma ray.”5 The search query refers
to an known topical term “gamma ray,” and it is therefore displayed as a red
node in the network visualization. The top 40 most related entities are shown as
nodes, with the top 5 connected by the red links. The different colours reflect their
types, e.g., topical terms, subjects, authors, or clusters. Each of these 40 entities is
further connected to its top 5 most related entities among the rest of the entities
in the visualization, with the condition that the cosine similarity is not below 0.6.

5 Available at http://thoth.pica.nl/astro/relate?input=gamma+ray

http://thoth.pica.nl/astro/relate?input=gamma+ray


Contextualization of topics 11

Fig. 2 The contextual view of the query term “gamma ray”

A thicker link means the two linked entities are mutually related, i.e., they are
among each other’s top 5 list. The colour of the link takes that of the node where
the link is originated. If the link is mutual and two linked entities are of different
types, one of the entity colours is chosen.

The displayed entities often automatically form groups depending on their
relatedness to each other, whereby more related entities are positioned closer to
each other. Each group potentially represents a different aspect related to the
query term. The size of a node is proportional to the logarithm of its frequency
of occurrences in the whole dataset. The absolute number of occurrences appears
when hovering the mouse cursor over the node. Due to the fact that different
statistical methods are at the core of the Ariadne algorithm, this number gives an
indication of the reliability of the suggested position and links.

In Figure 2, there are four clusters from OCLC-31, ECOOM-BC13 and ECOOM-
NLP11, and CWTS. The ECOOM-BC13 cluster eb 8 and ECOOM-NLP11 cluster
en 4 are directly linked to “gamma ray,” suggesting that these two clusters are
probably about gamma rays. It is not surprising that they are very close to each
other, because they contain 7560 and 5720 articles respectively but share 3603
articles. At the lower part, the OCLC-31 cluster ok 21 and the CWTS cluster
c 15 are also pretty close to our search term. They contain 1849 and 3182 articles
respectively and share 1721 articles in common which makes them close to each
other in the visualization. By looking at the topical terms and subjects around


12 Rob Koopman et al.

Fig. 3 The contextual view of cluster ok 21

these clusters, we can have a rough idea of their differences. Although they are all
about “gamma ray,” Clusters eb 8 and en 4 are probably more about “radiation
mechanisms,” “very high energy,” and “observations,” while Clusters ok 21 and
c 15 seem to focus more on “afterglows,” “prompt emission,” and “fireball.” Such
observations will invite users to explore these clusters or subjects further.

Each node is clickable which leads to another visualization of the context of
this selected node. If one is interested in cluster ok 21 for instance, after clicking
the node, a contextual view of cluster ok 21 is presented,6 as shown in Figure 3.
This context view provides a good indication about the content of the articles
grouped together in this cluster. In the context view of cluster ok 21 we see again
the cluster c 15, which was already near to ok 21 in the context view of “gamma
ray.” But the two ECOOM clusters, eb 8 and en 4 that are also in the context of
“gamma ray” are not visible any more. Instead, we find two more similar clusters
u 11 and ol 9. That means that, even though the clusters ok 21 and eb 8 are among
the top 40 entities that are related to “gamma ray,” they are still different in terms
of their content. This can be confirmed by looking at their labels in Table 4.7

As mentioned before, in the interface one can also further refine the display.
For instance, one can choose the number of nodes to be shown or decide to limit
the display to only authors, journals, topical terms, subjects, citations or clusters.

6 Available at http://thoth.pica.nl/astro/relate?input=[cluster:ok%2021].
7 More details about cluster labelling can be found in [16].

http://thoth.pica.nl/astro/relate?input=[cluster:ok%2021]


Contextualization of topics 13

Table 4 Labels of clusters similar to ok 21 and to ”gamma ray”

Cluster IDs Size Cluster labels
ok 21 1849 grb, ray burst, gamma ray, afterglow, bursts grbs, swift, prompt

emission, prompt, fireball, batse
c 15 3182 grb, ray bursts, gamma ray, afterglow, bursts grbs, sn, explosion,

swift, type ia, supernova sn
ol 9 2895 grb, ray bursts, gamma ray, afterglow, bursts grbs, sn, type ia, swift,

explosion, ia supernovae
u 11 2051 grb, ray bursts, gamma ray, afterglow, bursts grbs, sn, explosion,

type ia, swift, supernova
eb 8 7560 gamma ray, pulsar, ray bursts, grb, bursts grbs, high energy, jet,

radio, psr, synchrotron
en 4 5720 gamma ray, grb, ray bursts, cosmic ray, high energy, bursts grbs,

afterglow, swift, tev, tev gamma

The former can be done by the slider show or by editing the URL string directly.
For the latter options, tick boxes are given. An additional slider font allows to
experiment with the font size of the labels.

A display with only one type of entity enables us to see context filtered along
one perspective (lexical, journals, authors, subjects), and is often useful. For ex-
ample, Figure 48 shows at least three separate groups of authors who are most
related to “subject:hubble diagram.”

At any point of exploration, one can see the most related entities, grouped by
their types and listed at the bottom of the interface. The first category shown are
the related titles, the titles of the articles most relevant to a search query. Due to
license restrictions, we cannot make the whole bibliography available. But when
clicking on a title, one actually sees the context of a certain article. Not only titles
can be clicked through, all entities at the lower part are also clickable and such an
action leads to another contextual view of the selected entity.

At the top of the interface, under the search box, we find further hyperlinks
behind the label exact search and context search. Clicking on the hyperlinks auto-
matically sends queries to other information spaces such as Google, Google Scholar,
Wikipedia, and WorldCat. For exact search, the same query text is used. For con-
text search, the system generates a selection among all topical terms related to the
original query term and send this selection as a string of terms (with the Boolean
AND operation) to those information spaces behind the hyperlinks. This option
offers users a potential way to retrieve related literature or web resources from
a broader perspective. In turn, it also enables the user to better understand the
entity-based context view provided by Ariadne.

Let us now come back to our first research question: how does the Ariadne algo-
rithm work on a much smaller, field-specific dataset? The interface shows that the
original Ariadne algorithm works well on the small Astro dataset. Not surprisingly,
compared with our exploration in the much bigger and more general ArticleFirst
dataset, we find more consistent representations; that is, specific vocabulary is
displayed, which can be cross-checked in Wikipedia, Google or Google Scholar.
On the other hand, different corpora introduce different contexts for entities. For

8 Available at http://thoth.pica.nl/astro/relate?input=%5Bsubject%3Ahubble+
diagram%5D&type=2

http://thoth.pica.nl/astro/relate?input=%5Bsubject%3Ahubble+diagram%5D&type=2
http://thoth.pica.nl/astro/relate?input=%5Bsubject%3Ahubble+diagram%5D&type=2


14 Rob Koopman et al.

Fig. 4 The authors who are the most related to “subject:hubble diagram”

example, “young” in ArticleFirst9 is associated with adults and 30 years old, while
in LittleAriadne it is immediately related to young stars which are merely 5 or
10 millions years old.10 Also, the bigger number of topical terms in the larger
database leads to a situation where almost every query term produces a response.
In LittleAriadne searches for, e.g., a writer such as JaneAusten retrieve nothing.
Not surprisingly, for domain-specific entities, LittleAriadne tends to provide more
accurate context. A more thorough evaluation needs to be based, as for any other
topical mapping, on a discussion with domain experts.

4.2 Experiment 2 – Comparing clustering solutions

In LittleAriadne we extended the interface with the goal of observing and compar-
ing clustering solutions visually. As discussed in Section 3.1 cluster assignments
are treated in the same way as other entities associated with articles, such as top-
ical terms, authors, etc. Each cluster ID is therefore represented in the same space
and visualized in the same way. In the interface, when we use a search term, for

9 Available at http://thoth.pica.nl/relate?input=young
10 Available at http://thoth.pica.nl/astro/relate?input=young

http://thoth.pica.nl/relate?input=young
http://thoth.pica.nl/astro/relate?input=young


Contextualization of topics 15

example “[cluster:c]” and tick the “scan” option, the interface scans all the enti-
ties in the semantic matrix which starts with, in this case “cluster:c,” and then
effectively selects and visualizes all CWTS-C5 clusters.11 This way, we can eas-
ily see the distribution of a single clustering solution. Note that in this scanning
visualization, any cluster which contains less than 100 articles is not shown.

Figure 5 shows the individual distribution of clusters from all eight clustering
solutions. When two clusters have a relatively high mutual similarity, there is a link
between them. It is not surprising to see the HU-DC clusters are highly connected
as they are overlapping, and form a poly-hierarchy. Compared to CWTS-C5, UMSI
and two ECOOM clusters, the STS-RG and the two OCLC solutions have more
cluster-cluster links. This suggests that these clusters overlap more in terms of
their direct vocabularies and indirect vocabularies associated with their authors,
journals and citations.

If we scan two or more cluster entities, such as “[cluster:c][cluster:ok],” we put
two clustering solutions on the same visualization so that they can be compared
visually. In Figure 6 (a) we see the high similarity between clusters from CWTS-
C5 and those from OCLC-31.12 CWTS-C5 has 22 clusters while OCLC-31 has 31
clusters. Each CWTS-C5 cluster is accompanied by one or more OCLC clusters.
This indicates that they are different, probably because of the granularity aspect
instead of any fundamental issue. Figure 6 (b) shows two other sets of clusters that
partially agree with each other but clearly have different capacity in identifying
different clusters.13

Figure 7 (a) shows all the cluster entities from all eight clustering solutions.14

The STS and HU have hundreds of clusters, which make the visualization pretty
cluttered. Figure 7 (b) shows only the solutions from CWTS, UMSI, OCLC and
ECOOM, whose numbers of the clusters are comparable.15

Concerning our second research question - can we use LittleAriadne to compare
clustering solutions visually? - we can give a positive answer. But, it is not easy
to see from LittleAriadne why some clusters are similar and the others not. The
visualization functions as a macroscope[4] and provides a general overview of all the
clustering solutions, which helps to guide further investigation. It is not conclusive,
but a useful heuristic devise. For example, from Figure 7, especially 7 (b), it is
clear that there are “clusters of clusters.” That is, some clusters are detected by
all of these different methods. In the future we may investigate these clusters of
clusters more closely and perhaps discover that different solutions identify some
of the same topics. We continue the discussion of the use of visual analytics to
compare clustering solutions in the paper by Velden et al. [33].

11 This scan option is applicable to any other type of entities, for example, to see all sub-
jects which start with “quantum” by using “subject:quantum” as the search term and do the
scanning.
12 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%
5Bcluster%3Aok%5D&type=S&show=500
13 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Au%5D%
5Bcluster%3Asr%5D&type=S&show=500
14 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%
5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%
3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500
15 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%
5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%
3Aen%5D&type=S&show=500

http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Aok%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Aok%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Au%5D%5Bcluster%3Asr%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Au%5D%5Bcluster%3Asr%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D&type=S&show=500
http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D&type=S&show=500


16 Rob Koopman et al.

(a) CWTS-C5 clusters (b) UMSI0 clusters

(c) OCLC-31 clusters (d) OCLC-Louvain clusters

(e) ECOOM-BC13 clusters (f) ECOOM-NLP11 clusters

(g) STS-RG clusters (h) HU-DC clusters

Fig. 5 The distribution of clusters


Contextualization of topics 17

(a) Highly similar clustering solutions

(b) Clustering solutions with different focuses

Fig. 6 Visual comparison of clustering solutions


18 Rob Koopman et al.

(a) All clustering solutions

(b) Clusters from CWTS, UMSI, OCLC and ECOOM

Fig. 7 Visual comparison of clustering solutions


Contextualization of topics 19

5 Conclusion

We present a method implemented in an interface that allows browsing through
the context of entities, such as topical terms, authors, journals, subjects and cita-
tions associated with a set of articles. With the LittleAriadne interface, one can
navigate visually and interactively through the context of entities in the dataset by
seamlessly travelling between authors, journals, topical terms, subjects, citations
and cluster IDs as well as consult external open information spaces for further
contextualization.

In this paper we particularly explored the usefulness of the method to the
problem of topic delineation addressed in this special issue. LittleAriadne treats
cluster assignments from different solutions as additional special entities. This way
we provide the contextual view of clusters as well. This is beneficial for users who
are interested in travelling seamlessly between different types of entities and their
related cluster assignments generated by different solutions.

We also contributed two clustering solutions built on the vector representation
of articles, which is different from solutions provided by other methods. We start by
including references and treating them as entities with a certain lexical or semantic
profile. In essence, we start from a multipartite network of papers, cited sources,
terms, authors subjects, etc. and focus on similarity in a high dimensional space.
Our clusters are comparable to other solutions yet have their own characteristics.
Please see [33, 36] for more details.

We demonstrated that we can use LittleAriadne to compare different cluster-
ing solutions visually and generate a wider overview. This has a potential to be
complementary to any other method of cluster comparison. We hope that this in-
teractive tool supports discussion about different clustering algorithms and helps
to find the right meaning of clusters.

We have plans to further develop the Ariadne algorithm. The Ariadne algo-
rithm is general enough to incorporate additional types of entities into the semantic
matrix. Which entities we can add very much depends on the information in the
original dataset or database. In the future, we plan to add publishers, conferences,
etc. with the aim to provide a richer contextualization of entities typically found in
a scholarly publication. We also plan to elaborate links to articles that contribute
to the contextual visualization, thus strengthening the usefulness of Ariadne not
only for the associative exploration of contexts similar to scrolling through a sys-
tematic catalogue, but also as a direct tool for document retrieval.

In this context we plan to further compare LittleAriadne and Ariadne. As
mentioned before, the corpora matter when talking about context of entities. The
advantage of LittleAriadne is the confinement of the dataset to one scientific dis-
cipline or field and topics within. We hope by continuing such experiments also to
learn more about the relationship between genericity and specificity of contexts,
and how that can be best addressed in information retrieval.

Acknowledgement

Part of this work has been funded by the COST Action TD1210 Knowescape,
and the FP7 Project ImpactEV. We would like to thank the internal reviewers
Frank Havemann, Bart Thijs as well as the anonymous external referees for their


20 Rob Koopman et al.

valuable comments and suggestions. We would also like to thank Jochen Gläser,
William Harvey and Jean Godby for comments on the text.

References

1. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary
coins. Journal of Computer and System Sciences 66(4), 671–687 (2003). DOI http://dx.
doi.org/10.1016/S0022-0000(03)00025-4. URL http://www.sciencedirect.com/science/
article/pii/S0022000003000254

2. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Applications
to image and text data. In: Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’01, pp. 245–250. ACM, New
York, NY, USA (2001). DOI 10.1145/502512.502546. URL http://doi.acm.org/10.1145/
502512.502546

3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communi-
ties in large networks. Journal of Statistical Mechanics: Theory and Experiment (2008).
P10008 (12pp)

4. Börner, K.: Plug-and-play macroscopes. Communications of the ACM 54(3), 60–69 (2011)
5. Boyack, K., Klavans, R.: Weaving the fabric of science. In: K. Börner, E.F. Hardy (eds.)

6th Iteration (2009): Science Maps for Scholars. Places & Spaces: Mapping Science (2010)
6. Boyack, K.W.: Investigating the Effect of Global Data on Topic Detection. In: J. Gläser,

A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative
approach to the identification of thematic structures in science, Special Issue of Sciento-
metrics (2017). DOI 10.1007/s11192-017-2297-y

7. Boyack, K.W.: Thesaurus-based methods for mapping contents of publication sets. In:
J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a
comparative approach to the identification of thematic structures in science, Special Issue
of Scientometrics (2017). DOI 10.1007/s11192-017-2304-3

8. Galison, P.: Image and logic: A material culture of microphysics. University of Chicago
Press (1997)

9. Glänzel, W., Schubert, A.: Analysing scientific networks through co-authorship. In: H.F.
Moed, W. Glänzel, U. Schmoch (eds.) Handbook of quantitative science and technology
research, p. 257276. Springer (2004). DOI 10.1007/1-4020-2755-9 12

10. Glänzel, W., Thijs, B.: Using hybrid methods and ‘core documents’ for the representation
of clusters and topics. the astronomy dataset. In: J. Gläser, A. Scharnhorst, W. Glänzel
(eds.) Same data – different results? Towards a comparative approach to the identifi-
cation of thematic structures in science, Special Issue of Scientometrics (2017). DOI
Usinghybridmethods

11. Gläser, J., Glänzel, W., Scharnhorst, A.: Introduction to the special issue “same data,
different results?”. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different
results? Towards a comparative approach to the identification of thematic structures in
science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2296-z

12. Havemann, F., Gläser, J., Heinz, M.: Memetic search for overlapping topics. In: J. Gläser,
A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative
approach to the identification of thematic structures in science, Special Issue of Sciento-
metrics (2017). DOI 10.1007/s11192-017-2302-5

13. Havemann, F., Scharnhorst, A.: Bibliometric networks. CoRR abs/1212.5211 (2012).
URL http://arxiv.org/abs/1212.5211

14. Janssens, F., Zhang, L., Moor, B.D., Glänzel, W.: Hybrid clustering for validation and
improvement of subject-classification schemes. Information Processing & Management
45(6), 683 – 702 (2009). DOI http://dx.doi.org/10.1016/j.ipm.2009.06.003. URL http:
//www.sciencedirect.com/science/article/pii/S0306457309000673

15. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space.
Contemporary Math. 26, 189–206 (1984)

16. Koopman, R., Wang, S.: Mutual information based labelling and comparing clusters. In:
J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a
comparative approach to the identification of thematic structures in science, Special Issue
of Scientometrics (2017). DOI 10.1007/s11192-017-2305-2

http://www.sciencedirect.com/science/article/pii/S0022000003000254
http://www.sciencedirect.com/science/article/pii/S0022000003000254
http://doi.acm.org/10.1145/502512.502546
http://doi.acm.org/10.1145/502512.502546
http://arxiv.org/abs/1212.5211
http://www.sciencedirect.com/science/article/pii/S0306457309000673
http://www.sciencedirect.com/science/article/pii/S0306457309000673


Contextualization of topics 21

17. Koopman, R., Wang, S., Scharnhorst, A.: Contextualization of topics - browsing through
terms, authors, journals and cluster allocations. In: A.A. Salah, Y. Tonta, A.A.A. Salah,
C.R. Sugimoto, U. Al (eds.) Proceedings of ISSI 2015 Istanbul: 15th International Soci-
ety of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July,
2015. Bogaziçi University Printhouse (2015). URL http://www.issi2015.org/files/
downloads/all-papers/1042.pdf

18. Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: Interactive
navigation in a world of networked information. In: B. Begole, J. Kim, K. Inkpen, W. Woo
(eds.) Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human
Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea,
April 18 - 23, 2015, pp. 1833–1838. ACM (2015). DOI 10.1145/2702613.2732781. URL
http://doi.acm.org/10.1145/2702613.2732781

19. Kouw, M., Heuvel, C.V.d., Scharnhorst, A.: Exploring uncertainty in knowledge represen-
tations: Classifications, simulations, and models of the world. In: P. Wouters, A. Beaulieu,
A. Scharnhorst, S. Wyatt (eds.) Virtual Knowledge. Experimenting in the Humanities and
the Social Sciences, p. 89126. Cambridge, Mass.: MIT Press. (2013)

20. Leydesdorff, L., Welbers, K.: The semantic mapping of words and co-words in contexts.
Journal of Informetrics 5(3), 469–475 (2011). DOI 10.1016/j.joi.2011.01.008

21. Lu, K., Wolfram, D.: Measuring author research relatedness: A comparison of word-based,
topic-based, and author cocitation approaches. Journal of the American Society for Infor-
mation Science and Technology 63(10), 1973–1986 (2012). DOI 10.1002/asi.22628

22. Mahalanobis, P.C.: On the generalised distance in statistics. Proceedings National Insti-
tute of Science, India 2(1), 49–55 (1936)

23. Mali, F., Kronegger, L., Doreian, P., Ferligoj, A.: Dynamic scientific co-authorship net-
works. In: A. Scharnhorst, K. Börner, P. van den Besselaar (eds.) Models of Sci-
ence Dynamics, Understanding Complex Systems, pp. 195–232. Springer Berlin Hei-
delberg (2012). DOI 10.1007/978-3-642-23068-4 6. URL http://dx.doi.org/10.1007/
978-3-642-23068-4_6

24. Mayr, P., Scharnhorst, A.: Scientometrics and information retrieval: weak-links revitalized.
Scientometrics 102(3), 2193–2199 (2015). DOI 10.1007/s11192-014-1484-3. URL http:
//dx.doi.org/10.1007/s11192-014-1484-3

25. Mutschke, P., Mayr, P.: Science models for search: a study on combining scholarly in-
formation retrieval and scientometrics. Scientometrics pp. 1–23 (2014). DOI 10.1007/
s11192-014-1485-2. URL http://dx.doi.org/10.1007/s11192-014-1485-2

26. Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing:
A probabilistic analysis. Journal of Computer and System Sciences 61(2), 217 – 235
(2000). DOI http://dx.doi.org/10.1006/jcss.2000.1711. URL http://www.sciencedirect.
com/science/article/pii/S0022000000917112

27. Petersen, A.: Simulating nature: A philosophical study of computer-simulation uncertain-
ties and their role in climate science and policy advice. Het Spinhuis: Apeldoorn (2006)

28. Radicchi, F., Fortunato, S., Vespignani, A.: Citation Networks. In: A. Scharnhorst,
K. Börner, P. Besselaar (eds.) Models of Science Dynamics, Understanding Complex
Systems, vol. 69, chap. 7, pp. 233–257. Springer Berlin / Heidelberg, Berlin, Heidel-
berg (2012). DOI 10.1007/978-3-642-23068-4\ 7. URL http://dx.doi.org/10.1007/
978-3-642-23068-4_7

29. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill,
Inc., New York, NY, USA (1986)

30. de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515
(1965). DOI 10.1126/science.149.3683.510. URL http://www.sciencemag.org/content/
149/3683/510.short

31. Van Eck, N.J., Waltman, L.: Citation-based clustering of publications. In: J. Gläser,
A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative
approach to the identification of thematic structures in science, Special Issue of Sciento-
metrics (2017). DOI 10.1007/s11192-017-2300-7

32. Van Heur, B., Leydesdorff, L., Wyatt, S.: Turning to ontology in STS? turning to
STS through “ontology”. Social Studies of Science 43(3), 341362 (2013). DOI
10.1177/030631271245814

33. Velden, T., Boyack, K., van Eck, N., Glänzel, W., Gläser, J., Havemann, F., Heinz, M.,
Koopman, R., Scharnhorst, A., Thijs, B., Wang, S.: Comparison of topic extraction ap-
proaches and their results. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data –
different results? Towards a comparative approach to the identification of thematic struc-
tures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2306-1

http://www.issi2015.org/files/downloads/all-papers/1042.pdf
http://www.issi2015.org/files/downloads/all-papers/1042.pdf
http://doi.acm.org/10.1145/2702613.2732781
http://dx.doi.org/10.1007/978-3-642-23068-4_6
http://dx.doi.org/10.1007/978-3-642-23068-4_6
http://dx.doi.org/10.1007/s11192-014-1484-3
http://dx.doi.org/10.1007/s11192-014-1484-3
http://dx.doi.org/10.1007/s11192-014-1485-2
http://www.sciencedirect.com/science/article/pii/S0022000000917112
http://www.sciencedirect.com/science/article/pii/S0022000000917112
http://dx.doi.org/10.1007/978-3-642-23068-4_7
http://dx.doi.org/10.1007/978-3-642-23068-4_7
http://www.sciencemag.org/content/149/3683/510.short
http://www.sciencemag.org/content/149/3683/510.short


22 Rob Koopman et al.

34. Velden, T., Yan, S., Lagoze, C.: Mapping the Cognitive Structure of Astrophysics by
Infomap. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results?
Towards a comparative approach to the identification of thematic structures in science,
Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2299-9

35. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison:
Variants, properties, normalization and correction for chance. Journal of Machine Learning
Research 11, 28372854 (2010)

36. Wang, S., Koopman, R.: Clustering articles based on semantic similarity. In: J. Gläser,
A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative
approach to the identification of thematic structures in science, Special Issue of Sciento-
metrics (2017). DOI 10.1007/s11192-017-2298-x

37. Zitt, M., Bassecoulard, E.: Delineating complex scientific fields by an hybrid lexical-
citation method: An application to nanosciences. Information Processing & Manage-
ment 42(6), 1513 – 1531 (2006). DOI http://dx.doi.org/10.1016/j.ipm.2006.03.016. URL
http://www.sciencedirect.com/science/article/pii/S0306457306000379. Special Is-
sue on Informetrics

38. Zitt, M., Lelu, A., Bassecoulard, E.: Hybrid citation-word representations in science map-
ping: Portolan charts of research fields? Journal of the American Society for Information
Science and Technology 62, 1939 (2011)

http://www.sciencedirect.com/science/article/pii/S0306457306000379

	1 Introduction
	2 Data
	3 Method
	4 Experiments and results
	5 Conclusion