Noname manuscript No. (will be inserted by the editor) Contextualization of topics Browsing through the universe of bibliographic information Rob Koopman · Shenghui Wang · Andrea Scharnhorst Received: date / Accepted: date Abstract This paper describes how semantic indexing can help to generate a con- textual overview of topics and visually compare clusters of articles. The method was originally developed for an innovative information exploration tool, called Ari- adne, which operates on bibliographic databases with tens of millions of records [18]. In this paper, the method behind Ariadne is further developed and applied to the research question of the special issue “Same data, different results” – the better understanding of topic (re-)construction by different bibliometric approaches. For the case of the Astro dataset of 111,616 articles in astronomy and astrophysics, a new instantiation of the interactive exploring tool, LittleAriadne, has been cre- ated. This paper contributes to the overall challenge to delineate and define topics in two different ways. First, we produce two clustering solutions based on vector representations of articles in a lexical space. These vectors are built on semantic indexing of entities associated with those articles. Second, we discuss how Lit- tleAriadne can be used to browse through the network of topical terms, authors, journals, citations and various cluster solutions of the Astro dataset. More specif- ically, we treat the assignment of an article to the different clustering solutions as an additional element of its bibliographic record. Keeping the principle of semantic indexing on the level of such an extended list of entities of the bibliographic record, LittleAriadne in turn provides a visualization of the context of a specific clustering solution. It also conveys the similarity of article clusters produced by different al- R. Koopman OCLC Research, Schipholweg 99, Leiden, The Netherlands Tel.: +31 71 524 6500 E-mail: rob.koopman@oclc.org S. Wang OCLC Research, Schipholweg 99, Leiden, The Netherlands Tel.: +31 71 524 6500 E-mail: shenghui.wang@oclc.org A. Scharnhorst DANS-KNAW, Anna van Saksenlaan 51, The Hague, The Netherlands Tel.: +31 70 349 4450 E-mail: andrea.scharnhorst@dans.knaw.nl ar X iv :1 70 2. 08 21 0v 1 [ cs .D L ] 2 7 F eb 2 01 7 2 Rob Koopman et al. gorithms, hence representing a complementary approach to other possible means of comparison. Keywords Random projection · clustering · visualization · topical modelling · interactive search interface · semantic map · knowledge map 1 Introduction What is the essence, or the boundary of a scientific field? How can a topic be defined? Those questions are at the heart of bibliometrics. They are equally rele- vant for indexing, cataloguing and consequently information retrieval [24]. Rigour and stability in bibliometrically defining boundaries of a field are important for research evaluation and consequently the distribution of funding. But, for infor- mation retrieval - next to accuracy - serendipity, broad coverage and associations to other fields are of equal importance. If researchers seek information about a certain topic outside of their areas of expertise, their information needs can be quite different from those in a bibliometric context. Among the many possible hits for a search query, they may want to know which are core works (articles, books) and which are rather peripheral. They may want to use different rankings [25], get some additional context information about authors or journals, or see other closely related vocabulary or works associated with a search term. On the whole, they would have less need to define a topic and a field in a bijective, univocal way. Such a possibility to contextualize is not only important for term-based queries. It also holds for groups of query terms, or for the exploration of sets of docu- ments, produced by different clustering algorithms. Contextualisation is the main motivation behind this paper. If we talk of contextualisation we still stay in the realm of bibliographic infor- mation. That is, we rely on information about authors, journals, words, references as hidden in the entirety of the set of all bibliographic records. Decades of biblio- metrics research have produced many different approaches to cluster documents, or more specifically, articles. They often focus on one entity of the bibliographic record. To give one example, articles and terms within those articles (in title, abstract and/or full text) form a bipartite network. From this network we can either build a network of related terms (co-word analysis) or a network of related articles (based on shared words). The first method, sometimes called lexical [20], has been applied in scientometrics to produce so-called topical or semantic maps. The same exercise can be applied to authors and articles, authors and words [21], and in effect to each element of the bibliographic record for an article [13]. If we extend the bibliographic record of an article with the list of references contained by this article, we enter the area of citation analysis. Here, the following methods are widely used: direct citations, bibliographic coupling and co-citation maps. Hy- brid methods combine citation and lexical analysis (e.g., [14, 37]). We would like to note here that in an earlier comparison of citation- and word-based mapping approaches Zitt et al. ([38]) underline the differences both signals carry in terms of what aspect of scientific practice they represent. We come back to this in the next paragraph. Formally spoken, the majority of studies apply one method and often display unipartite networks. Sometimes analysis and visualization of multi-partite networks can be found [32]. Contextualization of topics 3 Each network representation of articles captures some aspect of connectivity and structure which can be found in published work. Co-authorship networks shed light on the social dimension of knowledge production, the so-called Invisible Col- lege [9, 23]. Citation relations are interpreted as traces of flows of knowledge [28, 30]. By using different bibliographic elements, we obtain different models for, or representations of, a field or topic; i.e. as a conceptual, cognitive unit; as a com- munity of practice; or as institutionalized in journals. One could also say that choosing what to measure affects the representation of a field or topic. Another source of variety beyond differences arising from choice of representations is how to analyze those representations. Fortunately, network analysis provides several clas- sical methods to choose from, including clustering and clique analysis. However, clusters can be defined in different ways, and some clustering algorithms can be computationally expensive when used on large or complex networks. Consequently, we find different solutions for the same algorithm (if parameters in the algorithm are changed) and different solutions for different algorithms. One could call this an effect of the choice of instrument for the measurement or how to measure. Using an ideal-typical workflow, these points of choice have been further detailed and discussed in another paper of this special issue ( [33]). The variability in each of the stages of the workflow results in ambiguity, and, if not articulated, makes it even harder to reproduce results. Overall, moments of choice add an uncertainty margin to the results [19, 27]. Last but not least, we can ask ourselves whether clear delineations exist between topics in practice. Often in the sciences very dif- ferent topics are still related to each other. There exist unsharp boundaries and almost invisible long threads in the fabric of science [5], which might inhibit the finding of a contradiction-free solution in form of a unique set of disjunct clusters. There is a seeming paradox between the fact that experts often can rather clearly identify what belongs to their field or a certain topic, and that it is so hard to quantitatively represent this with bibliometric methods. However, a closer look into science history and science and technology studies reveals that even among experts opions regarding subject matter classification or topic identification might vary. What belongs to a field and what not is as much an epistemic question as also an object of social negotiations. Moreover, the boundaries of a field change over time, and even a defined canon or body of knowledge determining the essence of a field or a topic can still be controversial or subject to change [8]. Defining a topic requires a trade-off between accepting the natural ambiguity of what a topic is and the necessity to define a topic for purposes of education, knowledge acquisition, and evaluation. Since different perspectives serve different purposes, there is also a need to preserve the diversity and ambiguity described earlier. Having said this, for the sake of scientific reasoning it is equally necessary to be able to further specify the validity and appropriateness of different methods for defining topics and fields [11]. This paper contributes to this sorting-out-process in several ways. All are driven by the motivation to provide a better understanding of the topic re-construction results by providing context: context of the topics themselves by using a lexical approach and all elements of the bibliographical record to delineate topics; and context for different solutions in the (re-)construction of topics. We first introduce the method of semantic indexing, by which each bibliographic record is decom- posed and a vector representation for each of its entities in a lexical space is build, resulting in a so-called semantic matrix. This approach is conceptually closer to 4 Rob Koopman et al. classical information retrieval techniques based on Salton’s vector space model [29] than to the usual bibliometrical mapping techniques. In particular, it is similar to Latent Semantic Indexing or Latent Semantic Analysis. In the specific case of the Astro dataset, we extend the bibliographic record with information on cluster assignments provided by different clustering solutions. For the purpose of a delin- eation of topics based on clustering of articles, we reconstruct a semantic matrix for articles based on the semantic indexing of their individual entities. Secondly, based on this second matrix, we produce our own clustering solutions (detailed in [36]) by applying two different clustering algorithms. Third, we present an inter- active visual interface called LittleAriadne that displays the context around those extracted entities. The interface responds to a search query with a network visu- alization of most related terms, authors, journals, citations and cluster IDs. The query can consist of words or author names, but also clustering solutions. The displayed nodes or entities around a query term represent, to a certain extent, the context of the query in a lexical, semantic space. In what follows, we address the following research questions: Q1 How does the Ariadne algorithm, originally developed for a large corpora which contains tens of millions of articles, work on a much smaller, field-specific dataset? How can we relate the produced contexts to domain knowledge re- trieved from other information services? Q2 Can we use LittleAriadne to compare different cluster assignments of papers, by treating those cluster assignments as additional entities? What can we learn about the topical nature of these clusters when exploring them visually? Concerning the last question, we restrict this paper to a description of the approach LittleAriadne offers, and we provide some illustrations. A more detailed discussion of the results of this comparison has been taken up as part of the comparison paper of this special issue [33], which on the whole addresses different analytic methods and visual means to compare different clustering solutions. 2 Data The Astro dataset used in this paper contains documents published in the period 2003–2010 in 59 astrophysical journals.1 Originally, these documents had been downloaded from the Web of Science in the context of a German-funded research project called “Measuring Diversity of Research,” conducted at the Humboldt- University Berlin from 2009 to 2012. Based on institutional access to the Web of Science, we worked on the same dataset. Starting with 120,007 records in total, 111,616 records of the document types Article, Letter and Proceedings Paper have been treated with different clustering methods (see the other contributions to this special issue). Different clustering solutions have been shared, and eventually a selection of solutions for the comparison has been defined. In our paper we used clustering solutions from CWTS-C5 (c) [31], UMSI0 (u) [34], HU-DC (hd) [12], STS-RG (sr) [6], ECOOM-BC13 (eb), ECOOM-NLP11 (en) (both [10]) and two of our own: OCLC-31 (ok) and OCLC-Louvain (ol) [36]. The CWTS-C5 and UMSI0 1 For details of the data collection and cleaning process leading to the common used Astro dataset see [33]. Contextualization of topics 5 are the clustering solutions generated by two different methods, Infomap and the Smart Local Moving Algorithm (SLMA) respectively, applied on the same direct citation network of articles. The two ECOOM clustering solutions are generated by applying the Louvain method to find communities among bibliographic coupled articles where ECOOM-NLP11 also incorporates the keywords information. The STS-RG clusters are generated by first projecting the relatively small Astro dataset to the full Scopus database. After the full Scopus articles are clustered using SLMA on the direct citation network, the cluster assignments of Astro articles are collected. The HU-DC clusters are the only overlapping clusters generated by a memetic type algorithm designed for the extraction of overlapping, poly- hierarchical topics in the scientific literature. Each article is assigned to a HU-DC cluster with a confidence value. We only took those assignments with a confidence value higher than 0.5. More detailed accounts of these clustering solutions can be found in [33]. Table 1 shows their labels later used in the interface, and how many clusters each solution produced. All the clustering solutions are based on the full dataset. However, each article is not necessarily guaranteed to have a cluster assignment in every clustering solution (see the papers about the clustering solutions for further details). The last column in Table 1 shows how many articles of the original dataset are covered by different solutions. Table 1 Statistics of clustering solutions generated by different methods Cluster label Solution #Clusters Coverage c CWTS-C5 22 91% u UMSI0 22 91% ok OCLC-31 31 100% ol OCLC-Louvain 32 100% sr STS-RG 556 96% eb ECOOM-BC13 13 97% en ECOOM-NLP11 11 98% hd HU-DC 113 91% 3 Method 3.1 Building semantic representations for entities The Ariadne algorithm was originally developed on top of the article database, ArticleFirst of OCLC [18]. The interface, accessible at http://thoth.pica.nl/ relate, allows users to visually and interactively browse through 35 thousand journals, 3 million authors, and 1 million topical terms associated with 65 mil- lion articles. The Ariadne pipeline consists of two steps: an offline procedure for semantic indexing and an online interactive visualization of the context of search queries. We applied the same method to the Astro dataset and built an instanti- ation, named LittleAriadne, accessible at http://thoth.pica.nl/astro/relate. To describe our method we give an example of an article from the Astro dataset in table 2. We list all the fields of this bibliographic record that we used for LittleAriadne. We include the following types of entities for semantic indexing: http://thoth.pica.nl/relate http://thoth.pica.nl/relate http://thoth.pica.nl/astro/relate 6 Rob Koopman et al. Table 2 An article from the Astro dataset Article ID ISI:000276828000006 Title On the Mass Transfer Rate in SS Cyg Abstract The mass transfer rate in SS Cyg at quiescence, estimated from the ob- served luminosity of the hot spot, is log M-tr = 16.8 +/- 0.3. This is safely below the critical mass transfer rates of log M-crit = 18.1 (correspond- ing to log T-crit(0) = 3.88) or log M-crit = 17.2 (corresponding to the “revised” value of log T-crit(0) = 3.65). The mass transfer rate during outbursts is strongly enhanced Author [author:smak j] ISSN [issn:0001-5237] Subject [subject:accretion, accretion disks] [subject:cataclysmic variables] [sub- ject:disc instability model] [subject:dwarf novae] [subject:novae, cata- clysmic variables] [subject:outbursts] [subject:parameters] [subject:stars] [subject:stars dwarf novae] [subject:stars individual ss cyg] [subject:state] [subject: superoutbursts] Citation [citation:bitner ma, 2007, astrophys j 1, v662, p564] [citation:bruch a, 1994, astron astrophys sup, v104, p79] [citation:buatmenard v, 2001, astron as- trophys, v369, p925] [citation:hameury jm, 1998, mon not r astron soc, v298, p1048] [citation:harrison te, 1999, astrophys j 2, v515, l93] [cita- tion:kjuikchieva d, 1998, a as, v262, p53] [citation:kraft rp, 1969, apj, v158, p589] [citation:kurucz rl, 1993, cd rom] [citation:lasota jp, 2001, new astron rev, v45, p449] [citation:paczynski b, 1980, acta astron, v30, p127] [cita- tion:schreiber mr, 2002, astron astrophys, v382, p124] [citation:schreiber mr, 2007, astron astrophys, v473, p897, doi 10.1051/0004-6361:20078146] [citation:smak j, 1996, acta astronom, v46, p377] [citation:smak j, 2002, acta astronom, v52, p429] [citation:smak j, 2004, acta astronom, v54, p221] [citation:smak j, 2008, acta astronom, v58, p55] [citation:smak ji, 2001, acta astronom, v51, p279] [citation:tutukov av, 1985, pisma astron zh, v11, p123] [citation:tutukov av, 1985, sov astron lett+, v11, p52] [cita- tion:voloshina ib, 2000, astron rep+, v44, p89] [citation:voloshina ib, 2000, astron zh, v77, p109] Topical terms mass transfer; transfer rate; ss; cyg; quiescence; estimated; observed; lumi- nosity; hot spot; log; tr; safely; critical; crit; corresponding; revised; value; outbursts; strongly; enhanced UAT terms [uat:stellar phenomena]; [uat:mass transfer]; [uat:optical bursts] Cluster ID [cluster:c 19] [cluster:u 16] [cluster:ok 18] [cluster:ol 23] [cluster:sr 17] [clus- ter:eb 1] [cluster:en 1] [cluster:hd 1] [cluster:hd 18] [cluster:hd 48] authors, journals (ISSN), subjects, citations, topical terms, MAI-UAT thesaurus terms and cluster IDs (see Table 1). For the Astro dataset, we extended the origi- nal Ariadne algorithm [17] by adding citations as additional entities. In the short paper about the OCLC clustering solutions [36] we applied clustering to differ- ent variants of the vector representation of articles, including variants with and without citations. We reported there about the effect of adding citations to vector representations of articles on clustering. In Table 2 we display the author name (and other entities) in a syntax (indi- cated by square brackets) that can immediately be used in the search field of the interface. Each author name is treated as a separate entity. The next type of entity is the journal identified by its ISSN number. One can search for a single journal using its ISSN number. In the visual interface, the ISSN numbers are replaced by the journal name, which is used as label for a journal node. The next type of entities are so-called subjects. Those subjects originate from the fields “Author Keywords” and “Keywords Plus” of the original Web of Science records. Citations, Contextualization of topics 7 Table 3 Entities in LittleAriadne Journals 59 Authors 55,607 Topical terms 60,501 Subjects 41,945 Citations 386,217 UAT terms 1534 Cluster IDs 610 Total 546,473 references in the article, are considered as a type of entity too. Here, we use the standardized abbreviated citations in the Web of Science database. We remark that we do not apply any form of disambiguation–neither for the author names nor for the citations. Topical terms such as “mass transfer” and “quiescence” in our example, are single words or two-word phrases extracted from titles and ab- stracts of all documents in the dataset. A multi-lingual stop-word list was used to remove unimportant words, and mutual information was used to generate two- word phrases. Only words and phrases which occur more than a certain threshold value were kept. The next type of entity is a set of Unified Astronomy Thesaurus (UAT)2 terms which were assigned by the Data Harmony’s Machine Aided Indexer (M.A.I.).3 Please refer to [7] for more details about the thesaurus and the indexing procedure. The last type of entity we add to each of the articles (specific for LittleAriadne) is the collection of cluster IDs corresponding to the clusters to which the article was assigned by the various clustering algorithms. For example, the article in Table 2 has been assigned to clusters “c 19” (produced by CWTS-C5) and “u 16” (produced by UMSI0), and so on. In other words, we treat the cluster assignments of articles as they would be classification numbers or additional subject headings. Table 3 lists the total number of different types of entities found in the Astro dataset. To summarize, we deconstruct each bibliographic record, extract a number of entities, and add some more (the cluster IDs and the topical terms). Next, we construct for each of these entities a vector in a word space built from topical terms and subject terms. We assume that the context of all entities is captured by their vectors in this space. Figure 1 gives a schematic representation of these vectors which form the matrix C. All types of entities – topical term, subject, author, citation, cluster ID and journal – form the rows of the matrix, and their components (all topical terms and subjects) the columns. The values of the vector components are the frequencies of the co-occurrence of an entity and a specific word in the whole dataset. That is, we count how many articles contain both an entity and a certain topical term or subject. Matrix C expresses the semantics of all entities in terms of their context. Such context is then used in a computation of their similarity/relateness. Each vector can be seen as the lexical profile of a particular entity. A high cosine similarity value between two entities indicates a large overlap of the contexts of these two 2 http://astrothesaurus.org/ 3 http://www.dataharmony.com/services-view/mai/ http://astrothesaurus.org/ http://www.dataharmony.com/services-view/mai/ 8 Rob Koopman et al. Fig. 1 Dimension reduction using Random Projection entities – in other words, a high similarity between them. This is different from measuring their direct co-occurrence. For LittleAriadne, the matrix C has roughly 546K × 102K elements, and is sparse and expensive for computation. To make the algorithm scale and to pro- duce a responsive online visual interface, we applied the method of Random Pro- jection [1, 15] to reduce the dimensionality of the matrix. As shown in Figure 1, we multiply C with a 102K × 600 matrix of randomly distributed –1 and 1, with half-half probabilities.4 This way, the original 546K × 102K matrix C is reduced to a Semantic Matrix C′ of the size of 546K × 600. Still, each row vector repre- sents the semantics of an entity. It has been discussed elsewhere [2] that with the method of Random Projection, similar to other dimension reduction methods, es- sential properties of the original vector space are preserved, and thus entities with a similar profile in the high-dimensional space still have a similar profile in the reduced space. A big advantage of Random Projection is that the computation is significantly less expensive than other methods, e.g., Principal Component Anal- ysis [2]. Actually, Random Projection is often suggested as a way of speeding up Latent Semantic Indexing (LSI) [26], and Ariadne is similar to LSI in some ways. LSI starts from a weighted term-document matrix, where each row represents the lexical profile of a document in a word space. In Ariadne, however, the unit of analysis is not the document. Instead, each entity of the bibliographic record is subject to a lexical profile. We explain in the next section that, by aggregating over all entities belonging to one article, one can construct a vector representation for the article that represents its semantics and is suitable for further clustering processes (for more details please consult [36]). 4 More efficient random projections are available. This version is more conservative and also computationally easier. Contextualization of topics 9 With the Matrix C′, the interactive visual interface dynamically computes the most related entities (i.e., ranked by cosine similarity) to a search query. After irrelevant entities have been filtered out by removing entities with a high Maha- lanobis distance [22] to the query, the remaining entities and the query node are positioned in 2D so that the distance between nodes preserves the corresponding distance in the high dimensional space as much as possible. We use a spring-like force-directed graph drawing algorithm for the positioning of the nodes. Designed as experimental, explorative tool, no other optimisation of the network layout is applied. In the on-line interface, it is possible to zoom into the visualization, to change the size of the labels (font slider) as well as the number of entities displayed (show slider). For the figures in the paper, we used snapshots, in which node labels might overlap. Therefore, we provide links to the corresponding interactive display for each of the figures. In the end, with its most related entities, the context of a query term can be effectively presented [18]. For LittleAriadne we extended the usual Ariadne interface with different lists of the most related entities, organized by type. This information is given below the network visualization. 3.2 From a semantic matrix of entities to a semantic matrix for articles The Ariadne interface provides context around entities, but does not produce article clusters directly. In other words, articles contribute to the context of entities associated with them but the semantics of themselves need to be reconstructed before we can apply clustering methods to identify article clusters. We describe the OCLC clustering workflow elsewhere [36], but here we would like to explain the preparatory work for it. The first step is to create a vector representation of each article. For each article, we look up all entities associated with this article in the Semantic Matrix C′. We purposefully leave out the cluster IDs, because we want to construct our own clustering later independently, i.e., without already including information about clustering solutions of other teams. For each article we obtain a set of vectors. For our article example in Table 2 we have 55 entities. The set of vectors for this article entails one vector representing the single author of this article, 12 vectors for the subjects, one vector for the journal, 21 vectors for the citations and 20 vectors for topical terms. Each article is represented by a unique set of vectors. The size of the set can vary, but each of the vectors inside of a set has the same length, namely 600. For each article we compute the weighted average of its constituent vectors as its semantic representation. Each entity is weighted by its inverse document frequency to the third power; therefore, frequent entities are heavily penalized to have little contribution to the resulting representation of the article. In the end, each article is represented by a vector of 600 dimensions which becomes a row in a new matrix M with the size of 111, 616 × 600. Note that since articles are repre- sented as a vector in the same space where other entities are also represented, it is now possible to compute the relatedness between entities and articles! Therefore in the online interface, we can present the articles most related to a query. To group these 111,616 articles into meaningful clusters, we apply standard clustering methods to M. A first choice, the K-Means clustering algorithm results in 31 clusters. As detailed in [36], with k = 31, the resulting 31 clusters perform 10 Rob Koopman et al. the best according to a pseudo-ground-truth built from the consensus of CWTS- C5, UMSI0, STS-RG and ECOOM-BC13. With this clustering solution the whole dataset is partitioned pretty evenly: the average size is 3600 ± 1371, and the largest cluster contains 6292 articles and the smallest 1627 articles. We also apply a network-based clustering method: the Louvain community detection algorithm. To avoid high computational cost, we first calculate for each article the top 40 most related articles, i.e., with the highest cosine similarity. This results in a new adjacency matrix M′ between articles, representing an article similarity network where the nodes are articles and the links indicate that the connected articles are very similar. We set the threshold for the cosine similarity at 0.6 to reduce links with low similarity values. A standard Louvain community detection algorithm [3] is applied to this network, producing 32 partitions, i.e., 32 clusters. Compared to K-Means 31 clusters, these 32 Louvain clusters vary more in terms of cluster size, with the largest cluster containing 9464 articles while the smallest cluster 86 articles. The Normalized Mutual Information [35] between these two solutions is 0.68, indicating that they are highly similar to each other yet different enough to be studied further. More details can be found in [36]. 4 Experiments and results To answer the two research questions listed in the introduction, we conducted the following experiments: Experiment 1. We implemented LittleAriadne as an information retrieval tool. We searched with query terms, inspected and navigated through the resulting network visualization. Experiment 2. We visually observed and compared different clustering solutions. 4.1 Experiment 1 – Navigate through networked information We implemented LittleAriadne, which allows users to browse the context of the 546K entities associated with 111K articles in the datasets. If the search query refers to an entity that exists in the semantic matrix, LittleAriadne will return, by default, top 40 most related entities, which could be topical terms, authors, subjects, citations or clusters. If there are multiple known entities in the search query, a weighted average of the vectors of individual entities is used to calculate similarities (the same way an article vector is constructed). If the search query does not contain any known entities, a blank page is returned, as there is no information about this query. Figure 2 gives a contextual view of “gamma ray.”5 The search query refers to an known topical term “gamma ray,” and it is therefore displayed as a red node in the network visualization. The top 40 most related entities are shown as nodes, with the top 5 connected by the red links. The different colours reflect their types, e.g., topical terms, subjects, authors, or clusters. Each of these 40 entities is further connected to its top 5 most related entities among the rest of the entities in the visualization, with the condition that the cosine similarity is not below 0.6. 5 Available at http://thoth.pica.nl/astro/relate?input=gamma+ray http://thoth.pica.nl/astro/relate?input=gamma+ray Contextualization of topics 11 Fig. 2 The contextual view of the query term “gamma ray” A thicker link means the two linked entities are mutually related, i.e., they are among each other’s top 5 list. The colour of the link takes that of the node where the link is originated. If the link is mutual and two linked entities are of different types, one of the entity colours is chosen. The displayed entities often automatically form groups depending on their relatedness to each other, whereby more related entities are positioned closer to each other. Each group potentially represents a different aspect related to the query term. The size of a node is proportional to the logarithm of its frequency of occurrences in the whole dataset. The absolute number of occurrences appears when hovering the mouse cursor over the node. Due to the fact that different statistical methods are at the core of the Ariadne algorithm, this number gives an indication of the reliability of the suggested position and links. In Figure 2, there are four clusters from OCLC-31, ECOOM-BC13 and ECOOM- NLP11, and CWTS. The ECOOM-BC13 cluster eb 8 and ECOOM-NLP11 cluster en 4 are directly linked to “gamma ray,” suggesting that these two clusters are probably about gamma rays. It is not surprising that they are very close to each other, because they contain 7560 and 5720 articles respectively but share 3603 articles. At the lower part, the OCLC-31 cluster ok 21 and the CWTS cluster c 15 are also pretty close to our search term. They contain 1849 and 3182 articles respectively and share 1721 articles in common which makes them close to each other in the visualization. By looking at the topical terms and subjects around 12 Rob Koopman et al. Fig. 3 The contextual view of cluster ok 21 these clusters, we can have a rough idea of their differences. Although they are all about “gamma ray,” Clusters eb 8 and en 4 are probably more about “radiation mechanisms,” “very high energy,” and “observations,” while Clusters ok 21 and c 15 seem to focus more on “afterglows,” “prompt emission,” and “fireball.” Such observations will invite users to explore these clusters or subjects further. Each node is clickable which leads to another visualization of the context of this selected node. If one is interested in cluster ok 21 for instance, after clicking the node, a contextual view of cluster ok 21 is presented,6 as shown in Figure 3. This context view provides a good indication about the content of the articles grouped together in this cluster. In the context view of cluster ok 21 we see again the cluster c 15, which was already near to ok 21 in the context view of “gamma ray.” But the two ECOOM clusters, eb 8 and en 4 that are also in the context of “gamma ray” are not visible any more. Instead, we find two more similar clusters u 11 and ol 9. That means that, even though the clusters ok 21 and eb 8 are among the top 40 entities that are related to “gamma ray,” they are still different in terms of their content. This can be confirmed by looking at their labels in Table 4.7 As mentioned before, in the interface one can also further refine the display. For instance, one can choose the number of nodes to be shown or decide to limit the display to only authors, journals, topical terms, subjects, citations or clusters. 6 Available at http://thoth.pica.nl/astro/relate?input=[cluster:ok%2021]. 7 More details about cluster labelling can be found in [16]. http://thoth.pica.nl/astro/relate?input=[cluster:ok%2021] Contextualization of topics 13 Table 4 Labels of clusters similar to ok 21 and to ”gamma ray” Cluster IDs Size Cluster labels ok 21 1849 grb, ray burst, gamma ray, afterglow, bursts grbs, swift, prompt emission, prompt, fireball, batse c 15 3182 grb, ray bursts, gamma ray, afterglow, bursts grbs, sn, explosion, swift, type ia, supernova sn ol 9 2895 grb, ray bursts, gamma ray, afterglow, bursts grbs, sn, type ia, swift, explosion, ia supernovae u 11 2051 grb, ray bursts, gamma ray, afterglow, bursts grbs, sn, explosion, type ia, swift, supernova eb 8 7560 gamma ray, pulsar, ray bursts, grb, bursts grbs, high energy, jet, radio, psr, synchrotron en 4 5720 gamma ray, grb, ray bursts, cosmic ray, high energy, bursts grbs, afterglow, swift, tev, tev gamma The former can be done by the slider show or by editing the URL string directly. For the latter options, tick boxes are given. An additional slider font allows to experiment with the font size of the labels. A display with only one type of entity enables us to see context filtered along one perspective (lexical, journals, authors, subjects), and is often useful. For ex- ample, Figure 48 shows at least three separate groups of authors who are most related to “subject:hubble diagram.” At any point of exploration, one can see the most related entities, grouped by their types and listed at the bottom of the interface. The first category shown are the related titles, the titles of the articles most relevant to a search query. Due to license restrictions, we cannot make the whole bibliography available. But when clicking on a title, one actually sees the context of a certain article. Not only titles can be clicked through, all entities at the lower part are also clickable and such an action leads to another contextual view of the selected entity. At the top of the interface, under the search box, we find further hyperlinks behind the label exact search and context search. Clicking on the hyperlinks auto- matically sends queries to other information spaces such as Google, Google Scholar, Wikipedia, and WorldCat. For exact search, the same query text is used. For con- text search, the system generates a selection among all topical terms related to the original query term and send this selection as a string of terms (with the Boolean AND operation) to those information spaces behind the hyperlinks. This option offers users a potential way to retrieve related literature or web resources from a broader perspective. In turn, it also enables the user to better understand the entity-based context view provided by Ariadne. Let us now come back to our first research question: how does the Ariadne algo- rithm work on a much smaller, field-specific dataset? The interface shows that the original Ariadne algorithm works well on the small Astro dataset. Not surprisingly, compared with our exploration in the much bigger and more general ArticleFirst dataset, we find more consistent representations; that is, specific vocabulary is displayed, which can be cross-checked in Wikipedia, Google or Google Scholar. On the other hand, different corpora introduce different contexts for entities. For 8 Available at http://thoth.pica.nl/astro/relate?input=%5Bsubject%3Ahubble+ diagram%5D&type=2 http://thoth.pica.nl/astro/relate?input=%5Bsubject%3Ahubble+diagram%5D&type=2 http://thoth.pica.nl/astro/relate?input=%5Bsubject%3Ahubble+diagram%5D&type=2 14 Rob Koopman et al. Fig. 4 The authors who are the most related to “subject:hubble diagram” example, “young” in ArticleFirst9 is associated with adults and 30 years old, while in LittleAriadne it is immediately related to young stars which are merely 5 or 10 millions years old.10 Also, the bigger number of topical terms in the larger database leads to a situation where almost every query term produces a response. In LittleAriadne searches for, e.g., a writer such as JaneAusten retrieve nothing. Not surprisingly, for domain-specific entities, LittleAriadne tends to provide more accurate context. A more thorough evaluation needs to be based, as for any other topical mapping, on a discussion with domain experts. 4.2 Experiment 2 – Comparing clustering solutions In LittleAriadne we extended the interface with the goal of observing and compar- ing clustering solutions visually. As discussed in Section 3.1 cluster assignments are treated in the same way as other entities associated with articles, such as top- ical terms, authors, etc. Each cluster ID is therefore represented in the same space and visualized in the same way. In the interface, when we use a search term, for 9 Available at http://thoth.pica.nl/relate?input=young 10 Available at http://thoth.pica.nl/astro/relate?input=young http://thoth.pica.nl/relate?input=young http://thoth.pica.nl/astro/relate?input=young Contextualization of topics 15 example “[cluster:c]” and tick the “scan” option, the interface scans all the enti- ties in the semantic matrix which starts with, in this case “cluster:c,” and then effectively selects and visualizes all CWTS-C5 clusters.11 This way, we can eas- ily see the distribution of a single clustering solution. Note that in this scanning visualization, any cluster which contains less than 100 articles is not shown. Figure 5 shows the individual distribution of clusters from all eight clustering solutions. When two clusters have a relatively high mutual similarity, there is a link between them. It is not surprising to see the HU-DC clusters are highly connected as they are overlapping, and form a poly-hierarchy. Compared to CWTS-C5, UMSI and two ECOOM clusters, the STS-RG and the two OCLC solutions have more cluster-cluster links. This suggests that these clusters overlap more in terms of their direct vocabularies and indirect vocabularies associated with their authors, journals and citations. If we scan two or more cluster entities, such as “[cluster:c][cluster:ok],” we put two clustering solutions on the same visualization so that they can be compared visually. In Figure 6 (a) we see the high similarity between clusters from CWTS- C5 and those from OCLC-31.12 CWTS-C5 has 22 clusters while OCLC-31 has 31 clusters. Each CWTS-C5 cluster is accompanied by one or more OCLC clusters. This indicates that they are different, probably because of the granularity aspect instead of any fundamental issue. Figure 6 (b) shows two other sets of clusters that partially agree with each other but clearly have different capacity in identifying different clusters.13 Figure 7 (a) shows all the cluster entities from all eight clustering solutions.14 The STS and HU have hundreds of clusters, which make the visualization pretty cluttered. Figure 7 (b) shows only the solutions from CWTS, UMSI, OCLC and ECOOM, whose numbers of the clusters are comparable.15 Concerning our second research question - can we use LittleAriadne to compare clustering solutions visually? - we can give a positive answer. But, it is not easy to see from LittleAriadne why some clusters are similar and the others not. The visualization functions as a macroscope[4] and provides a general overview of all the clustering solutions, which helps to guide further investigation. It is not conclusive, but a useful heuristic devise. For example, from Figure 7, especially 7 (b), it is clear that there are “clusters of clusters.” That is, some clusters are detected by all of these different methods. In the future we may investigate these clusters of clusters more closely and perhaps discover that different solutions identify some of the same topics. We continue the discussion of the use of visual analytics to compare clustering solutions in the paper by Velden et al. [33]. 11 This scan option is applicable to any other type of entities, for example, to see all sub- jects which start with “quantum” by using “subject:quantum” as the search term and do the scanning. 12 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D% 5Bcluster%3Aok%5D&type=S&show=500 13 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Au%5D% 5Bcluster%3Asr%5D&type=S&show=500 14 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D% 5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster% 3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500 15 Available at http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D% 5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster% 3Aen%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Aok%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Aok%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Au%5D%5Bcluster%3Asr%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Au%5D%5Bcluster%3Asr%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D%5Bcluster%3Asr%5D%5Bcluster%3Ahd%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D&type=S&show=500 http://thoth.pica.nl/astro/relate?input=%5Bcluster%3Ac%5D%5Bcluster%3Au%5D%5Bcluster%3Aok%5D%5Bcluster%3Aol%5D%5Bcluster%3Aeb%5D%5Bcluster%3Aen%5D&type=S&show=500 16 Rob Koopman et al. (a) CWTS-C5 clusters (b) UMSI0 clusters (c) OCLC-31 clusters (d) OCLC-Louvain clusters (e) ECOOM-BC13 clusters (f) ECOOM-NLP11 clusters (g) STS-RG clusters (h) HU-DC clusters Fig. 5 The distribution of clusters Contextualization of topics 17 (a) Highly similar clustering solutions (b) Clustering solutions with different focuses Fig. 6 Visual comparison of clustering solutions 18 Rob Koopman et al. (a) All clustering solutions (b) Clusters from CWTS, UMSI, OCLC and ECOOM Fig. 7 Visual comparison of clustering solutions Contextualization of topics 19 5 Conclusion We present a method implemented in an interface that allows browsing through the context of entities, such as topical terms, authors, journals, subjects and cita- tions associated with a set of articles. With the LittleAriadne interface, one can navigate visually and interactively through the context of entities in the dataset by seamlessly travelling between authors, journals, topical terms, subjects, citations and cluster IDs as well as consult external open information spaces for further contextualization. In this paper we particularly explored the usefulness of the method to the problem of topic delineation addressed in this special issue. LittleAriadne treats cluster assignments from different solutions as additional special entities. This way we provide the contextual view of clusters as well. This is beneficial for users who are interested in travelling seamlessly between different types of entities and their related cluster assignments generated by different solutions. We also contributed two clustering solutions built on the vector representation of articles, which is different from solutions provided by other methods. We start by including references and treating them as entities with a certain lexical or semantic profile. In essence, we start from a multipartite network of papers, cited sources, terms, authors subjects, etc. and focus on similarity in a high dimensional space. Our clusters are comparable to other solutions yet have their own characteristics. Please see [33, 36] for more details. We demonstrated that we can use LittleAriadne to compare different cluster- ing solutions visually and generate a wider overview. This has a potential to be complementary to any other method of cluster comparison. We hope that this in- teractive tool supports discussion about different clustering algorithms and helps to find the right meaning of clusters. We have plans to further develop the Ariadne algorithm. The Ariadne algo- rithm is general enough to incorporate additional types of entities into the semantic matrix. Which entities we can add very much depends on the information in the original dataset or database. In the future, we plan to add publishers, conferences, etc. with the aim to provide a richer contextualization of entities typically found in a scholarly publication. We also plan to elaborate links to articles that contribute to the contextual visualization, thus strengthening the usefulness of Ariadne not only for the associative exploration of contexts similar to scrolling through a sys- tematic catalogue, but also as a direct tool for document retrieval. In this context we plan to further compare LittleAriadne and Ariadne. As mentioned before, the corpora matter when talking about context of entities. The advantage of LittleAriadne is the confinement of the dataset to one scientific dis- cipline or field and topics within. We hope by continuing such experiments also to learn more about the relationship between genericity and specificity of contexts, and how that can be best addressed in information retrieval. Acknowledgement Part of this work has been funded by the COST Action TD1210 Knowescape, and the FP7 Project ImpactEV. We would like to thank the internal reviewers Frank Havemann, Bart Thijs as well as the anonymous external referees for their 20 Rob Koopman et al. valuable comments and suggestions. We would also like to thank Jochen Gläser, William Harvey and Jean Godby for comments on the text. References 1. Achlioptas, D.: Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66(4), 671–687 (2003). DOI http://dx. doi.org/10.1016/S0022-0000(03)00025-4. URL http://www.sciencedirect.com/science/ article/pii/S0022000003000254 2. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: Applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pp. 245–250. ACM, New York, NY, USA (2001). DOI 10.1145/502512.502546. URL http://doi.acm.org/10.1145/ 502512.502546 3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communi- ties in large networks. Journal of Statistical Mechanics: Theory and Experiment (2008). P10008 (12pp) 4. Börner, K.: Plug-and-play macroscopes. Communications of the ACM 54(3), 60–69 (2011) 5. Boyack, K., Klavans, R.: Weaving the fabric of science. In: K. Börner, E.F. Hardy (eds.) 6th Iteration (2009): Science Maps for Scholars. Places & Spaces: Mapping Science (2010) 6. Boyack, K.W.: Investigating the Effect of Global Data on Topic Detection. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Sciento- metrics (2017). DOI 10.1007/s11192-017-2297-y 7. Boyack, K.W.: Thesaurus-based methods for mapping contents of publication sets. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2304-3 8. Galison, P.: Image and logic: A material culture of microphysics. University of Chicago Press (1997) 9. Glänzel, W., Schubert, A.: Analysing scientific networks through co-authorship. In: H.F. Moed, W. Glänzel, U. Schmoch (eds.) Handbook of quantitative science and technology research, p. 257276. Springer (2004). DOI 10.1007/1-4020-2755-9 12 10. Glänzel, W., Thijs, B.: Using hybrid methods and ‘core documents’ for the representation of clusters and topics. the astronomy dataset. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identifi- cation of thematic structures in science, Special Issue of Scientometrics (2017). DOI Usinghybridmethods 11. Gläser, J., Glänzel, W., Scharnhorst, A.: Introduction to the special issue “same data, different results?”. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2296-z 12. Havemann, F., Gläser, J., Heinz, M.: Memetic search for overlapping topics. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Sciento- metrics (2017). DOI 10.1007/s11192-017-2302-5 13. Havemann, F., Scharnhorst, A.: Bibliometric networks. CoRR abs/1212.5211 (2012). URL http://arxiv.org/abs/1212.5211 14. Janssens, F., Zhang, L., Moor, B.D., Glänzel, W.: Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing & Management 45(6), 683 – 702 (2009). DOI http://dx.doi.org/10.1016/j.ipm.2009.06.003. URL http: //www.sciencedirect.com/science/article/pii/S0306457309000673 15. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math. 26, 189–206 (1984) 16. Koopman, R., Wang, S.: Mutual information based labelling and comparing clusters. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2305-2 http://www.sciencedirect.com/science/article/pii/S0022000003000254 http://www.sciencedirect.com/science/article/pii/S0022000003000254 http://doi.acm.org/10.1145/502512.502546 http://doi.acm.org/10.1145/502512.502546 http://arxiv.org/abs/1212.5211 http://www.sciencedirect.com/science/article/pii/S0306457309000673 http://www.sciencedirect.com/science/article/pii/S0306457309000673 Contextualization of topics 21 17. Koopman, R., Wang, S., Scharnhorst, A.: Contextualization of topics - browsing through terms, authors, journals and cluster allocations. In: A.A. Salah, Y. Tonta, A.A.A. Salah, C.R. Sugimoto, U. Al (eds.) Proceedings of ISSI 2015 Istanbul: 15th International Soci- ety of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015. Bogaziçi University Printhouse (2015). URL http://www.issi2015.org/files/ downloads/all-papers/1042.pdf 18. Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: Interactive navigation in a world of networked information. In: B. Begole, J. Kim, K. Inkpen, W. Woo (eds.) Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea, April 18 - 23, 2015, pp. 1833–1838. ACM (2015). DOI 10.1145/2702613.2732781. URL http://doi.acm.org/10.1145/2702613.2732781 19. Kouw, M., Heuvel, C.V.d., Scharnhorst, A.: Exploring uncertainty in knowledge represen- tations: Classifications, simulations, and models of the world. In: P. Wouters, A. Beaulieu, A. Scharnhorst, S. Wyatt (eds.) Virtual Knowledge. Experimenting in the Humanities and the Social Sciences, p. 89126. Cambridge, Mass.: MIT Press. (2013) 20. Leydesdorff, L., Welbers, K.: The semantic mapping of words and co-words in contexts. Journal of Informetrics 5(3), 469–475 (2011). DOI 10.1016/j.joi.2011.01.008 21. Lu, K., Wolfram, D.: Measuring author research relatedness: A comparison of word-based, topic-based, and author cocitation approaches. Journal of the American Society for Infor- mation Science and Technology 63(10), 1973–1986 (2012). DOI 10.1002/asi.22628 22. Mahalanobis, P.C.: On the generalised distance in statistics. Proceedings National Insti- tute of Science, India 2(1), 49–55 (1936) 23. Mali, F., Kronegger, L., Doreian, P., Ferligoj, A.: Dynamic scientific co-authorship net- works. In: A. Scharnhorst, K. Börner, P. van den Besselaar (eds.) Models of Sci- ence Dynamics, Understanding Complex Systems, pp. 195–232. Springer Berlin Hei- delberg (2012). DOI 10.1007/978-3-642-23068-4 6. URL http://dx.doi.org/10.1007/ 978-3-642-23068-4_6 24. Mayr, P., Scharnhorst, A.: Scientometrics and information retrieval: weak-links revitalized. Scientometrics 102(3), 2193–2199 (2015). DOI 10.1007/s11192-014-1484-3. URL http: //dx.doi.org/10.1007/s11192-014-1484-3 25. Mutschke, P., Mayr, P.: Science models for search: a study on combining scholarly in- formation retrieval and scientometrics. Scientometrics pp. 1–23 (2014). DOI 10.1007/ s11192-014-1485-2. URL http://dx.doi.org/10.1007/s11192-014-1485-2 26. Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences 61(2), 217 – 235 (2000). DOI http://dx.doi.org/10.1006/jcss.2000.1711. URL http://www.sciencedirect. com/science/article/pii/S0022000000917112 27. Petersen, A.: Simulating nature: A philosophical study of computer-simulation uncertain- ties and their role in climate science and policy advice. Het Spinhuis: Apeldoorn (2006) 28. Radicchi, F., Fortunato, S., Vespignani, A.: Citation Networks. In: A. Scharnhorst, K. Börner, P. Besselaar (eds.) Models of Science Dynamics, Understanding Complex Systems, vol. 69, chap. 7, pp. 233–257. Springer Berlin / Heidelberg, Berlin, Heidel- berg (2012). DOI 10.1007/978-3-642-23068-4\ 7. URL http://dx.doi.org/10.1007/ 978-3-642-23068-4_7 29. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA (1986) 30. de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965). DOI 10.1126/science.149.3683.510. URL http://www.sciencemag.org/content/ 149/3683/510.short 31. Van Eck, N.J., Waltman, L.: Citation-based clustering of publications. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Sciento- metrics (2017). DOI 10.1007/s11192-017-2300-7 32. Van Heur, B., Leydesdorff, L., Wyatt, S.: Turning to ontology in STS? turning to STS through “ontology”. Social Studies of Science 43(3), 341362 (2013). DOI 10.1177/030631271245814 33. Velden, T., Boyack, K., van Eck, N., Glänzel, W., Gläser, J., Havemann, F., Heinz, M., Koopman, R., Scharnhorst, A., Thijs, B., Wang, S.: Comparison of topic extraction ap- proaches and their results. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic struc- tures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2306-1 http://www.issi2015.org/files/downloads/all-papers/1042.pdf http://www.issi2015.org/files/downloads/all-papers/1042.pdf http://doi.acm.org/10.1145/2702613.2732781 http://dx.doi.org/10.1007/978-3-642-23068-4_6 http://dx.doi.org/10.1007/978-3-642-23068-4_6 http://dx.doi.org/10.1007/s11192-014-1484-3 http://dx.doi.org/10.1007/s11192-014-1484-3 http://dx.doi.org/10.1007/s11192-014-1485-2 http://www.sciencedirect.com/science/article/pii/S0022000000917112 http://www.sciencedirect.com/science/article/pii/S0022000000917112 http://dx.doi.org/10.1007/978-3-642-23068-4_7 http://dx.doi.org/10.1007/978-3-642-23068-4_7 http://www.sciencemag.org/content/149/3683/510.short http://www.sciencemag.org/content/149/3683/510.short 22 Rob Koopman et al. 34. Velden, T., Yan, S., Lagoze, C.: Mapping the Cognitive Structure of Astrophysics by Infomap. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017). DOI 10.1007/s11192-017-2299-9 35. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11, 28372854 (2010) 36. Wang, S., Koopman, R.: Clustering articles based on semantic similarity. In: J. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Sciento- metrics (2017). DOI 10.1007/s11192-017-2298-x 37. Zitt, M., Bassecoulard, E.: Delineating complex scientific fields by an hybrid lexical- citation method: An application to nanosciences. Information Processing & Manage- ment 42(6), 1513 – 1531 (2006). DOI http://dx.doi.org/10.1016/j.ipm.2006.03.016. URL http://www.sciencedirect.com/science/article/pii/S0306457306000379. Special Is- sue on Informetrics 38. Zitt, M., Lelu, A., Bassecoulard, E.: Hybrid citation-word representations in science map- ping: Portolan charts of research fields? Journal of the American Society for Information Science and Technology 62, 1939 (2011) http://www.sciencedirect.com/science/article/pii/S0306457306000379 1 Introduction 2 Data 3 Method 4 Experiments and results 5 Conclusion