key: cord-0449546-wp3vnqak authors: Sjogaarde, Peter title: Improving overlay maps of science: combining overview and detail date: 2021-10-15 journal: nan DOI: nan sha: 36da7de8997efaf77a2c4291cf14cb2e2c107cb6 doc_id: 449546 cord_uid: wp3vnqak Overlay maps of science are global base maps over which subsets of publications can be projected. Such maps can be used to monitor, explore, and study research through its publication output. Most maps of science, including overlay maps, are flat in the sense that they visualize research fields at one single level. Such maps generally fail to provide both overview and detail about the research being analyzed. The aim of this study is to improve overlay maps of science to provide both features in a single visualization. I created a map based on a hierarchical classification of publications, including broad disciplines for overview and more granular levels to incorporate detailed information. The classification was obtained by clustering articles in a citation network of about 17 million publication records in PubMed from 1995 onwards. The map emphasizes the hierarchical structure of the classification by visualizing both disciplines and the underlying specialties. To show how the visualization methodology can help getting both overview of research and detailed information about its topical structure, I projected two overlay maps onto the base map: (1) open access publishing and (2) coronavirus/Covid-19 research. To be able to support and manage research activities there is a need to monitor and study research, for example to coordinate research, follow-up investment or strengthen collaboration in targeted areas. It is relatively easy to keep track of the research activities of small research units. In contrast, large research units, for example whole universities, may have hundreds or thousands of employees, producing thousands of research publications each year within a broad spectrum of topics. Keeping track of the competences, research areas and collaborations is a challenging task for such organizations. Research publications are one important output of research activity. Publications can be studied to monitor research activity and gain insight into aspects such as collaboration patterns, specialization, strong research areas, trends, and development. Overlay maps of science have been proposed to "offer an intuitive way of visualizing the position of organizations or topics in a fixed map" (Rafols et al., 2010) . Overlay maps are base maps over which subsets of publications or filters can be projected, for example to study the position of organizations or topics, or to highlight properties such as citation impact, open access publishing or clinical research (Kay et al., 2014) . To this point, most overlay maps have been flat in the sense that they visualize research fields at one single level, commonly the levels of research disciplines or specialties. Such maps generally fail to provide both overview and detail about the research being studied. The aim of this study is to improve overlay maps of science to provide these two features in one single, interactive map, having a focus on the biomedical sciences. The maps created enable users to explore multiple levels of a hierarchical classification in a single interactive visualization. Visualizations of science have been around for a long time (for overviews, see Börner et al., 2005; Petrovich, 2020; van Eck & Waltman, 2014; Zitt et al., 2019) . Early work focused on maps restricted to one or a few research domains. Different aspects of domains have been visualized and studied using a variety of entities and relations, for example: co-publishing between researchers or organizations, co-occurrence of keywords, and citation relations between publications or journals. Beginning in the end of the 1990s, maps that cover large parts of the science system (at least within the natural science and biomedicine) have been created. Initial maps of science were based on journals and made it possible to get unpreceded overviews of the science system (for examples, see Bassecoulard & Zitt, 1999; Boyack et al., 2005; Leydesdorff, 2004 Leydesdorff, , 2006 Moya-Anegón et al., 2004) . Rafols et al. (2010) showed how comprehensive maps of science can be used as base maps, over which overlays can be projected. The idea of such a map is to fixate the positions of the nodes in the map, representing for example research fields, so that an overlay projected onto the map can be easily compared with the base map, as well as with other projections. For instance, consider the publication outputs, A and B, of two universities. An overlay map is created by projecting A onto a base map. The size of the nodes is scaled in relation to the distribution of A over research fields. Another map is created based on B, using the same procedure. We can now compare the subject orientation of the two universities by exploring the two maps and compare node sizes. If we color the nodes of the maps based on some variable, we can analyze different aspects of A and B, for example the amount of open access publishing, citation impact or degree of international collaboration in different research fields. Compared to maps restricted to particular domains, overlay maps provide context and points of reference, for example by offering the possibility to spot areas in which A and B do not have any research. Since overlay maps were introduced in scientometrics, they have been used in many applications; Kay et al. (2014) used overlay maps to visualize patents by companies, Tank and Shapira (2011) analyzed the growth of US and China co-publications in nanotechnology, Klaine et al. (2017) positioned environmental, health, and safety of nanomaterials in relation to general nanotechnology, Leydesdorff et al. (2015) created an overlay map based on journal relations using Scopus data and exemplified how the map can be used to explore the publication output of authors, organizational units or other publication sets, and Rotolo et al. (2017) used three case studies to demonstrate the use of overlay maps for strategic intelligence. Most of the applications have used maps at one single level and one single entity (e.g., journals or keywords). The study by Rotolo et al. (2017) includes what they refer to as "cognitive" base maps of research publications at different granularity levels. These maps are based on the Web of Science categories at the broadest level, journals at the meso-level and medical subject headings (MeSH) at the most granular level. However, the different granularity levels are presented in different maps and hierarchical relations between levels are not shown. To provide the possibility to navigate from broad to narrow levels in one single map, I base the map presented in this paper on a hierarchical classification obtained by clustering articles in a citation network. Publication-level classification at a global level (covering a complete multidisciplinary data source) obtained by clustering articles in a citation network was first implemented by Waltman and van Eck (2012) . Compared to classification at the journal-level, publication-level classifications can be made more granular. A hierarchical structure can be obtained by merging clusters at lower levels into broader clusters. Publication-level classifications have been used 6 to create overlay maps (RoRI Institute, 2019) . However, the applications are few and lack hierarchical structure, other than node coloring by major research areas. The validity of the clustering solutions created by clustering articles in citation networks has been contested (Held et al., 2021) . There is no ground truth classification and different methodological choices results in different, sometimes equally valid, representations of research delineation (Glänzel & Schubert, 2003; Gläser et al., 2017; Klavans & Boyack, 2017; Mai, 2011; Sjögårde & Ahlgren, 2018; Smiraglia & van den Heuvel, 2013; Velden et al., 2017; Waltman & van Eck, 2012) . Nevertheless, the results have been compared to a wide range of baselines and many different applications have been evaluated and compared Boyack, 2017; Boyack et al., 2011; Boyack & Klavans, 2010 Donner, 2021; Haunschild et al., 2018; Sjögårde & Ahlgren, 2018 , 2020 Šubelj et al., 2016; Waltman et al., 2020) . Overall, the findings of this research literature suggest that publication-level classifications based on citation relations perform at least reasonably well when evaluated quantitatively, but there is still a lack of qualitative evaluations. Including citations that are external to the analyzed data set improves accuracy of a clustering solution Boyack, 2017; Donner, 2021; . This is an advantage of a global approach, compared to a local one in which a clustering is based on a restricted set of publications. Nonetheless, a local approach may be preferable in some applications because it can emphasize the local, within field, context of research. However, local maps are difficult to compare to other maps. The rationale of overlay maps is to make comparisons between maps. A global approach is therefore most often a superior approach for overlay maps. To create a visualization of biomedical research literature that incorporates both overview and detail I based the visualization on a hierarchical publication-level classification. This classification was obtained by clustering articles in a (direct) citation network of PubMed records. Currently, PubMed indexes over 1 million publications yearly, covering a wide range of biomedical research disciplines. I therefore refer to the classification created as "global", even though it does not have a comprehensive coverage of all fields of science. A similar classification was recently created by Boyack et al. (2020) . This classification differs mainly by the choice of similarity measure between publications. Boyack et. al based their classification on a combination of direct citations and textual similarity. By complementing direct citations with textual similarity, they were able to include publications that would otherwise have no relations. Thus, the relation between publications in their approach is a mixture of different, fundamentally different, similarity measures and makes the interpretation of the classification more difficult. Since the model of Boyack et al. was published, more citations have been made openly available and it is now possible to create citation-based classifications with a more comprehensive coverage. I therefore base my classification strictly on direct citations. Figure 1 illustrates the process used to obtain the classification from the citation network and to create a base map from the classification. The base map incorporates features to emphasize the hierarchical structure of the classification. In the next section (3.1), I describe the data and process to obtain the classification, and in the following section (3.2), I describe the process to create the base map from the classification. Collection) and restricted the data to the publication types "article" and "review", a total of about 17 million publications with about 380 million direct citation relations. In the remainder of this paper, I use the term "publication" to refer to both articles and reviews. Except for some modifications, I obtained the classification using the methodology put forward in Waltman and van Eck (2012) . In accordance with this methodology, direct citation relations were used to create a network. Citation relations were normalized in relation to each publication's total number of citation relations. The Leiden algorithm (Traag et al., 2019) was used to obtain a partitioning of the publications. 1 To get clusters of substantial size, I restricted the cluster size to a minimum of 50 publications by reclassifying publications in clusters below this minimum size using the method provided in the software. The resolution parameter was calibrated to obtain clusters of about the same size as in Sjögårde & Ahlgren (2018) for the corresponding publication years, resulting in 60,650 clusters. Thereby, clusters approximately correspond to research topics, and I refer to clusters at this level as such. Topics were clustered into larger groups based on their relatedness using the methodology in Sjögårde and Ahlgren (2020) and calibrating the resolution parameter to obtain clusters of about the same size for corresponding publication years as obtained in the referred article. Thereby, clusters approximately correspond to research specialties. A minimum threshold of at least 500 publications was used at this level, which resulted in 2,139 specialties. 2 The next level approximates the level of research disciplines. My intention was to create clusters of approximately the size of other broad classifications, such as Web of Science journal categories (about 250 clusters), Science Metrix journal classification (180 clusters at the 1 The process took about 8h 25min to run 100 iterations. 256 GB RAM was allocated for the process. The value of the quality function (CPM) was 0.427. The resolution parameter was set to 0.00010. Version 1.1.0 of the software was downloaded from https://github.com/CWTSLeiden/networkanalysis [2020-11-20] . 2 At aggregated level reclassification was performed by merging clusters below the threshold with the cluster above the threshold having the strongest relational strength. "subfields" level) and Scopus Subject areas (about 330 clusters). Because the classification only includes biomedicine I aimed for a smaller number of clusters. I tested several different values of the resolution parameter. Too low values merged specialties with seemingly weak relatedness into coarse clusters, while too high values resulted in many specialties being unmerged. I finally chose a solution with 169 disciplines, after restricting cluster sizes to at least 100,000 publications. Disciplines were grouped into 23 broader research areas. It is particularly difficult to obtain good labels at this level, because terms extracted from bibliographic fields tend to be too narrow. For this reason, the research areas are not displayed in the visualization. The research areas are used in the visualization only to keep sibling disciplines (the group of disciplines with the same parent) in proximity. To create labels, I used the procedure proposed in Sjögårde et al. (2021) . Noun phrases were extracted from article titles, medical subject headings (MeSH), journal titles and author addresses. A noun phrase was operationalized as a sequence of adjectives and nouns, ending with a noun (van . A Java-program was written for this purpose and the Stanford Core NLP software was used for data mining (Manning et al., 2014) , in particular the lemmatizer and the Part-Of-Speech tagger (Toutanova et al., 2003; the topic level (α=0.33 was used for the TFS calculation). I used article titles, MeSH and journal titles at the specialty level (α=0.5) and journal titles and author addresses at the discipline level (α=0.67). As a basis for the map, I created a network of disciplines. In the following step I created a network of underlying specialties for each discipline. This network was positioned to the discipline node. A list of topics was created for each specialty. This list is displayed when clicking a node. In the following I describe the steps in detail: A. Discipline level A1. I created a list of disciplines with the attributes shown in Table 1 . For the purpose of illustration, the table presents values for an example discipline. The size attribute was calculated as the square root of the number of publications. This makes the area of each node proportional to the number of publications. If a cluster contained no more than 500 publications, a hyperlink was created to the underlying publications ("Get list in PubMed"). If a cluster contained 501-5,000 publications a separate hyperlink was created for each batch of 500 publications (1-500, 501-1,000, etc). If more than 5,000 publications no hyperlinks were provided and instead the text "Too many publ." was shown. This was done because of restrictions in hyperlink length. The underlying specialties were listed in the column "Children". Labels and numbers of publications for the specialties were concatenated into a list. The interactive map contains hyperlinks to the underlying publications for each of the specialties (with the same restriction to 5,000 publications). Discipline nodes were colored according to their cluster at the top level (research areas). (Jacomy et al., 2014) was used to create a layout based on the average normalized direct citation value between the nodes (the same relatedness value used for clustering). 4 The algorithm resembles a physical system in which nodes repulse each other and edges attract the nodes. The magnitude of the attraction is relative to the weight of the edges. To emphasize the hierarchical structure, relations between sibling disciplines were scaled by a factor set to 3. Thereby, siblings were positioned closer to each other, which facilitated the interpretation of the network. Finally, x-and ycoordinates were supplied to the list of disciplines. B1. For each discipline I created a network of its children, with each child representing one of the underlying specialties of the disciplines. Edges between specialties were assigned a weight based on the average normalized direct citation value between specialties. The ForceAtlas algorithm was used to create a layout for each set of sibling specialties. 5 B2. Each network of sibling specialties was positioned in such a way that its midpoint was . The same attributes were calculated for specialties as for disciplines. Underlying topics are listed in the attribute "Children" in the case of specialties. B4. Sizes of the specialty nodes were rescaled by dividing with 2. This was done for better readability of the visualization. I visualized the networks using the sigma.js package created with the "SigmaExporter" plugin for the visualization software Gephi. 6 R was used to create networks and other files necessary for the visualization (a json network file, 7 a json configuration file 8 and an html-file). The file size of the full base map is large due to the high amount of data in hyperlinks. To In this section I demonstrate how the base map provides both overview and detail by visualizing the hierarchical structure of the classification which it has been built upon. I then present two cases that show how the map can be used to enrich the study of research activities: The base map shows distinctly colored groups of sibling disciplines. Figure 2 is a screen shot of the interactive map that is available online. 9 The largest groups revealed by the map are: cell and molecular biology, pharmacology and drug treatment, physiology, psychiatry, neuroscience and neurology, cardiovascular diseases, health professions, surgery, cancer diagnosis and treatment, infectious diseases, ecosystems and environmental medicine, biophysics and biochemistry, and immunology. The zooming feature is displayed in Figure 3 , showing specialties in "dermatology; melanoma; skin". The figure reveals major specialties addressing skin cancer, psoriasis, allergy, skin treatment and hair loss. By looking at the node sizes we can estimate the relative size of these fields. Skin cancer is the largest node, but almost the same size as psoriasis. Some nodes are about half the size of these large nodes, for example hair loss ("alopecia; alopecia areata; hair") and laser therapy ("laser therapy; laser; melasma"), while others are very small, for example nail conditions ("ingrown nails; yellow nail syndrome; ingrown toenail") and skin diseases related to HIV ("hiv infection; eosinophilia; eosinophilic pustular folliculitis"). Clicking on a specialty gives the user further information. This feature is exemplified in Individual specialties had even higher rates, for example "chromatin immunoprecipitation; dna sequence analysis; bioinformatics" (77.1%) and "cell polarity; protein; insect protein" (77.3%). Other areas show low open access rates, for example in biochemistry where "nanoparticle; pharmacy; technology" had a rate of 23.4%, with some specialties ranging from 10% to 20%. In more applied areas the open access rates generally ranged from about 30% to 40%, for example in "cardiology; atrial fibrillation; heart failure" (37.5%) and "nursing; education; public health" (30.0%). Zooming into the 2018-2020 map reveals substantial differences between specialties. Differences can also be explored at the topic level. For instance, in immunology ("immunology; microbiology; dendritic cell") the area labeled "allele; tissue antigen; hla", concerning the role of the human leukocyte antigen (HLA) complex in the immune system, had an open access rate of only 31.8%. When we click on this node, we find a major topic engaged in gene sequencing in the HLA complex with an open access rate of only 0.6%. Following the link to PubMed and scrutinizing the list of publications reveals a focus to one journal, named "HLA", which is not an open access journal and has a low level of hybrid open access publications. This is an example of how zooming in on a particular research field can provide a more detailed understanding of the data. To create a map of research related to the coronavirus pandemic that started in late 2019, I used the search query in Table 2 . The query has been designed by the library at Karolinska institutet to get publications both about the disease (Covid-19) and the virus causing the disease (SARS-CoV-2). suicide"), nursing and other health profession related research ("nursing; education; public health"), remote health care ("telecare; remote consultation; american telemedicine") and immunology ("immunology; microbiology; dendritic cell"). Zooming into the coronavirus node ( Figure 10 ) reveals some major topics addressing; mathematical models of the growth and spread of the disease ("basic reproduction number; serial interval; mathematical model"), effects on the cardiovascular system ("myocardial injury; troponin i; cardiac injury"), psychological impact of the pandemic and the measures taken to control the spread ("anxiety; depression; psychological impact"), testing ("serologic test; immunoassay; igm"), and neurological symptoms and effects ("olfaction disorders; nervous system disease; smell"). on the angiotensin receptor used by the virus to enter the lungs ("angiotensin; angiotensin receptor; angiotensin i" in "hypertension; vascular endothelium; blood pressure"), (4) immune response prediction models (the topic "sars cov 2; immunoinformatic approach; designing" in the specialty "peptide; bacteriophage m13; b lymphocyte epitope"), and (5) prediction of how molecules bind to receptors (the topic "function; molecular docking simulation; protein ligand docking" in the specialty "chemical information; quantitative structure activity relationship; modeling"). Gläser et al., 2017; Waltman et al., 2020) . My choice of citation relation, clustering algorithm, parameter values and labelling approach have been guided by both empirical support and practical considerations. My approach based on direct citations has the advantages to be efficient (having fewer relations than bibliographic coupling and cocitations) and to include relations to/from a large proportion of the publications in the data source, given that a comprehensive data source and a large time span are used. The direct citation approach has performed well in quantitative evaluations using large corpuses (Boyack & Klavans, 2020; Klavans & Boyack, 2017) . Also the resolution parameter values have been guided by previous research (Sjögårde & Ahlgren, 2018 , 2020 . Nonetheless, other representations may be equally valid and express other aspects of the research landscape. For example, co-citations may be a better choice if one wants to examine the historical development of a research field, and bibliographic coupling might be preferable to display related publications in an information retrieval setting. Labeling of the obtained clusters is a challenging task. Even though the used methodology works reasonably well, the subject orientation of clusters is sometimes hard to interpret using the cluster labels. Occasionally, other information needs to be considered by a user to understand the subject orientation of a cluster and to distinguish it from other clusters, for example additional key terms, sibling cluster labels, child cluster labels and consulting publication records in PubMed. Providing interactivity facilitates such interpretation. However, it remains unclear to what extent interpretation is a problem for users. Further work evaluating the interpretability of classifications from a user perspective is therefore needed. The visualizations that I have presented include clusters visualized as nodes at two levels. It is possible to visualize more levels, but with the risk to make the visualization more cluttered and harder to interpret. Functionality hiding nodes at granular levels when zooming out and showing nodes when zooming in might be an option to be able to include more levels in the visualization. However, including nodes at additional levels does not necessarily help users to read and interpret the visualization. Users might for example prefer reading lists at more granular levels. Therefore, user studies are needed to develop user friendly features and to make interactive overlay maps of science easier to interpret. The intention of this study has not been to evaluate normalization methods or layout algorithms. There might be better options to create layouts, in particular at the discipline level, both regarding normalization of citation relations and layout algorithm. I have emphasized the clusters by multiplying relations within disciplines by a constant factor (set to 3). This procedure puts disciplines in the same cluster in proximity and improves readability. However, it may distort relations outside the cluster. The choice of the ForceAtlas layout algorithm was guided by my experience visualizing a wide range of bibliometric networks (e.g., co-author networks, MeSH-networks, co-authoring organization networks and article citation networks) in bibliometric practice at Swedish universities. ForceAtlas is implemented in the visualization software Gephi, 16 which makes it possible to try out different parameter values to facilitate the readability of a particular network and more generally to learn how to use parameter values for different kinds of networks. In my experience, the ForceAtlas algorithm creates visualizations that are interpretable and make sense, and I have received positive feedback from users on visualizations created using this layout. Alternatives to ForceAtlas are for example the OpenOrd layout algorithm used by Boyack et al. (2020) , the VOS layout algorithm which is implemented in the VOSviewer software , the Fruchterman-Reingold layout algorithm (Fruchterman & Reingold, 1991) and the Kamada-Kawai layout algorithm (Kamada & Kawai, 1989) . The maps created in this study have been restricted to the biomedical sciences. During the last years the amount and proportion of available bibliographic metadata has increased substantially. Future work may be extended to other research fields. There are several technical issues related to the used visualization tool. For example, the current version of the used visualization package does not support smartphones and tablets, identification of areas of interest to a user could be facilitated by filters and improved search features, loading the visualization files is rather slow, and hyperlinks are provided in batches if a cluster includes more than 500 publications. The intention has not been to provide a perfect visualization tool, but rather to show how interactive visualizations of hierarchical classifications can provide users with enriched possibilities to explore the scientific literature. I have demonstrated that it is possible to provide maps of science that can give the user an overview of millions of publications and details down to individual publications. Such maps may constitute a valuable tool for researchers studying science, improve transparency to clusterbased citation normalization, support research management and policy making and constitute a tool for researchers to explore research of relevance to them. All maps and underlying data and configuration files are available online: Classification and labels Enhancing direct citations: A comparison of relatedness measures for community detection in a large set of PubMed publications Indicators in a research institute: A multi-level classification of scientific journals Visualizing knowledge domains Investigating the effect of global data on topic detection Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately Accurately identifying topics using text: Mapping PubMed. STI 2018 Conference Proceedings Mapping the backbone of science Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches A detailed open access model of the PubMed literature Validation of the Astro dataset clustering solutions with external data Software survey: VOSviewer, a computer program for bibliometric mapping A comparison of two techniques for bibliometric mapping: Multidimensional scaling and VOS Graph drawing by force-directed placement. Software: Practice and Experience A new classification scheme of science fields and subfields designed for scientometric evaluation purposes Same data-different results? Towards a comparative approach to the identification of thematic structures in science Algorithmically generated subject categories based on citation relations: An empirical micro study using papers on overall water splitting Challenges to the validity of topic reconstruction The NIH Open Citation Collection: A public access, broad coverage resource ICite Database Snapshots (NIH Open Citation Collection). The NIH Figshare Archive ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software An algorithm for drawing general undirected graphs Patent overlay mapping: Visualizing technological distance Paradigms to assess the environmental impact of manufactured nanomaterials Using global mapping to create more accurate document-level maps of research fields Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge Clusters and maps of science journals based on bi-connected graphs in the journal citation reports Can scientific journals be classified in terms of aggregated journal-journal citation relations using the Journal maps, interactive overlays, and the measurement of interdisciplinarity on the basis of Scopus data The modernity of classification The Stanford CoreNLP natural language processing toolkit A new technique for building maps of large scientific domains based on the cocitation of classes and categories Science mapping Science overlay maps: A new tool for research policy and library management Supporting priority setting in science using research funding landscapes (Report No. 1; RoRI Working Paper) Strategic intelligence on emerging technologies: Scientometric overlay mapping Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics Granularity of algorithmically constructed publication-level classifications of research publications: Identification of specialties Algorithmic labeling in hierarchical classifications of publications: Evaluation of bibliographic fields and term weighting approaches Classifications and concepts: Towards an elementary theory of knowledge interaction Clustering scientific publications based on citation relations: A systematic comparison of different methods China-US scientific collaboration in nanotechnology: Patterns and dynamics Feature-rich part-of-speech tagging with a cyclic dependency network Enriching the knowledge sources used in a maximum entropy part-of-speech tagger Visualizing Bibliometric Networks Automatic term identification for bibliometric mapping Comparison of topic extraction approaches and their results A principled methodology for comparing relatedness measures for clustering publications A new methodology for constructing a publication-level classification system of science Bibliometric delineation of scientific fields Springer Handbook of Science and Technology Indicators The author declares no competing interests.