key: cord-0479507-ffziazkb
authors: Heidari, Golsa; Ramadan, Ahmad; Stocker, Markus; Auer, Soren
title: Leveraging a Federation of Knowledge Graphs to Improve Faceted Search in Digital Libraries
date: 2021-07-05
journal: nan
DOI: nan
sha: 3ced93b92c468620a97263725ff3c3e8677cbcd4
doc_id: 479507
cord_uid: ffziazkb

Scientists always look for the most accurate and relevant answers to their queries in the literature. Traditional scholarly digital libraries list documents in search results, and therefore are unable to provide precise answers to search queries. In other words, search in digital libraries is metadata search and, if available, full-text search. We present a methodology for improving a faceted search system on structured content by leveraging a federation of scholarly knowledge graphs. We implemented the methodology on top of a scholarly knowledge graph. This search system can leverage content from third-party knowledge graphs to improve the exploration of scholarly content. A novelty of our approach is that we use dynamic facets on diverse data types, meaning that facets can change according to the user query. The user can also adjust the granularity of dynamic facets. An additional novelty is that we leverage third-party knowledge graphs to improve exploring scholarly knowledge.

A knowledge graph (KG) is a knowledge base that uses a graph-structured data model or topology to combine data [3] . Knowledge graphs are often used to store interlinked information about entities with free-form semantics. In recent years, knowledge graphs have been presented and made publicly available in the scholarly field, in particular bibliographic metadata including information about entities such as publications, authors, and venues [4] .

Scholarly Knowledge graphs are knowledge bases for representing scholarly knowledge [11] . If scholarly knowledge graphs represent the key content published in papers about the addressed research problem, employed materials, methods, and obtained results, then accurate information can be retrieved from such graphs to satisfy user queries and questions. Due to the rise of knowledge graph usage among scientists, it is predictable that researchers' method of searching and exploring data is moving in that direction over the next few decades [9, 10] .

One of the essential applications of scholarly knowledge relies on data retrieval. Various search systems are implemented to help scientists for exploration of accurate data. An example of that is faceted search. Faceted search is a highefficiency search method with various applications. Faceted search is a method that augments traditional search systems with a faceted exploration system, allowing users to narrow down search results by applying multiple filters based on the classification of the properties [7] . A faceted classification system lists each knowledge component along various dimensions, called facets, facilitating the classifications to be reached and managed in multiple forms. Faceted search is widely implemented on bibliographic metadata. However, on data, i.e. the actual content of a paper, it simply cannot be implemented because this data is not structured properly.

Facets are defined in two categories: Static Facets and Dynamic Facets [14] . Facets in which the values for a facet are taken from a list of predefined values are called static facets. Static facets are useful for categories such as resource type that have a limited number of possible values [21] . In contrast, dynamic facets in which the values for each facet category are derived from the values stored in the knowledge graph are flexible [6] . Once the system determines which values to display for each category, it will show the matching items accordingly. This means that facets are not fixed and will be defined while search [2] .

The rest of the paper is organized as follows: Section 2 describes the background and related work; Section 3 illustrates our methodology, proposed conceptual model and workflow for improving dynamic faceted search to explore data in federated knowledge graphs; Section 4 describes our implementation of the conceptual model in ORKG 3 ; In Section 5 we discuss our work and challenges that we faced; In section 6 we propose some directions for future work; Finally, in Section 7 we conclude the work with a glance to the future work.

Search Systems. Nowadays, many databases contribute scholarly knowledge such as papers. Although faceted search is exceptionally beneficial for knowledge retrieval, search engines have used it almost at the level of metadata for the scholarly literature. In some disciplines, people also described content in articles in a structured manner and they have built search systems, but their work is limited to one research field.

Google Scholar 4 is a well-known example that renders a huge number of results fast and most results are not precise to the user information need. Although it has a vast database, static facets are just defined on the publishing date and, thus, limited support for refining queries. Furthermore, it does not search the content of a paper. Solely a full-text search on the abstract part of a paper when the full text is available.

Publishers such as IEEE 5 and Springer 6 show better results via their search system. Their search results are more accurate and using facets they can limit a huge number of unwanted papers to a more relevant set. But there are still limitations to their search system. The most prominent is that their database is limited to their publications. Therefore a large number of results would be missed. Moreover, while they offer faceted search, their facets are static and identical for all queries. TIB portal 7 is a meta catalogue, so it provides more relevant answers to the search. Hence, the results would be more accurate. But the problem of the static facets, however, exists there.

Research on Knowledge Graphs and Search Systems. Most of the scientific discoveries depend on searching and re-using the results of former researchers. Although data and metadata of publications always have been available easily, exploring content of a paper remained inaccessible. Scientists tried to explore how developments in web technology might support that method by implementing semantic improvements to journal articles.

S. Fathalla et al. claim that research contributions must be transparent and comparable. They designated surveys for research fields in a semantic way and introduced a knowledge graph that defines the specific research problems, approaches, implementations and evaluations in a structured and comparable way. They offered an ontology to capture the content of survey papers [5] . D. Poole et al. worked on semantic science. They focused on having machine-accessible scientific theories that can be used in making data comparable [17] .

Some researchers extend the current concept of nanopublications -small items of scientific results in RDF description -to expand their application range. Nanopublications have been introduced to make it more findable [12, 15] .

Y. Tzitzikas et al. introduced features and standards for surveying the products in the area of browsing and exploring RDF/S data sets. They introduced information requirements and structures. They provided a generalization of the main faceted exploration/browsing approaches using a small model including states and transitions between states [20] .

Some researchers provide theoretical foundations for faceted search in the context of RDF-based knowledge graphs enhanced with OWL ontology [1] . Others in addition to faceted search implementation, proposed a ranking system to order facets, and filtered the answer size to avoid numerous answers on statistical properties of their data set, as well [13] .

Shotton et al. published downloadable spreadsheets containing data from within tables and figures and enriched them with information from other articles.

They published machine-readable RDF 8 metadata both about the article and about the references it cites [19] .

LINDASearch presents a middle ware structure to produce information about some of the Open Linked Data Projects such as DBpedia, GeoNames, Linked-GeoData, FOAF profiles, Global Health Observatory, Linked Movie Database (LinkedMDB) and World Bank Linked Data [18] .

The next section briefly describes how implementing a faceted search over scholarly knowledge supports granular refinement of search queries and would leverage federated knowledge graphs.

The main idea is to work on different data types to leverage faceted search systems on knowledge graphs. The scholarly knowledge graph which is used for the infrastructure of the faceted search system should not only contain the metadata of the publications, but also semantic, machine-readable descriptions of scholarly knowledge [16] . Therefore, the knowledge graph would represent some of the content of a publication in a structured manner using inter-linked properties i.e., study date, study location, method, approaches, research problem, etc. Figure 1 shows how some of the information contained in a scholarly article would be defined in a scholarly knowledge graph. 

Our search system not only explores the exact data indicated in a paper but also processes some data to narrow down the search results by defining innovative facets. We treat each data type differently. For string data (i.e., properties that have strings for values), a user can select one or more values among all. This is also supported by an auto-complete feature to suggest candidate options. For properties such as method and approach, all methods used in the papers and all approaches related to them are proposed to the user and can be filtered. For numerical data, users may not only want to filter data by a distinct value but also by a range. Hence, different operators can be selected for the filtering process, specifically greater or smaller than a specific amount. Furthermore, a user can exclude values or even filter data for an interval. Similarly, operators can be applied for values of type date. In addition to including or excluding a date, a duration of a study can be selected as a valid filtering criterion. A date picker is activated on date properties so a user can easily select the date on a calendar.

In order to have smarter facets to better filter the search results for some data types, we need other knowledge graphs' data. Here is the point that exploration will flow from one knowledge graph to another one. For taxonomic data such as location, we search for the hierarchy in a related knowledge graph. Using API, a third-party knowledge graph can be explored to find the hierarchy of that location. Getting the hierarchy, exploration at various levels of a taxonomy can be done. In other words, different levels of facets will define.

Facets are defined not only on the metadata of a paper but also on the data, which is essential for each publication. Since facets are defined according to the semantic contribution descriptions for each paper, they are not static and would differ for each query. They are defined dynamically according to the query, and their granularity level can be chosen by the user while querying. For instance, looking for a paper about Covid-19, one would find R0 9 amounts as a facet. Such facet would not appear when searching mathematics research contributions. As our focus is on approaching a high-quality search on taxonomic data, these facets are defined in various granularity levels. For instance, Location can be explored at the continent level, region level, country level, city level, or even a compound level.

Our system is supporting such dynamic facets, which are inferred automatically from the respective data types and values. Facets can be different for each query, in contrast to other search systems which use just a predefined set of static facets.

The Open Research Knowledge Graph (ORKG) 10 is an online resource that semantically represents research contributions (from papers) in the form of an interconnected knowledge graph [16] . It provides machine-actionable access to scholarly literature that habitually is written in prose [5] , and enables the generation of tabular representations of contributions as comparisons. Given described papers and their research contributions, it is possible to compare the contributions addressing a specific problem, across the scholarly literature. Figure 2 shows a comparison in ORKG. We implemented our faceted search system for ORKG comparisons.

Some research contribution descriptions in the ORKG are specified by predefined templates. These templates support the dynamic and automated construction of facets for ORKG comparisons. Facets are defined on the different properties in a comparison.

In order to illustrate how we can leverage other knowledge graphs, we use Geonames 11 for the Location property. Each instance of the Location class in ORKG has a link to the corresponding resource in the Geonames knowledge graph. Querying Geonames is done via this link. According to its schema, the Geonames knowledge graph offers a variety of relations for the described resources. We are interested in the parent feature which annotates the parent entity of any given other entity (i.e., show the hierarchy of locations in Geonames). We propose to implement the solution, using API request to find the hierarchy of the location. Getting the hierarchy, exploration at various levels of a region taxonomy can be done. Figure 3 shows a subset of RDF triples from the Geonames representation of the City of Bonn entity indicating the parent feature as well. By querying the Geonames graph, the hierarchy of locations can be discovered. After obtaining this hierarchy, the information can be leveraged in a faceted search system to support searching on broader locations and thus support a form of qualitative spatial reasoning 12 . Figure 4 demonstrates the workflow between the ORKG and Geonames knowledge graphs.

For instance, if a user filters a contribution comparison for studies conducted in Europe (e.g., studies involving a European population or an ecosystem in Europe), for each paper's study location, our system checks the (RDF) description of the study location in Geonames. After evaluating in the hierarchy, whether the location has Europe in its parent features, the location is shown as a facet. If now a user chooses this facet, the correspondingly matching contribution descriptions would be displayed in the results. Therefore, a query for exploring paper contribution descriptions that refer to a special method of research and have specific values in a specific duration of a particular region, can easily be answered. Figure 2 depicts an example of the faceted search performed on a COVID-19 contribution comparison, which consists of 31 papers. When a filter icon is selected, a dialogue box containing the relevant values for the property appears, thus enabling the user to choose some of the candidate values. When applying a filter, the colour of the filter icon changes to be recognizable, and a tool-tip about the selected values is displayed when hovering over the filter icon. Additionally, all applied filters are indicated clearly on top of the table. The results are directly reflected on the screen. Furthermore, the system provides the opportunity to save these configurations and the subset of retrieved data as a new comparison to the database, with a permanent URL that can be shared with other researchers and users. We provide a link to the system to enable independent testing and investigation. 13 The code of the system is publicly available and documented on GitLab. 14 

Faceted search, as a search system, became popular with e-commerce services. During recent years, this search and exploration paradigm was increasingly used for developing scholarly knowledge databases, since it could better filter the search results and support the retrieval of more relevant data. It also improves data findability and reduces null-result searches. However, these benefits are not enough for a researcher who is looking for knowledge. We discuss next the key factors in evaluating a search system.

Precision matters. The problem with the knowledge graphs mentioned in the related work section is that despite having a huge database, the data indicated in a paper is not searchable. Therefore, scientists mostly would not achieve an accurate and relevant answer to their scientific queries. The key point is that, search on structured content, rather than full text, is likely to result in higher precision. However, it makes formulating queries also more complicated.

Recall is essential. The few knowledge graphs with structured content have limited databases and struggle to satisfy recall (e.g., limited to a particular research field and missing potentially relevant work outside the particular field). Hence, relevant answers to a query may not appear in the results.

Moreover, facets are normally defined on the metadata of a publication. Few knowledge graphs with a limited database defined facets on the content of a paper. Also, the facets are fixed and static and have no flexibility according to the users' query.

While the Scholarly knowledge graph describes papers in a structured manner, the content of each paper is explorable to discover the accurate data related to a search. As the number of contributions described in a knowledge graph increases so does recall. Our faceted search system leverages a federation of knowledge graphs. That's why the facets are defined dynamically according to the users' query. So the results of a query can be narrowed down into a precise set of answers.

Challenges. What made the problem of faceted search challenging for us are the following points:

-Knowledge graphs are heterogeneous by nature. Different knowledge graphs have different structure. Thus, they are not compatible with a strict search system. Various schemas and APIs make the exploration of federated systems even harder. -Completeness matters. The more complete the database is, the more data would be discovered. Unfortunately, some well-structured systems suffer from an incomplete data source [8] .

-Each paper could be related to one or more research fields. Therefore, finding the appropriate facet according to the user's search expression is challenging. -Facets which are defined according to the data obtained from other knowledge graphs e.g., location facets, could be defined on two different occasions. The first one was during the search process. We could run an API request when a user searches for a location. The advantage of this approach is that the data is current and there is no need to prepare data beforehand. However, the disadvantage is the increase in the response time and the fragility in regard to network connectivity and service availability. The second option is to cache data from the second knowledge graph to allow for faster processing. An important advantage of this approach is better performance. We propose to implement the first approach not to cache unnecessary data.

6 Future Work.

For future work, we plan to evaluate the proposed approach with user study (precision and recall), in particular user friendliness. We also plan to leverage more knowledge graphs for even smarter faceted search. We suggest that smart faceting may be defined for numerous data types, e.g., taxonomies, units, space and time, and numeric ranges which we briefly discuss next. Similarly to the approach described here with Geonames locations, for taxonomic data more generally we can leverage corresponding knowledge graphs to obtain hierarchies, e.g., about species, materials, chemicals, ecosystems, language, etc. Also we plan to integrate a smart unit conversion. For example, if the user is looking for the data in meter and the data in the knowledge graph is defined in kilometre, an automatic conversion would be applied before processing and displaying the results.

Our focus here was on demonstrating how knowledge graphs can be leveraged to improve faceted search for the special case of qualitative spatial data. In future work, we will extend the approach to quantitative spatial data in order to enable users filtering by regions on a map and support quantitative spatial reasoning in faceted search.

Of interest are also smart faceting on numeric ranges, such as Confidence Interval (CI) or types with well-defined boundaries, such as time intervals, pH or degree Kelvin. Smart faceting is aware of such constraints and prompts users accordingly with additional functionality (e.g., filtering by duration) or warnings (e.g., if a given value is invalid such as -300 degrees Kelvin).

Finally, we will explore applying ontologies for resolving the synonyms of the queries and defining facets according to them. For instance, if somebody is looking for the word covid, data using synonymous terms such as corona, covid-19, sars-cov-2, etc. should appear in results.

Nowadays, knowledge graphs are central to the successful exploitation of knowledge available as a steadily growing amount of digital data on the web. Such technologies are essential to lift traditional search systems from a keyword search to smart knowledge retrieval, which is crucial for obtaining the most relevant answers for a user query, especially in digital libraries. Despite improvements of scholarly search engines, traditional full-text search remains ineffective in many use cases. In this paper, we demonstrate a methodology for developing a faceted search system leveraging a federation of scholarly knowledge graphs. This search system can dynamically integrate content from further remote knowledge graphs to achieve a higher order of exploration usability on scholarly content, which can be matched and filtered to better satisfy user information needs. In future work, we will implement better support for various taxonomies and data types. In addition, we will work on integration query expansion features for discovering abbreviations and synonyms of terms in a query to further improve dynamic faceted search.

Faceted search over rdf-based knowledge graphs

Minimum-effort driven dynamic faceted search in structured databases

Towards a definition of knowledge graphs

Enhancing the microsoft academic knowledge graph via author name disambiguation, publication classification, and embeddings

Towards a knowledge graph representing research findings by semantifying survey articles

Automatic facet generation and selection over knowledge graphs

Semantic relatedness as an interfacet metric for facet selection over knowledge graphs

Knowledge graphs on the weban overview

Smart papers: Dynamic publications on the blockchain

Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge

Open research knowledge graph: A system walkthrough

Broadening the scope of nanopublications

Faceted search with object ranking and answer size constraints

Dynamic faceted search for technical support exploiting induced knowledge

Nano-publication in the e-science era

Creating a scholarly knowledge graph from survey article tables

Semantic science: Ontologies, data and probabilistic theories. In: Uncertainty Reasoning for the Semantic Web I

Lindasearch: a faceted search system for linked open datasets

Adventures in semantic publishing: exemplar semantic enhancements of a research article

Faceted exploration of rdf/s datasets: a survey

A survey of faceted search

Acknowledgements This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and the TIB Leibniz Information Centre for Science and Technology. The authors would like to thank Mohamad Yaser Jaradeh for helpful comments.