key: cord-0057769-vg1zrtjv
authors: Cao, Qianqian; Zhao, Bo
title: Application of Open-Source Software in Knowledge Graph Construction
date: 2020-10-30
journal: e-Learning, e-Education, and Online Training
DOI: 10.1007/978-3-030-63955-6_9
sha: 734d18e40a254afe0224743b5b8e0a378f1261cd
doc_id: 57769
cord_uid: vg1zrtjv

Knowledge graph (KG), as a new type of knowledge representation, has gained much attention in knowledge engineering. It is difficult for researchers to construct a high-quality KG. Open-source software (OSS) has been slightly used for the knowledge graph construction, which provide an easier way for researchers to development KG quickly. In this work, we discuss briefly the process of KGC and involved techniques at first. This review also summarizes several OSSs available on the web, and their main functions and features, etc. We hope this work can provide some useful reference for knowledge graph construction.

specific definition and representation of the knowledge based on graph could be considered a knowledge graph. In this paper, KG is a semantic graph consisting of vertices (or nodes) and edges, while the vertices represent concepts or entities and the edges represent relationships between the entities [4] . It able to organize information in an easy-to-maintain, easy-to-understand and easy-to-use method. It has become prevalent in education during these years. The process of knowledge graph construction includes 3 aspects: knowledge acquisition, knowledge fusion, knowledge processing, which will be discussed as follows.

KA is used to mine entities, entity attributes and relationships between the entities from structured, semi-structured, even unstructured data sources. There are mainly three corresponding processing methods for three types of data respectively when knowledge is acquired. For structured data, D2R [5] (an XML-based language) is used to map it to RDF [6] schema [7] (a type of hierarchy which is used to define types and possible relations). RDF (Resource Description Framework) is generally used to formalizes structured information and present it graphically. Semi-structured data (e.g. encyclopedic data, web data) can be gained automatically by using an unsupervised clustering algorithm. For unstructured data, the information extraction technique is mainly used to extract entities, entity attributes and relationships between the entities. Open information extraction (OIE) [8] is the promising field in information extraction. OIE is based on linguistic model and machine learning algorithm, which is used to extract information from open domain. However, the OIE technique might lead to the lower precision and lower recall rate. Through the process of KA, entities, attributes and relations between entities are obtained from structured, semi-structured and unstructured data sources.

Through knowledge acquisition, it might contains lots of redundant and erroneous information. KF focuses on re-cleaning and integrating those extracted results, such as improving the connection density (the distance in the KG) of entities/relations and making entity relations to present definite logic and levels. It mainly includes such task as entity linking and knowledge merger. Entity linking focuses on the process of entity disambiguation and entity resolution. Entity disambiguation is used to disambiguate a polysemous entity mention or infer that two different mentions are the same entity, and entity resolution is used to identify and link different manifestations of the same real world entities in various records. Then, entities are mapped from an input text to corresponding entities in a target knowledge base. So, entity linking is used to link the entities extracted from semi-structured and unstructured data sources to knowledge graph. Knowledge merger is used to link the entities extracted from structured data sources (e.g. external knowledge base data or relative database data) to knowledge graph. Through knowledge fusion, the ambiguity of the concepts is eliminated and redundant and wrong concepts are eliminated.

The result provided by knowledge fusion is not yet a knowledge graph. Knowledge processing is used to transforming the result into the more structured and networked knowledge architecture. KP mainly includes two aspects such as schema construction and knowledge reasoning. Schema construction refers to design the classes/concepts, relations, functions, axioms, instances for the knowledge graph. There are three ways of constructing method: artificial, automated and semi-automated construction. Knowledge reasoning aims at generating new knowledge based on the existing knowledge in the knowledge graph through computer reasoning. There are three common ways to reason over knowledge graph: logic-based reasoning, graph-based reasoning, deep learningbased reasoning. Through knowledge reasoning, knowledge graph is further enriched and expanded.

The term "open-source" is derived from the computer software industry. It can literally be interpreted as open-source code. In the paper, we will analyze and discuss the functions and characteristics of open-source software (OSS) in detail as follows. The commonly used OSSs in knowledge graph construction are shown as in Table 1 . Protégé [9] is an open-source software developed by the Stanford University School of Bioinformatics Research Center, which is used for schema construction and knowledge reasoning (knowledge processing). It allows pluggable components, plug-ins, which can visualize knowledge and reason. It has friendly graphic interactive interface, uniform style, strong interactivity, adaptability and operatablity. For schema construction, it is the main functions of Protégé, which include the construction of concept classes, relationships, attributes, and instances of the schema. It is designed to hide the underlying details (the schema description language) from your users. It can also perform knowledge-based reasoning by clicking on the "Reasoner" button (on the menu bar)and provide users with reasoning result and explanation in detail.

KAON [10] is an open-source toolkit developed by the University of Karlsruhe and the Research Center for Information Technologies in Karlsruhe. It is often used to construct schema (knowledge processing). KAON implements the functions of schema construction by using KAON-API. KAON-API provides mechanisms for storing and editing schema, and also has enabled applications to access and process the schema. KAON's constructing, storing, querying and retrieving schema operations are all performed based on graph. KAON's graph-based operation is more intuitive and convenient than Protégé in the constructing process.

DeepDive [11] is an open-source system developed by Stanford University's InfoLab laboratory, which is frequently used for information extraction (knowledge acquisition) and knowledge reasoning (knowledge processing). For knowledge acquisition, it uses "weakly supervised learning" algorithm [12] to extract structured relations from unstructured data, and then determines whether a specified relationship exists between entities. For knowledge processing, DeepDive is used to solve knowledge reasoning problems based on factor graphs, which is a type of probabilistic graphical model composed of variables and factors (definition of the relationships between variables in the graph). DeepDive uses probabilistic inference to estimate the probability whether the existing knowledge would be true, then determinate whether it will be retained. As a complete knowledge extraction framework, DeepDive provides users with application code and an inference engine. Users only need to write application code and inference rules used during reasoning for their specific tasks [13] .

KnowItAll [14] is another open-source software developed by the Turing Center at the University of Washington, which is used in knowledge acquisition widely. Training was conducted by unsupervised learning [15] and domain-independent operations. It has advanced performance than previous system in the tasks such as pattern learning and subclass extraction by using the search engine to perform extraction from massive Web data. Reverb and OLLIE are the first and second generation information extraction components of KnowItAll respectively. Reverb extracts the binary relationship directly from the English sentence without to pre-specify the relations. OLLIE system can extract information based on the "Syntax Dependency Tree". It has better extraction results and the more optimized "Long-term Dependency" effect than Reverb based on text sequence The distinctive feature of KnowItAll is its use of bootstrapping method that does not require any manually tagged training sentences [16] .

GATE [17] (General Architecture for Text Engineering) is an open-source knowledge extraction system developed by the University of Sheffield, UK, which is capable of solving almost all problems encountered when processing text. Therefore, it can be used in information extraction (knowledge acquisition) and entity disambiguation (knowledge fusion). For knowledge acquisition, GATE offers users the functions such as data processing, rule definition and entity disambiguation through its natural language processing components, which lays the foundation for information extraction. Each component has the open interface, so that they can be called by other systems freely.

Limes [18] is an open-source entity linking framework developed by the DICE (Data Science) group at Paderborn University. More specifically, it is a link discovery framework for metric spaces (a set together with a metric on the set.), which is used in knowledge fusion. It uses statistics, prefix suffixes, and location filtering to calculate the similarity rate between the entities, while the "entity pairs" mismatched will be filtered out. The specific characteristic of Limes is that it allows users to configure the rules of entity resolution flexibly.

Dedupe [19] is an open-source python library developed by Gregg Forest and Derek Ederalso, which is used for knowledge fusion. It uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data. Specifically, users only need to label a small amount of data selected during the calculation process. All labeled data is clustered and grouped by a compound clustering method. Then the duplicate labeled data will be removed based on the calculation of similarity features [20] and machine learning models [21] . A salient feature of Dedupe is that it supports users using user-defined data types to label the data which are used for training models.

SOFIE is an open-source software about knowledge processing developed by the Max Planck Institute, which can also be used to extract information (knowledge acquisition). For entity fusion, it can be used to parse natural language documents, extract knowledge from them and link the knowledge into an existing schema, and perform disambiguation based on logical reasoning. For knowledge acquisition, it processes the document, splits the document into short strings. SOFIE's [22] main algorithm is completely source-independent. SOFIE's performance could be further boosted by customizing its rules to specific types of input corpora. With appropriate rules, SOFIE could potentially accommodate other IE paradigms within its unified framework.

In this article, we have reviewed how the knowledge graph is constructed. We have discussed several the main OSSs available and the usage of these tools in the knowledge graph construction. As previously discussed, there are lots of open-source tools for constructing knowledge graph. We argue that different tool has its own features, such as Dedupe is more suitable to process structured data, DeepDive has better performance when it is used to extract sophisticated relationships between entities, SOFIE integrates information extraction with logical reasoning firstly. To sum up, KGs have arisen as one of the most important tool for knowledge representation, storage and organization. It is hope that more open-source softwares will be developed to provide researchers more convenient ways to build, update, retrieve, and maintain knowledge graph.

The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain

Variational reasoning for question answering with knowledge graph

Introducing the Knowledge Graph: things

A retrospective of knowledge graphs

D2r map a database to rdf mapping language

Knowledge graph refinement: a survey of approaches and evaluation methods

Open information extraction for the web

The protégé project: a look back and a look forward

KAON -Towards a large scale semantic web

Deepdive: A Data Management System For Automatic Knowledge Base Construction

Weakly-supervised relation classification for information extraction

Large-scale extraction of gene interactions from full-text literature using DeepDive

Unsupervised named-entity extraction from the Web: an experimental study

Unsupervised learning of tree alignment models for information extraction

Web-scale information extraction in knowitall: (preliminary results)

Unsupervised feature selection using feature similarity

Wikipedia contributors. Machine learning

Sofie: a self-organizing framework for information extraction

Acknowledgments. The research is supported by a National Nature Science Fund Project (No. 61967015), and specific project of teacher education of Yunnan Province education science planning (Union of higher education teachers) (GJZ171802).