JDIQ0602-3-07


7

A Challenge for Long-Term Knowledge Base Maintenance

CHRISTAN EARL GRANT and DAISY ZHE WANG, University of Florida

Categories and Subject Descriptors: H.2.4 [Systems]: Query Processing

General Terms: Design, Algorithms, Performance

Additional Key Words and Phrases: Knowledge base, probabilistic knowledge base, inference, entity resolu-
tion, databases

ACM Reference Format:
Christan Earl Grant and Daisy Zhe Wang. 2015. A challenge for long-term knowledge base maintenance.
ACM J. Data Inf. Qual. 6, 2–3, Article 7 (June 2015), 3 pages.
DOI: http://dx.doi.org/10.1145/2738044

1. INTRODUCTION

Knowledge bases (KBs) are repositories of interconnected facts with an inference en-
gine. Companies are increasingly populating KBs with facts from disparate sources
to create a central repository of information to provide users with a richer and more
integrated user experience [Herman and Delurey 2013]. Additionally, inference over
the constructed KB can produce new facts not specifically mentioned in the KB. Google
is now employing KBs to surface additional information for user search [Dong et al.
2014a]. Manually constructed KBs, such as YAGO [Hoffart et al. 2013] and DBpedia
[Auer et al. 2007], are increasingly being used as the gold standard and ground truth of
newer KBs [Dong et al. 2014b]. However, the growing number of KBs inside an organi-
zation require a sufficiently high level of quality and must be meticulously maintained.

Both YAGO and DBPedia were constructed based on data from Wikipedia. Within
Wikipedia, the medium lag between the occurrence of a notable event and the addition
of the event was measured at 356 days [Frank et al. 2012]. This fact spurred many
efforts to discover methods to automatically build, extend, and clean KBs [Frank et al.
2012; Ellis et al. 2012; Ji et al. 2014; Surdeanu and Ji 2014]. In these contests, teams
build systems to explore the creation of Web-scale KBs; however, by and large, these
contests stop short of designing systems for deployment in a production system. We
believe that there are two main questions that are wholly understudied across research
communities: in KBs, over time, (1) what stale information needs to be cleaned? and
(2) when should this information be updated?

This work was partially supported by DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM) and NSF Grad-
uate Research Fellowship grant DGE-0802270.
Authors’ address: C. E. Grant, E457-8 Computer and Information Science and Engineering Department,
University of Florida, Gainesville, FL 32611; email: cgrant@cise.ufl.edu; D. Z. Wang, E456 Computer and
Information Science and Engineering Department, University of Florida, Gainesville, FL 32611; email:
daisyw@cise.ufl.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c© 2015 ACM 1936-1955/2015/06-ART7 $15.00
DOI: http://dx.doi.org/10.1145/2738044

ACM Journal of Data and Information Quality, Vol. 6, No. 2–3, Article 7, Publication date: June 2015.

http://dx.doi.org/10.1145/2738044
http://dx.doi.org/10.1145/2738044


7:2 C. E. Grant and D. Z. Wang

In this article, we present a challenge to the information quality community to
develop techniques that support the long-term support and maintenance of critical,
rapidly growing KBs. We follow this challenge with two notable papers that make
strides in this direction. We end this group of papers with a discussion of three research
questions in response to this challenge.

2. RELATED WORK

Yahoo! recently released a description of WOO [Bellare et al. 2013], which is their
internal system for managing entity resolution over the growing number of entities
across the Web. As new information is ingested into WOO, it uses a custom search
engine to find candidate entities and enqueues them for possible updates. The WOO
paper is focused on the synthesis of KBs from existing sources; it does not fully explore
inference for growing the KB. Growing KBs using inference over the existing facts,
although helpful, can introduce difficult errors and is mostly avoided by WOO. This
type of KB expansion exacerbates the need for innovative quality control methods.

The Never Ending Language Learner (NELL) continuously builds and expands
knowledge bases through information extraction and inference [Carlson et al. 2010].
Part of the growth of NELL is the development of innovative techniques to continuously
review and validate existing information. The challenge that we pose is to investigate
NELL-style systems inside enterprise KBs, where system management is critical.

3. CHALLENGE AND RESEARCH DIRECTION

We present a challenge for the information quality community to integrate the in-
formation quality pipeline into large-scale KBs. Solutions to the research challenges
presented next will help KBs to continue to grow rapidly while ensuring that they
are suited for an organization’s business usage. There are three general areas that we
believe can receive immediate return from the information quality community:

—Probabilistic KBs: Many organizations prefer that their active data stores contain
only pristinely maintained information. Expanding KBs using inference can intro-
duce many types of errors. Naturally, organizations are averse to the more aggressive
yet noisy growth strategies. One promising approach to this problem is to maintain
the provenance and confidence of each fact and extraction [Wang et al. 2012]. Infor-
mation cleaning using these types of probability-aware KBs can provide a powerful
KB that allows users to supply thresholds on the facts that they trust.

—Scheduling quality audits: Aging KBs will inevitably contain information that ex-
pires, becomes invalid, or simply is proven inaccurate. Monitoring KBs for entities,
relationships, and facts with questionable quality is required to maintain KB qual-
ity. Over time, the existence of bad data is extremely elusive. Crowdsourcing (adding
a human in the loop) to validate facts is a leading approach to ensure clean facts
in KBs. Although accurate, crowdsourcing is more expensive and slower than au-
tomated methods. An interesting direction is deciding how to schedule KB quality
audits in an organization KB in accord with varying time, confidence, and probability
budgets. An initial approach is to use the popularity of existing facts in addition to
calculating the uncertainty to prioritize updates.

—Incremental KB maintenance: Speedy extraction of facts from a data source presup-
poses the need for rapid and incremental methods for updating KBs. Incremental
or streaming techniques require focused computation to meet demanding rates of
change. However, for long streams of updates, simply storing results in memory is too
expensive. An interesting research direction is to investigate the trade-offs between
online, batch, or query-driven techniques for computing KB updates, inferences, and
validation.

ACM Journal of Data and Information Quality, Vol. 6, No. 2–3, Article 7, Publication date: June 2015.


A Challenge for Long-Term Knowledge Base Maintenance 7:3

REFERENCES

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007.
DBpedia: A nucleus for a Web of open data. In Proceedings of the 6th International Semantic Web
Conference and 2nd Asian Semantic Web Conference (ISWC’07/ASWC’07). 722–735.

Kedar Bellare, Carlo Curino, Ashwin Machanavajihala, Peter Mika, Mandar Rahurkar, and Aamod Sane.
2013. WOO: A scalable and multi-tenant platform for continuous knowledge base synthesis. Proceedings
of the VLDB Endowment 6, 11, 1114–1125.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M.
Mitchell. 2010. Toward an architecture for never-ending language learning. In Proceedings of the 24th
AAAI Conference on Artificial Intelligence.

Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann,
Shaohua Sun, and Wei Zhang. 2014a. Knowledge vault: A Web-scale approach to probabilistic knowledge
fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD’14). ACM, New York, NY, 601–610.

Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei
Zhang. 2014b. From data to knowledge fusion. Proceedings of the VLDB Endowment 7, 10, 881–892.

Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, and Jonathan Wright. 2012. Linguistic resources
for 2012 knowledge base population evaluations. In Proceedings of the Text Analysis Conference (TAC’12).

John R. Frank, Max Kleiman-Weiner, Daniel A. Roberts, Feng Niu, Ce Zhang, Christopher Ré, and Ian
Soboroff. 2012. Building an entity-centric stream filtering test collection for TREC 2012. In Proceedings
of the 21st Text Retrieval Conference (TREC’12).

Mark Herman and Michael Delurey. 2013. The Data Lake: Taking Big Data beyond the Cloud. Retrieved
May 11, 2015, from http://www.boozallen.com/media/file/TA_DataLake.pdf.

Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum. 2013. YAGO2: A spatially
and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence 194, 28–61.

Heng Ji, Hoa Trang Dang, Joel Nothman, and Ben Hachey. 2014. Overview of TAC-KBP2014 entity discovery
and linking tasks. In Proceedings of the Text Analysis Conference (TAC’14).

Mihai Surdeanu and Heng Ji. 2014. Overview of the English slot filling track at the TAC2014 knowledge
base population evaluation. In Proceedings of the Text Analysis Conference (TAC’14).

Daisy Zhe Wang, Yang Chen, Sean Goldberg, Christan Grant, and Kun Li. 2012. Automatic knowledge base
construction using probabilistic extraction, deductive reasoning, and human feedback. In Proceedings of
the Joint Workshop on Automatic Knowledge Base Construction and Web-Scale Knowledge Extraction.
106–110.

Received November 2014; revised February 2015; accepted February 2015

ACM Journal of Data and Information Quality, Vol. 6, No. 2–3, Article 7, Publication date: June 2015.

http://www.boozallen.com/media/file/TA_DataLake.pdf