key: cord-0132109-07y9c4s0
authors: Shenoy, Kartik; Ilievski, Filip; Garijo, Daniel; Schwabe, Daniel; Szekely, Pedro
title: A Study of the Quality of Wikidata
date: 2021-07-01
journal: nan
DOI: nan
sha: 23d28f1dd5358bd5b039119151c4e7bb8bed39cd
doc_id: 132109
cord_uid: 07y9c4s0

Wikidata has been increasingly adopted by many communities for a wide variety of applications, which demand high-quality knowledge to deliver successful results. In this paper, we develop a framework to detect and analyze low-quality statements in Wikidata by shedding light on the current practices exercised by the community. We explore three indicators of data quality in Wikidata, based on: 1) community consensus on the currently recorded knowledge, assuming that statements that have been removed and not added back are implicitly agreed to be of low quality; 2) statements that have been deprecated; and 3) constraint violations in the data. We combine these indicators to detect low-quality statements, revealing challenges with duplicate entities, missing triples, violated type rules, and taxonomic distinctions. Our findings complement ongoing efforts by the Wikidata community to improve data quality, aiming to make it easier for users and editors to find and correct mistakes.

Historically, Wikipedia is the best known knowledge base relying on the "wisdom of the crowd" (Surowiecki, 2004) to ensure its quality; setting an example for other popular websites such as Quora 1 and Stack Exchange. 2 Wikidata (Vrandečić and Krötzsch, 2014) has been created in a similar manner -editing it is fairly straightforward. Consequently, Wikidata today is a joint creation of tens of thousands of human and bot contributors (Piscopo and Simperl, 2018) . The result is a rich set of factual statements that describe claims about entities and events in the real world. New information is entered everyday, resulting in very high growth rates and immediate description of popular world events. 3 Wikidata aims to allow "plurality of facts" (Möller, Lehmann and Usbeck) , and hence it is important that these facts are described with high-quality statements. We have little understanding of the quality of the knowledge contained in Wikidata. Relatively simple validators can spot syntactic errors, allowing for automatic detection ('flagging' or editing) of syntactically anomalous statements (Beek, Rietveld, Bazoobandi, Wielemaker and Schlobach, 2014 ). Yet, capturing and correcting semantic information is more challenging. While existing work has proposed an extensive set of quality notions (Piscopo and Simperl, 2019) , and started to apply statement validation to Wikidata (Thornton, Solbrig, Stupp, Gayo, Mietchen, Prud'Hommeaux and Waagmeester, 2019; Piscopo and Simperl, 2018) , to our knowledge, no past work has comprehensively applied indicators to measure quality of statements in Wikidata as a whole, and provided a vision for improving its quality in the future.

In this paper, we develop a framework to detect and analyze low-quality statements in Wikidata by shedding light on the current practices exercised by the community. In addition, we propose to enhance the quality of Wikidata by automatically flagging potential problematic statements for editors. Our work makes the following contributions:

1. We define three indicators that measure well-understood notions of quality of Wikidata statements, based on: 1) the statement revision history of Wikidata; 2) deprecation of statements; and 3) violations of property constraints defined by the community. 2. We develop an efficient framework that flags potential errors integrating these three indicators of quality. Namely, the community-based indicators find lowquality statements which have been deleted or deprecated throughout the history of Wikidata (since its inception in 2014), while the constraint-based indicator reveals outliers with high constraint violation ratios. 3. We apply our framework to analyze the quality of the entire Wikidata. 4 We report findings on key aspects of quality that affect users and editors, such as lowquality type statements, taxonomical modeling errors, duplicated nodes, and missing statements. 4. We propose recommended actions to interactively support high-quality contributions in the future, as well as to retroactively fix existing issues. By doing so, we complement ongoing efforts by the Wikidata community to improve data quality based on games and suggestions, aiming to make it easier to prevent, find, and correct mistakes.

Our quality indicators evaluate the degree of community consensus on what is acceptable, thus connecting to existing metrics of Wikidata quality, like accuracy, consistency, and veracity (Piscopo and Simperl, 2019) . By analyzing statements which have been removed, we reflect on the accuracy of the data. By formulating and analyzing semantic rules (constraints) that statements must satisfy, we provide insights into the well-formedness and consistency of the data. The analysis of the deprecated statements addresses the veracity of claims, by indicating that there was once consensus about their veracity, but this is no longer the case.

We make our code 5 (Shenoy, Ilievski and Garijo, 2021b ) and materials 6 available to facilitate further work on analyzing quality of Wikidata statements. The rest of the paper is structured as follows. Section 2 introduces the three indicators, their formalization, and combination into a joint framework. All of our findings with their supporting analyses are described in Section 3. Recommended actions can be found in Section 4. We relate to prior work on Wikidata quality in Section 5. The paper concludes in Section 6.

We seek to measure semantic quality aspects of Wikidata. We devise a framework for detecting low-quality statements in Wikidata, which combines three indicators of quality, based on: 1) community updates; 2) deprecated statements; and 3) property constraints. In this section, we describe each of the quality indicators and we provide details on their formalization into an integrated framework that can analyze the quality of Wikidata. We formalize low quality with = 0.

Throughout this section, we use to refer to a set of statements. A statement ( , , , ) ∈ refers to the union of an edge subject , predicate , object , and qualifier set . Qualifier sets contain property-value pairs ( , ) ∈ that further describe the tuple ( , , ), (e.g., with the date or the source of assertion). Such statements are common building blocks of modern hyperrelational Knowledge Graphs (KGs), like Wikidata or YAGO (Tanon, Weikum and Suchanek, 2020) .

Community-based indicator We define a communitybased indicator of KG quality by considering that the KG statements that have been permanently deleted by the community (i.e., statements deleted at a time point and not restored in time points , > ) are of low quality. Following the idea of "wisdom of the crowd" (Surowiecki, 2004) , we assume that community-based KGs, like Wikidata, are selfcorrecting over time, i.e., its contributors detect low-quality statements, and either delete or replace them.

However, the set of removed tuples by itself is neither necessary nor sufficient to indicate incorrect statements. A statement might be simply updated with a semantically equivalent one. Object values may be reassigned from one property or class to another, which might be considered more appropriate to express the relationship between the subject and the object. Literals may be updated with a new value that may or may not be semantically different than the original one. The latter case often corresponds to the adoption of new naming conventions, e.g., replacing the name "Pamela C Rasmussen" with "Pamela C. Rasmussen". To address these issues, we consider the low-quality ( = 0) statements of a dump at a time to be a union of: 1) the removed statements which were not updated ( ( )), and 2) the removed statements which were updated with a significantly different value ( ( )). Formally, ( = 0, ) = ( ) ∪ ( ).

Deprecation-based indicator Wikidata has a 'soft' alternative to deletions: deprecating statements to indicate consensus about the end of their validity. A statement is marked as deprecated in two cases: 1) if it has been superseded by another statement, or 2) if it is now known to be wrong, but was once thought correct. 7 For example, the community agreed that Pluto ceased to be a planet since 13th September, 2006 and hence the claim stating that fact has been deprecated.

Deprecated statements ( ) are valuable for studying the evolution of Wikidata and the agreement about its statements. However, they are undesired when using Wikidata in applications that require up-to-date information, like entity linking and question answering. Thus, we consider all deprecated statements of a dump at a time to be indicators of low quality, formally: ( = 0, ) = ( ).

The Wikidata community has defined property constraints, i.e., rules that specify how properties should be used. 8 Each property in Wikidata specifies the constraint types that apply to it. Statements expressed with that property can then either conform to the constraint or violate it. We denote the set of all violations in a Wikidata dump at a time with ( ). Constraints are split in three groups: mandatory, suggested, and normal (i.e., constraints which are neither mandatory, nor suggested). Each constraint type is further specified per property, by stating additional elements: property-dependent classes, exceptions, and property paths.

At present, Wikidata defines 30 types of property constraints. Constraints vary in nature, and range from format validation (e.g., correct dates or naming conventions) to ensuring a consistent usage of a property (e.g., making sure that symmetric properties are used in both directions). We provide examples for three key constraint types in Figure 1 : type constraint, value type constraint, and item-requiresstatement constraint. The Wikidata type and value type constraints indicate that the domain of a property (or range, respectively) has to conform to one of the listed classes, but specify them further with exceptions and property paths. The item-requires-statement constraint dictates that a Wikidata item with one property should also specify another one. The type constraint specifies that subjects that have an occupation have to be instances of one of the eight allowed classes, unless the subject is prescriber. The value type constraint dictates that objects of occupation statements have to be either instances or subclasses of one of the six possible classes shown. The item requires statement constraint specifies that items which have an occupation value must also have an instance-of statement. All depicted constraints have a normal status.

Constraints may also specify exceptions. In Figure 1 , the type constraint indicates that subjects that have an occupation have to be instances of one of the eight allowed classes, unless the subject is prescriber ("person legally empowered to write medical prescriptions"), 9 whereas the value type constraint dictates that objects of occupation statements have to be either instances or subclasses of one of the six possible classes shown. The item requires statement constraint specifies that items which have an occupation value must also have an instance-of statement. All constraints presented in this figure have a normal status.

The constraint-based indicator considers violations of property constraints in their corresponding statements to be low-quality statements.We denote the set of statements that violate a constraint with . The set of low-quality ( = 0) statements according to this indicator is: ( = 0, ) = ( ).

While the three indicators of quality have different foci, each of them identifies a set of low-quality statements, denoted by , , and in the previous section. In the rest of the paper, we analyze the low-quality statements identified by each indicator. We inspect deprecated and permanently deleted statements in Wikidata, we assess what constraints are violated, and we compare the violations with the deletions. In our experiments, we employ the Knowledge Graph ToolKit (KGTK) (Ilievski, Garijo, Chalupsky, Divvala, Yao, Rogers, Li, Liu, Singh, Schwabe and Szekely, 2020) , which supports flexible and scalable imports of Wikidata, and supports efficient manipulation of large hyperrelational KGs, which is essential for the analysis carried out by our quality framework.

Community-based indicator: We collected a dataset of Wikidata statements that have been permanently removed 9 https://www.wikidata.org/wiki/Q99393050 (i.e., removed and not added again) since the first available dump of Wikidata in October, 2014. The dumps of Wikidata are released weekly. We generated this dataset by downloading all available weekly JSON Wikidata dumps from the Internet Archive, 10 resulting in 311 dumps; 11 converting them to the KGTK format; and extracting statements that had been removed between each pair of successive dumps ( , ), where < . We also checked whether statements that have been removed before were present in the more recent of the two dumps, . Formally:

( , ) = ⧵ , with < ( , ) = ⧵ , with < ( ) = ( ( ) ⧵ ( , )) ∪ ( , ), with < Here, and represent the added and deleted statements between and , respectively. The operator ⧵ represents a difference between two sets, ∪ is the union, and the total removed statements for the 0-th dump is ( 0 ) = ∅ After obtaining the full set of removed statements, we analyzed how many of the nodes had been redirected to new nodes (i.e., duplicate removal), and computed the distribution of classes and properties being removed. For literals, we investigated whether a value had been entirely removed or updated by computing the similarity between the removed value and the new one. We analyzed the similarity for each literal type separately. For strings, we measured Levenshtein distance between the removed and the updated text. For dates, we measured the time distance between the removed and the updated date. For quantities, we computed the difference in magnitude between the removed and the new quantity. We consider deleted statements with no update and 10 https://archive.org/search.php?query=wikidata 11 Approximately two years of dumps were missing from Internet Archive, but we were able to retrieve them with the help of contributors from the Wikidata community. The set of allowed parent classes for the subject are defined in expected_parents, whereas exceptions is the set of subjects for which this constraint is not required. If the subject of a statement is an instance of a class in expected_parents, or any of its subclasses, then the constraint is satisfied for that statement. The constraint is also satisfied if the subject belongs to the set of exceptions defined by the property constraint. Notably, for some properties, the instance of relation is replaced with a subclass of relation.

deleted statements with a notable update to be of low quality (cf. Section 2.1).

We consider all deprecated statements to be of low quality. Wikidata indicates deprecation through the rank qualifier of a statement. We retrieved all statements with a deprecated rank value in the early Jan, 2021 version of Wikidata (the last dump we collected), and we explored their distribution in terms of entities and properties.

We consider statements that violate constraints to be of low quality, = 0. We prioritized constraints that are common in Semantic Web research and cover a sufficient number of properties (e.g., type and value type).

Wikidata has pages with constraint violation reports, 12 which are calculated with an ad-hoc extension of Wikibase. 13 However, it is unclear whether these reports are updated regularly. Given the size of Wikidata, validating its constraints with the Shape Constraint Language (SHACL) or the Shape Expressions Language (ShEx) is computationally prohibitive (Boneva, Dusart, Alvarez and Gayo, 2019) . Moreover, it is unclear whether these languages can encode exceptions and allowed values in property constraints, and, to the best of our knowledge, there is no available implementation of SHACL/ShEx constraint validators for Wikidata. For this reason, we encoded each constraint type as a KGTK query template. Each template is instantiated once per property, allowing their efficient validation in parallel. Constraint violations for a property are computed in a two-step manner: we first obtain the set of statements that satisfy the constraint for a property, and then we subtract this set from the overall number of statements for that property. We omit constraints defined on external identifier properties, as our aim is to capture semantic and modeling errors in Wikidata. An example query template is shown in Figure 2 . The query inspects whether the subjects for a property are instances of a class that is allowed by the constraint, or any of its subclasses. If this is the case, or the subject is listed as an exception to the constraint, then the constraint is satisfied for this statement. Notably, for some properties, the instance of relation is replaced with a subclass of relation. A full example query for one property can be seen in Annex A.

Combination of indicators: Each quality indicator produces a set of statements. We compute the overlap between the deleted statements and the constraint violations as follows. We added all deleted statements to the Wikidata version where we computed the violations, and calculated the number of violations without and with the total removed statements: and , respectively. The difference between these two yields the number of violations that were fixed by the removal of the statements ( ). Formally:

Our framework indicators result in: 1) a dataset of 76.5M removed statements, describing 26.2M distinct subjects (Garijo and Szekely, 2021) ; 2) a dataset of 10M deprecated statements (Shenoy, Ilievsky and Szekely, 2021c) ; and 3) a set of correct statements and constraint violations (Shenoy, Ilievski and Garijo, 2021a) , according to the constraint types specified in Table 1 . This table shows that most of the property constraints have a normal status, and that the median time to validate a property constraint over Wikidata ranges between 55 and 175 seconds for the five constraints. This demonstrates the feasibility of our approach to validate Wikidata constraints at scale.

In this section, we highlight the main findings of our analysis by shedding light into complex issues related to KG quality, such as node redundancy, naming conventions, taxonomic distinctions, completeness, accuracy of constraints, and type consistency. We also explore whether constraint violations are getting corrected over time, thereby improving the overall quality of Wikidata. Specifically, we study the following eight research questions:

1. Are entities being deduplicated? 2. Can the community distinguish classes from instances? 3. Are naming conventions needed? 4. Are property types and value types respected? 5. Can we detect missing triples? 6. Are constraints correct and complete? 7. What statements get deprecated? 8. Are constraint violations getting fixed?

For each of these questions: 1) we motivate its relevance and impact on Wikidata; 2) we present our findings about its current state; and 3) we provide an in-depth analysis and representative examples. Based on these findings, we provide recommendations about improving the state-of-theart quality of Wikidata in Section 4. 

Entity linking and deduplication are complex open research challenges in many KGs. Redirects are a common mechanism to deduplicate nodes, and are applied when a user recognizes that two nodes describe the same subject, e.g., Category:1911 in Morocco redirects from Q18511155 to Q9404406. 14 Our analysis reveals over 2 million redirected nodes, which affect over 20 million statements (26% of all removed statements). The relatively high number of redirects reflects Wikidata's dynamic nature and the community pursuit for a high-quality, well-integrated graph. It is not known how many duplicate entities currently remain in Wikidata.

21.3 million statements (27.8% of the removed statements) have either a redirected subject or a redirected object. We inspected the property containing the largest number of redirected items, instance of (P31), to understand what type of nodes have been redirected. Table 2 (top) shows the five classes with the highest number of redirected instances, which include well-populated classes in Wikidata like human, scholarly article, and gene. In addition, a portion of the instance of (P31) redirects are due to classes that themselves have been redirected. Table 2 (bottom) shows the five redirected classes with a highest number of member instances, which include encyclopedic article, village of Poland, and rotating variable star.

When adding new instances to Wikidata, contributors must specify descriptive values for the taxonomy relations of instance of (P31) and subclass of (P279). Wikidata's fairly wide ontology (containing millions of classes) and the prior evidence on the difficulty of distinguishing between taxonomic relations in Wikidata (Piscopo and Simperl, 2018) , raise the question: can the community distinguish classes from instances? Our analysis of removed statements with object properties reveals nearly half a million cases where one of the taxonomic relations has been changed to the other, which point to the fact that the community struggles 14 https://www.wikidata.org/wiki/Help:Redirects to decide whether to use instance-of (P31) or subclass-of (P279) to model inheritance in Wikidata. 15 Drilling down, we see that in 44 thousand cases, the instance of statement was replaced with a subclass of statement. In the case of former P279 edges, the number of taxonomic switches is notably larger: nearly half (444k out of 935k) P279 edges were replaced by a P31 edge only. Illustrative examples in Table 3 indicate that these switches often happen in cases where it is not trivial to distinguish between the two taxonomic relations. For example, the community struggles to specify the membership of laboratory centrifuge as laboratory equipment -a former instance of relation has been replaced with a subclass of one. Conversely, the Chemical Markup Language used to be specified as a subclass of a markup language, but this has been corrected into an instance of relation. In both cases, the updated relation seems more intuitive, which, in line with the "wisdom of the crowd" assumption, would indicate that switches between the two relations largely reflect fixes of prior modeling errors.

To our knowledge, Wikidata does not prescribe how to encode strings, though there are guidelines for dates. 16 We performed an analysis to investigate the proportion of updates for both strings and dates, in order to study current practices, and possible oscillations between different semantically equivalent values. Our analysis reveals that the community has already performed millions of updates between semantically (nearly) equivalent forms of literals.

In particular, we observe that in the majority of cases (61.5% of all removed dates), the date was replaced with a semantically equivalent date with a different surface form. An example is the year 1964, modified from "000000001964-00-00T00:00:00Z/9" to "1964-00-00T00:00:00Z/9". When it comes to removed string statements, we observe that 46% of them (14 million) have been replaced with new values. The distribution of the Levenshtein distances between the old and the new string values is shown in Figure 3 . We 15 https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology/ Problems 16 https://www.wikidata.org/wiki/Help:Dates Table 2 Distribution of classes in redirected P31 statements. We show 5 classes with the highest number of redirected instances, and 5 classes that have been redirected themselves. The counts and the percentages represent numbers of affected statements. The percentages are relative to total redirected statements, not total statements. Table 3 Community updates of instance-of (P31) and subclass-of (P279). observe that strings with low Levenshtein distances are typically stylistic updates, e.g., from "Pamela C Rasmussen" to "Pamela C. Rasmussen". Among the strings with a medium Levenshtein distance (of 10), we see updates which are meant as specifications and can also be interpreted as mere stylistic adaptations, such as the update of "Hiroshima EAST BLD" to "Hiroshima East Building". The strings with a large distance (of 20) are generally different from the original strings, such as the update of "Meredith Boyle Metzger" to "Susan Michaelis".

Type and value type constraints are similar to the domain and range constraints in Semantic Web languages like OWL, and are covered in resources like YAGO (Tanon et al., 2020) and VerbNet (Schuler, 2005) . Many properties in Wikidata have associated type and value type constraints, as shown in Table 1 . Have these constraints been respected by the data? We observe that only a small portion of the mandatory constraints, and a much larger portion of the suggested constraints, violate the set constraints. While the violations are largely concentrated around a small set of properties and could in theory be fixed, it is unclear whether this is desired, as the suggested status implies that they might not need to be strictly enforced.

As shown by the violation ratios in Table 4 (rows 1 and 2), only a small portion of the mandatory type and value type constraints are violated (0.08% and 0.03%, respectively). The proportion of violations is larger for normal constraints, which represent the majority (0.76% and 0.65%, respectively). The violation ratio is the highest for the suggested 

Correct (constraint-satisfying) and incorrect (constraint-violating) statements for the five constraint types analyzed in this paper: type (Q21503250), value type (Q21510865), item requires statement (Q21503247), inverse (Q21510855), and symmetric (Q21510862). The violation ratio (VR) is the percentage of incorrect statements in the total set of statements in a given category. We separate the statistics among (M)andatory, (N)ormal and (S)uggested constraints. constraints, where as many as 20% of the statements were found to violate type constraints. This might be expected, as the status suggested implies less strict semantics than mandatory constraints. This analysis entails that fixing the current type and value type violations would require nearly 44 thousand edits for the mandatory constraints, and 4.7 million edits for the normal and suggested constraints. Figure 4 shows a Zipfian distribution of the violation ratios for the properties that have type and value type constraint, i.e., most violations are concentrated around a few properties.

It is well known that broad-coverage KGs are inherently incomplete (Dong, Gabrilovich, Heitz, Horn, Lao, Murphy, Strohmann, Sun and Zhang, 2014) . This incompleteness can be partially addressed through the property constraints: item-requires-statement (IRS), inverse, and symmetric. These constraints point to a missing triple for the same entity, a missing triple with an inverse property, and with a symmetric property, respectively. For example, IRS dictates that entities that have an occupation property must also have a statement with the instance of property. We investigate to which extent these constraints have been followed by the statements in Wikidata. As shown in Table 4 , the mandatory constraints for these constraint types reveal nearly a thousand violations, which may indicate missing triples. The situation worsens for normal and suggested constraints, whose enforcement would lead to millions of potentially missing triples. While fixing symmetric and inverse constraints is programmatically trivial, it is unclear whether this is always desired, as the constraint violations may be caused by an incorrect original statement rather than a missing one. For example, if a spouse link exists from entity Table 4 (rows 3-5) illustrates how mandatory IRS and inverse constraints are largely followed (with only 0.02% and 1.9% violations, respectively). As expected, the violation ratios are larger for normal, and largest for suggested constraints, peaking at 8% for the IRS suggested constraints. Table 5 shows examples for properties with highest violation ratios. For instance, the property votes received (P1111) requires other properties like office contested (P541) to be present, which is violated in all 46k cases where it appears. The inverse property for the properties has natural reservoir (P1605) and stepparent (P3448) is missing in nearly all cases, resulting in five thousand violations. The most commonly violated symmetric properties include Sandbox-Lexeme (P5188), together with (P1706), and scheduled service destination (P521), resulting in around 1,500 violations in total.

If the constraints are to be used as a driving force to improve the quality of Wikidata, it is important that they are correct and complete. As shown in Table 4 , the majority of the constraints fit the data, which can be seen as an indicator that the constraints are of good quality. Yet, we note that across all constraint types, a small portion of the constraints yields a large portion of violations.

The head of the distribution in Figure 4 reveals properties whose constraint definitions are outdated. Table 5 lists those property constraints with large (nearly 100%) violation ratios, which may point to discrepancies between the constraints and the underlying data. For example, towards (P5051) expects subjects to be instances of transport stop (Q548662), which is violated for all its 64 instances. 28 of these instances have a type vein (Q9609) (e.g., external jugular vein (Q2512768)), and use the towards property to indicate the direction blood flow of a vein in the human body (e.g., subclavian vein is oriented towards the brachiocephalic vein). In this case, rather than fixing each statement with a constraint violation manually, one could generalize the constraint, i.e., enhance the type constraint for the towards property to allow for instances of vein.

We investigate whether deprecated statements, as a soft alternative to deletions, reveal different behavior compared to removed statements. Among the 10 million statements with deprecated rank in Wikidata, we observe that many belong to the domain of Astronomy. This indicates that the decision between removing and deprecating a statement largely depends on the community and the domain.

Specifically, we found 10,040,256 deprecated statements. The top-5 properties in deprecated statements are shown in Table 6 . We observe that all frequently deprecated properties (e.g., proper motion) belong to the domain of Astronomy, and that large portion of the overall deprecations (around 90%) is expressed with these first five properties. In addition, we observe that the deprecated instance of statements describe membership of celestial objects, like infrared source, star, and galaxy.

Our analysis reveals that Wikidata has millions of deleted statements and constraint violations. Do these two sets overlap? We observe that many of the removed statements violated a constraint, i.e., many of the removals coincide with former violations, thereby improving the quality of Wikidata over time.

Specifically, out of the 2.31 million removed statements for which a mandatory type constraint is defined, a third violated that constraint (Table 7) . Most of the former violations correspond to normal and suggested constraints. Overall, we observe that the removed statements fixed millions of constraint violations, including 6 million type violations and 7.5 million symmetric violations.

Notably, constraints could have been fixed or violated through the addition (instead of removal) of statements, which we are not considering in our work and, as such, it is a limitation of our current analysis.

The knowledge in Wikidata is relatively reliable in comparison to other general-domain KGs (Färber, Bartscherer, Menne and Rettinger, 2018 ). Yet, our analysis reveals a variety of quality aspects of Wikidata that can be improved going forward. Based on our findings, we propose several recommended actions to include in the interactive contributing environment of Wikidata. These recommendations are intended to prevent low-quality statements from being added, as fixing them later might take a large number of edits. The recommendations can complement ongoing efforts by the Wikidata community to improve data quality based on games and suggestions, aiming to make it easier for users and editors to find and correct mistakes.

Integrate entity linking: To prevent introducing duplicate nodes, it would be beneficial to provide suggestions for similar entities when these exist. For instance, if the user is introducing a basketball player named "Michael Jordan" who played for Chicago Bulls, the environment should inform the user that a similar item is already present in Wikidata (with id Q41421).

Prevent type and value type violations: When an editor introduces a new entity, its type should be coherent with the type and value type constraints of its properties. When this is not the case, the editor should be warned about a possible violation. Instead of adapting each new statement, the editor may opt to suggest adapting the constraints themselves.

Introduce format guidelines for strings: Our analysis showed that a large portion of the literal updates transform the literal between two semantically equivalent forms. We propose having more precise formatting guidelines for strings, aiming to adopt consistent naming conventions. For instance, a guideline for initials of human names may dictate including a letter and a dot ("Pamela C. Rasmussen" rather than "Pamela C Rasmussen").

Complement missing data: Wikidata's interactive editing environment should propose that the editor makes complete edits, i.e., edits that satisfy the constraints of the affected properties. One way to achieve this would be to suggest that the edits satisfy the constraints of types itemrequires-statement, symmetric, and inverse, by either adding the full set of statements that satisfy the constraint, or removing the one violating it. A complementary idea is to include a link prediction method, like HINGE (Rosso, Yang and Cudré-Mauroux, 2020) or StarE (Galkin, Trivedi, Maheshwari, Usbeck and Lehmann, 2020) , in order to suggest missing statements based on probabilistic graph patterns.

Fix statements retroactively: Given the large number of existing constraint violations, it is important to help the Wikidata community to fix them. One possibility is to leverage Wikidata's Distributed games 17 approach and create games to help editors efficiently validate and fix the constraints. A good starting point for this are the property constraints with large violation ratios, which were detected through our analysis in Table 5 and Figure 4 . An alternative approach, based on our finding in Section 3.6, is to fix violations automatically with the expectation that after the automatic fixes there will be fewer violations, and it would be more efficient to fix the errors introduced by the automatic fixes than the original ones. Another option is to employ methods that automatically detect errors in KGs (Yao and Barbosa, 2021) .

The quality of Knowledge Graphs has been studied in existing literature. Chen, Cao, Chen and Ding (2019) proposed a framework for evaluating the quality of KGs, consisting of dimensions that quantify their fitness for downstream applications. Similarly, quality metrics from 28 prior papers are surveyed by Piscopo and Simperl (2019) , and grouped into three dimensions: intrinsic (i.e., accuracy, trustworthiness, and consistency of entities), contextual (i.e., completeness and timeliness of resources), and representation (i.e., understanding, interoperability of entities). Our quality indicators are orthogonal to these metrics, as we consider the consensus of the community for them. In addition, our methods go further by proposing an approach to efficiently evaluate some of the metrics proposed by Piscopo and Simperl (2019) .

Many of the metrics proposed by Piscopo and Simperl (2019) are covered by Färber et al. (2018) , who compare the quality of modern KGs: Wikidata, YAGO, DBpedia, FreeBase, and OpenCyc. Piscopo and Simperl (2018) evaluated the quality of Wikidata from an ontological perspective, using indicators related to quantitative measures of classes and instances (e.g., number of instances and number of properties) and of the richness of classes, relations, and properties (e.g., inheritance richness and class hierarchy depth). Prior work has also investigated whether the 17 https://wikidata-game.toolforge.org/distributed/# quality of a knowledge statement in Wikidata depends on the engagement of its editor (leader or contributor) (Piscopo, Phethean and Simperl, 2017b; Piscopo and Simperl, 2018) , or the knowledge provenance indicated through the references of a statement (Piscopo, Kaffee, Phethean and Simperl, 2017a) . Instead, our work performs a systematic analysis of constraint violations, and assesses whether the removal of statements by the community reduces violations.

Wikidata includes several tools that monitor, analyze, and enforce aspects of quality. The primary sources tool (PST) facilitates a curation workflow for uploading data into Wikidata. 18 The Objective Revision Evaluation Service (ORES) scores revisions automatically, aiming to detect edits which represent vandalisms. 19 Recoin ("Relative Completeness Indicator") (Balaraman, Razniewski and Nutt, 2018) extends Wikidata entity pages with information about the relative completeness of the information. Relative completeness is computed by comparing the available information for an entity against other similar entities. Property constraint pages define existing property constraints and report number of violations for a single dump. 20 Our analysis complements the constraint violations reported by Wikidata's pages, by providing in-depth insights about these violations, and abstracting them into findings and recommendations. 21 Recently, Wikidata has started moving beyond individual property constraints, representing a higher-level notion of quality in the form of shapes that are meant to provide norms of well-formedness for sub-graphs describing concepts of interest (Thornton et al., 2019) , e.g., human. 22 These shapes are collected as Schemas. 23 Each schema defines the desired sub-graph topology describing a given concept, using ShEx shape expressions (Thornton et al., 2019) . Schemas are defined through consensus among specific communities (e.g., molecular biology, software engineering, etc.) interested in standardizing concepts relevant to them. 24 We have not addressed the analysis of Wikidata at this level of abstraction, but the approach described in this work can be naturally extended in this direction. A similar observation can be made about prior work that encodes Wikidata constraints based on the multi-attributed relational structures (MARS) (Patel-Schneider and Martin, 2020), a formal data model for generalized property graphs devised by Marx, Krötzsch and Thost (2017) .

Recognizing the complexity of the class and type hierarchy in Wikidata, the authors of YAGO4 hand-crafted a len et al., 2004) ; and running scripts to synthesize YAGO by ingesting the data from Wikidata and processing the SHACL expressions. YAGO4 defines constraints on domain and range, disjointness, functionality, and cardinality. The authors report that enforcing these constraints leads to a removal of 132M statements from Wikidata, i.e., 28% of all facts. The constraints defined by YAGO4 overlap partially with the constraints in Wikidata studied in this paper. Subsequent work should compare the findings from validating constraints in YAGO4 and Wikidata, and it should generalize the in-depth analysis done in this paper to other KGs like YAGO4. Rashid, Torchiano, Rizzo, Mihindukulasooriya and Corcho (2019) investigated the evolution of 10 classes from DBpedia over 11 of its releases, measuring aspects of: persistence, consistency, and completeness. This effort resembles our community-based indicator, but it reports analysis over a small data subset, a smaller knowledge graph, and fewer dumps. The goal of Rashid et al. (2019) is to identify potential problems in the data processing pipeline, which is orthogonal to our goal of detecting low-quality statements in the knowledge graph itself.

Other work has focused on data validation in KGs. The LOD Laundromat (Beek et al., 2014 ) is a large-scale infrastructure that can validate and clean syntactic errors that do not fit the formal specification of RDF, such as bad encoding, undefined URI prefixes, and premature endof-file markers. Beek, Ilievski, Debattista, Schlobach and Wielemaker (2018) devise a toolchain for analyzing of the quality of literals in LOD Laundromat's data collection, proposing to automatically improve their value canonization and language tagging. Our work focuses on errors that cannot be detected by methods that check the syntactic validity of typed literals, like illegal dates, and is thus orthogonal to such prior work.

Recent work has assessed quality for specific domains. For instance, Turki, Jemielniak, Taieb, Gayo, Aouicha, Banat, Shafee, Prud'Hommeaux, Lubiana, Das and Mietchen (2020) report an analysis using ShEx expressions to assess the quality of COVID-19 knowledge in Wikidata. This analysis is more comprehensive than the one reported in our paper, but with a much more limited scope and less generalizable, reflecting the consensus of a specialized community.

Finally, our work relates to efforts that assess the quality of voluntary contributions to large knowledge bases, like Wikipedia (Wilkinson and Huberman, 2007; Raman, Sauerberg, Fisher and Narayan, 2020) and Open Street Maps (Mooney and Corcoran, 2012; Fonte, Antoniou, Bastin, Estima, Arsanjani, Bayas, See and Vatseva, 2017) . The quality indicators and findings in these works may inspire future research into the quality of large "wisdom of the crowd"-based KGs like Wikidata. 25 https://www.w3.org/TR/shacl/

This paper studies the quality of Wikidata by proposing three quality indicators based on statements that have been 1) permanently removed; 2) deprecated; or 3) violate constraints defined by the community. Our analysis reveals that, while Wikidata is becoming a KG of increasing quality (removing duplicate entities, fixing modeling errors, and removing constraint violations) there is still room for improvement for preventing entity duplication and constraint violations, having consistent guidelines for literals, and completing missing data.

Our findings may complement ongoing efforts by the Wikidata community to improve data quality based on games and suggestions, aiming to make it easier for users to find and correct mistakes. In fact, we are initiating a discussion on how to integrate our methods, findings, and recommendations into Wikidata's infrastructure. Future work will expand our constraint analysis to additional constraint types and properties; investigate the quality of Wikidata over time, its relation to contributor profiles (Piscopo and Simperl, 2018) ; and will expand our findings by considering additional qualifiers and references (Piscopo et al., 2017a) .

Recoin: relative completeness in wikidata

Literally better: Analyzing and improving the quality of literals

Lod laundromat: a uniform way of publishing other people's dirty data

Shape designer for shex and shacl constraints

A practical framework for evaluating the quality of knowledge graph

Knowledge vault: A web-scale approach to probabilistic knowledge fusion

Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago

Assessing vgi data quality. Mapping and the citizen sensor

Message passing for hyper-relational knowledge graphs

Wikidata removed statements from

Kgtk: a toolkit for large knowledge graph manipulation and analysis

Logic on mars: Ontologies for generalised property graphs

Owl web ontology language overview. W3C recommendation 10

Survey on english entity linking on wikidata

Characteristics of heavily edited objects in openstreetmap

Wikidata on mars

Provenance information in a collaborative knowledge graph: an evaluation of wikidata external references

What makes a good collaborative knowledge graph: Group composition and quality in wikidata

Who models the world? collaborative ontology creation and user roles in wikidata 2

What we talk about when we talk about wikidata quality: a literature survey

Classifying wikipedia article quality with revision history networks

A quality assessment approach for evolving knowledge bases

Beyond triplets: hyperrelational knowledge graph embedding for link prediction

Verbnet: A broad-coverage, comprehensive verb lexicon

Constraint violation summaries (Dump: Dec 7th

usc-isi-i2/wd-quality: First notebook release

Wikidata deprecated statements by

The wisdom of crowds : why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations

Yago 4: A reason-able knowledge base

Using shape expressions (shex) to share rdf data models and to guide curation with rigorous validation

Using logical constraints to validate information in collaborative knowledge graphs: a study of COVID-19 on Wikidata

Wikidata: a free collaborative knowledgebase

Cooperation and quality in wikipedia

Typing errors in factual knowledge graphs: Severity and possible ways out

This material is based on research sponsored by Air Force Research Laboratory under agreement number FA8750-20-2-10002. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory or the U.S. Government. Daniel Schwabe was partially supported by grant 309808/2017-0 from CNPq -Brazil.

The snippet below represents the KGTK queries that encode the item requires statement constraints (IRS) for property P1321 (place of origin (Switzerland)) in Wikidata. 26 The property has two IRS constraints: 1) each item of the property should be (P31) a human (Q5) and 2) its country of citizenship (P27) should be Switzerland (Q39). There is a single exception to this rule, the person Hans von Flachslanden (Q1583384). The code of the query below is generated automatically with our framework. Comments have been added (with "#") to explain the different parts of the query.