Beyond VIAF: Wikidata as a Complementary Tool for Authority Control in Libraries ARTICLE Beyond VIAF Wikidata as a Complementary Tool for Authority Control in Libraries Carlo Bianchini, Stefano Bargioni, and Camillo Carlo Pellizzari di San Girolamo INFORMATION TECHNOLOGY AND LIBRARIES | JUNE 2021 https://doi.org/10.6017/ital.v40i2.12959 ABSTRACT This paper aims to investigate the reciprocal relationship between VIAF® and Wikidata and their possible roles in the semantic web environment. It deals with their data, their approach, their domain, and their stakeholders, with particular attention to identification as a fundamental goal of Universal Bibliographic Control. After examining interrelationships among VIAF, Wikidata, libraries and other GLAM institutions, a double approach is used to compare VIAF and Wikidata: first, a quantitative analysis of VIAF and Wikidata data on personal entities, presented in eight tables; and second, a qualitative comparison of several general characteristics, such as purpose, scope, organizational and theoretical approach, data harvesting and management (shown in table 9). Quantitative data and qualitative comparison show that VIAF and Wikidata are quite different in their purpose, scope, organizational and theoretical approach, data harvesting, and management. The study highlights the reciprocal role of VIAF and Wikidata and its helpfulness in the worldwide bibliographical context and in the semantic web environment and outlines new perspectives for research and cooperation. INTRODUCTION In 2011, the Library Linked Data Incubator Group, a W3C working group with the aim “to help increase global interoperability of library data on the Web,” published its final report. Two interrelated issues were tackled in that milestone report: what libraries can do for the semantic web and what the semantic web can do for libraries. Linked data is an important asset for libraries as the “use of identifiers allows diverse descriptions to refer to the same thing. Through rich linkages with complementary data from trusted sources, libraries can increase the value of their own data beyond the sum of their sources taken individually.”1 So linked data greatly contribute to library cataloguing work not just for description of resources but also for their proper identification. On the other hand, libraries have always created and curated a significant amount of valuable information assets and library authority data for names and subjects to help reduce “redundancy of bibliographic descriptions on the Web by clearly identifying key entities that are shared across Linked Data. This will also aid in the reduction of redundancy of metadata representing library holdings.”2 The report opened a new way of thinking about Universal Bibliographic Control (UBC), a “world- wide system for control and exchange of bibliographic information,” (https://archive.ifla.org/ubcim/ubcim-archive.htm) the purpose of which is “to make universally Carlo Bianchini (carlo.bianchini@unipv.it) is Associate Professor, Department of Musicology and Cultural Heritage, University of Pavia. Stefano Bargioni (bargioni@pusc.it) is Deputy Director, Library of the Pontifical University Santa Croce (Rome). Camillo Carlo Pellizzari di San Girolamo (camillo.pellizzaridisangirolamo@sns.it) is graduate student, Department of Classics, University of Pisa and Scuola Normale Superiore. © 2021. https://archive.ifla.org/ubcim/ubcim-archive.htm mailto:carlo.bianchini@unipv.it mailto:bargioni@pusc.it mailto:camillo.pellizzaridisangirolamo@sns.it INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 2 and promptly available, in a form which is internationally acceptable, basic bibliographic data on all publications in all countries.”3 Exchanging information and data requires standards, at both the national and international level, for description, identification, and data format. Nowadays, a pillar of UBC is VIAF® (the Virtual International Authority File), a worldwide project designed by a few national libraries and run by OCLC, which combines multiple name authority files with the goal “to lower the cost and increase the utility of library authority files by matching and linking widely-used authority files and making that information available on the Web [https://www.viaf.org/].” It “clusters together the various forms of names for an entity” and has become “a major source for authority control and is becoming the collective reference source at the international level.”4 VIAF is a fundamental tool for the identification of entities (people, locations, works, and expressions) relevant for the bibliographic universe. Yet, as it is based on the harvesting of data from authoritative national libraries spread all over the world, it has a top-down approach: libraries and services that are not VIAF sources can only refer to VIAF, but not actively cooperate with it, and, for its nature, VIAF cannot admit user cooperation. Therefore, on a global scale, a very large number of local libraries are excluded, and their data, collections, and specificities are, too. Furthermore, since the design and development of VIAF at the beginning of the 21 st century, the semantic web environment has hugely evolved, and libraries are more and more required to act in new directions and to explore new forms of cooperation.5 Illien and Bourdon maintain not only that libraries “must now be careful to keep up their own interoperability,” but also that they “would be well-advised to keep up or enter into dialogue with the most influential communities in the Web of data—smoothing out their own disputes in the meantime.”6 Moreover, they believe that “building collaborative authority registries linked to standardized identifiers is one of the fundamental cornerstones of the new Universal Bibliographic Control.”7 Also, Dunsire and Willer suggest that a “smart UBC should strive to support all those who wish to think globally and act locally, with a better mix of bottom-up and top-down methodologies” as far as the “attempts to implement UBC as a worldwide system for the control and exchange of bibliographic information using top-down methodologies have only partially succeeded at global scale.”8 As a result, a better integration of libraries into the semantic web seems to require the involvement of a larger group of stakeholders—such as non-national agencies, museums, archives, and users—and the adoption of a complementary bottom-up approach. A new global actor of the semantic web has both a bottom-up and a very inclusive approach: Wikidata. Wikidata is a freely available hosted platform that anyone—including libraries—can use to create, publish, and use Linked Open Data (LOD). Since 2012, many users have been involved in a bottom-up approach to identity management in Wikidata. Furthermore, interest in and experience with the use of Wikidata to publish LOD among GLAM (galleries, libraries, archives, and museums) institutions is constantly increasing.9 The Wikidata role as an important tool for the identification of entities of any kind —not just those of traditional importance to GLAM—has likewise been increasingly recognized in recent years.10 https://www.viaf.org/ INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 3 So, two worldwide identification tools, two different backgrounds, two opposite approaches. Are they mutually exclusive, or integrable? Is one of them sufficient for libraries’ needs, or do libraries need both? Which stakeholders are best served by VIAF? Which are best served by Wikidata? This paper investigates the reciprocal relationship between VIAF and Wikidata and of their possible specific roles in the semantic web environment with respect to their approach, their domain, and their stakeholders, with particular attention to identification as a fundamental goal of UBC. Relationship between VIAF and Libraries VIAF gathers a huge quantity of authority data from more than 50 sources, listed in the home page of the project (https://viaf.org). Millions of records coming from national libraries and other institutions are continuously processed using algorithms based on the matching of data and bibliographic relationships with the goal of creating clusters of names (figure 1).11 Figure 1. VIAF cluster for Wolfgang Amadeus Mozart Clusters are usable in many services “to identify names, locations, works, and expressions while preserving regional preferences for language, spelling, and script” (https://www.oclc.org/en/viaf.html). Clusters may contain one or more IDs from VIAF sources. Furthermore, unique identifiers of clusters (a VIAF ID, e.g., https://viaf.org/viaf/7524651/) are freely reusable and reused by other institutions to add useful information to their catalogues, open up new paths of information for the end user, contribute local data to the linked data cloud, and much more.12 https://viaf.org/ https://viaf.org/viaf/32197206/ https://www.oclc.org/en/viaf.html https://viaf.org/viaf/7524651/ INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 4 Data sources are selected and approved by the VIAF Council (see https://www.oclc.org/en/viaf/contributing.html), and may belong to two categories: VIAF Contributors, usually national LAM (libraries, archives, museums) agencies, admitted following very selective criteria; and Other Data Providers, i.e., “other selected sources (e.g., Wikipedia [sic]) that are not VIAF Contributor agencies.”13 Other Data Providers include ISNI and Wikidata (even if Wikidata is often confused with Wikipedia, as in the quotation above).14 While Contributors are eligible to appoint a representative to the VIAF Council, Other Data Providers are not. So, VIAF is based on a rigid three-level hierarchical approach: VIAF, VIAF Contributors, and Other Data Providers. All the other national and local institutions, i.e., relevant national data producers that are no t national agencies, cannot provide data to VIAF; instead, they are expected to benefit from the use of VIAF IDs after performing a reconciliation process of their own data with VIAF IDs. However, benefits could be not completely satisfactory in term of quality of data: while VIAF deals with “widely-used authority files,” it can be supposed that the libraries of non-national agencies need authority data more relevant on a local or specialistic basis. Lastly, while VIAF guidelines state that VIAF participants should periodically send updated data to VIAF, it is not clear when and how VIAF retrieves and collects data from Other Data Providers (https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Guidelines.pdf). Relationships between Wikidata and Academic, Research, and Public Libraries Wikidata was launched in 2012 by the Wikimedia Foundation as the central storage of the structured data from all Wikimedia Foundation projects; it is “a freely available hosted platform that anyone—including libraries—can use to create, publish, and use LOD.”15 Wikidata stores stable and common information about entities, i.e., items and properties, and interlinks between different Wikimedia projects, in a form compliant with the RDF model (see https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer). Additionally, Wikidata uses triples and enriches them with qualifiers and references.16 Qualifiers allow adding specifications about the validity of a statement (start/end date, precision, obsolescence, series ordinal, etc.); references are fundamental to justify the data, i.e., to document the authority data creator’s reason for choosing the name or form of name on which a controlled access point is based. 17 Wikidata uses the software Wikibase (https://wikiba.se/), which is “an open-source software suite for creating collaborative knowledge bases” whose “data model prioritizes language independence and knowledge diversity.” The Wikibase open-source software, which is currently used by more than thirty institutions, supports federated SPARQL queries. 18 Wikibase’s approach and characteristics are particularly interesting for the library world. Gemeinsame Normdatei (GND) created a working group with Wikimedia Deutschland in order to “debate whether Wikibase is suitable for the needs of existing authority files coming from libraries” (https://wiki.dnb.de/display/GND/Authority+Control+meets+Wikibase); in March 2020 it was stated that the cooperation “has proven successful” and the current aim is to “develop a Wikibase- based GND and put it into use” (https://wiki.dnb.de/pages/viewpage.action?pageId=167019461). Similarly, the Bibliothèque nationale de France (BnF) and the Agence bibliographique de l'enseignement supérieur (Abes) launched the joint French National Entities File (FNE), which in 2019 carried out “a Proof of Concept to investigate the feasibility of using the software https://www.oclc.org/en/viaf/contributing.html https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Guidelines.pdf https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer https://wikiba.se/ https://wiki.dnb.de/display/GND/Authority+Control+meets+Wikibase https://wiki.dnb.de/pages/viewpage.action?pageId=167019461 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 5 infrastructure of Wikibase to support the FNE.”19 A synthesis of the proof of concept, published in July 2020, mentioned, among the decisions taken, the choice to develop FNE to build on Wikibase (https://www.transition-bibliographique.fr/wp-content/uploads/2020/07/synthese-preuve- concept-fne.pdf). FNE is scheduled to be launched in the next few years (https://f.hypotheses.org/wp- content/blogs.dir/2167/files/2020/02/20200128_8_VersUnFichierNationalDEntites.pdf ). Even more interestingly, between 2017 and 2018, OCLC explored a linked data Wikibase prototype; the final report shows, among other results, that “the building blocks of Wikibase can be used to create structured data with a precision that exceeds current library standards” and that “to populate knowledge graphs with library metadata, tools that facilitate the import and enhancement of data created elsewhere are recommended [. . . and . . .] the pilot underscored the need for interoperability between data sources, both for ingest and export.”20 In late 2019, the IFLA Wikidata Working Group was formed “to explore and advocate for the use of and contribution to Wikidata by library and information professionals, the integration of Wikidata and Wikibase with library systems, and alignment of the Wikidata ontology with library metadata formats such as BIBFRAME, RDA, and MARC” (https://www.ifla.org/node/92837). On the Wikimedia side, in 2019 the LD4-Wikidata Affinity Group (LD4 stands for “linked data for”) was created by Hilary Thorsen, Wikimedian in Residence at Stanford University, to understand “how the library can contribute to and leverage Wikidata as a platform for publishing, linking, and enriching library linked data” (https://wiki.lyrasis.org/display/LD4P2/LD4- Wikidata+Affinity+Group). Libraries’ interest in Wikidata is usually focused on LOD and semantic discovery, not on authority control: “Libraries may each use different, unique, or select identifiers and authority control methods for disambiguation. Increasingly, Wikidata is becoming an important tool for synchronizing across identifiers like Virtual International Authority File (VIAF) and ORCID identifiers. Integrating awareness of Wikidata and its uses for enhancing metadata and link ed open data will help advance a more interconnected research web.”21 Identification is a key issue both in bibliographic control and in the semantic web environment, as John Riemer noted: “Recent examination of the efforts involved in what we have historically called authority control in the PCC community has led us to the conclusion that the primary emphasis should be on identity management.”22 As a matter of fact, Wikibase and Wikidata’s approach to authority control and bibliographic description is quite new: not only does the traditional distinction between authority and bibliographic data disappear in a Wikibase description, but Wikidata is to be considered firstly as an identity management tool for any kind of entity.23 Relationship between VIAF and Wikidata The first attempt of cooperation between VIAF and Wikidata goes back to 2012, when Maximilian Klein and Alex Kyrios, Wikipedians in Residence at OCLC and the British Library, respectively, developed a project to integrate authority data from the VIAF with English Wikipedia biographical articles. The project successfully “added authority data to hundreds of thousands of articles on the English Wikipedia,” but above all showed that “linking of data represents an opportunity for libraries to present their traditionally siloed data, such as catalogue and authority records, in more openly accessible web platforms.”24 At the time, Wikidata was taking its first steps, but later authority data were successfully transferred from English Wikipedia to Wikidata. https://www.transition-bibliographique.fr/wp-content/uploads/2020/07/synthese-preuve-concept-fne.pdf https://www.transition-bibliographique.fr/wp-content/uploads/2020/07/synthese-preuve-concept-fne.pdf https://f.hypotheses.org/wp-content/blogs.dir/2167/files/2020/02/20200128_8_VersUnFichierNationalDEntites.pdf https://f.hypotheses.org/wp-content/blogs.dir/2167/files/2020/02/20200128_8_VersUnFichierNationalDEntites.pdf https://www.ifla.org/node/92837 https://wiki.lyrasis.org/display/LD4P2/LD4-Wikidata+Affinity+Group https://wiki.lyrasis.org/display/LD4P2/LD4-Wikidata+Affinity+Group https://wiki.lyrasis.org/display/LD4P2/LD4-Wikidata+Affinity+Group INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 6 At present, the connection between Wikidata and VIAF is very strong. Both VIAF and Wikidata are founded on a strict authority control that is built on a few cataloguing principles . In particular, both apply the principle that the authorized access point “for the name of an entity should be recorded as authority data along with identifiers for the entity and variant forms of name.”25 In addition, Wikidata is a data provider in VIAF, while VIAF IDs are constantly recorded and updated in Wikidata items. At present, Wikidata has 8,304,947 personal items, out of which 2,061,046 items have a VIAF ID. Moreover, each month a Wikidata bot (https://www.wikidata.org/wiki/User:KrBot) updates links in Wikidata items to redirected VIAF clusters and removes links to abandoned VIAF clusters. The relevance of VIAF to the Wikidata information ecosystem is evident in the visualization of external identifiers in the items: VIAF IDs, represented on Wikidata by property P214 (https://www.wikidata.org/wiki/Property:P214), are automatically sorted as the first external identifier, preceded by the group of ISO standards and followed by the group of VIAF sources.26 Using specific gadgets, i.e., enhancements of the edit interface, Wikidata registered users can add to a specific item the IDs of single VIAF sources extracting them from the VIAF ID(s) present in the item.27 Unfortunately, there is no automatic reciprocity between VIAF and Wikidata: when a Wikidata item gets a link to a VIAF cluster, VIAF does not have an automated way to add a reciprocal link to the Wikidata item. Likewise, when a VIAF cluster gets a link to a Wikidata item, Wikidata has no automatic way to add a reciprocal link to the VIAF cluster. Another very important aspect of the VIAF-Wikidata relationship is that Wikidata uploads data from VIAF only by voluntary work of Wikidata users; and this approach applies to national library data, and to any other data, too. When available, VIAF IDs are typically one of the most important elements used by users to decide the identity of a Wikidata item. Wikidata Controls on VIAF In Wikidata, the use of constraints—i.e., rules that check the appropriate use of a property (https://www.wikidata.org/wiki/Help:Property_constraints_portal)—enables easy discovery of possible inconsistencies in statements, both in data and in external identifiers. Weekly, a Wikidata bot (https://www.wikidata.org/wiki/User:KrBot2) updates the database reports containing the constraint violations for each property, so that users can check the issues and try to fix them. Users can also check constraint violations in real time using the appropriate queries linked in the talk page of each property. As far as to VIAF IDs, two types of constraint-violations are particularly relevant both for the data entry and for the present paper: • “Single value” violations, i.e., one item has two or more VIAF IDs. This means that either one or more VIAF IDs are not to be related to the item, so that the non-pertinent VIAF IDs should be removed from the Wikidata item or that more VIAF IDs exist for the same real entity, so that all the existing VIAF IDs must be kept in the Wikidata item until VIAF merges them. An example of a merge performed by VIAF, maybe on the basis of the correspondent Wikidata item, can be found in Iulius Rufinianus (https://www.wikidata.org/wiki/Q28131664), where the eight distinct VIAF IDs contained in the Wikidata item on September 24, 2019, have now been merged (https://www.wikidata.org/w/index.php?title=Q28131664&oldid=1001570078); in April 2021, the Wikidata item for Alaricus I (https://www.wikidata.org/wiki/Q102371) contains https://www.wikidata.org/wiki/User:KrBot https://www.wikidata.org/wiki/Property:P214 https://www.wikidata.org/wiki/Help:Property_constraints_portal https://www.wikidata.org/wiki/User:KrBot2 https://www.wikidata.org/wiki/Q28131664 https://www.wikidata.org/w/index.php?title=Q28131664&oldid=1001570078 https://www.wikidata.org/wiki/Q102371 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 7 four VIAF IDs (but there were ten on June 29, 2020; https://www.wikidata.org/w/index.php?title=Q102371&oldid=1220309663). • “Unique value” violations, i.e., two or more Wikidata items have the same VIAF ID. This violation means not only an error on the Wikidata side, but it could imply an error in VIAF too. In the former, either one or more Wikidata items have a non-pertinent VIAF ID, to be removed; or the same entity is referred to by one or more Wikidata items, to be merged. In the latter, the VIAF ID conflates two or more distinct entities in one cluster. An example of conflation is the cluster at https://viaf.org/viaf/57898554/, where the painter Herbert E. Abrams (1920–2003; https://www.wikidata.org/wiki/Q4117019) and the physician Herbert L. Abrams (1920–2016; https://www.wikidata.org/wiki/Q23665535) conflate. In that case, Wikidata users can report the VIAF conflation error in the proper Wikidata error- report pages.28 In most cases just a few weeks are required for VIAF to merge clusters regarding the same entity when Wikidata includes them in the same item, but solutions to cases of conflation are fixed more slowly. While updates to VIAF clusters and IDs are obviously necessary and welcome, they are somehow risky for VIAF Contributors, providers, and users that base the consistency of their data on VIAF. So, national libraries could import incorrect data into their IDs and Wikidata could import wrong national libraries IDs referring to different entities into the same Wikidata item. There is no evidence that the error-report pages created and updated by Wikidata users are being systematically taken into consideration by VIAF to solve its conflations. Recently, other issues in the use of VIAF as a source were raised when VIAF removed very important information about its cluster merging process, information that is no longer available to worldwide libraries and users. The VIAF data dump page (http://viaf.org/viaf/data) is refreshed monthly and, until April 2020, it included a persist file. For example, the February 2020 dump, viaf-20200203-persist-rdf.xml.gz, contained data about redirected clusters and—potentially— abandoned clusters as well. This information is essential to the prompt and safe synchronization of local data with VIAF clusters. In this dump, redirected clusters were described, for instance, as follows: while any abandoned cluster (14,692,237 out of 24,030,176!) was erroneously described as follows: This XML empty statement omits the specific information about the abandoned cluster. To obtain this invaluable information again, we filed a bug by email. 29 The decision taken was drastic: starting in May 2020, VIAF stopped including this information in its monthly dump, as stated at the bottom of the page itself.30 As a result, the only recourse available to VIAF Contributors or any https://www.wikidata.org/w/index.php?title=Q102371&oldid=1220309663 https://viaf.org/viaf/57898554/ https://www.wikidata.org/wiki/Q4117019 https://www.wikidata.org/wiki/Q23665535 http://viaf.org/viaf/data INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 8 other institution that would synchronize their authority records with VIAF identifiers is to rely on an external identification tool such as Wikidata! MATERIALS AND METHODS Any comparison between VIAF and Wikidata must consider their different content. VIAF contains personal name clusters, corporate name clusters, geographic name clusters, and work clusters, whereas Wikidata allows items to describe any kind of entity relevant in the universe of discourse of the users’ data and irrespective of their bibliographic nature. Even if all kinds of VIAF clusters are relevant for bibliographic control, this study is limited to the analysis of personal name clusters in VIAF and of items having “instance of: human” (P31:Q5) in Wikidata, because they are largely the most represented in VIAF and they can be directly compared.31 Some entities, such as mythological persons, legendary persons, etc., that are personal clusters in VIAF, are not treated as humans in Wikidata and belong to other instances (e.g., https://www.wikidata.org/wiki/Q95074). A double approach was used to compare VIAF and Wikidata: First, data analyses of VIAF and Wikidata were performed, to compare VIAF clusters and Wikidata items and to investigate their reciprocal relationships (see the Data Analysis section). Second, a comparison of several general characteristics, such as scope, objectives, philosophy, authority control, and identification, was made based on respective websites and available literature to find and highlight differences and similarities. Full VIAF dumps are available in native XML, RDF, MARC-21 XML, or ISO-2709 MARC-21 (http://viaf.org/viaf/data/). VIAF clusters were analyzed using an XML dump published on September 6, 2020 (http://viaf.org/viaf/data/viaf-20200906-clusters.xml.gz). Full Wikidata dumps are available in XML, JSON, or RDF.32 However, given the size of the entire dataset, it is much more convenient to create customized RDF dumps using the tool WDumper (https://wdumps.toolforge.org/). All the information (settings, dimension, and date of base dump) about dumps created using WDumper remains traced (https://wdumps.toolforge.org/dumps). Wikidata items were analyzed using a customized RDF dump updated to September 14, 2020 (https://wdumps.toolforge.org/dump/732). The customized dump contains all statements with non-deprecated values33 present in items having both “instance of: human” (P31:Q5) in best rank and at least one value of “VIAF ID” (P214) in best rank. Both dumps were parsed using three Perl scripts. Dumps and scripts were uploaded on Zenodo and are all available for analysis and reuse.34 Perl scripts generate JSON data that are published on the HTML page http://catalogo.pusc.it/beyond_viaf/, where they are interpreted by JavaScript scripts in order to populate eight tables: three dedicated to VIAF (tables 1–3) and five to Wikidata (tables 4–8). In order to select the statements to be analyzed in Wikidata items, three sets of relevant properties were found through three distinct SPARQL queries at the end of September 2020: VIAF members (table 5), authority controls related to libraries but not being VIAF members (table 6), and biographical dictionaries (table 7).35 At the beginning of October 2020, another SPARQL query was performed to find all the personal items containing the authority controls related to libraries but not being VIAF members (table 6, column 4), without filtering the search to personal items having at least one value of “VIAF ID” (P214).36 https://www.wikidata.org/wiki/Q95074 http://viaf.org/viaf/data/ http://viaf.org/viaf/data/viaf-20200906-clusters.xml.gz https://wdumps.toolforge.org/ https://wdumps.toolforge.org/dumps https://wdumps.toolforge.org/dump/732 http://catalogo.pusc.it/beyond_viaf/ http://catalogo.pusc.it/beyond_viaf/#summary http://catalogo.pusc.it/beyond_viaf/#summary http://catalogo.pusc.it/beyond_viaf/#tb5 http://catalogo.pusc.it/beyond_viaf/#tb6 http://catalogo.pusc.it/beyond_viaf/#tb7 http://catalogo.pusc.it/beyond_viaf/#tb6 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 9 DATA ANALYSIS: VIAF CLUSTERS AND WIKIDATA ITEMS For this paper, two different versions of the data tables were produced: the first version, available at http://catalogo.pusc.it/beyond_viaf/, is a full, commented, and dynamic version of all the tables. Within that version, links to the acronyms (such as LC, DNB, SUDOC, etc.) of all the VIAF Contributors and Other Data Providers are available too. Static versions of these tables are included in this paper with commentary. VIAF VIAF has 22,099,715 personal clusters, half of which (50.90%; table 1, col. 2) are isolated clusters (i.e., they contain only one ID). The presence of isolated clusters is interesting because it means that those clusters are created based on data coming from just one source. What is more, the percentage of isolated clusters is much higher (71.19%; table 1, col. 12) if just VIAF Contributors are taken into account (i.e., excluding isolated clusters due to data from Other Data Providers, such as ISNI). It is worth noting that Other Data Providers can form isolated clusters, with the relevant exception of Wikidata (for which VIAF uses the acronym WKP), which never appears in isolated clusters (table 1, cols. 7 and 8). Table 1. VIAF personal clusters by number of sources [adapted from http://catalogo.pusc.it/beyond_viaf/#tb1] The total number of IDs present in VIAF clusters is 51,327,847 (table 2), distributed in 22,099,715 clusters; the most relevant Contributors include LC (7,266,628 IDs), DNB (5,677,731 IDs), SUDOC (3,278,189 IDs), and NTA (2,754,036 IDs), while the most relevant Other Data Providers are ISNI (8,455,814 IDs) and WKP (2,148,680 IDs) (table 2). Apart from LC and DNB, data about isolated clusters (table 2, col. 5) shows that the number of isolate clusters tends to slowly decrease over time and that clustering has improved: recently-added sources tend to have a higher share of isolated IDs. Another relevant figure is that sources in non-Latin alphabets usually have higher shares of isolated IDs.37 So, a high number of isolated clusters may reveal a source that is partially in need to be gathered to existing clusters. http://catalogo.pusc.it/beyond_viaf/ http://catalogo.pusc.it/beyond_viaf/#tb1 http://catalogo.pusc.it/beyond_viaf/#tb1 http://catalogo.pusc.it/beyond_viaf/#tb1 http://catalogo.pusc.it/beyond_viaf/#tb1 http://catalogo.pusc.it/beyond_viaf/#tb2 http://catalogo.pusc.it/beyond_viaf/#tb2 http://catalogo.pusc.it/beyond_viaf/#tb2 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 10 Table 2. VIAF personal clusters by source [adapted from http://catalogo.pusc.it/beyond_viaf/#tb2] The histories of VIAF clusters, as contained in XML dumps, appear weird and incoherent. For example, many VIAF Contributors in their first year of appearance seem to have no additions and many removals (e.g., BAV row; for complete information see table 3 on the website at http://catalogo.pusc.it/beyond_viaf/#tb3). Incoherence is due to the absence of redirected and abandoned clusters in the data. Nevertheless, the histories allow us to reconstruct the year of first contribution of each source—an information otherwise unavailable—and to detect major changes in the data provided to VIAF by each source.38 Table 3. VIAF history of personal clusters by source [adapted from http://catalogo.pusc.it/beyond_viaf/#tb3] Wikidata Wikidata has 8,304,947 personal items and 2,061,046 of them contain a VIAF ID. Usually one or more VIAF sources are extracted from the VIAF ID(s), so that 1,905,470 personal items containing VIAF ID have at least one VIAF source ID (table 4, col. 1). Wikidata records IDs from a wide range http://catalogo.pusc.it/beyond_viaf/#tb2 http://catalogo.pusc.it/beyond_viaf/#tb3 http://catalogo.pusc.it/beyond_viaf/#tb4 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 11 of other resources, such as non-VIAF bibliographic agencies and biographical dictionaries (investigated in these tables), but also encyclopedias and various online databases. Considering the 2,061,046 items containing a VIAF ID, 684,367 items contain only one VIAF source ID (table 4, col. 1), but only 353,710 items contain only one among VIAF sources IDs and non-VIAF sources IDs and biographical dictionaries IDs (table 4, col. 15); so, more than 300,000 items containing only one VIAF source ID have at least one non-VIAF source ID and/or one biographical dictionary ID. Table 4. Wikidata personal items (pers. it.) by number of IDs [adapted from http://catalogo.pusc.it/beyond_viaf/#tb4] VIAF and Wikidata: A Data Comparison From a quantitative perspective, Wikidata personal items (8,304,947) are 37.58% of VIAF personal clusters (22,099,715), while Wikidata personal items having a VIAF ID (2,061,046) are 9.26%. IDs from VIAF sources present in Wikidata personal items containing VIAF ID (6,292,778; table 5, col. 3) are 12.91% of IDs present in VIAF personal clusters (48,740,933; table 5, col. 4). In the authors’ opinion, quantitative confrontation between VIAF and Wikidata must be carefully considered. It could be argued that is a noticeable disadvantage of Wikidata with respect to VIAF, but it would be right only from a bibliographic control perspective and the other side of the coin must be examined too. As Wikidata represents any kind of entity relevant for its users (libraries, archives, museums, and many other stakeholders), VIAF contains just over a third of Wikidata items (37%). Furthermore, a very large part of the personal entities represented in Wikidata (at present, more than 6,200,000, i.e., about 75%) cannot rely on VIAF for identification purposes (for example, because Wikidata personal items can also represent singers, lawyers, pilots, and so on). It can be concluded that VIAF can be considered just one specialized source, in the domain of the semantic web and with respect to the objectives of Wikidata. Considering single VIAF sources, Wikidata surpasses VIAF by number of IDs only in two cases, PERSEUS (135.18%) and SIMACOB (102.17%) (table 5, col. 5). This is possible because Wikidata and VIAF gather different sets of data from both the sources; the former uses sets of data obtained by its users, while the latter uses only data sent by the contributor. All the other sources, because of the absence of systematic imports, are much rarer in Wikidata than in VIAF. http://catalogo.pusc.it/beyond_viaf/#tb4 http://catalogo.pusc.it/beyond_viaf/#tb4 http://catalogo.pusc.it/beyond_viaf/#tb5 http://catalogo.pusc.it/beyond_viaf/#tb5 http://catalogo.pusc.it/beyond_viaf/#tb5 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 12 Table 5. Wikidata personal items (pers. it.) by VIAF source [adapted from http://catalogo.pusc.it/beyond_viaf/#tb5] Table 6 and table 7 show authority control in Wikidata living aside VIAF. Wikidata contains some non-VIAF sources (usually non-national libraries or groups of libraries which couldn’t become VIAF Contributors); their IDs in personal items having VIAF ID (894,161) are the 86.04% of their IDs in all personal items (958,206; table 6, col. 4), meaning that Wikidata provides a clusterization for more than 64,000 IDs (6%) probably corresponding to non-existent VIAF clusters (table 6, totals). http://catalogo.pusc.it/beyond_viaf/#tb6 http://catalogo.pusc.it/beyond_viaf/#tb7 http://catalogo.pusc.it/beyond_viaf/#tb6 http://catalogo.pusc.it/beyond_viaf/#tb6 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 13 Table 6. Wikidata personal items (pers. it.) by non-VIAF sources [adapted from http://catalogo.pusc.it/beyond_viaf/#tb6] Table 7. Wikidata personal items (pers. it.) by biographical dictionary [adapted from http://catalogo.pusc.it/beyond_viaf/#tb7] In general the presence of IDs of biographical dictionaries (796,609 IDs in total) in 725,755 personal items having VIAF ID helps significantly in the definition of authoritative dates of birth and death (table 7, total of column 2 and table 4, total of column 12). http://catalogo.pusc.it/beyond_viaf/#tb7 http://catalogo.pusc.it/beyond_viaf/#tb4 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 14 A comparison between table 1, column 7, and table 2, row WKP (the acronym for Wikidata wrongly used by VIAF) shows that 2,147,319 clusters contain 2,148,680 WKP IDs; it means that, from a VIAF point of view, Wikidata duplicates are only 1,361. Furthermore, a comparison between the total and row 0 in table 8, col. 1, shows that 2,061,046 items contain at least one VIAF ID and that 2,037,638 items contain exactly one VIAF ID; so, items containing one or more VIAF duplicates are 23,408. As a result, it can be concluded that the percentage of duplicates in Wikidata is less than 0.01% and in VIAF is about 0.01%, so Wikidata is as trustworthy as VIAF. VIAF and Wikidata not only are able to discover reciprocal duplicates, but also discover duplicates in VIAF sources, by a comparison between table 8, col. 3—containing the total number of the cases in which a VIAF source has at least one duplicate—and table 8, col. 5—containing the total number of the cases in which VIAF sources are duplicated. However, while duplicates recorded by VIAF are findable only by querying the monthly dumps using in-house–made programs, duplicates discovered by Wikidata are easily findable through SPARQL queries detecting single-value constraint violations. Table 8. Wikidata personal items (pers. it.) by repeated VIAF sources and VIAF source IDs [adapted from http://catalogo.pusc.it/beyond_viaf/#tb8] DISCUSSION VIAF and Wikidata are quite different in their purpose, scope, organizational and theoretical approach, data harvesting and management. A major difference between VIAF and Wikidata is in their purpose: on the one hand, VIAF aims to identify bibliographic entities and to connect authority data provided by selected Contributors (national libraries, cultural agencies, and other major institutions) and extracted from Other Data Providers (such as ISNI, RISM or DE663, Wikidata, etc.) through the creation of clusters by means of software. On the other hand, like ISNI, Wikidata focuses on both identification and description of entities and has the purpose of building collaboratively a database concerning the sum of all relevant knowledge—provided that each item complying with its notability criteria is accepted— using a crowdsourced approach (https://www.wikidata.org/wiki/Wikidata:Notability). http://catalogo.pusc.it/beyond_viaf/#tb1 http://catalogo.pusc.it/beyond_viaf/#tb2 http://catalogo.pusc.it/beyond_viaf/#tb8 http://catalogo.pusc.it/beyond_viaf/#tb8 http://catalogo.pusc.it/beyond_viaf/#tb8 http://catalogo.pusc.it/beyond_viaf/#tb8 https://www.wikidata.org/wiki/Wikidata:Notability INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 15 Another relevant difference between VIAF and Wikidata is their scope: while VIAF aims to identify a few selected types of entities already described within the bibliographic universe by national agencies, Wikidata aims to identify and describe any kind of entity of interest for the Wikidata community. Wikidata items may exist for any kind of entity and may contain a very broad range of data and of external identifiers. So, Wikidata can represent bibliographic data and entities —e.g., at present Wikidata records data for the 54% of all the bibliographic sources cited in Wikipedia entries—any other kind of entity provided for in VIAF (i.e., agents, works, expressions, and places), and any other entity defined by the FRBR-IFLA LRM model (e.g., manifestations, items, timespans, nomens, res, etc.), and by other models relevant for the GLAM universe (such as FRBRoo and CIDOC).39 But it is open to any data model because it can also include any kind of entity outside the bibliographic or cultural heritage universe, as it is a knowledge base capable of containing any kind of statement on any entity users want to describe. In addition, for any kind of entity there is no minimum or maximum number of statements that must or can be added; as soon as an entity is clearly identified, it can be added to Wikidata. Moreover, when miss ing, new identifiers—and properties for description—can be proposed by anyone through property proposals and, if well defined, they are usually approved within two weeks (https://www.wikidata.org/wiki/Wikidata:Property_proposal). A broader scope is supposed to be much more convenient for users who wish to discover previously unknown links and information in the semantic web. Organizational Model Due to the VIAF top-down approach, data is completely managed by OCLC with no chance for common users or medium and small libraries or other institutions to directly improve VIAF clusters (e.g., by adding other data coming from their collections or from encyclopedias or online databases, merging duplicates, solving conflations, etc.). As the Wikidata approach is “to crowd- source data acquisition, allowing a global community to edit the data,” data is curated directly by users interested in their creation and use.40 So, in Wikidata, data is produced by volunteers, by means of semiautomatic or manual data harvesting from any desired and available source. Moreover, users’ statistics show that authoritative data from national bibliographic agencies and other libraries, archives, and museums are normally uploaded by common users, not by librarians (or any other kind of institutional data curator).41 Identification Function The theoretical approach differs too, both as to the form of the names and as to identification function. In VIAF, preferred and variant forms of names for persons are based on national cataloguing codes. Because national codes are different, VIAF is needed and works as a neutral hub of all the national preferred forms. Cataloguing rules can assure uniformity and univocity to the forms of the names of the entities within a national catalogue but are quite complicated to be understood and used by users. In Ranganathan’s words, “the cataloguing conventions are on the surface quite contrary to what Mr. Everybody is familiar with.”42 In contrast, preferred forms in Wikidata are based on the international principles of the convenience of the user and common usage.43 A clear example is the use of the direct form of name (Jane Doe) instead of the inverted form of name (Doe, Jane). A different usage in the forms of names could be an issue for the integration of library metadata in Wikidata. In practice, however, it is not. First, there is no conflict between the Wikidata form and any other form from a theoretical point of view, as Wikidata form is already treated in VIAF as the preferred form within its specific context.44 In addition to that, Wikidata accepts any library https://www.wikidata.org/wiki/Wikidata:Property_proposal INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 16 identifier, so that any library-controlled form can be linked to a Wikidata item and vice versa. Furthermore, a Wikidata bot could be programmed to dump authorized and variant access points from national authority files and add them to the item labels and aliases. 45 Lastly, it could be argued that national cataloguing codes are compliant with the ICP principles and with the convenience of the user and common usage. But a remarkable difference is that while in national codes principles are applied by cataloguers for users, in Wikidata they are expressed directly by the users themselves. As the identification function is a major feature of the semantic web, the different approach of VIAF and Wikidata to this issue must be underlined. As noted, “VIAF remains neutral towards differences in the cataloguing policy of its data contributors” and, for this reason, VIAF accepts all IDs provided by its sources, even when they are not clearly identifiable entities but are just labels (see for example https://viaf.org/viaf/307171748 or https://viaf.org/viaf/305052259).46 On the contrary, Wikidata explicitly requires each item to refer to “a clearly identifiable conceptual or material entity” (second notability criterium; https://www.wikidata.org/wiki/Wikidata:Notability). As a consequence, many isolated clusters formed by VIAF on the basis of single Contributors’ IDs related to not-clearly-identifiable entities are not acceptable in Wikidata and remain unlinked. Moreover, data on cluster duplication shows that identification in Wikidata is performed with the same quality level as in VIAF. Clusters for identification purpose are created both in VIAF and Wikidata, but differently from VIAF, in Wikidata external identifiers—as all the other data—are not provided in a structured way by national libraries or other institutions (with very few exceptions); instead, identifiers are usually found and added by common users through web scrapers and after data cleaning. What is more, matches are not performed automatically, but semiautomatically (through tools such as OpenRefine or Mix’n’match (https://mix-n-match.toolforge.org/ and https://openrefine.org/) or manually. An enhanced feature of Wikidata in clusterization is the record of a wider variety of sources and relative IDs: due to its openness, Wikidata refers to VIAF and its sources, but also to any other library or cultural institution and to a large number of reference sources like encyclopedias and biographical dictionaries too (table 7). A wider variety of identification sources and manual work assure a higher level of identification. Data Quantity Data harvesting affects both quantity and quality of data. In VIAF, data are collected from periodical contributions of VIAF participants, with very large sets of data. Therefore, from a quantitative point of view, VIAF has a far larger number of people (22,099,715 personal clusters) in comparison with Wikidata (8,304,947 personal items). Even though Wikidata was created in 2012, the number of personal items in Wikidata is currently only over a third (37%) of all VIAF personal clusters. Although quantities are not directly comparable due to the different universe to be described, in the last few years initiatives to enhance organized cooperation between libraries and Wikidata and to promote data production in Wikidata are increasing. A very high-quality initiative is supported by Cornell University, Harvard University, Stanford University, and the University of Iowa’s School of Library and Information Science, in collaboration with the Library of Congress and the Program for Cooperative Cataloging (PCC). Their Linked Data for Production (LD4P) Wikidata project is “an in- depth exploration of how Wikidata could serve as a platform for publishing, linking, and enriching library linked data” https://viaf.org/viaf/307171748 https://viaf.org/viaf/305052259/#Jones,_A._L https://www.wikidata.org/wiki/Wikidata:Notability https://mix-n-match.toolforge.org/ https://openrefine.org/ http://catalogo.pusc.it/beyond_viaf/#tb7 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 17 (https://www.wikidata.org/wiki/Wikidata:WikiProject_Linked_Data_for_Production). An additional example is the IFLA Wikidata Working Group that was formed “to explore and advocate for the use of and contribution to Wikidata by library and information professionals, the integration of Wikidata and Wikibase with library systems, and alignment of the Wikidata ontology with library metadata formats such as BIBFRAME, RDA, and MARC” (https://www.ifla.org/node/92837). Even so, Wikidata is still very far from having a structured workflow to ingest data from national or local libraries, museums, and archives. In fact, while the projects mentioned above are mainly dedicated to explaining to the public of librarians and institutions why Wikidata is important and how to contribute to it, there are still very few projects which are mainly dedicated to the concrete massive synchronisation of data between library and bibliographic data and Wikidata. In fact, they also require a relevant effort in the manual cleaning of discrepancies and oddities emerging from the synchronisation. Relevant exceptions are the National Library of Wales 47 and the Biblioteca europea di informazione e cultura, where significant work has been done to synchronise respective databases of authors (and of other types of entities) with Wikidata. 48 Data Quality Data quality also needs to be analyzed in detail. Even if data from national libraries are authoritative and of high quality, as a virtual file VIAF neither has nor produces its own data. Consequently, VIAF data does not always remain authoritative because errors can be both inherited and added, and clusters can be duplicated. The issue is well known by ISNI, that “whenever necessary [. . .] splits and merges data coming from VIAF, and even applies protection to data that has been fixed manually.”49 As shown in table 2 and table 8, VIAF clusters are subject to isolation and duplication when they are created and to many changes and updates when they are maintained. So, even if VIAF collects a huge amount of authoritative data and creates clusters of IDs, VIAF users can not always safely and continuously rely on them. Data flows just in one direction (from national libraries to VIAF), VIAF deletes and rebuilds clusters without giving priority to the stability of one cluster over another, and, after April 2020, VIAF no longer makes available to users a record of its changes.50 On the contrary, Wikidata data is always under strict control of any user, as its structure is designed to trace any minimum change to its data. Every single addition or deletion is documented, not just to easily recover eventual vandalism, but also to support any decision with clear evidence. Any stakeholder can exactly know if, how, when, and why data changed, in any moment. What is more, from a qualitative point of view, Wikidata seems to offer a better solution for the recording of authority data than VIAF. First, it can store a wider variety of data about a person in a more semantic way. Not only is it possible in Wikidata to express preferred and variant forms of the name, related names, works, co-authors, publication statistics, and other data about the person—like in VIAF—but all these data are all expressed in a semantic way. For example, whereas in VIAF “Bach, Anna Magdalena” is just a related name of Johann Sebastian Bach, in Wikidata she is recorded and qualified as the person who married the musician. Thanks to that different approach, Wikidata can represent and show Bach’s full genealogic tree (https://magnus- toolserver.toolforge.org/ts2/geneawiki/?q=Q1339). As Adamich noted, “building graphs from bibliographic entities is really about making the data machine readable and understandable. It is about making the data web enabled. In terms of translation, linked data opens up a whole new world over our MARC entrapment.”51 https://www.wikidata.org/wiki/Wikidata:WikiProject_Linked_Data_for_Production https://www.ifla.org/node/92837 http://catalogo.pusc.it/beyond_viaf/#tb2 http://catalogo.pusc.it/beyond_viaf/#tb8 https://magnus-toolserver.toolforge.org/ts2/geneawiki/?q=Q1339 https://magnus-toolserver.toolforge.org/ts2/geneawiki/?q=Q1339 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 18 Quality is enhanced by matching methods too; whereas VIAF matches identities by an algorithm based on explicit identifiers or string matching (such as the forms of the name, dates, and bibliographic relationships),52 Wikidata matches are usually decided by a human, the user, or (in the case of semiautomatic imports) at least checked a posteriori by a human after some time. The higher precision of manual over automatic matching is recognized also in VIAF Guidelines. 53 Furthermore, as seen above, notability requires that, when clear identification is impossible, no item must be created in Wikidata. Data Maintenance and Usability Data quality relies also on maintenance. Comparison between Wikidata items and VIAF clusters shows a very small but constant presence of errors to be fixed in both (around 0.01%), even if it is impossible to determine with certainty whether VIAF uses Wikidata error pages. Issues on fixing VIAF errors directly by VIAF Contributors were already noted: “While clustering anomalies can be handled by VIAF itself, reporting errors found in source data of VIAF partners raise problems related to the efficiency of the notification workflows. At this point, involvement of VIAF partners themselves in the process is needed.”54 On the other hand, in Wikidata anyone can edit items, add new data or delete mistakes, merge items, fix various issues, and so on, on the fly. Due to its openness, Wikidata may also suffer from vandalism, but it has its own solutions.55 Along with this, data receive special attention to their accuracy and reliability because they are uploaded and maintained by users that are direct stakeholders. For this reason, in Wikidata, references to bibliographical or biographical sources and to Other Data Provider IDs such as any national and international identification system are suggested, promoted, and carefully examined. Moreover, there is a commitment to monitor the consistency of VIAF clusters. The ability of Wikidata to identify inconsistent VIAF clusters and the fact that VIAF isolated clusters can be reduced at least by 30%56 by referring to identifiers from Wikidata and Other Data Providers, are the best demonstration of the quality of its data and of the importance of the Other Data Providers in VIAF clusterization. As to the usability of data, the internal search of VIAF lacks more than basic functions: the only available filter allows to limit results to clusters having one specific source; on the contrary, filtering searches for clusters having and/or not having a specific group of sources or to clusters having more or less sources would be very useful, especially in order to find duplicates. In contrast, Wikidata has a SPARQL query service which returns results based on the current status of the database and its internal search can integrate some of the functions of the query service, allowing to look for items having and/or not having specific statements (https://www.wikidata.org/wiki/Special:Search).57 Considering cases in which VIAF and Wikidata discover potential duplicates in their sources, VIAF has no page dedicated to listing cases of (supposedly) duplicate IDs from its sources, while Wikidata easily allows to find cases in which single sources have (supposedly) duplicate IDs through constraint violations58 and appropriate SPARQL queries. A Comparison Table A comparison table was built to compare scope, role, system, and functions between VIAF and Wikidata, inspired by and adapted from a VIAF vs ISNI comparison.59 https://www.wikidata.org/wiki/Special:Search INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 19 Table 9. Comparison between and complementarity of VIAF and Wikidata features Feature VIAF Wikidata Scope ● Persons ● Organizations ● Works ● Expressions ● Locations ● Any kind of VIAF entity ● Any “res” of IFLA LRM ● Any entity of CIDOC ● Any other non-GLAM entity ● Any entity in the universe of discourse Software ● Unknown ● Wikibase60 Data. Person entity properties ● Preferred form of name, based on national cataloguing rules ● Very rich variant forms of name, identified by national agencies variant forms ● Sources ● Preferred form of name (label) based on convenience of the user and common usage61 ● Variant forms of name (aliases), organized by languages and scripts62 ● Sources (as statements and references and with qualifiers) Data. Quantity (persons) ● Number of clusters: 33,656,281 (Sept. 2020) ● Number of personal clusters: 22,099,715 (Sept. 2020) ● Number of entities: 90,260,081 (Oct. 2020) ● Number of personal items: 8,304,947 (Oct. 2020) ● Number of personal items with VIAF ID: 2,061,046 (Sept. 2020) Data. Harvesting ● Data are provided by authoritative national bibliographic agencies ● Data are added through massive semiautomatic imports and/or manually by any interested user Data. Quality ● Data are granted by authoritative national bibliographic agencies ● Data are controlled by any directly interested user, based on data from VIAF, available bibliographic agencies, and other authoritative bibliographic sources Data. Other entities properties ● ISBN, titles, dates included in the cluster ● Any kind of property applicable to an entity can be used (multimedia included)63 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 20 Feature VIAF Wikidata ● Dates, genre, bibliographic references from sources, xlinks, etc. ● Properties are unchangeable ● All statements admit references, which are strongly recommended in some cases ● Unavailable properties can be freely added through a process of property proposal64 Data. Dates ● Dates are extracted from authority and bibliographic records using a parsing technique; calendars and precision are not available65 ● Dates are imported semiautomatically from various sources or filled in manually; different calendars are available and further statements can be made through qualifiers66 Data. Vandalism ● No vandalism: data are editable only by OCLC ● Everyone can edit, but items which are frequently vandalized can be temporarily or permanently protected from the edits of unregistered users67 Data. Fixing errors, deduplicating, or unmerging clusters/items ● Suggestions and requests via email ● Asynchronous ● Presumably, automated processes and human interventions ● VIAF rebuilds clusters and does not give priority to the stability of one cluster over another68 ● Everyone can edit69 ● Instantaneous ● Probable errors (constraint- violations) are detected in an automated way (by bots and through queries) ● Pages with lists of probable errors (constraint-violations) are freely available and constantly updated in an automated way (by bots)70 Data. License ● All public data (license: http://opendatacommons.org/licen ses/by/1.0/) ● All public data (license: https://creativecommons.org/publi cdomain/zero/1.0/deed.it) Role ● Create clusters ● Ingest authority records from VIAF Contributors and Other Data Providers (included WKD and ISNI) ● Publish and diffuse VIAF IDs and data ● Create items with a worldwide recognized and standard identifier ● Interlink items with any available external identifier ● Ingest data from VIAF, from VIAF Contributors, and Other Data Providers (e.g., ISNI) http://opendatacommons.org/licenses/by/1.0/ http://opendatacommons.org/licenses/by/1.0/ https://creativecommons.org/publicdomain/zero/1.0/deed.it https://creativecommons.org/publicdomain/zero/1.0/deed.it INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 21 Feature VIAF Wikidata ● Allow to create and maintain on Toolforge free tools—e.g., Mix’n’match—to ingest external identifiers71 ● Manage library, bibliographic, and non-library and non-bibliographic linked data ● Publish and diffuse Wikidata IDs and data Organizational model ● OCLC service, guided by VIAF Council of participating institutions ● Hierarchical, top-down ● Membership on request and subordinated to approval ● Largely limited to national bibliographic agencies ● Wikimedia project ● Distributed, bottom-up ● Everyone can take part in the project72 ● Open to any bibliographic or non- bibliographic institution (national, large, medium, and small) System. Website ● Interface only in English language ● Interface in nearly any language and script; new ones can be added ● Online facilities (end user input; edit online facilities for end user) ● Login enhances users’ experience (by gadgets and scripts) System. Updating ● Periodical (asynchronous) ingestions ● Continuous, instantaneous, free updates System. Versioning ● History is included in each present cluster and for abandoned clusters ● History is inaccessible in redirected clusters ● Page history available in each item and for redirected items ● For deleted items, history is accessible only to administrators Long-term preservation policy ● OCLC maintains the hosting, software, and data for VIAF73 ● Wikimedia Foundation maintains the hosting, software, and data for Wikidata74 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 22 Feature VIAF Wikidata Notifications to stakeholders ● Notifications to be sent to data providers ● Notifications are sent to end users and contributors Display, search, and download ● In multiple formats: xml and json, including justlinks.json; ● Basic search interface ● Clusters are listed without clear ranking rule ● Integrating monthly dumps ● API endpoint75 ● Before April 2020, by monthly dump with persist links; after, monthly dumps without persists links ● In multiple formats: json, php, n3, ttl, nt, rdf, jsonld, html76 ● Search interface 77 ● API endpoint78 ● SPARQL query endpoint79 ● Dumps80, also customizable81 ● See https://www.wikidata.org/wiki/Help :About_data Linked data and SRU ● Linked data ● SRU82 (search and browse indexes, using CQL syntax; output formats are XML or HTML) ● Linked data Interoperability. Local ● Local institution can only reconcile VIAF IDs to their own data ● As changes are made by VIAF, synchronization must be periodically performed by sources and local institutions ● Full reconciliation, upload, and synchronization of local IDs on Wikidata and vice versa ● Dedicated tools: Mix’n’match ● Other tools: OpenRefine ● Bots ● Manually CONCLUSION Main VIAF and Wikidata features and personal entities data were analyzed and compared in this study to focus on analogies and differences, and to highlight their reciprocal role and helpfulness in the worldwide bibliographical context and in the semantic web environment. VIAF is a major international initiative to address the challenge of reliably identifying bibliographic agents on the web, by means of authoritative data based on national cataloguing codes and coming from the national libraries involved in the UBC program. Moreover, VIAF is a pillar of the identification process that users enact within Wikidata. Still, the comparison emphasized a few relevant issues in VIAF’s approach, designed more than twenty years ago: a very selective policy of inclusion of its sources—Contributors and Other Data Providers—and to their participation to the governance, that prevents a worldwide openness of the project to non - national libraries and cultural institutions; an obvious neutrality toward data coming from its https://www.wikidata.org/wiki/Help:About_data https://www.wikidata.org/wiki/Help:About_data https://www.wikidata.org/wiki/Help:About_data https://www.wikidata.org/wiki/Help:About_data INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 23 Contributors, even when data are not compliant with the identification requirements of the semantic web; troubles in correct clustering of IDs (duplicate clusters to be merged and conflated clusters to be split), and a one-way flow of data due to its top-down approach that prevents a quick and cooperative workflow to identify and fix errors; the ability to identify only a narrow range of entities (i.e., mainly bibliographic entities, but not even all those provided by IFLA LRM). On the other side, the semantic web has offered new important tools and chances to libraries, archives, museums and other cultural institutions, and their data are recognized as a relevant asset for building the backbone of the semantic web as to the control of entities of bibliographic and cultural interest. After eight years of existence, Wikidata is playing a relevant role in the publication, aggregation, and control of bibliographic and non-bibliographic information in the semantic web too. It is more and more indicated as a hub for identifiers in the semantic web.83 Wikidata depends on VIAF for a large part of the identification work of its items on VIAF and VIAF’s preeminent role in Wikidata is acknowledged by its primary position in the identifiers section of the data of each item. For this reason, the Wikidata community constantly monitors the consistency of VIAF clusters and continuously updates lists of errors present in them . On the other hand, if VIAF is undoubtedly very useful to the Wikidata community, Wikidata can support the consistency of VIAF clusters. The Wikidata informational ecosystem is much larger and wider, can be built by any interested institution and person, and its identification function can count also on the authority work of national and non-national libraries excluded from the VIAF environment, and on authoritative non-bibliographical reference sources too. This study opens some research perspectives. Analysis was limited to data about personal entities, as this kind of entity was the only one directly comparable, while further research is wanted to possibly extend the analysis to other kinds of entities. Moreover, more research should be devoted to the investigation of the treatment of special categories of persons and their names, such as mythological and legendary characters, ancient Greek and Latin authors, kings, queens, popes, saints, and so on, as VIAF Guidelines84 themselves declare among VIAF’s typical problems the clusterization of such names (and they often get five or more VIAF IDs in Wikidata). A further line of research should consider the relevance of the clusterization of encyclopedias and other reference sources in the identification process within Wikidata. Lastly, isolated clusters would need more consideration; as a matter of fact, in this study they were used as a clue of relatively recent uploads in VIAF, but LC and DNB show a high rate of isolated clusters too (maybe due to the richness of their collections and metadata). More research on isolated clusters could help to describe with more precision the possible role of non-national libraries and institutions and of their locally rich collections in identifying lesser-known agents (not just persons) in a worldwide perspective. From analyzed data and direct comparison, it can be concluded that VIAF and Wikidata can be constantly improved through reciprocal comparison, which allows discovery of errors in both. VIAF and Wikidata are two relevant tools for the authority control in the semantic web and they each have a specific role to play and different stakeholders. Unfortunately, as opposed to the relationship between VIAF and ISNI, at present no aspect of VIAF-Wikidata interoperability is discussed between the managing structures of both systems, on a regular or irregular basis . While Wikidata appears to be more reliable with regards to the identification process, its most significant weakness consists in its unorganized and unplanned crowdsourced data acquisition, INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 24 even if based at present on about 11,500 active editors.85 Furthermore, the Wikidata community still lacks the constant support and cooperation of institutional data curators such as librarians, archivists, and museum curators. Many current projects are mainly dedicated to explaining to the potential institutional stakeholders the importance and the usefulness of Wikidata for their institutional missions, but there are still too few projects devoted to massive synchronization of data from institutional silos to Wikidata. But, as soon as these initiatives reach a critical mass, Wikidata will become the real global hub of the web of data. ACKNOWLEDGEMENTS All the authors have cooperated in the redaction and revision of the article. Nevertheless, each author has mainly authored specific sections and subsections of the article: • Stefano Bargioni: Data Analysis; VIAF; Wikidata; VIAF and Wikidata: A Data Comparison. • Carlo Bianchini: Introduction; Discussion; Organizational Model; Identification Function; Data Quantity; Data Quality; Data Maintenance and Usability. • Camillo Carlo Pellizzari di San Girolamo: Relationship between VIAF and Libraries; Relationship between Wikidata and Academic, Research, and Public Libraries; Relationship between VIAF and Wikidata; Wikidata Controls on VIAF; Materials and Methods; Conclusion. All authors contributed to A Comparison Table. The authors wish to thank the anonymous reviewer whose suggestions helped to improve and enrich the paper, and the editor for his helpful edits. INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 25 ENDNOTES 1 Thomas Baker et al., Library Linked Data Incubator Group Final Report, sec. 2 (W3C Incubator Group, October 25, 2011), http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/. 2 Baker et al., Library Linked Data. 3 Dorothy Anderson, Universal Bibliographic Control. A Long Term Policy—A Plan for Action (Munchen: Verlag Dokumentation, 1974), 11. 4 Anila Angjeli, Andrew Mac Ewan, and Vincent Boulet, “ISNI and VIAF: Transforming Ways of Trustfully Consolidating Identities,” in IFLA WLIC 2014 (IFLA 2014 Lyon, IFLA, 2014), 2, http://library.ifla.org/985/1/086-angjeli-en.pdf. 5 Rick Bennett et al., “VIAF (Virtual International Authority File): Linking the Deutsche Nationalbibliothek and Library of Congress Name Authority Files,” International Cataloguing and Bibliographic Control 36, no. 1 (2007): 12–18; Barbara B. Tillett, The Bibliographic Universe and the New IFLA Cataloging Principles : Lectio Magistralis in Library Science = L’universo bibliografico e i nuovi principi di catalogazione dell’IFLA : Lectio Magistralis di biblioteconomia (Fiesole (Firenze): Casalini libri, 2008), 14–15, http://digital.casalini.it/9788885297814; “VIAF. Connect Authority Data across Cultures and Languages to Facilitate Research,” OCLC, 2020, https://www.oclc.org/en/viaf.html. 6 Gildas Illien and Françoise Bourdon, “A la recherche du temps perdu, retour vers le futur: CBU 2.0” (paper, IFLA WLIC 2014, Lyon, France, 2014), 13–14, http://library.ifla.org/956/. 7 Illien and Bourdon, “A la recherche,” 15. 8 Gordon Dunsire and Mirna Willer, “The Local in the Global: Universal Bibliographic Control from the Bottom Up” (paper, IFLA WLIC 2014, Lyon, France, 2014), 11, http://library.ifla.org/817/. 9 Luca Martinelli, “Wikidata: La Soluzione Wikimediana Ai Linked Open Data,” AIB Studi 56, no. 1 (March 2016): 75–85, https://doi.org/10.2426/aibstudi-11434; Jesús Tramullas, “Objetos culturales y metadatos: hacia la liberación de datos en Wikidata,” Anuario ThinkEPI 11 (2017): 319–21, https://doi.org/10/ghbj63; Xavier Agenjo-Bullón and Francisca Hernández-Carrascal, “Wikipedia, Wikidata y Mix’n’match,” Anuario ThinkEPI 14 (2020), https://doi.org/10/ghbj6t; Claudio Forziati and Valeria Lo Castro, “The Connection between Library Data and Community Participation: The Project SHARE Catalogue-Wikidata,” JLIS.it 9, no. 3 (2018): 109–20, https://doi.org/10/ggxj9n; Adrian Pohl, “Was Ist Wikidata Und Wie Kann Es Die Bibliothekarische Arbeit Unterstützen?,” ABI Technik 38, no. 2 (2018): 208, https://doi.org/10/ghbj6w; ARL White Paper on Wikidata: Opportunities and Recommendations (The Association of Research Libraries, 2019), https://www.arl.org/wp- content/uploads/2019/04/2019.04.18-ARL-white-paper-on-Wikidata.pdf; Regine Heberlein, “On the Flipside: Wikidata for Cultural Heritage Metadata through the Example of Numismatic Description” (paper, IFLA WLIC 2019, Libraries: Dialogue for Change, session 206: Art Libraries with Subject Analysis and Access, Athens, Greece, August 28, 2019), http://library.ifla.org/2492/1/206-heberlein-en.pdf. 10 ARL White Paper on Wikidata, 27–30; Theo van Veen, “Wikidata: From ‘an’ Identifier to ‘the’ Identifier,” Information Technology and Libraries 38, no. 2 (2019): 72–81, http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/ http://library.ifla.org/985/1/086-angjeli-en.pdf http://digital.casalini.it/9788885297814 https://www.oclc.org/en/viaf.html http://library.ifla.org/956/ http://library.ifla.org/817/ https://doi.org/10.2426/aibstudi-11434 https://doi.org/10/ghbj63 https://doi.org/10/ghbj6t https://doi.org/10/ggxj9n https://doi.org/10/ghbj6w https://www.arl.org/wp-content/uploads/2019/04/2019.04.18-ARL-white-paper-on-Wikidata.pdf https://www.arl.org/wp-content/uploads/2019/04/2019.04.18-ARL-white-paper-on-Wikidata.pdf http://library.ifla.org/2492/1/206-heberlein-en.pdf INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 26 https://doi.org/10/ghbj62; Hilary Thorsen, “LD4P: Linked Data for Production: Wikidata as a Hub for Identifiers” (slideshow presentation, June 11, 2020), https://docs.google.com/presentation/d/1jWz3_nCf5rdd- 7ejETGlfv99UV2PnD1v/edit?usp=embed_facebook. 11 Tillett, The Bibliographic Universe, 15. 12 Open Data Commons Attribution License (ODC-By) v1.0 (as stated in http://viaf.org/viaf/data/). 13 “VIAF Admission Criteria,” OCLC, 2020, https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Admission%20Criteria.pdf. 14 The description of Wikidata source in http://viaf.org/viaf/partnerpages/WKP.html seems to refer to Wikipedia before the existence of Wikidata. The same acronym WKP reflects this anachronism, whereas ISNI correctly uses WKD. Anyway, this description, as well as many others, requires an update. 15 Stacy Allison-Cassin and Dan Scott, “Wikidata: A Platform for Your Library’s Linked Open Data,” Code4Lib Journal 40 (May 4, 2018), https://journal.code4lib.org/articles/13424. 16 Carlo Bianchini and Pasquale Spinelli, “Wikidata at Fondazione Levi (Venice, Italy): A Case Study for the Publication of Data about Fondo Gambara, a Collection of 202 Musicians’ Portraits,” JLIS.it 11, no. 3 (September 15, 2020): 24. 17 IFLA Working Group on Functional Requirements and Numbering of Authority Records (FRANAR), Functional Requirements for Authority Data: A Conceptual Model (München: K. G. Saur, 2009), 46, https://www.ifla.org/files/assets/cataloguing/frad/frad_2013.pdf. For qualifiers, see https://www.wikidata.org/wiki/Help:Qualifiers; for references see https://www.wikidata.org/wiki/Help:Sources. 18 Partial lists are linked from https://wikibase-registry.wmflabs.org/wiki/Main_Page. 19 See https://www.transition-bibliographique.fr/fne/french-national-entities-file/; the Proof of Concept is available at https://github.com/abes-esr/poc-fne. 20 Jean Godby et al., Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage (Dublin OH: OCLC Research, 2019): 8, https://doi.org/10.25333/faq3-ax08. 21 IFLA, “Opportunities for Academic and Research Libraries and Wikipedia” (discussion paper, 2016), 10, https://www.ifla.org/files/assets/hq/topics/info- society/iflawikipediaopportunitiesforacademicandresearchlibraries.pdf. 22 John Riemer, “The Program for Cooperative Cataloging & a Wikidata Pilot” (slideshow presentation, June 16, 2020), slide 5, https://docs.google.com/presentation/d/1NpkAQdGGft1Wi2vX0zgMtIxwXWjPq96NtXx4Mmy XFFI/edit#slide=id.p. 23 Godby et al., “Creating Library Linked Data,” 8. https://doi.org/10/ghbj62 https://docs.google.com/presentation/d/1jWz3_nCf5rdd-7ejETGlfv99UV2PnD1v/edit?usp=embed_facebook https://docs.google.com/presentation/d/1jWz3_nCf5rdd-7ejETGlfv99UV2PnD1v/edit?usp=embed_facebook http://viaf.org/viaf/data/ https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Admission%20Criteria.pdf http://viaf.org/viaf/partnerpages/WKP.html https://journal.code4lib.org/articles/13424 https://www.ifla.org/files/assets/cataloguing/frad/frad_2013.pdf https://www.wikidata.org/wiki/Help:Qualifiers https://www.wikidata.org/wiki/Help:Sources https://wikibase-registry.wmflabs.org/wiki/Main_Page https://www.transition-bibliographique.fr/fne/french-national-entities-file/ https://github.com/abes-esr/poc-fne https://doi.org/10.25333/faq3-ax08 https://www.ifla.org/files/assets/hq/topics/info-society/iflawikipediaopportunitiesforacademicandresearchlibraries.pdf https://www.ifla.org/files/assets/hq/topics/info-society/iflawikipediaopportunitiesforacademicandresearchlibraries.pdf https://docs.google.com/presentation/d/1NpkAQdGGft1Wi2vX0zgMtIxwXWjPq96NtXx4MmyXFFI/edit%23slide=id.p https://docs.google.com/presentation/d/1NpkAQdGGft1Wi2vX0zgMtIxwXWjPq96NtXx4MmyXFFI/edit%23slide=id.p INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 27 24 Maximilian Klein and Alex Kyrios, “VIAFbot and the Integration of Library Data on Wikipedia,” Code4Lib Journal 22 (October 14, 2013), https://journal.code4lib.org/articles/8964. 25 IFLA Cataloguing Section and IFLA Meeting of Experts on an International Cataloguing Code, Statement of International Cataloguing Principles (ICP) (Den Haag: IFLA, 2016), para. 5.3. 26 https://www.wikidata.org/wiki/MediaWiki:Wikibase- SortedProperties#IDs_with_datatype_%22external-id%22; ISNI (P213, https://www.wikidata.org/wiki/Property:P213) is presently sorted after VIAF instead of in the ISO section because it is considered primarily as a VIAF source. 27 Epìdosis, Viaf e Wikidata.mpg, 2020, https://commons.wikimedia.org/wiki/File:VIAF_e_Wikidata.mpg; a list of gadgets is available at https://www.wikidata.org/wiki/Wikidata:VIAF/cluster#Gadgets. 28 The main error-report page is https://www.wikidata.org/wiki/Wikidata:VIAF/cluster/conflating_entities; its subpage https://www.wikidata.org/wiki/Wikidata:VIAF/cluster/conflating_specific_entries is designed for collecting “easy” cases of conflation, when only a few members of a cluster should be moved elsewhere, while the cluster is substantially sane. 29 Moreno Hayley, email to author, March 23, 2020. To the question if data about abandoned clusters would have been maintained, the VIAF answered, “We recognize that the data in the file was not usable. VIAF is in a period of transition and it was decided that we could not at this time fix the file so it has been removed from the list of available downloads.” 30 The statement read: “The persist-rdf.xml file has been removed and will no longer be available,” accessed October 23, 2020. 31 Angjeli, Mac Ewan, and Boulet “ISNI and VIAF,” 3. 32 https://dumps.wikimedia.org/wikidatawiki/; instructions and a list of kinds of data dumps are available at https://www.wikidata.org/wiki/Wikidata:Database_download. 33 A general explanation of ranks is available at https://www.wikidata.org/wiki/Help:Ranking. Here is a small summary: values of statements can be ranked in three ways, “preferred,” “normal” (default), and “deprecated”; the expression “values with non-deprecated rank” includes all values with preferred rank or normal rank; the expression “values with best rank” includes only values with preferred rank or normal rank, with this condition: if the same statement has two or more values and at least one of them has preferred rank, values with normal rank aren’t counted; if there aren’t values with preferred rank, all values with normal rank are counted. 34 VIAF and Wikidata dumps, together with the scripts, were published on Zenodo at https://doi.org/10.5281/zenodo.4457114. https://journal.code4lib.org/articles/8964 https://www.wikidata.org/wiki/MediaWiki:Wikibase-SortedProperties%23IDs_with_datatype_%22external-id%22 https://www.wikidata.org/wiki/MediaWiki:Wikibase-SortedProperties%23IDs_with_datatype_%22external-id%22 https://www.wikidata.org/wiki/Property:P213 https://commons.wikimedia.org/wiki/File:VIAF_e_Wikidata.mpg https://www.wikidata.org/wiki/Wikidata:VIAF/cluster%23Gadgets https://www.wikidata.org/wiki/Wikidata:VIAF/cluster/conflating_entities https://www.wikidata.org/wiki/Wikidata:VIAF/cluster/conflating_specific_entries https://dumps.wikimedia.org/wikidatawiki/ https://www.wikidata.org/wiki/Wikidata:Database_download https://www.wikidata.org/wiki/Help:Ranking https://doi.org/10.5281/zenodo.4457114 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 28 35 The queries can be performed using the following links: VIAF members: https://w.wiki/i5J; authority controls related to libraries but not being VIAF members: https://w.wiki/i5K; biographical dictionaries: https://w.wiki/i5N. 36 The query can be performed using the following link: https://w.wiki/i5p. 37 It could be because they are probably more difficult to cluster, but in some cases also because they represent infrequently described entities. 38 As suggested by the reviewer, more removals than additions may be a clue of a cleanup project. 39 Pat Riva, Patrick Le Boeuf, and Maja Zumer, IFLA Library Reference Model, draft (Den Haag: IFLA, 2017), https://www.ifla.org/files/assets/cataloguing/frbr-lrm/ifla_lrm_2017-03.pdf; Nick Crofts et al., “Definition of the CIDOC Conceptual Reference Model,” version 5.0.4, ICOM/CIDOC CRM Special Interest Group, 2011, http://www.cidoc-crm.org/html/5.0.4/cidoc-crm.html; Chryssoula Bekiari et al., eds., FRBR Object-Oriented Definition and Mapping from FRBRER, FRAD and FRSAD, version 2.0 (International Working Group on FRBR and CIDOC CRM Harmonisation, 2013), http://old.cidoc- crm.org/docs/frbr_oo/frbr_docs/FRBRoo_V2.0_draft_2013May.pdf; Lydia Pintscher, Lea Lacroix, and Mattia Capozzi, “What’s New on the Wikidata Features This Year,” YouTube video, October 26, 2020, truocolo, https://www.youtube.com/watch?v=EbXdZK54GrU. 40 Denny Vrandečić and Markus Krötzsch, “Wikidata: A Free Collaborative Knowledgebase,” Communications of the ACM 57, no. 10 (September 23, 2014): 80, https://doi.org/10/gftnsk. 41 For a general statistic see http://wikidata.wikiscan.org/users; for a statistic about the VIAF property see https://bambots.brucemyers.com/NavelGazer.php?property=P214; changing the id of the property at the end of the URL allows exploring other property statistics. 42 Shiyali Ramamrita Ranganathan, Reference Service, 2nd ed., Ranganathan Series in Library Science 8 (Bombay: Asia Publishing House, 1961), 74. 43 IFLA Cataloguing Section and IFLA Meeting of Experts on an International Cataloguing Code, Statement of International Cataloguing Principles (ICP), 5, https://www.ifla.org/publications/node/11015. 44 Wikidata does have a guideline for a preferred label, and its choice is based on users’ convenience (https://www.wikidata.org/wiki/Help:Label, par. 1.2) as required by International Cataloguing Principles (2016). As to the choice of the Wikidata label in a specific language, VIAF does not show any clear principle, while the authors believe that it would be preferable to use the English (“en”) label, whenever available. See IFLA Cataloguing Section and IFLA Meeting of Experts on an International Cataloguing Code, Statement of International Cataloguing Principles (ICP). 45 For example, in September it was done for NKC using OpenRefine (sample edit: https://www.wikidata.org/w/index.php?title=Q520487&diff=1269046867&oldid=12668704 64). https://w.wiki/i5J https://w.wiki/i5K https://w.wiki/i5N https://w.wiki/i5p https://www.ifla.org/files/assets/cataloguing/frbr-lrm/ifla_lrm_2017-03.pdf http://www.cidoc-crm.org/html/5.0.4/cidoc-crm.html http://old.cidoc-crm.org/docs/frbr_oo/frbr_docs/FRBRoo_V2.0_draft_2013May.pdf http://old.cidoc-crm.org/docs/frbr_oo/frbr_docs/FRBRoo_V2.0_draft_2013May.pdf https://www.youtube.com/watch?v=EbXdZK54GrU https://doi.org/10/gftnsk http://wikidata.wikiscan.org/users https://bambots.brucemyers.com/NavelGazer.php?property=P214 https://www.ifla.org/publications/node/11015 https://www.wikidata.org/wiki/Help:Label https://www.wikidata.org/w/index.php?title=Q520487&diff=1269046867&oldid=1266870464 https://www.wikidata.org/w/index.php?title=Q520487&diff=1269046867&oldid=1266870464 INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 29 46 Angjeli, Mac Ewan, and Boulet, “ISNI and VIAF,” 9. 47 Simon Cobb (https://www.wikidata.org/wiki/User:Sic19) became Wikidata Visiting Scholar in 2017 (https://en.wikipedia.org/wiki/User:Jason.nlw/Wikidata_Visiting_Scholar). 48 Federico Leva and Marco Chemello, “The Effectiveness of a Wikimedian in Permanent Residence: The BEIC Case Study,” JLIS.It 9, no. 3 (September 2018): 141–47, https://doi.org/10.4403/jlis.it-12481. 49 Angjeli, Mac Ewan, and Boulet, “ISNI and VIAF,” 11. 50 Andrew Mac Ewan, “ISNI, VIAF and NACO and Their Relationship to ORCID, discussion paper for PCC Policy Committee, 4 November,” 2013, 2, http://www.loc.gov/aba/pcc/documents/ISNI%20PoCo%20discussion%20paper%202013.d ocx. 51 Tom Adamich, “Library Cataloging Workflows and Library Linked Data: The Paradigm Shift,” Technicalities 39, no. 3 (May/June 2019): 14. 52 OCLC, VIAF Guidelines, rev. July 16, 2019, 2, https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Guidelines.pdf. 53 OCLC, VIAF Guidelines, 5. “When VIAF is unable to algorithmically match some of the source authority records with each other, they can be manually pulled together into a single cluster using an internal table.” 54 Angjeli, Mac Ewan, and Boulet, “ISNI and VIAF,” 16. 55 Stefan Heindorf et al., “Vandalism Detection in Wikidata,” in Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM ’16 (New York, NY: Association for Computing Machinery, 2016), 327–36, https://doi.org/10/gg2nmm; Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli, “Building Automated Vandalism Detection Tools for Wikidata,” in Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion (Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2017), 1647–54, https://doi.org/10/ghhtzf. 56 See table 1, col. 1 vs col. 9; it should be noted that col. 9 considers only non-VIAF sources and biographical dictionaries, but Wikidata also links to encyclopedias and other online databases. 57 For example, people not having VIAF id but having ICCU id (https://tinyurl.com/y6hbtjuo); instructions about the internal search are available at https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch. 58 https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations. 59 Angjeli, Mac Ewan, and Boulet, “ISNI and VIAF,” 16. 60 https://www.mediawiki.org/wiki/Wikibase/DataModel. https://www.wikidata.org/wiki/User:Sic19 https://en.wikipedia.org/wiki/User:Jason.nlw/Wikidata_Visiting_Scholar https://doi.org/10.4403/jlis.it-12481 http://www.loc.gov/aba/pcc/documents/ISNI%20PoCo%20discussion%20paper%202013.docx http://www.loc.gov/aba/pcc/documents/ISNI%20PoCo%20discussion%20paper%202013.docx https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Guidelines.pdf https://doi.org/10/gg2nmm https://doi.org/10/ghhtzf https://tinyurl.com/y6hbtjuo https://www.mediawiki.org/wiki/Help:Extension:WikibaseCirrusSearch https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations https://www.mediawiki.org/wiki/Wikibase/DataModel INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 30 61 “The label is the most common name that the item would be known by” (https://www.wikidata.org/wiki/Help:Label). See also IFLA Cataloguing Section and IFLA Meeting of Experts on an International Cataloguing Code, Statement of International Cataloguing Principles (ICP), 5., https://www.ifla.org/publications/node/11015. 62 Bots exist to create more and more variant forms based on matching properties, such as date of birth (P569) and date of death (P570), and to import variant forms of names from national authority files. See, for example, https://www.wikidata.org/w/index.php?title=Q5669&diff=611600491&oldid=608231160 . 63 https://www.wikidata.org/wiki/Help:Data_type. 64 https://www.wikidata.org/wiki/Wikidata:Property_proposal. 65 Jenny A. Toves and Thomas B. Hickey, “Parsing and Matching Dates in VIAF,” Code4Lib Journal, 26 (October 21, 2014), https://journal.code4lib.org/articles/9607; Stefano Bargioni, “From Authority Enrichment to AuthorityBox : Applying RDA in a Koha Environment,” JLIS.It 11, no. 1 (2020): 175–89, https://doi.org/10/gg66rq. 66 https://www.wikidata.org/wiki/Help:Dates. 67 See Heindorf et al., “Vandalism Detection in Wikidata.” 68 See Mac Ewan, “ISNI, VIAF and NACO.” 69 See https://www.wikidata.org/wiki/Help:Merge, https://www.wikidata.org/wiki/Help:Split_an_item, and https://www.wikidata.org/wiki/Help:Conflation_of_two_people. 70 Complete list at https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations (e.g., https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214). 71 https://admin.toolforge.org/; see also Xavier Agenjo-Bullón and Francisca Hernández- Carrascal, “Registros de autoridades, enriquecimiento semántico y Wikidata,” Anuario ThinkEPI 12 (2018): 361–72, https://doi.org/10/ghbj6z. 72 https://www.wikidata.org/wiki/Wikidata:Property_proposal. 73 https://www.oclc.org/en/viaf.html. 74 https://www.wikidata.org/wiki/Wikidata:Introduction. 75 https://platform.worldcat.org/api-explorer/apis/VIAF. 76 https://www.wikidata.org/wiki/Special:EntityData; see also https://www.wikidata.org/wiki/Wikidata:Database_download. 77 https://www.wikidata.org/wiki/Special:Search. https://www.wikidata.org/wiki/Help:Label https://www.ifla.org/publications/node/11015 https://www.wikidata.org/w/index.php?title=Q5669&diff=611600491&oldid=608231160 https://www.wikidata.org/wiki/Help:Data_type https://www.wikidata.org/wiki/Wikidata:Property_proposal https://journal.code4lib.org/articles/9607 https://doi.org/10/gg66rq https://www.wikidata.org/wiki/Help:Dates https://www.wikidata.org/wiki/Help:Merge https://www.wikidata.org/wiki/Help:Split_an_item https://www.wikidata.org/wiki/Help:Conflation_of_two_people https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P214 https://admin.toolforge.org/ https://doi.org/10/ghbj6z https://www.wikidata.org/wiki/Wikidata:Property_proposal https://www.oclc.org/en/viaf.html https://www.wikidata.org/wiki/Wikidata:Introduction https://platform.worldcat.org/api-explorer/apis/VIAF https://www.wikidata.org/wiki/Special:EntityData https://www.wikidata.org/wiki/Wikidata:Database_download https://www.wikidata.org/wiki/Special:Search INFORMATION TECHNOLOGY AND LIBRARIES JUNE 2021 BEYOND VIAF | BIANCHINI, BARGIONI, AND PELLIZZARI DI SAN GIROLAMO 31 78 https://www.wikidata.org/w/api.php. 79 https://query.wikidata.org/. 80 https://dumps.wikimedia.org/wikidatawiki/. 81 https://wdumps.toolforge.org/. 82 https://www.oclc.org/developer/develop/web-services/viaf/authority-source.en.html. 83 van Veen, “Wikidata.” 84 See “Typical problems” in VIAF Guidelines: https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Guidelines.pdf. 85 Pintscher, Lacroix, and Capozzi, “What’s New.” https://www.wikidata.org/w/api.php https://query.wikidata.org/ https://dumps.wikimedia.org/wikidatawiki/ https://wdumps.toolforge.org/ https://www.oclc.org/developer/develop/web-services/viaf/authority-source.en.html https://www.oclc.org/content/dam/oclc/viaf/VIAF%20Guidelines.pdf AbstraCt Introduction Relationship between VIAF and Libraries Relationships between Wikidata and Academic, Research, and Public Libraries Relationship between VIAF and Wikidata Wikidata Controls on VIAF Materials and Methods Data Analysis: VIAF Clusters and Wikidata Items VIAF Wikidata VIAF and Wikidata: A Data Comparison Discussion Organizational Model Identification Function Data Quantity Data Quality Data Maintenance and Usability A Comparison Table Conclusion Acknowledgements Endnotes