ZBW Labs

ZBW Labs
	Data donation to Wikidata, part 2: country/subject dossiers of the 20th Century Press Archives
The world's largest public newspaper clippings archive comprises lots of material of great interest particularly for authors and readers in the Wikiverse. ZBW has digitized the material from the first half of the last century, and has put all available metadata under a CC0 license. More so, we are donating that data to Wikidata, by adding or enhancing items and providing ways to access the dossiers (called "folders") and clippings easily from there.
Challenges of modelling a complex faceted classification in Wikidata
That had been done for the persons' archive in 2019 - see our prior blog post. For persons, we could just link from existing or a few newly created person items to the biographical folders of the archive. The countries/subjects archives provided a different challenge: The folders there were organized by countries (or continents, or cities in a few cases, or other geopolitical categories), and within the country, by an extended subject category system (available also as SKOS). To put it differently: Each folder was defined by a geo and a subject facet - a method widely used in general purpose press archives, because it allowed a comprehensible and, supported by a signature system, unambiguous sequential shelf order, indispensable for quick access to the printed material.Folders specifically about one significant topic (like the Treaty of Sèvres) are rare in the press archives, whereas country/subject combinations are rare among Wikidata items - so direct linking between existing items and PM20 folders was hardly achievable. The folders in themselves had to be represented as Wikidata items, just like other sources used there. Here however we did not have works or scientific articles, but thematic mini-collections of press clippings, often not notable in themselves and normally without further formal bibliographic data. So a class of PM20 country/subject folder was created (as subclass of dossier, a collection of documents). Aiming at items for each folder - and having them linked via PM20 folder ID (P4293) to the actual press archive folders was yet only part of the solution.In order to represent the faceted structure of the archive, we needed anchor points for both facets. That was easy for the geographical categories: the vast majority of them already existed as items in Wikidata, a few historical ones, such as Russian peripheral countries, had to be created. For the subject categories, the situation was much different. Categories such as The country and its people, politics and economy, general or Postal services, telegraphy and telephony were constructed as baskets for collecting articles on certain broader topics. They do not have an equivalent in Wikidata, which tries to describe real world entities or clear-cut concepts. We decided therefore to represent the categories of the subject category system with their own items of type PM20 subject category. Each of the about 1400 categories is connected to the upper one via a "part of" (P361) property, thus forming a five-level hierarchy.More implementation subtletiesFor both facets, according Wikidata properties where created as "PM20 geo code" (P8483) and "PM20 subject code" (P8484). As external identifiers, they link directly to lists of subjects (e.g., for Japan) or geographical entities (e.g., for The country ..., general). For all countries where the press archives material has been processed - this includes the tedious task of clarifying the intellectual property rights status of each article -, the Wikidata item for the country includes now a link to a list of all press archives dossiers about this country, covering the first half of the 20th century.
The folders represented in Wikidata (e.g., Japan : The country ..., general) use "facet of" (P1269) and "main subject" (P921) properties to connect to the items for the country and subject categories. Thus, not only each of the 9,200 accessible folders of the PM20 country/subject archive is accessible via Wikidata. Since the structural metadata of PM20 is available, too, it can be queried in its various dimensions - see for example the list of top level subject categories with the number of folders and documents, or a list of folders per country, ordered by signature (with subtleties covered by a "series ordial" (P1545) qualifier). The interactive map of subject folders as shown above is also created by a SPARQL query, and gives a first impression of the geographical areas covered in depth - or yet only sparsely - in the online archive.Core areas: worldwide economy, worldwide colonialismThe online data reveals core areas of attention during 40 years of press clippings collection until 1949. Economy, of course, was in the focus of the former HWWA (Hamburg Archive for the International Economy), in Germany and namely Hamburg, as well as in every other country. More than half of all subject categories are part of the n Economy section of the category system and give in 4,500 folders very detailed access to the field. About 100,000 of the almost 270,000 online documents of the archive are part of this section, followed by history and general politics, foreign policy, and public finance, down to more peripheral topics like settling and migration, minorities, justice or literature. Originating in the history of the institution (which was founded as "Zentralstelle des Hamburgischen Kolonialinstituts", the central office of the Hamburg colonial institute) colonial efforts all over the world were monitored closely. We published with priority the material about the former German colonies, listed in the Archivführer Deutsche Kolonialgeschichte (Archive guide to the German Colonial Past, also interconnected to Wikidata). Originally collected to support the aggressive and inhuman policy of the German Empire, it is now available to serve as research material for critical analysis in the emerging field of colonial and postcolonial studies.Enabling future community effortsWhile all material about the German colonies (and some about the Italian ones) is online, and accessible now via Wikidata, this is not true for the former British/French/Dutch/Belgian colonies. While Japan or Argentina are accessible completely, China, India or the US are missing, as well as most of the European countries. And while 800+ folders about Hamburg cover it's contemporary history quite well, the vast majority of the material about Germany as a whole is only accessible "on premises" within ZBW's locations. It however is available as digital images, and can be accessed through finding aids (in German), which in the reading rooms directly link to a document viewer. The metadata for this material is now open data and can be changed and enhanced in Wikidata. A very selective example how that could work is a topic in German-Danish history - the 1920 Schleswig plebiscites. The PM20 folder about these events was not part of the published material, but got some interest with last year's centenary. The PM20 metadata on Wikidata made it possible to create an according folder completely in Wikidata, Nordslesvig : Historical events, with a (provisional) link to a stretch of images on a digitized film. While the checking and activation of these images for the public was a one-time effort in the context of an open science event, the creation of a new PM20 folder on Wikidata may demonstrate how open metadata can be used by a dedicated community of knowledge to enable access to not-yet-open knowledge. Current intellectual property law in the EU forbids open access to all digitized clippings from newspapers published in 1960 until 2031, and all where the death date of a named author is not known until after 2100. Of course, we hope for a change in that obstrusive legislation in a not-so-far future. We are confident that the metadata about the material, now in Wikidata, will help bridging the gap until it will finally be possible to use all digitized press archives contents as open scientific and educational resources, within and outside of the Wikimedia projects.
More information at WikiProject 20th Century Press Archives, which links also to the code for creating this data donation.
Pressemappe 20. Jahrhundert
    
          Wikidata &#160;
	Building the SWIB20 participants map
 

Here we describe the process of building the interactive SWIB20 participants map, created by a query to Wikidata. The map was intended to support participants of SWIB20 to make contacts in the virtual conference space. However, in compliance with GDPR we want to avoid publishing personal details. So we choose to publish a map of institutions, to which the participants are affiliated. (Obvious downside: the 9 un-affiliated participants could not be represented on the map).
We suppose that the method can be applied to other conferences and other use cases - e.g., the downloaders of scientific software or the institutions subscribed to an academic journal. Therefore, we describe the process in some detail.


We started with a list of institution names (with country code and city, but without person ids), extracted and transformed from our ConfTool registration system, saved it in CSV format. Country names were normalized, cities were not (and only used for context information).


We created an OpenRefine project, and reconciled the institution name column with Wikidata items of type Q43229 (organization, and all its subtypes). We included the country column (-&gt; P17, country) as relevant other detail, and let OpenRefine “Auto-match candidates with high confidence”. Of our original set of 335 country/institution entries, 193 were automaticaly matched via the Wikidata reconciliation service. At the end of the conference, 400 institutions were identified and put on the map (data set).


We went through all un-matched entries and either a) selected one of the suggested items, or b) looked up and tweaked the name string in Wikidata, or in Google, until we found an according Wikipedia page, openend the linked Wikidata object from there, and inserted the QID in OpenRefine, or c) created a new Wikidata item (if the institution seemed notable), or d) attached “not yet determined” (Q59496158) where no Wikidata item (yet) exists, or e) attached “undefined value” (Q7883029) where no institution had been given


The results were exported from OpenRefine into a .tsv file (settings)

Again via a script, we loaded ConfTool participants data, built a lookup table from all available OpenRefine results (country/name string -&gt; WD item QID), aggregated participant counts per QID, and loaded that data into a custom SPARQL endpoint, which is accessible from the Wikidata Query Service. As in step 1, for all (new) institution name strings, which were not yet mapped to Wikidata, a .csv file was produced. (An additional remark: If no approved custom SPARQL endpoint is available, it is feasible to generate a static query with all data in it’s “values” clause.)

During the preparation of the conference, more and more participants registered, which required multiple loops: Use the csv file of step 5 and re-iterate, starting at step 2. (Since I found no straightforward way to update an existing OpenRefine project with extended data, I created a new project with new input and output files for every iteration.)


Finally, to display the map we could run a federated query on WDQS. It fetches the institution items from the custom endpoint and enriches them from Wikidata with name, logo and image of the institution (if present), as well as with geographic coordinates, obtained directly or indirectly as follows: a) item has “coodinate location” (P625) itself, or b) item has “headquarters location” item with coordinates (P159/P625), or c) item has “located in administrative entity” item with coordinates (P131/P625), or c) item has “country” item (P17/P625) Applying this method, only one institution item could not be located on the map.


Data improvements
The way to improve the map was to improve the data about the items in Wikidata - which also helps all future Wikidata users.
New items
For a few institutions, new items were created:
Burundi Association of Librarians, Archivists and Documentalists
FAO representation in Kenya
Aurora Information Technology
Istituto di Informatica Giuridica e Sistemi Giudiziari
For another 14 institutions, mostly private companies, no items were created due to notability concerns. Everything else already had an item in Wikidata!
Improvement of existing items
In order to improve the display on the map, we enhanced selected items in Wikidata in various ways:
Add English label
Add type (instance of)
Add headquarter location
Add image and/or logo
And we hope, that participants of the conference also took the opportunity to make their institution “look better”, by adding for example an image of it to the Wikidata knowledge base.
Putting Wikidata into use for a completely custom purpose thus created incentives for improving “the sum of all human knowledge” step by tiny step.

 
Wikidata for Authorities
    
          Linked data &#160; 
      

Deutsch
	Journal Map: developing an open environment for accessing and analyzing performance indicators from journals in economics
by Franz Osorio, Timo Borst
Introduction
Bibliometrics, scientometrics, informetrics and webometrics have been both research topics and practical guidelines for publishing, reading, citing, measuring and acquiring published research for a while (Hood 2001). Citation databases and measures had been introduced in the 1960s, becoming benchmarks both for the publishing industry and academic libraries managing their holdings and journal acquisitions that tend to be more selective with a growing number of journals on the one side, budget cuts on the other. Due to the Open Access movement triggering a transformation of traditional publishing models (Schimmer 2015), and in the light of both global and distributed information infrastructures for publishing and communicating on the web that have yielded more diverse practices and communities, this situation has dramatically changed: While bibliometrics of research output in its core understanding still is highly relevant to stakeholders and the scientific community, visibility, influence and impact of scientific results has shifted to locations in the World Wide Web that are commonly shared and quickly accessible not only by peers, but by the general public (Thelwall 2013). This has several implications for different stakeholders who are referring to metrics in dealing with scientific results:
 

With the rise of social networks, platforms and their use also by academics and research communities, the term 'metrics' itself has gained a broader meaning: while traditional citation indexes only track citations of literature published in (other) journals, 'mentions', 'reads' and 'tweets', albeit less formal, have become indicators and measures for (scientific) impact.
Altmetrics has influenced research performance, evaluation and measurement, which formerly had been exclusively associated with traditional bibliometrics. Scientists are becoming aware of alternative publishing channels and both the option and need of 'self-advertising' their output.
In particular academic libraries are forced to manage their journal subscriptions and holdings in the light of increasing scientific output on the one hand, and stagnating budgets on the other. While editorial products from the publishing industry are exposed to a global competing market requiring a 'brand' strategy, altmetrics may serve as additional scattered indicators for scientific awareness and value.

Against this background, we took the opportunity to collect, process and display some impact or signal data with respect to literature in economics from different sources, such as 'traditional' citation databases, journal rankings and community platforms resp. altmetrics indicators:
CitEc. The long-standing citation service maintainted by the RePEc community provided a dump of both working papers (as part of series) and journal articles, the latter with significant information on classic impact factors such as impact factor (2 and 5 years) and h-index.
Rankings of journals in economics including Scimago Journal Rank (SJR) and two German journal rankings, that are regularly released and updated (VHB Jourqual, Handelsblatt Ranking).
Usage data from Altmetric.com that we collected for those articles that could be identified via their Digital Object Identifier.
Usage data from the scientific community platform and reference manager Mendeley.com, in particular the number of saves or bookmarks on an individual paper.
Requirements
A major consideration for this project was finding an open environment in which to implement it. Finding an open platform to use served a few purposes. As a member of the "Leibniz Research Association," ZBW has a commitment to Open Science and in part that means making use of open technologies to as great extent as possible (The ZBW - Open Scienc...). This open system should allow direct access to the underlying data so that users are able to use it for their own investigations and purposes. Additionally, if possible the user should be able to manipulate the data within the system.
The first instance of the project was created in Tableau, which offers a variety of means to express data and create interfaces for the user to filter and manipulate data. It also can provide a way to work with the data and create visualizations without programming skills or knowledge. Tableau is one of the most popular tools to create and deliver data visualization in particular within academic libraries (Murphy 2013). However, the software is proprietary and has a monthly fee to use and maintain, as well as closing off the data and making only the final visualization available to users. It was able to provide a starting point for how we wanted to the data to appear to the user, but it is in no way open.

Challenges
The first technical challenge was to consolidate the data from the different sources which had varying formats and organizations. Broadly speaking, the bibliometric data (CitEc and journal rankings) existed as a spread sheet with multiple pages, while the altmetrics and Mendeley data came from a database dumps with multiple tables that were presented as several CSV files. In addition to these different formats, the data needed to be cleaned and gaps filled in. The sources also had very different scopes. The altmetrics and Mendeley data covered only 30 journals, the bibliometric data, on the other hand, had more than 1,000 journals.
Transitioning from Tableau to an open platform was big challenge. While there are many ways to create data visualizations and present them to users, the decision was made to use R to work with the data and Shiny to present it. R is used widely to work with data and to present it (Kläre 2017). The language has lots of support for these kinds of task over many libraries. The primary libraries used were R Plotly and R Shiny. Plotly is a popular library for creating interactive visualizations. Without too much work Plotly can provide features including information popups while hovering over a chart and on the fly filtering. Shiny provides a framework to create a web application to present the data without requiring a lot of work to create HTML and CSS. The transition required time spent getting to know R and its libraries, to learn how to create the kinds of charts and filters that would be useful for users. While Shiny alleviates the need to create HTML and CSS, it does have a specific set of requirements and structures in order to function.
The final challenge was in making this project accessible to users such that they would be able to see what we had done, have access to the data, and have an environment in which they could explore the data without needing anything other than what we were providing. In order to achieve this we used Binder as the platform. At it's most basic Binder makes it possible to share a Jupyter Notebook stored in a Github repository with a URL by running the Jupyter Notebook remotely and providing access through a browser with no requirements placed on the user. Additionally, Binder is able to run a web application using R and Shiny. To move from a locally running instance of R Shiny to one that can run in Binder, instructions for the runtime environment need to be created and added to the repository. These include information on what version of the language to use,  which packages and libraries to install for the language, and any additional requirements there might be to run everything.

Solutions
Given the disparate sources and formats for the data, there was work that needed to be done to prepare it for visualization. The largest dataset, the bibliographic data, had several identifiers for each journal but without journal names. Having the journals names is important because in general the names are how users will know the journals. Adding the names to the data would allow users to filter on specific journals or pull up two journals for a comparison. Providing the names of the journals is also a benefit for anyone who may repurpose the data and saves them from having to look them up. In order to fill this gap, we used metadata available through Research Papers in Economics (RePEc). RePEc is an organization that seeks to "enhance the dissemination of research in Economics and related sciences". It contains metadata for more than 3 million papers available in different formats. The bibliographic data contained RePEc Handles which we used to look up the journal information as XML and then parse the XML to find the title of the journal.  After writing a small Python script to go through the RePEc data and find the missing names there were only 6 journals whose names were still missing.
For the data that originated in an MySQL database, the major work that needed to be done was to correct the formatting. The data was provided as CSV files but it was not formatted such that it could be used right away. Some of the fields had double quotation marks and when the CSV file was created those quotes were put into other quotation marks resulting doubled quotation marks which made machine parsing difficult without intervention directly on the files. The work was to go through the files and quickly remove the doubled quotation marks.

In addition to that, it was useful for some visualizations to provide a condensed version of the data. The data from the database was at the article level which is useful for some things, but could be time consuming for other actions. For example, the altmetrics data covered only 30 journals but had almost 14,000 rows. We could use the Python library pandas to go through the all those rows and condense the data down so that there are only 30 rows with the data for each column being the sum of all rows. In this way, there is a dataset that can be used to easily and quickly generate summaries on the journal level.
Shiny applications require a specific structure and files in order to do the work of creating HTML without needing to write the full HTML and CSS. At it's most basic there are two main parts to the Shiny application. The first defines the user interface (UI) of the page. It says what goes where, what kind of elements to include, and how things are labeled. This section defines what the user interacts with by creating inputs and also defining the layout of the output. The second part acts as a server that handles the computations and processing of the data that will be passed on to the UI for display. The two pieces work in tandem, passing information back and forth to create a visualization based on user input. Using Shiny allowed almost all of the time spent on creating the project to be concentrated on processing the data and creating the visualizations. The only difficulty in creating the frontend was making sure all the pieces of the UI and Server were connected correctly.
Binder provided a solution for hosting the application, making the data available to users, and making it shareable all in an open environment. Notebooks and applications hosted with Binder are shareable in part because the source is often a repository like Github. By passing a Github repository to Binder, say one that has a Jupyter Notebook in it, Binder will build a Docker image to run the notebook and then serve the result to the user without them needing to do anything. Out of the box the Docker image will contain only the most basic functions. The result is that if a notebook requires a library that isn't standard, it won't be possible to run all of the code in the notebook. In order to address this, Binder allows for the inclusion in a repository of certain files that can define what extra elements should be included when building the Docker image. This can be very specific such as what version of the language to use and listing various libraries that should be included to ensure that the notebook can be run smoothly. Binder also has support for more advanced functionality in the Docker images such as creating a Postgres database and loading it with data. These kinds of activities require using different hooks that Binder looks for during the creation of the Docker image to run scripts.

Results and evaluation
The final product has three main sections that divide the data categorically into altmetrics, bibliometrics, and data from Mendeley. There are additionally some sections that exist as areas where something new could be tried out and refined without potentially causing issues with the three previously mentioned areas. Each section has visualizations that are based on the data available.
Considering the requirements for the project, the result goes a long way to meeting the requirements. The most apparent area that the Journal Map succeeds in is its goals is of presenting data that we have collected. The application serves as a dashboard for the data that can be explored by changing filters and journal selections. By presenting the data as a dashboard, the barrier to entry for users to explore the data is low. However, there exists a way to access the data directly and perform new calculations, or create new visualizations. This can be done through the application's access to an R-Studio environment. Access to R-Studio provides two major features. First, it gives direct access to the all the underlying code that creates the dashboard and the data used by it. Second, it provides an R terminal so that users can work with the data directly. In R-Studio, the user can also modify the existing files and then run them from R-Studio to see the results. Using Binder and R as the backend of the applications allows us to provide users with different ways to access and work with data without any extra requirements on the part of the user. However, anything changed in R-Studio won't affect the dashboard view and won't persist between sessions. Changes exist only in the current session.

All the major pieces of this project were able to be done using open technologies: Binder to serve the application, R to write the code, and Github to host all the code. Using these technologies and leveraging their capabilities allows the project to support the Open Science paradigm that was part of the impetus for the project.
The biggest drawback to the current implementation is that Binder is a third party host and so there are certain things that are out of our control. For example, Binder can be slow to load. It takes on average 1+ minutes for the Docker image to load. There's not much, if anything, we can do to speed that up. The other issue is that if there is an update to the Binder source code that breaks something, then the application will be inaccessible until the issue is resolved.

Outlook and future work
The application, in its current state, has parts that are not finalized. As we receive feedback, we will make changes to the application to add or change visualizations. As mentioned previously, there a few sections that were created to test different visualizations independently of the more complete sections, those can be finalized.
In the future it may be possible to move from BinderHub to a locally created and administered version of Binder. There is support and documentation for creating local, self hosted instances of Binder. Going that direction would give more control, and may make it possible to get the Docker image to load more quickly.
While the application runs stand-alone, the data that is visualized may also be integrated in other contexts. One option we are already prototyping is integrating the data into our subject portal EconBiz, so users would be able to judge the scientific impact of an article in terms of both bibliometric and altmetric indicators.
 

References


William W. Hood, Concepcion S. Wilson. The Literature of Bibliometrics, Scientometrics, and Informetrics. Scientometrics 52, 291–314 Springer Science and Business Media LLC, 2001. Link


R. Schimmer. Disrupting the subscription journals’ business model for the necessary large-scale transformation to open access. (2015). Link


Mike Thelwall, Stefanie Haustein, Vincent Larivière, Cassidy R. Sugimoto. Do Altmetrics Work? Twitter and Ten Other Social Web Services. PLoS ONE 8, e64841 Public Library of Science (PLoS), 2013. Link


The ZBW - Open Science Future. Link


Sarah Anne Murphy. Data Visualization and Rapid Analytics: Applying Tableau Desktop to Support Library Decision-Making. Journal of Web Librarianship 7, 465–476 Informa UK Limited, 2013. Link


Christina Kläre, Timo Borst. Statistic packages and their use in research in Economics | EDaWaX - Blog of the project ’European Data Watch Extended’. EDaWaX - European Data Watch Extended (2017). Link


Journal Map - Binder application for displaying and analyzing metrics data about scientific journals
	Integrating altmetrics into a subject repository - EconStor as a use case
Back in 2015 the ZBW Leibniz Information Center for Economics (ZBW) teamed up with the Göttingen State and university library (SUB), the Service Center of Götting library federation (VZG) and GESIS Leibniz Institute for the Social Sciences in the *metrics project funded by the German Research Foundation (DFG). The aim of the project was: “… to develop a deeper understanding of *metrics, especially in terms of their general significance and their perception amongst stakeholders.” (*metrics project about).
In the practical part of the project the following DSpace based repositories of the project partners participated as data sources for online publications and – in the case of EconStor – also as implementer for the presentation of the social media signals:
EconStor - a subject repository for economics and business studies run by the ZBW, currently (Aug. 2019) containing round about 180,000 downloadable files,
GoeScholar - the Publication Server of the Georg-August-Universität Göttingen run by the SUB Göttingen, offering approximately 11,000 publicly browsable items so far,
SSOAR - the “Social Science Open Access Repository” maintained by GESIS, currently containing about 53,000 publicly available items.
In the work package “Technology analysis for the collection and provision of *metrics” of the project an analysis of currently available *metrics technologies and services had been performed.
As stated by [Wilsdon 2017], currently suppliers of altmetrics “remain too narrow (mainly considering research products with DOIs)”, which leads to problems to acquire *metrics data for repositories like EconStor with working papers as the main content. As up to now it is unusual – at least in the social sciences and economics – to create DOIs for this kind of documents. Only the resulting final article published in a journal will receive a DOI.
Based on the findings in this work package, a test implementation of the *metrics crawler had been built. The crawler had been actively deployed from early 2018 to spring 2019 at the VZG. For the aggregation of the *metrics data the crawler had been fed with persistent identifiers and metadata from the aforementioned repositories.
At this stage of the project, the project partners still had the expectation, that the persistent identifiers (e.g. handle, URNs, …), or their local URL counterparts, as used by the repositories could be harnessed to easily identify social media mentions of their documents, e.g. for EconStor:
handle: “hdl:10419/…”
handle.net resolver URL: “http(s)://hdl.handle.net/10419/…”
EconStor landing page URL with handle: “http(s)://www.econstor.eu/handle/10419/…”
EconStor bitstream (PDF) URL with handle: “http(s)://www.econstor.eu/bitstream/10419/…”
This resulted in two datasets: One for publications identified by DOIs (doi:10.xxxx/yyyyy) or the respective metadata from Crossref and one for documents identified by the repository URLs (https://www.econstor.eu/handle/10419/xxxx) or the items metadata stored in the repository.
During the first part of the project several social media platforms had been identified as possible data sources for the implementation phase. This had been done by interviews and online surveys. For the resulting ranking see the Social Media Registry. Additional research examined which social media platforms are relevant to researchers at different stages of their career and if and how they use them (see: [Lemke 2018], [Lemke 2019] and [Mehrazar 2018]). This list of possible sources for social media citations or mentions had then been further reduced to the following six social media platforms which are offering free and open available online APIs:
Facebook
Mendeley
Reddit
Twitter
Wikipedia
Youtube
Of particular interest to the EconStor team were the social media services Mendeley and Twitter, as those had been found being among the “Top 3 most used altmetric sources …” for Economic and Business Studies (EBS) journals “… - with Mendeley being the most complete platform for EBS journals” [Nuredini 2016].
*metrics integration in EconStor
In early 2019 the EconStor team finally received a MySQL data dump of the compiled data which had been collected by the *metrics crawler. In consultations between the project partners and based on the aforementioned research, it became clear, that only the collected data from Mendeley, Twitter and Wikipedia were suitable to be embedded into EconStor. It was also made clear, by the VZG, that it had been nearly impossible to use handle or respective local URLs to extract social media mentions from the free of charge provided APIs of the different social media services. Instead, in case of Wikipedia ISBNs had been used and for Mendeley the title and author(s) as provided in the repository’s metadata. Only for the search via the Twitter API the handle URLs had been used.
The datasets used by the *metrics crawler to identify works from EconStor included a dataset of 15,703 DOIs (~10% of the EconStor content back then), sometimes representing other manifestations of the documents stored in EconStor (e.g. pre- or postprint versions of an article), their respective metadata from the Crossref DOI registry and also a dataset of 153,807 EconStor documents identified by the handle/URL and metadata stored in the repository itself. This second dataset also included the documents related to the publications identified by the DOI set.
The following table (Table 1) shows the results of the *metrics crawler for items in EconStor. It displays one row for each service and the used identifier set. Each row also shows the time period during which the crawler harvested the service and how many unique items per identifier set were found during that period.
social media service (set)
harvested from
harvested until
unique EconStor items mentioned
Mendeley (DOI)
2018-08-06
2019-01-11
7,800
Mendeley (URL)
2019-01-10
2019-01-11
24,953
Twitter (DOI)
2018-02-13 (date of first captured tweet 2018-02-03)
2019-01-11 (date of last captured tweet 2019-01-10)
418
Twitter (URL)
2018-12-14 (date of first captured tweet 2018-12-05)
2019-01-11 (date of last captured tweet 2019-01-09)
32
Wikipedia (DOI)
2018-10-05
2019-01-11
93
Wikipedia (URL)
2019-01-11
2019-01-11
100
Table 1: Unique EconStor Items found per identifier set and social media service
The following table (Table 2) shows how many of the EconStor items were found with identifiers from both sets. As you can see, only for the service Mendeley the sets have a significant overlap. Which shows, that it is desirable for a service such as EconStor, to expand the captured coverage of its items in social media by the use of other identifies than just DOIs.
social media site
unique items identified by both DOI and URL
Mendeley
4,323
Twitter
0
Wikipedia
2
Table 2: Overlap in found identifiers
As a result of the project, the landing pages of EconStor items, which have been mentioned on Mendeley, twitter or Wikipedia during the time of data gathering, have now, for the time being, a listing of “Social Media Mentions”. This is in addition to the already existing cites and citations, based on the RePEc - CitEc service and the download statistics, which is displayed on separate pages.
Image 1: “EconStor item landing page”
The back end on the EconStor server is realized as a small RESTful Web service programmed in Java that returns JSON formatted data (see Figure 1). Given a list of identifiers (DOIs/handle) it returns the sum of mentions for Mendeley, Twitter and Wikipedia in the Database, per specified EconStor item, as well as the links to the counted tweets and Wikipedia articles. In case of Wikipedia this is also grouped by the language of the Wikipedia the mention was found in.
 

{
    "_metrics": {
        "sum_mendeley": 0,
        "sum_twitter": 3,
        "sum_wikipedia": 0
    },
    "identifier": "10419/144535",
    "identifiertype": "HANDLE",
    "repository": "EconStor",
    "tweetData": {
        "1075481976793116673": {
            "created_at": "Wed Dec 19 20:04:19 +0000 2018",
            "description": "Economist Wettbewerb Regulierung Monopole Economics @DICEHHU @HHU_de VWL Antitrust Düsseldorf Quakenbrück Berlin FC St. Pauli",
            "id_str": "1075481976793116673",
            "name": "Justus Haucap",
            "screen_name": "haucap"
        },
        "1075484066949025793": {
            "created_at": "Wed Dec 19 20:12:37 +0000 2018",
            "description": "Twitterkanal des Wirtschaftsdienst - Zeitschrift für Wirtschaftspolitik, hrsg.  von @ZBW_news; RT ≠ Zustimmung; Impressum: https://t.co/X0gevZb9lR",
            "id_str": "1075484066949025793",
            "name": "Wirtschaftsdienst",
            "screen_name": "Zeitschrift_WD"
        },
        "1075486159772504065": {
            "created_at": "Wed Dec 19 20:20:56 +0000 2018",
            "description": "Professor for International Economics at HTW Berlin - University of Applied Sciences; Senior Policy Fellow at the European Council on Foreign Relations",
            "id_str": "1075486159772504065",
            "name": "Sebastian Dullien",
            "screen_name": "SDullien"
        }
    },
    "twitterids": [
        "1075486159772504065",
        "1075484066949025793",
        "1075481976793116673"
    ],
    "wikipediaQuerys": {}
}
Figure 1: “Example json returned by webservice - Twitter mentions”
 
Image 2: “Mendeley and Twitter mentions”
During the creation of the landing page of an EconStor item (see Image 1), a JAVA servlet queries the web service and, if some social media mentions is detected, renders the result into the web page. For each of the three social media platforms the sum of the mentions is displayed and for Twitter and Wikipedia even backlinks to the mentioning tweets/articles are provided as a drop-down list, below the number of mentions (see Image 2). In case of Wikipedia this is also grouped by the languages of the articles in Wikipedia in which the ISBN of the corresponding work has been found.
Conclusion
While being an interesting addition to the existing download statistics and citations by RePEc/CitEc, that are already integrated into EconStor, currently the gathered “social media mentions” offer only a limited additional value to the EconStor landing pages. One reason might be, that only a fraction of all the documents of EconStor are covered. Another reason might be according to [Lemke 2019], that there is currently a great reluctance to use social media services among economists and social scientists, as it is perceived as: “unsuitable for academic discourse; … to cost much time; … separating personal from professional matters is bothersome; … increases the efforts necessary to handle information overload.”
Theoretically, the prospect of a tool for the measurement of the scientific uptake, with a quicker response time than classical bibliometrics, could be very rewarding, especially for a repository like EconStor with its many preprints (e.g. working papers) provided in open access.
As [Thelwall 2013] has stated: “In response, some publishers have turned to altmetrics, which are counts of citations or mentions in specific social web services because they can appear more rapidly than citations. For example, it would be reasonable to expect a typical article to be most tweeted on its publication day and most blogged within a month of publication.” and “Social media mentions, being available immediately after publication—and even before publication in the case of preprints…”.
But especially these preprints, that come without a DOI, are still a challenge to be correctly identified, and therefore to be counted as social media mentions. This is something the *metrics crawler has not changed, since it is using title and author metadata to search in Mendeley, which does not give a 100% sure identification and ISBNs to search in Wikipedia.
Even though a quick check revealed that at the time of writing this article (Aug. 2019) at least Wikipedia offers a handle search. A quick search for EconStor handles in the English Wikipedia returns now a list of 184 pages with mentions of “hdl:10419/”, the German Wikipedia 13 - but these are still very small numbers (Aug. 22nd, 2019: currently 179,557 full texts are available in EconStor).

https://en.wikipedia.org/w/api.php?action=query&amp;list=search&amp;srlimit=500&amp;srsearch=%22hdl:10419%2F%22&amp;srwhat=text&amp;srprop&amp;srinfo=totalhits&amp;srenablerewrites=0&amp;format=jsonsearch via API in english wikipedia
Another problem is, that at the time of this writing, the *metrics crawler is not continuously operated, therefore our analysis is based on a data dump of social media mentions from spring 2018 to early 2019.
Since it is one of the major benefits of altmetrics that it can be obtained much faster and is more recent then classical citation-based metrics, it reduces the value of the continued integration of this static and continuously getting older dataset being integrated into EconStor landing pages. Hence, we are looking for more recent and regular updates of social media data that could serve as a ‘real-time’ basis for monitoring social media usage in economics.
As a consequence, we are currently looking for: a) an institution to commit itself to run the *metrics crawler and b) a more active social media usage in the sciences of Economics and Business Studies.
References
[Lemke 2018] Lemke, Steffen; Mehrazar, Maryam; Mazarakis, Athanasios; Peters, Isabella (2018): Are There Different Types of Online Research Impact?, In: Building &amp; Sustaining an Ethical Future with Emerging Technology. Proceedings of the 81st Annual Meeting, Vancouver, Canada, 10–14 November 2018, ISBN 978-0-578-41425-6, Association for Information Science and Technology (ASIS&amp;T), Silver Spring, pp. 282-289 http://hdl.handle.net/11108/394
[Lemke 2019] Lemke, Steffen; Mehrazar, Maryam; Mazarakis, Athanasios; Peters, Isabella (2019): “When You Use Social Media You Are Not Working”: Barriers for the Use of Metrics in Social Sciences, Frontiers in Research Metrics and Analytics, ISSN 2504-0537, Vol. 3, Iss. [Article] 39, pp. 1-18, http://dx.doi.org/10.3389/frma.2018.00039
[Mehrazar 2018] Maryam Mehrazar, Christoph Carl Kling, Steffen Lemke, Athanasios Mazarakis, and Isabella Peters (2018): Can We Count on Social Media Metrics? First Insights into the Active Scholarly Use of Social Media, WebSci ’18: 10th ACM Conference on Web Science, May 27–30, 2018, Amsterdam, Netherlands. ACM, New York, NY, USA, Article 4, 5 pages, https://doi.org/10.1145/3201064.3201101
[Metrics 2019] Einbindung von *metrics in EconStor, https://metrics-project.net/downloads/2019-03-28-EconStor-metrics-Abschluss-WS-SUB-G%C3%B6.pptx
[Nuredini 2016] Nuredini, Kaltrina; Peters, Isabella (2016): Enriching the knowledge of altmetrics studies by exploring social media metrics for Economic and Business Studies journals, Proceedings of the 21st International Conference on Science and Technology Indicators (STI Conference 2016), València (Spain), September 14-16, 2016, http://hdl.handle.net/11108/261
[OR2019] Relevance and Challenges of Altmetrics for Repositories - answers from the *metrics project. https://www.conftool.net/or2019/index.php/Paper-P7A-424Orth%2CWeiland_b.pdf?page=downloadPaper&amp;filename=Paper-P7A-424Orth%2CWeiland_b.pdf&amp;form_id=424&amp;form_index=2&amp;form_version=final
[Social Media Registry] Social Media Registry - Current Status of Social Media Plattforms and *metrics, https://docs.google.com/spreadsheets/d/10OALs5kxtmML4Naf1ShXh0cTmONE8q9EFhTzmgPINv4/edit?usp=sharing
[Thelwall 2013] Thelwall M, Haustein S, Larivie`re V, Sugimoto CR (2013): Do Altmetrics Work? Twitter and Ten Other Social Web Services. PLoS ONE 8(5): e64841. http://dx.doi.org/10.1371/journal.pone.0064841
[Wilsdon 2017] Wilsdon, James et al. (2017): Next-generation metrics: Responsible metrics and evaluation for open science. Report of the European Commission Expert Group on Altmetrics, ISBN 78-92-79-66130-3, http://dx.doi.org/10.2777/337729
Integrating altmetrics data into EconStor
	20th Century Press Archives: Data donation to Wikidata
ZBW is donating a large open dataset from the 20th Century Press Archives to Wikidata, in order to make it better accessible to various scientific disciplines such as contemporary, economic and business history, media and information science, to journalists, teachers, students, and the general public.
The 20th Century Press Archives (PM20) is a large public newspaper clippings archive, extracted from more than 1500 different sources published in Germany and all over the world, covering roughly a full century (1908-2005). The clippings are organized in thematic folders about persons, companies and institutions, general subjects, and wares. During a project originally funded by the German Research Foundation (DFG), the material up to 1960 has been digitized. 25,000 folders with more than two million pages up to 1949 are freely accessible online.  The fine-grained thematic access and the public nature of the archives makes it to our best knowledge unique across the world (more information on Wikipedia) and an essential research data fund for some of the disciplines mentioned above.
The data donation does not only mean that ZBW has assigned a CC0 license to all PM20 metadata, which makes it compatible with Wikidata. (Due to intellectual property rights, only the metadata can be licensed by ZBW - all legal rights on the press articles themselves remain with their original creators.) The donation also includes investing a substantial amount of working time (during, as planned, two years) devoted to the integration of this data into Wikidata. Here we want to share our experiences regarding the integration of the persons archive metadata.

Folders from the persons archive, in 2015 (Credit: Max-Michael Wannags)
Linking our folders to WikidataThe essential bit for linking the digitized folders was in place before the project even started: an external identifier property (PM20 folder ID, P4293), proposed by an administrator of the German Wikipedia in order to link to PM20 person and company folders. We participated in the property proposal discussion and made sure that the links did not have to reference our legacy Coldfusion application. Instead, we created a "partial redirect" on the purl.org service (maintained formerly by OCLC, now by the Internet Archive) for persistent URLs which may redirect to another application on another server in future. Secondly, the identifier and URL format was extended to include subject and ware folders, which are defined by a combination of two keys, one for the country and another for the topic. The format of the links in Wikidata is controlled by a regular expression, which covers all four archives mentioned above. That works pretty well -  very few format errors occurred so far -, and it relieved us from creating four different archive-specific properties.Shortly after the property creation, Magnus Manske, the author of the original Mediawiki software and lots of related tools, scraped our web site and created a Mix-n-Match catalog from it. During the following two years, more than 60 Wikidata users contributed to matching Wikidata items for humans to PM20 folder IDs. For a start, deriving links from GND
Many of the PM20 person and company folders were already identified by an identifier from the German Integrated Authority File (GND). So, our first step was creating PM20 links for all Wikidata items which had matching GND IDs. For all these items and folders, disambiguation had already taken place, and we could safely add all these links automatically.
Infrastructure: PM20 endpoint, federated queries and QuickStatements
To make this work, we relied heavily on Linked Data technologies. A PM20 SPARQL endpoint had already been set up for our contribution to Coding da Vinci (a "Kultur-Hackathon" in Germany). Almost all automated changes to Wikidata we made are based on federated queries on our own endpoint, reaching out to the Wikidata endpoint, or vice versa, from Wikidata to PM20. In the latter case, the external endpoint has to be registered at Wikidata. Wikidata maintains a help page for this type of queries.
For our purposes, federated queries allow extracting current data from both endpoints. In the case of the above-mentioned missing_pm20_id_via_gnd.rq query, this way we can skip all items, where a link to PM20 already exists.
Within the query itself, we create a statement string which we can feed into the QuickStatements tool. That includes, for every single statement, a reference to PM20 with link to the actual folder, so that the provenance of these statements is always clear and traceable. Via script, a statement file is extracted and saved with a timestamp. Data imports via QuickStatements are executed in batch mode, and an activity log keeps track of all data imports and other activities related to PM20.
Creating missing items
After the matching of about 93 % of the person folders which include free documents in Mix-n-Match, and some efforts to discover more pre-existing Wikidata items, we decided to create the 346 missing person items, again via QuickStatements input. We used the description field in Wikidata by importing the content of the free-text "occupation" field in PM20 for better disambiguation of the newly created items. (Here a rather minimal example of such an item created from PM20 metadata.) Thus, all PM20 person folders which have digitized content were linked to Wikidata in June 2019.
Supplementing Wikidata with PM20 metadata
A second part of the integration of PM20 metadata into Wikidata was the import of missing property values to the according items. This comprised simple facts like "date of birth/death", occupations such as "economist", "business economist", "social scientist", "earth scientist", which we could derive from the "field of activity" in PM20, up to relations between existing items, e.g. a family member to the according family, or a board member to the according company. A few other source properties have been postponed, because alternative solutions exist, and the best one may depend on the intended use in future applications. The steps of this enrichment process and links to the code used - including the automatic generation of references - are online, too.

Complex statement added to Wikidata item for Friedrich Krupp AG
Again, we used federated queries. Often the target of a Wikidata property is an item in itself. Sometimes, we could directly get this via the target item's PM20 folder ID (families, companies); sometimes we had to create lookup tables. For the latter, we used "values" clauses in the query (in case of "occupation"), or (in case of "country of citizenship"), we have to match countries from our internal classification in advance - a process for which we use OpenRefine. Other than PM20 folder IDs, which we avoided adding when folders do not contain digitized content, we added the metadata to all items which were linked to PM20, and intend to repeat this process periodically when more items (e.g., companies) are identified by PM20 folder IDs. In some housekeeping activity, we also add periodically the numbers of documents (online and total) and the exact folder names as qualifiers to newly emerging PM20 links in items.
Results of the data donation so far
With all 5266 persons folder with digitized documents linked to Wikidata, the data donation of the person folders metadata is completed. Besides the folder links, which have already heavily been used to create links in Wikipedia articles, we have got
- more than 6000 statements which are sourced in PM20 (from "date of birth" to the track gauge of a Brazilian railway line)
- more than 1000 items, for which PM20 ID is the only external identifier
The data donation will be presented on the WikidataCon in Berlin (24.-26.10.2019) as a "birthday present" on the occasion Wikidata's seventh birthday. ZBW will further keep the digital content available, amended with a static landing page for every folder, which also will serve as source link for the metadata we have integrated into Wikidata. But in future, Wikidata will be the primary access path to our data, providing further metadata in multiple languages and links to a plethora of other external sources. And the best is, different from our current application, everybody will be able to enhance this open data through the interactive tools and data interfaces provided by Wikidata.Participate in WikiProject 20th Century Press Archives
For the topics, wares and companies archives, there is still a long way to go. The best structure for representing these archives and their folders - often defined by the combination of a country within a geographical hierarchy with a subject heading in a deeply nested topic classification -, has to be figured out. Existing items have to be matched, and lots of other work is to be done. Therefore, we have created the WikiProject 20th Century Press Archives in Wikidata to keep track of discussions and decisions, and to create a focal point for participation. Everybody on Wikidata is invited to participate - or just kibitz. It could be challenging particularly for information scientists, and people interested in historic systems for the organization of knowledge about the whole world, to take part in the mapping of one of these systems to the emerging Wikidata knowledge graph.
 

          Linked data &#160; 
          Open data &#160;
	ZBW's contribution to "Coding da Vinci": Dossiers about persons and companies from 20th Century Press Archives
At 27th and 28th of October, the Kick-off for the "Kultur-Hackathon" Coding da Vinci is held in Mainz, Germany, organized this time by GLAM institutions from the Rhein-Main area: "For five weeks, devoted fans of culture and hacking alike will prototype, code and design to make open cultural data come alive." New software applications are enabled by free and open data.
For the first time, ZBW is among the data providers. It contributes the person and company dossiers of the 20th Century Press Archive. For about a hundred years, the predecessor organizations of ZBW in Kiel and Hamburg had collected press clippings, business reports and other material about a wide range of political, economic and social topics, about persons, organizations, wares, events and general subjects. During a project funded by the German Research Organization (DFG), the documents published up to 1948 (about 5,7 million pages) had been digitized and are made publicly accessible with according metadata, until recently solely in the "Pressemappe 20. Jahrhundert" (PM20) web application. Additionally, the dossiers - for example about Mahatma Gandhi or the Hamburg-Bremer Afrika Linie - can be loaded into a web viewer.
As a first step to open up this unique source of data for various communities, ZBW has decided to put the complete PM20 metadata* under a CC-Zero license, which allows free reuse in all contexts. For our Coding da Vinci contribution, we have prepared all person and company dossiers which already contain documents. The dossiers are interlinked among each other. Controlled vocabularies (for, e.g., "country", or "field of activity") provide multi-dimensional access to the data. Most of the persons and a good share of organizations were linked to GND identifiers. As a starter, we had mapped dossiers to Wikidata according to existing GND IDs. That allows to run queries for PM20 dossiers completely on Wikidata, making use of all the good stuff there. An example query shows the birth places of PM20 economists on a map, enriched with images from Wikimedia commons. The initial mapping was much extended by fantastic semi-automatic and manual mapping efforts by the Wikidata community. So currently more than 80 % of the dossiers about - often rather prominent - PM20 persons are linked not only to Wikidata, but also connected to Wikipedia pages. That offers great opportunities for mash-ups to further data sources, and we are looking forward to what the "Coding da Vinci" crowd may make out of these opportunities.
Technically, the data has been converted from an internal intermediate format to still quite experimental RDF and loaded into a SPARQL endpoint. There it was enriched with data from Wikidata and extracted with a construct query. We have decided to transform it to JSON-LD for publication (following practices recommended by our hbz colleagues). So developers can use the data as "plain old JSON", with the plethora of web tools available for this, while linked data enthusiasts can utilize sophisticated Semantic Web tools by applying the provided JSON-LD context. In order to make the dataset discoverable and reusable for future research, we published it persistently at zenodo.org. With it, we provide examples and data documentation. A GitHub repository gives you additional code examples and a way to address issues and suggestions.
* For the scanned documents, the legal regulations apply - ZBW cannot assign licenses here.
 

Pressemappe 20. Jahrhundert
    
          Linked data &#160;
	Wikidata as authority linking hub: Connecting RePEc and GND researcher identifiers
In the EconBiz portal for publications in economics, we have data from different sources. In some of these sources, most notably ZBW's "ECONIS" bibliographical database, authors are disambiguated by identifiers of the Integrated Authority File (GND) - in total more than 470,000. Data stemming from "Research papers in Economics" (RePEc) contains another identifier: RePEc authors can register themselves in the RePEc Author Service (RAS), and claim their papers. This data is used for various rankings of authors and, indirectly, of institutions in economics, which provides a big incentive for authors - about 50,000 have signed into RAS - to keep both their article claims and personal data up-to-date. While GND is well known and linked to many other authorities, RAS had no links to any other researcher identifier system. Thus, until recently, the author identifiers were disconnected, which precludes the possibility to display all publications of an author on a portal page.
To overcome that limitation, colleagues at ZBW have matched a good 3,000 authors with RAS and GND IDs by their publications (see details here). Making that pre-existing mapping maintainable and extensible however would have meant to set up some custom editing interface, would have required storage and operating resources and wouldn't easily have been made publicly accessible. In a previous article, we described the opportunities offered by Wikidata. Now we made use of it.


v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}


  Normal
  0
  false
  
  
  false
  false
  false
  
  DE
  X-NONE
  X-NONE
  
   
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="267">
  
   UnhideWhenUsed="false" QFormat="true" Name="Normal">
  
   UnhideWhenUsed="false" QFormat="true" Name="heading 1">
  
  
   UnhideWhenUsed="false" QFormat="true" Name="Title">
  
  
   UnhideWhenUsed="false" QFormat="true" Name="Subtitle">
  
   UnhideWhenUsed="false" QFormat="true" Name="Strong">
  
   UnhideWhenUsed="false" QFormat="true" Name="Emphasis">
  
   UnhideWhenUsed="false" Name="Table Grid">
  
  
   UnhideWhenUsed="false" QFormat="true" Name="No Spacing">
  
   UnhideWhenUsed="false" Name="Light Shading">
  
   UnhideWhenUsed="false" Name="Light List">
  
   UnhideWhenUsed="false" Name="Light Grid">
  
   UnhideWhenUsed="false" Name="Medium Shading 1">
  
   UnhideWhenUsed="false" Name="Medium Shading 2">
  
   UnhideWhenUsed="false" Name="Medium List 1">
  
   UnhideWhenUsed="false" Name="Medium List 2">
  
   UnhideWhenUsed="false" Name="Medium Grid 1">
  
   UnhideWhenUsed="false" Name="Medium Grid 2">
  
   UnhideWhenUsed="false" Name="Medium Grid 3">
  
   UnhideWhenUsed="false" Name="Dark List">
  
   UnhideWhenUsed="false" Name="Colorful Shading">
  
   UnhideWhenUsed="false" Name="Colorful List">
  
   UnhideWhenUsed="false" Name="Colorful Grid">
  
   UnhideWhenUsed="false" Name="Light Shading Accent 1">
  
   UnhideWhenUsed="false" Name="Light List Accent 1">
  
   UnhideWhenUsed="false" Name="Light Grid Accent 1">
  
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1">
  
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1">
  
   UnhideWhenUsed="false" Name="Medium List 1 Accent 1">
  
  
   UnhideWhenUsed="false" QFormat="true" Name="List Paragraph">
  
   UnhideWhenUsed="false" QFormat="true" Name="Quote">
  
   UnhideWhenUsed="false" QFormat="true" Name="Intense Quote">
  
   UnhideWhenUsed="false" Name="Medium List 2 Accent 1">
  
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1">
  
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1">
  
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1">
  
   UnhideWhenUsed="false" Name="Dark List Accent 1">
  
   UnhideWhenUsed="false" Name="Colorful Shading Accent 1">
  
   UnhideWhenUsed="false" Name="Colorful List Accent 1">
  
   UnhideWhenUsed="false" Name="Colorful Grid Accent 1">
  
   UnhideWhenUsed="false" Name="Light Shading Accent 2">
  
   UnhideWhenUsed="false" Name="Light List Accent 2">
  
   UnhideWhenUsed="false" Name="Light Grid Accent 2">
  
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2">
  
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2">
  
   UnhideWhenUsed="false" Name="Medium List 1 Accent 2">
  
   UnhideWhenUsed="false" Name="Medium List 2 Accent 2">
  
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2">
  
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2">
  
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2">
  
   UnhideWhenUsed="false" Name="Dark List Accent 2">
  
   UnhideWhenUsed="false" Name="Colorful Shading Accent 2">
  
   UnhideWhenUsed="false" Name="Colorful List Accent 2">
  
   UnhideWhenUsed="false" Name="Colorful Grid Accent 2">
  
   UnhideWhenUsed="false" Name="Light Shading Accent 3">
  
   UnhideWhenUsed="false" Name="Light List Accent 3">
  
   UnhideWhenUsed="false" Name="Light Grid Accent 3">
  
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3">
  
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3">
  
   UnhideWhenUsed="false" Name="Medium List 1 Accent 3">
  
   UnhideWhenUsed="false" Name="Medium List 2 Accent 3">
  
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3">
  
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3">
  
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3">
  
   UnhideWhenUsed="false" Name="Dark List Accent 3">
  
   UnhideWhenUsed="false" Name="Colorful Shading Accent 3">
  
   UnhideWhenUsed="false" Name="Colorful List Accent 3">
  
   UnhideWhenUsed="false" Name="Colorful Grid Accent 3">
  
   UnhideWhenUsed="false" Name="Light Shading Accent 4">
  
   UnhideWhenUsed="false" Name="Light List Accent 4">
  
   UnhideWhenUsed="false" Name="Light Grid Accent 4">
  
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4">
  
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4">
  
   UnhideWhenUsed="false" Name="Medium List 1 Accent 4">
  
   UnhideWhenUsed="false" Name="Medium List 2 Accent 4">
  
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4">
  
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4">
  
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4">
  
   UnhideWhenUsed="false" Name="Dark List Accent 4">
  
   UnhideWhenUsed="false" Name="Colorful Shading Accent 4">
  
   UnhideWhenUsed="false" Name="Colorful List Accent 4">
  
   UnhideWhenUsed="false" Name="Colorful Grid Accent 4">
  
   UnhideWhenUsed="false" Name="Light Shading Accent 5">
  
   UnhideWhenUsed="false" Name="Light List Accent 5">
  
   UnhideWhenUsed="false" Name="Light Grid Accent 5">
  
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5">
  
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5">
  
   UnhideWhenUsed="false" Name="Medium List 1 Accent 5">
  
   UnhideWhenUsed="false" Name="Medium List 2 Accent 5">
  
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5">
  
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5">
  
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5">
  
   UnhideWhenUsed="false" Name="Dark List Accent 5">
  
   UnhideWhenUsed="false" Name="Colorful Shading Accent 5">
  
   UnhideWhenUsed="false" Name="Colorful List Accent 5">
  
   UnhideWhenUsed="false" Name="Colorful Grid Accent 5">
  
   UnhideWhenUsed="false" Name="Light Shading Accent 6">
  
   UnhideWhenUsed="false" Name="Light List Accent 6">
  
   UnhideWhenUsed="false" Name="Light Grid Accent 6">
  
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6">
  
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6">
  
   UnhideWhenUsed="false" Name="Medium List 1 Accent 6">
  
   UnhideWhenUsed="false" Name="Medium List 2 Accent 6">
  
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6">
  
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6">
  
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6">
  
   UnhideWhenUsed="false" Name="Dark List Accent 6">
  
   UnhideWhenUsed="false" Name="Colorful Shading Accent 6">
  
   UnhideWhenUsed="false" Name="Colorful List Accent 6">
  
   UnhideWhenUsed="false" Name="Colorful Grid Accent 6">
  
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis">
  
   UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis">
  
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference">
  
   UnhideWhenUsed="false" QFormat="true" Name="Intense Reference">
  
   UnhideWhenUsed="false" QFormat="true" Name="Book Title">
  
  
 /* Style Definitions */
 table.MsoNormalTable
	{mso-style-name:"Normale Tabelle";
	mso-tstyle-rowband-size:0;
	mso-tstyle-colband-size:0;
	mso-style-noshow:yes;
	mso-style-priority:99;
	mso-style-parent:"";
	mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
	mso-para-margin:0cm;
	mso-para-margin-bottom:.0001pt;
	line-height:12.0pt;
	mso-pagination:widow-orphan;
	font-size:10.0pt;
	font-family:"Times New Roman","serif";}

Initial situation in Wikidata
Economists were, at the start of this small project in April 2017, already well represented among the 3.4 million persons in Wikidata - though the precise extent is difficult to estimate. Furthermore, properties for linking GND and RePEc author identifiers to Wikidata items were already in place:
P227 “GND ID”, in ~375,000 items
P2428 “RePEc Short-ID” (further-on: RAS ID), in ~2,200 items
both properties in ~760 items
For both properties, “single value” and “distinct values” constraints are defined, so that (with rare exceptions) a 1:1 relation between the authority entry and the Wikidata item should exist. That, in turn, means that a 1:1 relation between both authority entries can be assumed.
The relative amounts of IDs in EconBiz and Wikidata is illustrated by the following image.

Person identifiers in Wikidata and EconBiz, with unknown overlap at the beginning of the project (the number of 1.1 million persons in EconBiz is a very rough estimate, because most names – outside GND and RAS – are not disambiguated)
Since many economists have Wikipedia pages, from which Wikidata items have been created routinely, the first task was finding these items and adding GND and/or RAS identifiers to them. The second task was adding items for persons which did not already exist in Wikidata.
Adding mapping-derived identifiers to Wikidata items
For items already identified by either GND or RAS, the reciprocal identifiers where added automatically: A federated SPARQL query on the mapping and the public Wikidata endpoint retrieved the items and the missing IDs. A script transformed that into input for Wikidata’s QuickStatements2 tool, which allows adding statements (as well as new items) to Wikidata. The tool takes csv-formatted input via a web form and applies it in batch to the live dataset.

Import statements for QuickStatements2. The first input line adds the RAS ID “pan31” to the item for the economist James Andreoni. The rest of the input line creates a reference to ZBWs mapping for this statement and so allows tracking its provenance in Wikidata.
That step resulted in 384 added GND IDs to items identified by RAS ID, and, in the reverse direction, 77 added RAS IDs to items identified by GND ID. For the future, it is expected that tools like wdmapper will facilitate such operations.
Identifying more Wikidata items
Obviously, the previous step left out the already existing economists in Wikidata, which up to then had neither a GND nor a RAS ID. Therefore, these items had to be identified by adding one of the identifiers. A semi-automatic approach was applied to that end, starting with the “most important” persons from RePEc and EconBiz datasets. That was extended in an automatic step, taking advantage of existing VIAF identifiers (a step which could have been also the first one).
For RePEc, the “Top economists” ranking page (~4,600 authors) was scraped and cross-linked to a custom-created basic RDF dataset of the RePEc authors. The result was transformed to an input file for Wikidata’s Mix’n’match tool, which had been developed for the alignment of external catalogs with Wikidata. The tool takes a simple CSV file, consisting of a name, a description and an identifier, and tries to automatically match against Wikidata labels. In a subsequent interactive step, it allows to confirm or remove every match. If confirmed, the identifier is automatically added as value to the according property of the matched Wikidata item.
For GND, all authors with more than 30 publications in EconBiz where selected in a custom SPARQL endpoint. Just as the “RePEc Top” matchset, a “GND economists (de)” matchset with ~18,000 GND IDs, names and descriptions was loaded into Mix’n’match and aligned to Wikidata.
Becoming more familiar with the Wikidata-related tools, policies and procedures, existing VIAF property values were exploited as another opportunity for seeding GND IDs in Wikidata. In a federated SPARQL query on a custom VIAF and the public Wikidata endpoint, about 12,000 missing GND IDs were determined and added to Wikidata items which had been identified by VIAF ID.
After each of these steps, the first task – adding mapping-derived GND or RAS identifiers – was repeated. That resulted in 1908 Wikidata items carrying both IDs. Since ZBWs author mapping based on at least 10 matching publications, the alignment of high-frequency resp. highly-ranked GND and RePEc authors made it highly probable that authors already present in Wikidata were identified in the previous steps. That reduced the danger of creating duplicates in the following task.
Creating new Wikidata items from the mapped authorities
For the rest of the authors in the mapping, 2179 new Wikidata items were created. This task was carried out again by the QuickStatements2 tool, for which the input statements were created by a script, based on a SPARQL query on the afore-mentioned endpoints for RePEc authors and GND entries. The input statements were derived from both authorities, in the following fashion:
the label (name of the person) was taken from GND
the occupation “economist” was derived from RePEc (and in particular from the occurrence in its “Top Economists” list)
gender and date of birth/death were taken from GND (if available)
the English description was a concatenated string “economist” plus the affiliations from RePEc
the German description was a concatenated string “Wirtschaftswissenschaftler/in” plus the affiliations from GND
The use of Wikidata’s description field for affiliations was a makeshift: In the absence of an existing mapping of RePEc (and mostly also GND) organizations to Wikidata, it allows for better identification of the individual researchers. In a later step, when according organization/institute items exist in Wikidata and mappings are in place, the items for authors can be supplemented step-by-step by formal “affiliation” (P1416) statements.
According to Wikidata’s policy, an extensive reference to the source for each statement in the synthesized new Wikidata item was added.
The creation of items in an automated fashion involves the danger of duplicates. However, such duplicates turned up only in very few cases. They have been solved by merging items, which technically is very easy in Wikidata. Interestingly, a number of “fake duplicates” indeed revealed multifarious quality issues, in Wikidata and in both of the authority files, which, too, have been subsequently resolved.
... and even more new items for economists ...
The good experiences so far let us get bolder, and we considered creating Wikidata items for the still missing "Top Economists" (according to RePEc).
For item creation, one aspect we had to consider was the compliance with Wikidata's notability policy. This policy is much more relaxed than the policies of the large Wikipedias. It states as one criterion sufficient for item creation that the item "refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references." There seems to be some consensus in the community that authority files such as GND or RePEc authors count as "serious and publicly available references". This of course should hold even more for a bibliometric ranked subset of these external identifiers.
We thus inserted another 1,839 Wikidata items for the rest of the RePEc Top 10 % list. Additionally - to mitigate the immanent gender bias such selections often bear - we imported all missing researchers from RePEc's "Top 10 % Female Economists" list. Again, we added reference statements to RePEc which allow Wikidata users to keep track of the source of the information.
Results
The immediate result of the project was:
all of the 3081 pairs of identifiers from the initial mapping by ZBW is incorporated now in Wikidata items
1217 Wikidata items in addition to these also have both identifiers (created by individual Wikidata editors, or the efforts described above)
(All numbers in this section as of 2017-11-13.) While that still is only a beginning, given the total amount of authors represented in EconBiz, it is a significant share of the “most important” ones:

Top 10 % RAS and frequent GND in EconBiz (&gt; 30 publications). “Wikidata economists” is a rough estimate of the amount of persons in the field of economics (twice the number of those with the explicit occupation “economist”)
While the top RePEc economists are now completely covered by Wikidata, for GND the overlap has been improved significantly during the last year. This occured in parts as a side-effect of the efforts described above, in parts it is caused by the genuine growth of Wikidata in regard to the number of items as well as the increasing density of external identifiers.
Here the current percentages, compared to those one year earlier, which were presented in our previous article:

Large improvements in the coverage of the most frequent authors by Wikidata (query, result)
While the improvements in absolute numbers are impressive, too - the number of GND IDs for all EconBiz persons (with at least one publication) has increased from 39,778 to 59,074 - the image demonstrates that particularly the coverage for our most frequent authors has risen largely.
The addition of all RePEc top economists has created further opportunities for matching these items from the afore-mentioned GND Mix-n-match set, which will again will add up to the mapping. All matching and duplicates checking done, we may re-consider the option of adding the remaining frequent GND persons (&gt;30 publications in EconBiz) automatically to Wikidata.
The mapping data can be retrieved by everyone, via SPARQL queries, by specialized tools such as wdmapper, or as part of the Wikidata dumps. What is more, it can be extended by everybody – either as a by-product of individual edits adding identifiers to persons in Wikidata, or by a directed approach. For directed extensions, any subset can be used as a starting point: Either a new version of the above mentioned ranking, or other rankings also published by RePEc, covering in particular female, or economists from e.g. Latin America; or all identifiers from a particular institution, either derived from GND or RAS. The results of all such efforts are available at once and add up continuously.
Yet, the benefits of using Wikidata cannot be reduced to the publication and maintenance of mapping itself. In many cases it offers much more than just a linking point for two identifiers:
links to Wikipedia pages about the authors, possibly in multiple languages
rich data about the authors in defined formats, sometimes with explicit provenance information
access to pictures etc. from Wikimedia Commons, or quotations from Wikiquote
links to multiple other authorities
As an example for the latter, the in total 6825 RAS identifiers in Wikidata are already mapped to 2389 VIAF and 1742 LoC authority IDs (while ORCID with 69 IDs is still remarkably low). At the same time, these RePEc-connected items were linked to 1502 English, 690 German and   272 Spanish Wikipedia pages which provide rich human-readable information.
In turn, when we take the GND persons in EconBiz as a starting point, roughly 60,000 are already represented in Wikidata. Besides large amounts of other identifiers, the according Wikidata items offer more than 33,000 links to German and more than 24,000 links to English Wikipedia pages (query).
For ZBW, “releasing” the dataset into Wikidata as a trustworthy and sustainable public database not only saves the “technical” costs of data ownership (programming, storage, operating, for access and for maintenance). Responsibility for - and fun from - extending, amending and keeping the dataset current can be shared with many other interested parties and individuals.
 
Wikidata for Authorities
    
          Authority control &#160; 
          Wikidata &#160; 
      

Deutsch
	New version of multi-lingual JEL classification published in LOD
The Journal of Economic Literature Classification Scheme (JEL)  was created and is maintained by the American Economic Association. The AEA provides this widely used resource freely for scholarly purposes. Thanks to André Davids (KU Leuven), who has translated the originally English-only labels of the classification to French, Spanish and German, we provide a multi-lingual version of JEL. It's lastest version (as of 2017-01) is published in the formats RDFa and RDF download files. These formats and translations are provided "as is" and are not authorized by AEA. In order to make changes in JEL tracable more easily, we have created lists of inserted and removed JEL classes in the context of the skos-history project.

JEL Klassifikation für Linked Open Dataskos-history
    
          Linked data &#160;
	Economists in Wikidata: Opportunities of Authority Linking
Wikidata is a large database, which connects all of the roughly 300 Wikipedia projects. Besides interlinking all Wikipedia pages in different languages about a specific item – e.g., a person -, it also connects to more than 1000 different sources of authority information.
The linking is achieved by a „authority control“ class of Wikidata properties. The values of these properties are identifiers, which unambiguously identify the wikidata item in external, web-accessible databases. The property definitions includes an URI pattern (called „formatter URL“). When the identifier value is inserted into the URI pattern, the resulting URI can be used to look up the authoritiy entry. The resulting URI may point to a Linked Data resource - as it is the case with the GND ID property. This, on the one hand, provides a light-weight and robust mechanism to create links in the web of data. On the other hand, these links can be exploited by every application which is driven by one of the authorities to provide additional data: Links to Wikipedia pages in multiple languages, images, life data, nationality and affiliations of the according persons, and much more.

Wikidata item for the Indian Economist Bina Agarwal, visualized via the SQID browser


In 2014, a group of students under the guidance of Jakob Voß published a handbook on "Normdaten in Wikidata" (in German), describing the structures and the practical editing capabilities of the the standard Wikidata user interface. The experiment described here focuses on persons from the subject domain of economics. It uses the authority identifiers of the about 450,000 economists referenced by their GND ID as creators, contributors or subjects of books, articles and working papers in ZBW's economics search portal EconBiz. These GND IDs were obtained from a prototype of the upcoming EconBiz Research Dataset (EBDS). To 40,000 of these persons, or 8.7 %, a person in Wikidata is connected by GND. If we consider the frequent (more than 30 publications) and the very frequent (more than 150 publications) authors in EconBiz, the coverage increases significantly:
Economics-related Persons in EconBiz
Number of publications
total
in Wikidata
percentage
Datasets: EBDS as of 2016-11-18; Wikidata as of 2016-11-07 (query, result)
&gt; 0
457,244
39,778
8.7 %
&gt; 30
18,008
3,232
17.9 %
&gt; 150
1,225
547
44.7 %
These are numbers "out of the box" - ready-made opportunities to link out from existing metadata in EconBiz and to enrich user interfaces with biographical data from Wikidata/Wikipedia, without any additional effort to improve the coverage on either the EconBiz or the Wikidata side. However: We can safely assume that many of the EconBiz authors, particularly of the high-frequency authors, and even more of the persons who are subject of publications, are "notable" according the Wikidata notablitiy guidelines. Probably, their items exist and are just missing the according GND property.
To check this assumption, we take a closer look to the Wikidata persons which have the occupation "economist" (most wikidata properties accept other wikidata items - instead of arbitrary strings - as values, which allows for exact queries and is indispensible in a multilingual environment).  Of these approximately 20,000 persons, less than 30 % have a GND ID property! Even if we restrict that to the 4,800 "internationally recognized economists" (which we define here as having Wikipedia pages in three or more different languages), almost half of them lack a GND ID property. When we compare that with the coverage by VIAF IDs, more than 50 % of all and 80 % the internationally recognized Wikidata economists are linked to VIAF (SPARQL Lab live query). Therefore, for a whole lot of the persons we have looked at here, we can take it for granted the person exists in Wikidata as well as in the GND, and the only reason for the lack of a GND ID is that nobody has added it to Wikidata yet.
As an aside: The information about the occupation of persons is to be taken as a very rough approximation: Some Wikidata persons were economists by education or at some point of their career, but are famous now for other reasons (examples include Vladimir Putin or the president of Liberia, Ellen Johnson Sirleaf). On the other hand, EconBiz authors known to Wikidata are often qualified not as economist, but as university teacher, politican, historican or sociologist. Nevertheless, their work was deemed relevant for the broad field of economics, and the conclusions drawn at the "economists" in Wikidata and GND will hold for them, too: There are lots of opportunities for linking already well defined items.
What can we gain?
The screenshot above demonstrates, that not only data about the person itself, her affiliations, awards received, and possibly many other details can be obtained. The "Identifiers" box on the bottom right shows authoritiy entries. Besides the GND ID, which served as an entry point for us, there are links to VIAF and other national libraries' authorities, but also to non-library identifier systems like ISNI and ORCID. In total, Wikidata comprises more than 14 million authority links, more than 5 millions of these for persons.
When we take a closer look at the 40,000 EconBiz persons which we can look up by their GND ID in Wikidata, an astonishing variety of authorities is addressed from there: 343 different authorities are linked from the subset, ranging from "almost complete" (VIAF, Library of Congress Name Authority File) to - in the given context- quite exotic authorities of, e.g., Members of the Belgian Senate, chess players or Swedish Olympic Committee athletes. Some of these entries link to carefully crafted biographies, sometimes behind a paywall  (Notable Names Database, Oxford Dictionary of National Biography, Munzinger Archiv, Sächsische Biographie, Dizionario Biografico degli Italiani), or to free text resources (Project Gutenberg authors). Links to the world of museums and archives are also provided, from the Getty Union List of Artist Names to specific links into the British Museum or the Musée d'Orsay collections.
A particular use can be made of properties which express the prominence of the according persons: Nobel Prize IDs, for example, definitivly should be linked to according GND IDs (and indeed, they are). But also TED speakers or persons with an entry in the Munzinger Archive (a famous and long-established German biographical service) are assumed to have GND IDs. That opens a road to a very focused improvement of the data quality: A list of persons with that properties, restricted to the subject field (e.g., "occupation economist"), can be easily generated from Wikidata's SPARQL Query Service. In Wikidata, it is very easy to add the missing ID entries discovered during such cross-checks interactively. And if it turns out that an "very important" person from the field is missing from the GND at all, that is a all-the-more valuable opportunity to improve the data quality at the source.
How can we start improving?
As a prove of concept, and as a practical starting point, we have developed a micro-application for adding missing authority property values. It consists of two SPARQL Lab scripts: missing_property creates a list of Wikidata persons, which have a certain authority property (by default: TED speaker ID) and lacks another one (by default: GND ID). For each entry in the list, a link to an application is created, which looks up the name in the according authority file (by default: search_person, for a broad yet ranked full-text search of person names in GND). If we can identify the person in the GND list, we can copy its GND ID, return to the first one, click on the link to the Wikidata item of the person and add the property value manually through Wikidata's standard edit interface. (Wikidata is open and welcoming such contributions!) It takes effect within a few seconds - when we reload the missing_property list, the improved item should not show up any more.
Instead of identifying the most prominent economics-related persons in Wikidata, the other way works too: While most of the GND-identified persons are related to only one or twe works, as an according statistics show, few are related to a disproportionate amount of publications. Of the 1,200 persons related to more than 150 publications, less than 700 are missing links to Wikidata by their GND ID. By adding this property (for the vast majority of these persons, a Wikidata item should already exist), we could enrich, at a rough estimate, more than 100,000 person links in EconBiz publications. Another micro-application demonstrates, how the work could be organized: The list of EconBiz persons by descending publication count provides "SEARCH in Wikidata" links (functional on a custom endpoint): Each link triggers a query which looks up all name variants in GND and executes a search for these names in a full-text indexed Wikidata set, bringing up an according ranked list of suggestions (example with the GND ID of John H. Dunning). Again, the GND ID can be added - manually but straightforward - to an identified Wikidata item.
While we can not expect to reduce the quantitative gap between the 450,000 persons in EconBiz and the 40,000 of them linked to Wikidata significantly by such manual efforts, we surely can step-by-step improve for the most prominent persons. This empowers applications to show biographical background links to Wikipedia where our users expect them most probably. Other tools for creating authority links and more automated approaches will be covered in further blog posts. And the great thing about wikidata is: All efforts add up - while we are doing modest improvements in our field of interest, many others do the same, so Wikidata already features an impressive overall amont of authority links.
PS. All queries used in this analysis are published at GitHub. The public Wikidata endpoint cannot be used for research involving large datasets due to its limitations (in particular the 30 second timeout, the preclusion of the "service" clause for federated queries, and the lack of full-text search). Therefore, we’ve loaded the Wikidata dataset (along with others) into custom Apache Fuseki endpoints on a performant machine. Even there, a „power query“ like the one on the number of all authority links in Wikidata takes about 7 minutes. Therefore, we publish the according result files in the GitHub repository alongside with the queries. 
Wikidata for Authorities
    
          Wikidata &#160; 
          Authority control &#160; 
          Linked data &#160;
	Integrating a Research Data Repository with established research practices
Authors: Timo Borst, Konstantin Ott
In recent years, repositories for managing research data have emerged, which are supposed to help researchers to upload, describe, distribute and share their data. To promote and foster the distribution of research data in the light of paradigms like Open Science and Open Access, these repositories are normally implemented and hosted as stand-alone applications, meaning that they offer a web interface for manually uploading the data, and a presentation interface for browsing, searching and accessing the data. Sometimes, the first component (interface for uploading the data) is substituted or complemented by a submission interface from another application. E.g., in Dataverse or in CKAN data is submitted from remote third-party applications by means of data deposit APIs [1]. However the upload of data is organized and eventually embedded into a publishing framework (data either as a supplement of a journal article, or as a stand-alone research output subject to review and release as part of a ‘data journal’), it definitely means that this data is supposed to be made publicly available, which is often reflected by policies and guidelines for data deposit.
In clear contrast to this publishing model, the vast majority of current research data however is not supposed to be published, at least in terms of scientific publications. Several studies and surveys on research data management indicate that at least in the social sciences there is a strong tendency and practice to process and share data amongst peers in a local and protected environment (often with several local copies on different personal devices), before eventually uploading and disseminating derivatives from this data to a publicly accessible repository. E.g., according to a survey among Austrian researchers, the portion of researchers agreeing to share their data either on request or among colleagues is 57% resp. 53%, while the agreement to share on a disciplinary repository is only 28% [2]. And in another survey among researchers from a local university and cooperation partner, almost 70% preferred an institutional local archive, while only 10% agreed on a national or international archive. Even if there is data planned to be published via a publicly accessible repository, it will first be stored and processed in a protected environment, carefully shared with peers (project members, institutional colleagues, sponsors) and often subject to access restrictions – in other words, it is used before being published.With this situation in mind, we designed and developed a central research data repository as part of a funded project called ‘SowiDataNet’ (SDN - Network of data from Social Sciences and Economics) [3]. The overall goal of the project is to develop and establish a national web infrastructure for archiving and managing research data in the social sciences, particularly quantitative (statistical) data from surveys. It aims at smaller institutional research groups or teams, which often do lack an institutional support or infrastructure for managing their research data. As a front-end application, the repository based on DSpace software provides a typical web interface for browsing, searching and accessing the content. As a back-end application, it provides typical forms for capturing metadata and bitstreams, with some enhancements regarding the integration of authority control by means of external webservices. From the point of view of the participating research institutions, a central requirement is the development of a local view (‘showcase’) on the repository’s data, so that this view can be smoothly integrated into the website of the institution. The web interface of the view is generated by means of the Play Framework in combination with the Bootstrap framework for generating the layout, while all of the data is retrieved and requested from the DSpace backend via its Discover interface and REST-API.

SDN ArchitectureDiagram: SowiDataNet software componentsThe purpose of the showcase application is to provide an institutional subset and view of the central repository’s data, which can easily be integrated into any institutional website, either as an iFrame to be embedded by the institution (which might be considered as an easy rather than a satisfactory technical solution), or as a stand-alone subpage being linked from the institution’s homepage, optionally using a proxy server for preserving the institutional domain namespace. While these solutions imply the standard way of hosting the showcase software, a third approach suggests the deployment of the showcase software on an institution’s server for customizing the application. In this case, every institution can modify the layout of their institutional view by customizing their institutional CSS file. Because using Bootstrap and LESS Compiling the CSS file, a lightweight possibility might be to modify only some LESS Variables compiling to an institutional CSS file.As a result from the requirement analysis conducted with the project partners (two research institutes from the social sciences), and in accordance with the survey results cited, there is a strong demand for managing not only data which is to be published in the central repository, but also data which is protected and circulating only among the members of the institution. Moreover, this data is described by additional specific metadata containing internal hints on the availability restrictions and access conditions. Hence, we had to distinguish between the following two basic use cases to be covered by the showcase:
To provide a view on the public SDN data (‘data published’)
To provide a view on the public SDN data plus the internal institutional data resp. their corresponding metadata records, the latter only visible and accessible for institutional members (‘data in use’)
From the perspective of a research institution and data provider, the second use case turned out to be the primary one, since it covers more the institutional practices and workflows than the publishing model does. As a matter of fact, research data is primarily generated, processed and shared in a protected environment, before it may eventually be published and distributed to a wider, potentially abstract and unknown community – and this fact must be acknowledged and reflected by a central research data repository aiming at the contributions from researchers which are bound to an institution.If ‘data in use’ is to be integrated into the showcase as an internal view on protected data to be shared only within an institution, it means to restrict the access to this data on different levels. First, for every community (in the sense of an institution), we introduce a DSpace collection for just those internal data, and protect it by assigning it to a DSpace user role ‘internal[COMMUNITY_NAME]’. This role is associated with an IP range, so that only requests from that range will be assigned to the role ‘internal’ and granted access to the internal collection. In the context of our project, we enter only the IP of the showcase application, so that every user of this application will see the protected items. Depending on the locality of the showcase application resp. server, we have to take further steps: If the application resp. server is located in the institution’s intranet, the protected items are only visible and accessible from the institution’s network. If the application is externally hosted and accessible via the World Wide Web – which is expected to be the default solution for most of the research institutes –, then the showcase application needs an authentication procedure, which is preferably realized by means of the central DSpace SowiDataNet repository, so that every user of the showcase application is granted access by becoming a DSpace user.In the context of an r&amp;d project where we are partnering with research institutes, it turned out that the management of research data is twofold: while repository providers are focused on the publishing and unrestricted access to research data, researchers are mainly interested in local archiving and sharing of their data. In order to manage this data, the researchers’ institutional practices need to be reflected and supported. For this purpose, we developed an additional viewing and access component. When it comes to their integration with existing institutional research practices and workflows, the implementation of research data repositories requires concepts and actions which go far beyond the original idea of a central publishing platform. Further research and development is planned in order to understand and support better the sharing of data in both institutional and cross-institutional subgroups, so the integration with a public central repository will be fostered.Link to prototype
References[1] Dataverse Deposit-API. Retrieved 24 May 2016, from http://guides.dataverse.org/en/3.6.2/dataverse-api-main.html#data-deposit-api[2] Forschende und ihre Daten. Ergebnisse einer österreichweiten Befragung – Report 2015. Version 1.2 - Zenodo. (2015). Retrieved 24 May 2016, from https://zenodo.org/record/32043#.VrhmKEa5pmM[3] Project homepage: https://sowidatanet.de/. Retrieved 24 May 2016.[4] Research data management survey: report - Nottingham ePrints. (2013). Retrieved 24 May 2016, from http://eprints.nottingham.ac.uk/1893/[5] University of Oxford Research Data Management Survey 2012 : The Results | DaMaRO. (2012). Retrieved 24 May 2016, from https://blogs.it.ox.ac.uk/damaro/2013/01/03/university-of-oxford-research-data-management-survey-2012-the-results/
Institutional view on research data