Distant Reader Catalog

Abstract

About a year ago I implemented a traditional library catalog against content of the Distant Reader. I used Koha to do this work, and the process was almost trivial. Moreover, the implementation suits all of my needs. Kudos to the Koha community!

Introduction

About a year ago I got an automated email message from OCLC, and to paraphrase, it said, "Your collection has been successfully updated and added to WorldCat." I asked myself, "What collection?" After a bit of digging around, I discovered a few OAI-PMH data repositories I had submitted to OCLC many years ago, and these repositories contained the content being updated. Through the process of this discovery, I learned I have an OCLC symbol, ININI , and after wrestling authentication procedures, I was able to edit my profile. Fun!

I then got to thinking, "I am able to programmatically create and edit MARC records. I am able to bring up an online catalog. Koha supports OAI-PMH. I could create MARC records describing the content of the Distant Reader, import them into Koha, and ultimately have them become a part of WorldCat. Hmmm..." So I did.

Implementation

My first step was to create a virtual computer runing Ubuntu, because Ubuntu is the preferred flavor of Linux supported by Koha. I spun up a virtual computer at Digital Ocean. It has 2 cores, 4 GB of RAM, and 60 GB of disk space. Tiny, by my standards. This generates an ongoing personal monthly expense of something like $25.

The next step was to install Koha . This took practice; I had to destroy my virtual machine a few times, and I had to re-install Koha a few times, but all-in-all the process worked as advertised. Again, it was not difficult, it just took practice. I was able to get Koha installed in less than a few days. I could probably do it now in less than eight hours.

The third step was to add records to the catalog. This required me to first use Koha's administrative interface to create authorized terms for both local collections and data types. I then wrote a set of scripts to create MARC records from my cache of content. These scripts were written against curated databases describing: 1) etexts from Project Gutenberg, 2) PDF files from DOAJ journals, 3) articles on the topic of COVID from a data set called CORD-19, and 4) TEI files from a project call Early Print. In each case, I looped through the given database, read the desired metadata, and output MARC records amenable to Koha. At the end of this proces, I had created about .3 million records. The small sample of linked records exemplify the data I created. Simple. Rudimentary. Functional.

To actually load the records I wrote two tiny shell scripts -- both front-ends to Koha's bulkmarcimport.pl routine. The first front-end simply deletes records. Given a set of MARC records, the second front-end imports them. This importing process is very efficient. Read record. Parse it. Add parsed data to database. After a number of configured records have been added, add them to the index. Repeat for all records. Somebody really knew what they were doing when they wrote bulkmarcimport.pl

Usage

Now that records have been loaded and indexed, the catalog can be queried. For the most part, I use the advanced search interface for this purpose because I'm usually interested in searching within my set of collections. Search results are easily limited by facets. Detailed results point to the original/canonical items as well as the local cache. See the screen shots below:

./search.png
advanced search interface

./results.png
results page

./details.png
details page

What's even better is Koha's support for OAI-PMH. Just use Koha's administrative interface to turn OAI-PMH on, and the entire collection becomes available. My catalog's OAI-PMH data root is located at http://catalog.distantreader.org/cgi-bin/koha/oai.pl. Returning to OCLC, I updated my collection of repositories to include a pointer to the catalog's OAI-PMH root URL, and by the time you read this I believe I will have added my .3 million records to WorldCat.

Summary

The process of creating a traditional library catalog of Distant Reader content was easy: 1) spin up a virtual machine, 2) install Koha, 3) create/edit MARC records, 4) add them to Koha, 5) go to step #3. The process is never done. Finally, you can use the catalog at http://catalog.distantreader.org. It is not fast, but it is functional, very. Again, "Kudos to the Koha community!"


Creator: Eric Lease Morgan <emorgan@nd.edu>
Source: This is the first published version of this posting.
Date created: 2024-06-10
Date updated: 2024-06-10
Subject(s): Distant Reader; Koha;
URL: https://distantreader.org/blog/catalog/