Methods for in-sourcing authority control with MarcEdit, SQL, and regular expressions


The University of Akron

From the SelectedWorks of Michael Monaco

2020

Methods for in-sourcing authority control with
MarcEdit, SQL, and regular expressions
Mike Monaco, The University of Akron

Available at: https://works.bepress.com/michael-monaco/24/

http://www.uakron.edu/
https://works.bepress.com/michael-monaco/
https://works.bepress.com/michael-monaco/24/


1 
 

Methods for in-sourcing authority control with MarcEdit, SQL, and regular expressions 

 
Mike Monaco 

Coordinator, Cataloging Services 

The University of Akron, Akron, Ohio, USA 

https://orcid.org/0000-0001-7244-5154  

 
The University of Akron 

302 Buchtel Common 

Akron, Ohio 44325-1712 

Office: 330-972-2446 

mmonaco@uakron.edu 

 
This is an Accepted Manuscript of an article published by Taylor & Francis in the Journal of 

Library Metada on December 20, 2019, available online: 

http://www.tandfonline.com/10.1080/19386389.2019.1703497  

https://orcid.org/0000-0001-7244-5154
mailto:mmonaco@uakron.edu


2 
 

Methods for in-sourcing authority control with MarcEdit, SQL, and regular expressions 

 
ABSTRACT 

This is a report on a developing method to automate authority control in-house (that is, without 

an outside vendor), especially for batch-loaded bibliographic records for electronic resources. A 

SQL query of the Innovative Sierra database retrieves entries from the “Headings used for the 

first time” report. These entries are then processed with some regular expression substitutions to 

create a list of terms suitable for batch searching in the OCLC Connexion client. Two approaches 

to this problem are described in detail and results compared. A similar method for using the 

“Unauthorized headings” report from the SirsiDynix Symphony ILS is also described. 

 
Keywords: Authority control, automation, Regular expressions, SQL, batch processes, 

workflows, Sierra ILS 

 
Shorter title: Methods for in-sourcing authority control  


1 
 

     Background 

     Like many, perhaps most, university libraries in the United States, the majority of The 

University of Akron University Libraries’ collection budget has shifted from print/tangible to 

electronic resources (e-resources), while cataloging staff time has been reduced through attrition 

and the assignment of additional non-cataloging responsibilities. Shifting from print resources 

where manual authority work for individually vetted bibliographic (bib) records is possible to 

batches of e-resources bib records loaded en masse without the same individual vetting has 

meant changing the approach to authority work for most of the new titles added to the collection. 

Because outsourced authority vendors are not an option, authority control remains in-house, but 

by working with the systems unit, the cataloging unit is developing a method to automate much 

of the authority workload associated with loading large batches of bib records. 

 
     Literature review 

     Authority control is a fundamental and perennial challenge in librarianship, and a great deal 

has been written about authority control, automation, and challenges posed by vendor-supplied 

bib records. This review focuses on recent work on (1) how the quality of vendor-supplied 

batches of bib records has been assessed, (2) attempts to control the quality of such records and 

the headings in them, and (3) efforts to automate authority work. 

     The importance of quality metadata for access and discoverability, and the centrality of 

authority control to quality control in a bibliographic database, is assumed rather than argued for 

in the present paper. Snow (2017) provides an in-depth review of the literature on the importance 

of metadata quality.   

 
2 
 

     No discussion of authority control can ignore the use of vendors to “outsource” the drudgery 

of managing headings and loading authority records (ARs). Park (1992) presents a relatively 

early look at moving from manual (in-house) authority files to automated authority files 

maintained by vendors. Tsui & Hinders (1999) look at outsourcing authority work with vendors, 

which at the time was the only real alternative to finding ARs manually for each access point. 

Their cost-benefit analysis compared OCLC charges and credits and the associated staff time to 

the cost of a vendor contract and associated staff time. Aschmann (2002) gives a detailed plan 

for outsourcing that includes creating an RFP, working with a vendor, and forming an in-house 

authority control team. Ten years after Park’s hopeful outline, Aschmann found that outsourcing 

did not necessarily save staff time.  

     Jackson (2003) discusses some of the advantages and perils of automated authority control. 

While the ability to automatically update headings in the catalog is a great benefit, the downsides 

include the limits of computer matching (which may produce amusing errors) and the need to 

manually identify authorities to add to the catalog. Jackson concludes that vendor-supplied 

authority control carries the same benefits and perils, and hopes that future integrated library 

systems (ILS) will provide better options for in-house authority control. Velluci (2004) surveys 

the major vendors of authority control and explains the services they offer. It is an interesting 

snapshot of the state of authority control in 2004. One of the leading vendors mentioned is no 

longer in business, and the author notes resistance, on the part of vendors, to the idea of an 

international authority file. It is an accurate and detailed overview of the services available, 

excepting only services that have developed since 2004, such as updating bib records to use 

RDA forms of headings and adding URIs to MARC fields. Zhu & Von Sheggern (2005) outline 

various quality control services that can be provided by vendors and try to set realistic 


3 
 

expectations for librarians about these services with examples of what can and cannot be 

automated. This article serves as an excellent overview of how automated authority control 

works, explaining normalization, matching, and which elements of authority and bib records are 

typically utilized. The authors also provide a checklist of common options and questions that 

libraries should ask of vendors. Williams (2010) gives a case study of database cleanup with 

Marcive, noting some of the limits of automation identified by Zhu & Von Sheggern (2005) and 

some local issues such as the strain loading batches of bib records put on their ILS, and how they 

dealt with those challenges.  

     Some recent work has taken up the idea of managing large scale quality control and authority 

work projects internally. Kreyche, Lisius, & Park (2010) describe a process at Kent State 

University Libraries for updating name ARs with death dates added since the NACO policy 

change which made adding death dates more routine. While it is an important example of a 

large-scale in-house project, the vendor Backstage Library Works eventually began offering the 

same service. Ziso, LeVan, & Morgan (2010) describe a method that, rather than using ARs 

within the database, queries OCLC’s WorldCat Identities file to direct users to authorized access 

points and related works for searches. It is certainly an outside-the-box approach, but most 

library catalogs still rely on internal authority files, and the method does not help update 

headings in bib records, so bib record quality would need to be addressed by other workflows. 

Mak (2013) gives a detailed look at a process at Michigan State University Libraries to cope with 

the mass re-issue of name authority records (NARs) by the Library of Congress, when many 

NARs were revised to meet RDA standards. Mak describes a process where ARs in the local 

catalog are exported, converted to a format allowing the extraction of control numbers for batch 

searching, and then comparing the retrieved ARs to the exported ARs to identify and select 


4 
 

updated records to load. An AutoIT script automates most of this this process and even updates 

bib records within the ILS. Cook (2014) provides a roundup of useful tools for manipulating 

metadata, including programs, development environments, and programming languages that can 

be used to manipulate MARC records. Some of these tools were utilized in the present project. 

     Carrasco, Serrano, & Castillo-Buergo (2016) describe a tool for matching headings in the 

context of a large database with possible duplicates. Their work is notable for relying on bib 

records to disambiguate names. This was accomplished by analyzing time periods and dates in 

the bib records associated with names rather than using ARs. Dong, Glerum, & Fenichel (2017) 

describes a process for resolving a problem that was more or less unique to their shared database: 

duplicate series data. This is useful to other libraries because the authors detail their planning 

process and practical lessons learned. The article also includes a good literature review on 

database quality and describes some approaches and projects in large-scale authority control 

undertaken elsewhere. Wolf (2019) describes processes that use existing lists of changed or 

updated ARs (the Library of Congress “Weekly Lists” and OCLC’s “Closed Dates in Authority 

Records”) to extract record identifier numbers, and then queries these numbers in the local Sierra 

ILS (hereafter, Sierra) to determine which ARs need to be re-loaded into Sierra. Wolf’s process 

involves using regular expressions to extract the relevant data, JSON queries of the Sierra 

database, and batch searches in the OCLC Connexion Client (Connexion), making her work 

somewhat similar to the present project. Indeed it is a complimentary project; where the present 

project begins with internal notifications (headings reports) Wolf’s begins with external 

notifications (the aforementioned Library of Congress and OCLC lists). 

     A natural catalyst for searching for large scale authority control solutions has been the 

increasing practice of batch loading bib records, especially records for intangible e-resources. 


5 
 

The batch loading of e-resource bib records creates several challenges for authority control, as 

vendor-supplied bib records may be of varying quality and are loaded in volumes that make 

evaluating the records individually impractical. 

     Sanchez, Fatout, Howser, & Vance (2006) is one of the first publications to address the use of 

“non-traditional, non-ILS supplied editing utilities to correct MARC records prior to loading” (p. 

54). Their paper describes their use of MarcEdit, Word, and Excel to correct errors in bib records 

provided by NetLibrary. These corrections were carried out in batches, but authority work was 

carried out manually by catalogers. While manual authority work on small batches of e-resource 

bib records was feasible in 2006, the growth of e-resources in library collections and the 

reduction of library staff renders such an approach less practical today. 

     Heinrich (2008) details quality enhancements made to electronic book (e-book) bib records 

both pre- and post-load. The pre-load work included vetting different collections and requesting 

customizations based on local practices. The post-load work included deduplication of titles, 

transferring local information to the batch-loaded records, and establishing overlay protection for 

the local data fields. However no attempt was made at authority control for electronic serials (e-

serials) because the “records are unstable” (p. 15) -- that is, e-serial bib records are frequently 

updated and redistributed, making local changes to records less permanent than they are for e-

books. Moreover, the batch-loaded bib records were also excluded from the headings reports out 

of concern that the large number of records would “overwhelm the capacity of the headings 

reports (p. 15). Finn (2009) describes pre-load authority work on batches of bib records at 

Virginia Tech. Their procedures are a mix of outsourced and internal work. First, Library 

Technologies, Inc. (LTI) edits the batch files to correct certain common errors and creates a 

report on “unlinked” (that is, uncontrolled) headings. Then library staff use MarcEdit to make 


6 
 

changes to the access points in the bib records based on LTI reports. LTI also supplies ARs for 

the library to load. Global updates and headings reports in the ILS are used after loading the bib 

records to cover additional corrections. Martin & Mundle (2010) offer a typology of authority 

problems (broadly: access issues, load issues, and record quality issues), explain their procedures 

for dealing with them, and emphasize the usefulness of talking to vendors as a tactic for 

maintaining quality control. They focus in particular on Springer e-books, a collection that also 

proved vexing to other consortia. Wu & Mitchell (2010) describe some of the e-book record 

quality issues the University of Huston Libraries has found and how they are addressed with 

MarcEdit batch processes, as well as the difficulties posed by changing cataloging standards, 

particularly the preference for provider-neutral records. 

     Panchyshyn (2013) introduces a procedure for quality control via a checklist. Authority 

control is managed at Kent State by isolating batches of e-resource bib records from other 

records (p. 27-28). Like Heinrich (2008), Panchyshyn warns that the costs associated with of 

authority control may make it inadvisable for certain kinds of resources -- in this case, e-

resources that will not be held “in perpetuity” (p. 34). Beisler & Kurt (2012) describe a task force 

used to deal with issues with e-resources and batch loading workflows, developing a form 

similar to Panchyshyn’s checklist for managing workflow. They mention very little on quality 

control and automated authority processing however. David & Thomas (2015) look at the quality 

of bib records for e-resources. They note that the quality of bib records is especially important 

for e-resources because these resources cannot be found on the shelf, and user browsing and 

selection takes place in the catalog, based mainly on the bibliographic metadata displayed there 

(p. 802). They focus on the types of errors that occur in access points and the time and cost of 

correcting them. Their study of user searches confirmed that title, author, and subject fields are 


7 
 

the most important access points, both because they are most frequently chosen for single-field 

searches and because their analysis of keyword searches found that title, author, and subject 

terms were the three most commonly entered kinds of search terms. Of course all three of these 

access points are controlled by ARs, further highlighting the importance of authority work for 

access. 

     Flynn & Kilkenny (2017) describe dealing with the problem of e-resource bib record quality 

at the consortia level. Their paper describes the evolving policies and procedures that were put in 

place to improve record quality in OhioLINK. These are focused on changes to bib records -- 

some manual and some automated. They also include a helpful review of the literature on vendor 

record quality and discuss how they worked with various vendors to improve record quality at 

the source. Van Kleeck, Nakano, Langford, Shelton, Lundgren, & O’Dell (2017) examine record 

sources, again highlighting the importance of record quality for e-resources. They conclude that 

OCLC bib records distributed via WorldShare Collection Manager (WCM) are generally equal to 

or superior to the vendor-supplied records from other sources that they examined. The record 

sources are identified in this study (as opposed to David & Thomas (2015) and Flynn & 

Kilkenny (2017) who anonymize the vendors and publishers), making this an especially helpful 

article for librarians developing their own workflows. Their emphasis on record quality 

(especially authorized access points) underscores the importance of authority control for e-

resources which are primarily accessed through OPACs or discovery layers dependent on these 

access points. Thompson & Traill (2017) describe a method to check record quality with Python 

scripts that evaluate quality using a rubric that gives credit for the presence of authorized access 

points, call numbers, and descriptive fields which affect discovery such as summaries and 

contents notes. The records’ scores according to the rubric are used to separate records that can 


8 
 

be batch loaded from those that will need human intervention to assure completeness and 

correctness. This project has had the added value of helping compare the relative quality of 

different sources of bib records, and confirming Van Kleeck et al.’s observation that WCM 

provides better records than most vendors. 

     Tingle & Teeter (2018) describe an effort to make e-resources visible in a fairly literal 

manner. Proxies for titles and topics were placed on the shelves among print resources. The 

project highlights how significant an issue the discoverability of e-resources remains, but does 

not particularly address record quality within the catalog. 

 
     Automation of authority work at The University of Akron 

     Even libraries with authority control vendors often find it impractical or cost-ineffective to 

have authority outsourcing for e-resources bib records. As discussed in the literature review, e-

resources pose a particularly vexing problem because the records are often of low quality, 

because the records are not expected to remain for long or will be updated with new records at 

regular intervals, and/or because the sheer number of incoming records can be daunting. At The 

University of Akron (UA), a large public research university which does not use an authority 

control vendor, a process was developed to leverage some free software, simple database queries, 

and existing capabilities already present in the ILS and bibliographic utilities to create a process 

that improves and controls the access points in bib records with minimal staff effort, and retrieve 

supporting AR in batches. The process is an example of successful collaboration between 

librarians with different areas of functional expertise and at different institutions, and we hope 

our initial successes will inspire other librarians to push themselves to develop skills beyond 

those traditionally employed within their units.  


9 
 

     In developing this method to download batches of ARs to support (and update) headings in 

incoming bib records, the goal is to automate authority control in-house, especially for batch-

loaded bib records for e-resources.  

     Before loading, batches of bib records for e-resources have their access points for names and 

topics compared to the Library of Congress' Linked Data Service (LDS) via the MarcEdit report 

“Validate Headings.” This report changes headings in bib records that match variant access 

points for authorities in the LDS to the authorized forms. The bib records are then loaded into 

Sierra. This triggers headings reports in Sierra. The “Headings used for the first time” report lists 

entries for headings that are new to the catalog and therefore do not match ARs in the catalog. 

This report can be queried with SQL to retrieve text strings to search against the authority file in 

OCLC via a batch process in Connexion and download matching ARs in batches.  

     An earlier version of the process will also be described, which involved using a text editor to 

sort access points by type and then run a series of find/replace operations using regular 

expressions (regexes, singular: regex) to normalize the access points for batch searching. Some 

pointers for applying the method in the SirsiDynix Symphony ILS follow. Symphony has a 

different approach to headings reports than Sierra, but Symphony’s reports can still yield usable 

textual search strings if the report output is processed with a series of regexes similar to those 

used in the Sierra methods. 

     Statistics collected to track the success rate of the headings validation tool in MarcEdit and 

the batch searching of ARs based on the SQL queries are provided. 

     The conclusion assesses the cost in staff time versus benefit in improved access, and discusses 

the lessons learned by the authors in this collaboration, as well as suggesting possible 

refinements and improvements of the process and areas for further exploration. 


10 
 

     Pre-load authority work with MarcEdit 

     At UA, several procedures are followed to improve the quality of bib records before loading 

them into Sierra. There are two categories of procedures: collection-specific tasks and heading 

validation. Unlike Virginia Tech’s procedures as reported in Finn (2009), this work is carried out 

entirely in-house. 

     Most e-resource bib record collections have specific sets of edits that are always applied 

either before loading (in MarcEdit’s MarcEditor program) or during the load (with specialized 

load tables for the collections). These edits may be local customizations (collocation fields to 

identify the collection, local call numbers and location codes, etc.), or for a few collections they 

may be more extensive, such as adding form/genre headings to streaming video collections. For a 

few collections, recurrent errors that have not been adequately addressed by the record suppliers 

have their own set of tasks in MarcEditor. The most extreme case is a streaming video collection 

that has recurring errors in access points, such as incorrect forms of names, qualifiers incorrectly 

added to corporate body headings, and problems with subdivisions coding (missing or 

improperly coded delimiters and subfield codes). In many cases these edits are made in MarcEdit 

because the applicable ARs do not have matching variant access points that would enable the 

ILS to automatically “flip” the access points in the bib records, and because Sierra reports but 

does not automatically flip variant forms of headings when a bib record is loaded (rather, 

automated processing is triggered when the ARs are loaded). These pre-load edits make 

improvements to record quality, but the most dramatic and efficient processing is the second 

category, utilizing MarcEdit’s “Validate Headings” report.  


11 
 

     Heading validation in MarcEdit compares access points in bib records to the authorities in the 

Library of Congress’ LDS. Variant headings for names are flipped to the authorized form if there 

is an exact match. The validate headings report is a routine part of the workflow at UA for many 

batches of records. Because the validation report provides a statistical log of the changed 

headings, the statistics of each set processed are compiled to determine the relative quality of 

records from different publishers and whether the time required to run the report is justifiable. It 

was determined that there was little benefit from running the report on the brief bib records 

supplied for the discovery layer, but on the other hand some collections benefited significantly, 

especially those that had been harvested at some point from the Library of Congress or OCLC 

and which therefore had older forms of headings. Two years of data collection (March 2017-

March 2019) has demonstrated that of a total of 1,230,195 bib records loaded in batches, 32,249 

access points were changed from a variant form to the authorized form. Because the Validate 

Headings report notes headings in 1xx, 6xx, and 7xx fields separately, it was possible to 

separately track name access points and topical access points. This was helpful as some sets, 

such as streaming video, tended to have far more name access points than would be typical of e-

books or serials. The results for sets of bib records from different vendors were compared to the 

results for bib records from OCLC (via WorldShare Collection Manager), with the assumption 

(supported by Flynn & Kilkenny (2017)) that OCLC records were a reasonable benchmark for 

acceptable record quality. Using this benchmark, UA only continued using the Validate 

Headings report for sets that had a rate of correction higher than that of the OCLC record sets. 

Some selected collections are summarized on table 1, Summary of MarcEdit Validate Headings 

on selected record sets; the OCLC row is bolded for emphasis as it served as the benchmark for 

deciding whether the time invested in the report was worthwhile. 


12 
 

[place table 1 here] 

     Pre-load authority work is particularly beneficial to the workflow because Sierra can identify 

headings that have not been used before in the catalog (“Headings used for the first time”) which 

perforce do not have corresponding ARs in the catalog: an AR’s 1xx field would constitute a 

previous use of the heading. But because many of the headings in the bib records have been 

validated or changed to match existing ARs, it is more likely that ARs corresponding to the 

headings can be retrieved. The reported “new” headings are supplied in a report within Sierra, 

which made the next steps -- automated retrieval of ARs in-house -- possible. Two versions of 

the method are detailed below, because the two slightly different approaches have different 

strengths and weaknesses.  

 
     Post-load authority work with Headings reports 

     The remainder of this paper will describe the development and implementation of a method to 

accomplish authority control by loading ARs matching headings that have been flagged as “new” 

to the catalog by headings reports in the ILS.   

     For clarity, the two different versions of the method are referred to as “Alpha” and “Beta.” A 

third process for another ILS is dubbed “Gamma.” The method has three components, which will 

be referred to as a query, processing, and batch searching. The query retrieves data; the 

processing prepares the data to be batch searched. Batch searching uses the batch processing 

module in Connexion to retrieve ARs. The query and processing vary in each version of the 

method, and it is hoped that the discussion of how they developed and how the different methods 

compare in terms of efficiency and success will be helpful to others adapting the method to their 

own libraries. 


13 
 

     The Alpha method: background and query 

     The initial project began with the somewhat obvious thought that it would be nice to be able 

to gather the headings in the “Headings used for the first time” report in Sierra and batch search 

them in Connexion. See figure 1, Sample “Headings used for the first time” report entry. A 

cataloger will recognize the MARC field listed as “Field” in the report. Corresponding MARC 

fields also appear in ARs. The challenge would be collecting the MARC data in a form that 

could entered into Connexion searches. 

[place figure 1 here] 

     A colleague in Systems (Susan DiRenzo Ashby, Coordinator, Systems, The University of 

Akron) identified the location of the report’s components in Sierra’s database, and another 

colleague (Michael Dowdell, Systems Administrator, The University of Akron) devised a simple 

SQL query to collect the MARC fields with the triggering headings. pgAdmin is used to run 

these queries and place the results in a comma-separated values (.csv) file. pgAdmin is a user 

interface for accessing databases, executing SQL queries, and managing the results. The .csv file, 

once the contents are processed (normalized to remove MARC and Sierra codes and tags and 

potential stop words, operators, or commands), can in turn can be entered into Connexion’s batch 

searching tool to retrieve matching ARs. These ARs ultimately are loaded in support of the bib 

records. Over time, through trial and error and with help from Craig Boman (Discovery Systems 

Librarian, Miami University) the query was refined. 

     The Alpha method queries the Sierra database for the terms listed under “Field:” in the report. 

The SQL query was: 

SELECT field 

FROM sierra_view.catmaint 


14 
 

WHERE condition_code_num=1 

ORDER by field 

; 

 
The SQL query is asking for a particular column of data (field), in a particular table 

(sierra_view.catmaint), where another column in the table (condition_code_num) has a particular 

value (1). This has exactly the desired effect: the query returns the data labeled “Field:” from all 

entries in the “Headings used for the first time” report. In the case of the entry depicted in figure 

1, the data is: 

a1001 |aWolfram, Adolph,|earranger of music,|einstrumentalist 

Thus, all of the MARC coding (tags, indicators, subfield delimiters) and also the Sierra field 

group tag (here, the initial “a”) are returned by the query. This data would interfere with a batch 

search in Connexion, since the Connexion search is querying the WorldCat authority file, which 

contains only authorized access points and variants. Additional data such as the relator terms in 

the example (“|earranger of music,|einstrumentalist”) also interfere with searching. This problem 

is addressed later in the “processing” component of this procedure. 

     The field group tag is useful as it distinguishes name headings (tagged “a” for “author” or “b” 

for “other author”) from subject headings (tagged “d”). This is important because the Sierra 

requires separately loaded ARs for names when they are used as name access points (Sierra tag 

“a” or “b” and MARC tags 1xx or 7xx) or as subjects (“d” and 6xx). The Connexion batch 

searching tool on the other hand requires separate searches for topical headings and name 

headings. Fortunately it is possible to search the index of Library of Congress (LC) names, 

which includes personal names, corporate bodies, conferences, and uniform titles (including 

name/title headings). The possible combinations of tags and headings are laid out in table 2, 


15 
 

Headings types in Sierra and WorldCat. The shaded area highlights situations where name 

headings are used as subject access points. 

[place table 2 here] 

     This is why the SQL script includes the command to ORDER the output by “field”. ORDER 

sorts the data alphanumerically. The fields starting with “a” or “b” will be separated from the 

“d”s. Furthermore those starting with “d600” through “d630” would all be grouped together, 

regardless of the order they appeared in the headings report. Sorting the full fields, with the 

initial Sierra and MARC tags, effectively groups these different uses of headings. That is, the 

sorted list is ordered into three groups: name headings used as name access points (or “names-as-

names”), name headings used as subject access points (or “names-as-subjects,” shown shaded in 

table 2), and subject headings. (A few other field group tags may also appear in the report, 

depending on the local settings used, but these too would be gathered by tags.) The three types of 

headings were then manually “cut and pasted” in a text processing application (in this case, 

Editpad) into three distinct files to be searched and loaded with slightly different criteria: the 

names-as-names which are searched as LC names and loaded as names authorities, the names-as-

subjects which are searched as LC names and loaded as subject headings, and the subject 

headings which are searched as LC Subject Headings (LCSH) and loaded as subject headings.  

 
     The Alpha method: processing 

     The Connexion batch searching tool can import text lists of terms to search. The problem 

though remained: how to search just the data in MARC subfields that would be useable in these 

searches. Returning to the example in Figure 1, the goal is to search just the words “Wolfram” 

and “Adolph” and not the words “a1001” “|aWolfram,” “Adolph,|earranger”  “of”, and  


16 
 

“music,|einstrumentalist”, which is how Connexion would parse the field as retrieved. The 

solution arrived at for the Alpha was to use a series of regexes to find and delete the extraneous 

data, which is mostly readily identified by MARC codes, and also to strip out punctuation and 

common stop words and operators that would confound the searches. The stop words and 

operators can appear both in subject and name -- especially name/title or uniform title, headings. 

Consider headings such as: “Same-sex divorce,” “Actors with disabilities,” “Cyrus, the Great, 

King of Persia, -530 B.C. or 529 B.C.,” and “Gone with the wind (Motion picture : 1939)”. The 

underlined words are interpreted by Connexion as potential operators or stop words, and the 

punctuation is interpreted as syntax for commands, any of which can interfere with keyword 

search. The stop words slow down the batch process as they are not indexed and waste effort. 

The operators and command syntax can cause errors that stop the affected searches. 

Occasionally, some name elements are identical to WorldCat index labels and will not be readily 

searched as keywords, because the batch process interprets them as commands lacking proper 

punctuation. For example, the family name “Su” will be interpreted as the label “su” (for the 

LCSH index of WorldCat’s authorities) and regarded as an error as it is missing the “:” or “=” 

which would tell Connexion whether it is a keyword or browse search of that index. There is 

little to be done in such cases, as removing these name elements is unlikely to create a search 

with just one match. However the stop words and operators can generally be removed with no 

loss of precision. 

     A somewhat complicated series of “find/replace” operations using regexes were therefore 

performed in the separated text files of names and subjects. The complete list of expressions used 

follow: 

1. (.*\|a) 


17 
 

2. (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) 

3. (\|e.*|\|4.*|\|0.*|\|j.*) 

4. (\|.) 

5. (“|;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| 

near | same ) 

     The first expression simply selects everything up to, and including, “|a” which is how Sierra 

represents “subfield a” in the MARC field. So, for the example from figure 1, this selects “b7001 

|a”. This selection is replaced with nothing; that is, it is simply deleted. The other expressions are 

all replaced with a blank space, so that the remaining terms do not run together. This is important 

because the Sierra database does not store spaces that appear before or after subfield delimiters 

in the MARC record.  

     The second expression selects commonly occurring AACR2 abbreviations that occur in 

names with uncertain or incomplete dates. These abbreviations are generally selected in the 

context of a name heading’s subfield d (hence the “\|d” preceding some tokens); other likely 

contexts are signified in the expression such as “b. ca.,” “-ca.” and so on. These are likely to 

occur in older record sets which some vendors distribute. They may also exist in older bib 

records in the catalog and appear in the report because of some other edit that was made to the 

record. The example in figure 1 does not have any such abbreviations however. 

     The third expression selects relationship terms and identifiers, again including the subfield 

delimiters themselves. In figure 1, “|earranger of music,|einstrumentalist” would be selected. The 

fourth expression selects any remaining subfield delimiters and codes, such as subfield q 

(marking a fuller form of name).  The last expression selects a variety of punctuation marks and 

common stop-words and operators.  


18 
 

     Running these find and replace substitutions is not especially time-consuming, but they must 

be run in order and require some attention to detail. Figure 2 shows two screenshots of some 

actual Sierra fields output by the Alpha query. On the left is the raw output, on the right is the 

same screen after processing. 

[place figure 2 here] 

    Batch searching 

     At this point the data, saved as a plain text file, can be imported into Connexion for batch 

searching. Names (whether used as names or subjects in Sierra) should be searched with the 

default index “LC Names” (nw:) selected; subjects with “LCSH” (su:). The batch was run with a 

limit of one hit per match. This limitation to a single hit avoids situations where human 

intervention might be required to decide between two or more similar headings that are partial 

matches to the entry. Such partial matches might be name/titles that matched just the name, 

modified name headings that matched an entry with no modifier, and so on. As an example, 

consider the name “Colombo, Maria” from figure 2. A search of the name authority file for 

“nw:colombo maria” yields twelve hits for names containing those words, but none exactly 

match the entry. On closer inspection none can be identified with the Maria Colombo in UA’s 

catalog anyway, but even if one of the multiple hits were a match, there would be no way to 

automate selection of the correct heading. Moreover, there is a limit to the number of records that 

can be stored in a Connexion save file (10,000) and including more than one match would 

potentially fill the save file before all the terms in the batch are searched. 

     This was the procedure was carried out at UA for four months in 2017, with queries made 

about once a week. The reports had a mean average of 4412 entries, mostly due to bibliographic 


19 
 

batch loads and a simultaneous project of re-loading certain e-resource collection bib records. 

About 52% of the entries were names-as-names, 6% names-as-subjects, and 42% subjects. The 

greatest success by far was had searching the names-as-names. 59% of the name-as-name entries 

returned a unique AR, while just 23% of names-as subjects and 5% of subjects did the same. The 

lower success rates for name and topical subject headings can be partly explained by the fact 

subdivisions were always included in the authority searches, but ARs established with main 

headings plus subdivisions are relatively rare. Because the local installation of Sierra was not a 

version that could ignore subdivisions when creating the headings report, only subdivided ARs 

would match subdivided headings. So, it made sense to try to find ARs that also have the 

subdivisions.  

     In August of 2017 the project was put on hold as upgrades to Sierra were planned, and by 

good fortune another librarian at a conference (Craig Boman, Miami University) suggested a 

tweak that could eliminate (1) the need to separate the data retrieved in the query and (2) most of 

the processing.  

     The Beta method: query 

     Mr.  Boman suggested altering the query use to SELECT index_entry rather than SELECT 

field. 1  The “index_entry” is the data labeled “Indexed as Author” (or “Indexed as Subject”, etc.) 

in the heading report. In figure 1. this is simply “adolph wolfram.” These index entries are ready 

to batch search, for the most part. Because the UA implementation of Sierra does not index the 

title part of name/title headings in the author index, there is less need to remove stop words and 

operators from the names. Punctuation is not present in the index entries either. But of course 

there remains the issue of separating names-as-names, names-as-subjects, and subjects. This was 

                                                           
1 C. Boman (personal communication, May 14, 2018) 


20 
 

accomplished with another tweak. A condition was added to the query, based on the prefixes in 

the field. Instead of running one query and then separating and normalizing the output with the 

Alpha processing, the separation could be accomplished by running three distinct queries. The 

resulting data needs less processing, because the MARC coding and punctuation are already 

absent. 

     Names-as-names were selected with the following query that exploits regexes in the search. 

The use of a regex is indicated by the tilde (~) and the expression enclosed in single quotes. 

SELECT index_entry 

FROM sierra_view.catmaint 

WHERE condition_code_num=1 and field ~'^a|^b' 

; 

 
The “WHERE” conditional now focuses on fields that begin with an “a” or “b” -- that is, on 

fields with the index group tag for “name” (a) or “other name” (b). As mentioned above, the 

“index_entry’ will not contain subfield t, so articles and other stop words and operators are less 

common. Even so, conference names, place names, and uniform titles may occur in these as 

“names” or “other names” so there may still be some terms that will confound Connexion batch 

searches. For example, the abbreviation for Oregon (“Or.”) will appear in the index_entry as 

“or”, which will be interpreted as an operator in Connexion, and since it is likely to be at the end 

of a string, it will be an operator with bad syntax. More commonly, corporate or conference 

names may have words like “the” or “and,” and personal names might have an “or” in uncertain 

dates, or AACR2 abbreviations that might not be recorded in the AR’s variant (4xx) fields. Thus, 

some processing is still carried out.  

    Names-as-subjects are handled similarly, with the following query: 

SELECT index_entry 

FROM sierra_view.catmaint 

WHERE condition_code_num=1 and field ~'^d6[0-3]' 


21 
 

; 

 
Here the conditional selects fields beginning with a “d” (subjects) and the MARC tags 600 

through 630. This therefore selects personal names (tag 600), conference and corporate names 

(tag 611 and 610), or uniform titles (630). In principle a MARC tag 620 could also be selected, 

but in practice this should not happen because 620 is undefined in MARC21. 

     And topicals are selected with a third query: 

SELECT index_entry 

FROM sierra_view.catmaint 

WHERE condition_code_num=1 and field ~'^d65' 

; 

 
Here, any subject (d) tagged 65x is selected. UA’s implementation of Sierra tags only MARC 

fields 650 and 651 as subjects; 653 and 655 are placed in indexes with other tag codes. 

  
     The Beta method: processing 

     For each query, the output .csv files are opened in Editpad and the fourth line of regex from 

the Alpha process is used to remove stop words and operators. A minor hiccup was introduced to 

the process when batches of files processed by a vendor began to be loaded, as these included 

one or more subfield 0 in MARC 6xx fields. Because UA’s implementation of Sierra had not 

been set up to exclude subfield zero from indexes, the content of the subfield was included in the 

text. For example, a personal name subject access point for  

Derrida, Jacques--Criticism and interpretation 

uses the MARC coding: 

 600  10|aDerrida, Jacques|0http://id.loc.gov/authorities/names/n79092610|xCriticism and  

 interpretation.|0http://id.loc.gov/authorities/subjects/sh99005576 

 
22 
 

and appeared in the index as:  

 
derrida jacques http id loc gov authorities names n79092610 criticism and interpretation 

http id loc gov authorities subjects sh99005576 

 
An additional regex was needed to strip out the content of subfield zero: (http id loc gov 

authorities subjects sh[\d]+)|(http id loc gov authorities names n[a-z]?[\d]+). In the future, when 

the subfield zero is excluded from the indexes, it will not be necessary to remove these strings of 

characters. Thus, for this heading, after running the regexes to remove stop words and operators, 

and the subfield zero identifiers, the remaining data is:  

derrida jacques criticism interpretation 

The text file is now ready for import into a Connexion batch search. 

     The Beta method removed a few steps from processing, and was also simpler in the sense that 

there was no need to cut vast selections of data from a single spreadsheet. This made the Beta 

method a bit less demanding of attention than the Alpha method.  

 
     Results 

     The Alpha method was tested on 91,491 entries in the headings report over a four month 

period (April 11, 2017-August 11, 2017). This ultimately yielded 31,891 ARs of all types. The 

Beta method was tested on slightly smaller number of entries -- 87,077 -- collected over a six 

month period (July 12, 2018-January 9, 2019). The results are summarized in table 3, Alpha and 

Beta results. 

[place table 3 here] 

     The Alpha processing took a noticeably longer time to perform than the Beta, because the 

query results had to be sorted and saved into different files; the Beta process, involving just a 


23 
 

single regex substitution, could be performed rapidly. However the majority of the time needed 

for both versions was simply allowing the batch searching to run, exporting the ARs from 

Connexion, and loading the ARs into the ILS. Therefore the overall time spent on each method 

was nearly the same for a given heading report. The difference was more qualitative, as the 

Alpha method involved more attention to detail in selecting, reformatting, cutting and pasting, 

and saving data from spreadsheets. Notably, the Alpha query results always included some 

“junk” headings: non-MARC headings from brief bib records and headings in local 970 tags. 

The non-MARC headings were added to brief records by staff outside of the cataloging unit and 

were often incomplete; as they were not intended to be authorized access points it made no sense 

to search for matching authorities. The 970 tags had been added to provide access points for the 

table of contents of monographs and were in an idiosyncratic format which Sierra’s automated 

authority processing (AAP) could not access. These “junk” headings had to be excluded from the 

batch processing as well. 

     To compare success rates, the number of successful AR retrievals is divided by the total 

number of entries searched in the batch to arrive at a success ratio. Comparing the ratio of 

success for the Alpha and Beta processes, the difference is rather small overall -- about a 35% 

success rate in the Alpha and 33% in the Beta. Differences emerge when comparing the success 

ratios for specific types of headings, and the total number of headings of each type. The Alpha 

data shows a 63% success rate for names, versus 49% in the Beta. The rates for names-as-

subjects are closer, and based on smaller sample sizes. The rates for subjects are very small, at 

5% and 9% respectively. One would expect less success in subject (and name-as-subject) 

searches because it is not often the case that extended strings of headings and subdivisions will 

match an identical and unique AR. Some ILSs will ignore subdivisions when verifying subject 


24 
 

headings, but UA’s installation of Sierra checks the entire string including subdivisions. 

Similarly, name/title headings can pose problems, because NACO practice is not to create an AR 

for every title, but only those needing qualifiers or cross references. The issue is that the indexed 

fields lack subfield delimiters which would allow subdivisions to be removed before searching. 

While some subject authority records (SARs) are established with subdivisions, these are a 

minority of all SARs and the possible combinations of headings and subdivisions in bib records 

is vast. In principle one might search the batch of names-as-subjects in the subject index (su:). 

This would double the time and effort spent searching for names used as subjects, but it may be 

an avenue worth pursuing in the future. 

     The most glaring difference -- the difference in success rates for name entries -- may be 

explained by several factors.  

     First, the Beta process does not provide an easy way to remove AACR2 abbreviations from 

dates used to qualify names, such as “b.” (for born) and “d.” (for died). Because these would 

generally occur after a subfield d in the MARC field, the second regex in the Alpha could 

identify and remove them. But Beta selects indexed entries rather than the full MARC, so “b.” 

and “d.” in the names could be AACR2 abbreviations or they could be initials. It may prove 

helpful to devise a regex that will remove such abbreviations when occurring near numbers as a 

workaround.  

     Secondly, the Alpha and Beta test were not undertaken simultaneously. Because the Beta test 

was run later, it would likely be checking headings that do not have corresponding ARs. The 

entries in the later “Headings used for the first time” report would be less likely to have 

corresponding ARs simply because they are already being compared to a more robust authority 


25 
 

file in the ILS due to the ARs already loaded from the Alpha method. Ideally, the two methods 

should be compared using the same day’s headings report.  

     Thirdly, there is the simple fact that the bib records loaded during the two test periods were 

different. This would be impossible to completely account for in principle, as different staff and 

faculty were loading different sets of bib records for different purposes in the normal course of 

the library’s operation. 

    The higher success rate for name headings in the Alpha method is a problem requiring more 

investigation to explain. All in all there were far too many variables in the MARC ecosystem of a 

functioning ILS to make a truly controlled comparison. 

     Another complicating factor is that the second set of entries, which were used to test the 

second version of the process, had relatively fewer subject entries overall. This accounts for the 

similar “overall” success rates (35 and 33%) despite the Alpha process seeing significantly more 

successful name searches. This increases the suspicion that the difference in success rates owes 

more to the different bib records loaded than about the processes themselves. Thus, it was clear 

that a direct comparison of the methods was in order.  

    Alpha and Beta head-to-head 

     A more direct and meaningful comparison would be to run the two processes against a single 

headings report and compare the results. This comparison was made by allowing the “Headings 

used for the first time” report to accumulate for several weeks until there were 21,488 entries. 

Then both the Alpha and Beta methods were tested, with a stopwatch running to determine the 

exact amount of time the queries and processing took, beginning with opening the pgAdmin tool 

and stopping when the three files (names, names-as-subjects, and subjects) were saved. The 

results confirmed that the Beta method was considerably faster. The Alpha method took fourteen 


26 
 

minutes and two seconds. The Beta method took five minutes and forty-nine seconds. So, the 

Beta process clearly has the advantage in terms of time and effort.  

     In the single headings report, the Alpha and Beta queries yielded similar but slightly different 

counts for the total number of entries in each category. These are summarized in table 4, Search 

strings retrieved by the Alpha and Beta queries. 

[place table 4 here] 

     The totals were reasonably close, but were not exactly the same. This discrepancy could be 

explained by two factors. First, some non-MARC entries from brief bib records made their way 

into the Alpha list. These still begin with an index tag of “a” so the Alpha query selects them 

along with the MARC fields. The non-MARC fields were obvious in the Alpha results, and 

omitted during the sorting. This would necessarily leave the Alpha lists shorter. But another 

unavoidable factor was that fourteen minutes had elapsed between the Alpha and Beta queries, 

so in effect the Beta query was querying a slightly larger report. Checking the report again after 

these tests revealed that another 40 entries had been added to the “Headings used for the first 

time” report since running the Alpha query. This small difference in total hits is tolerated as 

insignificant. 

     Running the two batches of results in Connexion yielded very similar results. The Alpha 

process had 4345 successful searches, while the Beta had 4348. De-duplicating the results of 

each batch reduced the hits for each to 4338 and 4341, respectively. Moreover, comparing the 

two sets to each other showed that the Alpha batch had 36 ARs not in the Beta batch, and there 

were 39 ARs in the Beta but not in the Alpha. Examination of the two sets of ARs did reveal 

some patterns to the discrepancies. These fell into two classes: conference name authorities and 

name/title authorities. 


27 
 

     Conference headings were particularly problematic for both methods. Alpha returned the AR 

n50062132 (International Wheat Genetics Symposium) but the Beta did not. This can be 

accounted for by the fact that the MARC field which triggered the entry in the report was:  

a1112 |aInternational Wheat Genetics 

Symposium|0http://id.loc.gov/authorities/names/n50062132|n(12th :|d2013 :|cYokohama-

shi, Japan) 

 
Note that there is a subfield zero embedded in the heading. This is an artifact of an authority 

vendor’s processing of the record for consortium that provides the e-resource bib reocrds. The 

expression (\|e.*|\|4.*|\|0.*|\|j.*), which was used to trim relator terms and URIs from fields, 

removed everything following the subfield zero. Thus the Alpha method batch searched only the 

portion in subfield a. Meanwhile the Beta batch searched the entire conference heading, 

including the specific numbering, year, and place. Because this was not established separately, 

there was no matching authority to return. A case might be made for wanting to retrieve the 

general conference name AR, even if it does not match a specific index entry, much as one might 

retain a topical AR for subjects that only appear further subdivided in the indexes. 

     On the other hand, the Beta method was able to retrieve the conference authority 

n  86042368, (Palestine Arab Congress. Executive Committee), while Alpha did not. In this case, 

the conference name uses subfield e for the subordinate unit (Executive Committee). In most 1xx 

and 7xx tags, subfield e is used for relator terms and therefore removed from the fields in the 

Alpha process. But because it used for part of the name in 111 and 711 fields, removing subfield 

e creates a less specific search, and the batch process, which accepts only single matches, 

rejected the multiple matches in the authority file. In this case, the unmodified Palestine Arab 

Congress and a specific meeting in 1921 were also established, making the Alpha version of the 


28 
 

search find three matches. Of course since only single hits are retained, none of these were 

retrieved. 

      It should be possible to further improve the queries to retrieve conference headings 

separately, so that they can be processed differently with revised regex substitutions and 

searched separately. This may be a project for the future. 

      In the case of name/title entries, there is a difference in how Sierra indexes name/title 

combinations and how they appear in the MARC fields. The MARC fields may be a single line, 

as in the case of 7xx fields with names in subfield a (and possibly qualifiers in b, c, d, and q) and 

titles in subfield t (and possibly qualifiers in l, s, etc.), or they may appear in two fields (1xx + 

240). But Sierra only reports on names when a new name/title 7xx is added to the catalog. 

Therefore, the index entry retrieved in the Beta SQL query might be:  

furman james 1937 1989  

while the MARC field retrieved by the Alpha SQL query on the same entry is:  

b70012|aFurman, James,|d1937-1989.|tHehlehlooyuh.  

When the Alpha batch searches for “Furman James 1937 1989 Hehlehlooyuh,” it will not find a 

match, but the Beta batch search for just the name will. 

     It would be possible to adjust the substitution regex to remove subfield t (and anything 

following it) for the Alpha processing, but this would be a two-edged sword: it will avoid 

missing some retrievable names, but it will also be unable to retrieve name/titles. This is 

ultimately a special case of the recall versus precision problem. Precision was favored in this 

case, and subfield t retained.  

     At this point it would seem that two processes are quite comparable in effectiveness. They 

have different strengths. Alpha can be a bit more precise. Beta is a bit less time-consuming. 


29 
 

Which is more suitable for use at a given library will depend on the resources that can be devoted 

to implementing and possibly improving them.  

     As a final test, the ARs from each set were loaded in “test” mode, so that a count of overlays 

(that is, ARs which are already in the catalog) and inserts (ARs new to the catalog) were reported 

without actually loading the ARs. The Alpha file had 746 overlays and 3592 inserts. The Beta 

had 749 overlays and 3592 inserts.  Overlays would generally be “harmless” in the sense that at 

worst, they duplicated records already in Sierra. They might beneficially update an existing 

record, if the iteration in the catalog was out of date (pre-RDA forms, open dates, or changes to 

names). But inserts are generally the goal for this process.  

     One possible refinement would be to compare the relevance of the search results in terms of 

how many “blind references” the different methods produce. It is obviously likely that some of 

the successful batch searches will be “false hits” -- matches that are only partial and/or refer to 

different names or topics than the heading in use. The ARs thus retrieved will become “blind 

references,” authorities that do not support any headings in bib records. Such blind references are 

normally suppressed or deleted in regular database maintenance. Anticipating this problem, at 

UA a MARC 910 field is inserted into all the ARs retrieved by the batch searches to identify the 

records as batch-loaded rather than manually added. This allowed us to select blind references in 

the “Blind references” headings report that originated from these batches for summary deletion.  

     The Alpha and Beta processes can help Sierra libraries, but is this in-sourcing approach 

applicable to other ILSs? The answer is yes, providing the ILS has some mechanism for 

reporting unauthorized or uncontrolled headings. 

 
     Gamma method for SirsiDynix Symphony 


30 
 

     Another opportunity suggested itself with the SirsiDynix Symphony ILS. Symphony has a 

report that will export a text file identifying “unauthorized headings”. These are, like the 

headings in the Sierra report already discussed, headings that do not match any ARs in the 

system. Because the type of headings (names, topical, etc.) and even the MARC tags involved 

(100, 700, etc.) can be preselected before running the report, no SQL query or sorting is required. 

Moreover, name ARs can control both name access points and subject access points, so there is 

no need to search and load names-as-subjects separately from names. Figure 3 below is an 

illustration of a part of one such report’s output. Note that the report was run with the options to 

“format report” and “view log” unchecked -- leaving these options checked produces a less 

useable report with line breaks and page breaks that complicate normalization.  

[place figure 3 here] 

In the above sample, a few differences in the output from the SQL query will be evident.  

     First, there is the presence of some header information in the first few lines. These are 

generated by the system, and can simply be selected and deleted manually in the text editor. 

Secondly, diacritics in this report are displayed as an additional character, typically a character 

with a diacritic of its own. For example  

Abū Dāʼūd Sulaymān ibn al-Ashʻath al-Sijistānī 

is displayed here as: 

Abåu Dåa®åud Sulaymåan ibn al-Ash°ath al-Sijiståanåi  

The additional characters precede the characters that should have diacritics applied. This is a 

character encoding issue, as the Unicode encoding does not translate correctly into the output. 

     Third, a subfield “?” with the term “UNAUTHORIZED” is appended to each line. These 

appear in the staff view of bib records in Symphony as well. Finally, each line is preceded by a 


31 
 

number, indicating the number of occurrences of the heading in the database. Because these 

counts are before the MARC tags, it is important that the report be run for each type of heading 

(1xx/600/630/7xx names and 65x topical headings). But because Symphony allows NARs to 

authorize both name and subject uses of the name, there is no need to segregate names-as-

subjects from names.  

     The other issues require some simple alterations to the regex used to normalize the data in the 

first Alpha processing. The first expression will remove the “counts” along with everything else 

preceding subfield a, and is fine as it is. Changing the third expression to (\|e.*|\|4.*|\|0.*|\|\?.*) 

will remove the “|?UNAUTHORIZED” along with relator terms, URIs, and the like.  

     Removing the characters representing diacritics is a bit more complicated, but is doable. After 

stripping out all the other MARC coding, the expression [^a-zA-Z0-9- \x00-\x1F\x7F] will 

select all the special characters standing in for diacritics. (The expression matches characters that 

are NOT letters, numbers, a dash, a space, or special characters like line breaks.) These are to be 

replaced with nothing (i.e., not a blank space).  

     The rest of the processing should be relatively obvious. Because the subfield delimiter symbol 

in both Sierra and Symphony is a “pipe,” the other regexes in the Alpha processing will work the 

same way. Figure 4 shows the same entries of the sample report after running the following 

regex find/replace substitutions. The first and last expression should be replaced with nothing 

rather than a blank space. 

1. (.*\|a) 

2. (\|db\. ca\. |\|db\. |\|d\. ca\.|\|dd\. |\|dca\. |-ca\. |\|dfl\. ca\. |\|dfl\.) 

3. (\|e.*|\|4.*|\|0.*|\|j.*|\|\?.*) 

4. (\|.|\|$) 


32 
 

5. ("|;|:|\(|\)|\?| and | or |&c\.|&| in | an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| 

near | same |\.) 

6. [^a-zA-Z0-9- \x00-\x1F\x7F] 

 
It would be reasonable to expect success rates similar to the Alpha method for Sierra, though this 

was not tested.  

 
Future directions & further study 

     The initial success using the headings reports and batch processes to add ARs to the catalog 

has been encouraging. As the method has been tested by other librarians, additional tweaks and 

refinements to the query have been suggested. For example, David Green (Infrastructure 

Specialist, The State Library of Ohio) suggested working deduplication, processing, and removal 

of excess white spaces into the query by changing the three Beta queries to use the following 

first line2: 

SELECT DISTINCT trim(regexp_replace(index_entry, '(“|;|:|\(|\)|\?| and | or |&c\.|&| in | 

an |,| the | for | on | so | with | to | by |”|’| be | that |\.{3}| near | same | \s+ )', ' ', 'g')) 

 
Only unique (“DISTINCT”) entries are selected, and the regex substitution (with an additional 

term to replace multiple consecutive blank spaces) is applied to the output. A similar query, or 

set of queries, might be devised to speed up the Alpha method. Indeed this train of thought 

further suggests moving all processing out of the text editor program and into a batch of 

command line commands to further streamline the Gamma method as well -- an exercise perhaps 

for those with more advanced scripting skills than the present author. The ongoing development 

                                                           
2 D. Green (personal communication, May 23, 2019) 


33 
 

of these methods through collaboration among librarians in different functional areas and 

different institutions has been gratifying and promises to further refine these methods. 

     Looking ahead, further manipulations such as removing subdivisions from subject entries 

should improve success rates. The detailed results log from the batch searches may also be worth 

examination to identify headings that should be checked manually when time or staffing permits. 

Moreover there is likely more that can be accomplished with other headings reports in Sierra 

(and other ILSs). For example, the Sierra “Near matches” report, which identifies entries that are 

partial matches to ARs, could be used to identify ARs that may need to be checked against the 

authority file (either the Library of Congress Name Authority File or OCLC) for updates. It may 

also be practical to use a SQL query to extract the “Correct heading is:” entry from the “Invalid 

headings” report, which notes fields in bib records that match variant forms in ARs. Batch 

searching the “Correct heading is:” entries would be a way to confirm that the ARs in the catalog 

are current, and re-loading them would trigger Sierra’s AAP (at least in those Sierra 

implementations that have this feature turned on).  

     Further study is also warranted to determine the relative effectiveness of this method versus 

those achieved by different vendors, in terms of the number of headings correctly matched to 

ARs.   

  
34 
 

References  

 
Aschman, A. (2002). The lowdown on automated vendor supplied authority control. Technical 

Services Quarterly, 20(3), 33-44. DOI: 10.1300/J124v20n03_03 

 
Beisler, A., & Kurt, L. (2012). E-book workflow from inquiry to access: Facing the challenges to 

implementing e-book access at the University of Nevada, Reno. Collaborative 

Librarianship, 4(3), 96–116.   

 
Carrasco, R. C., Serrano, A., & Castillo-Buergo, R. (2016). A parser for authority control of 

author names in bibliographic records. Information Processing and Management, 52(5), 

753–764. DOI: 10.1016/j.ipm.2016.02.002 

 
Cook, D. (2014). Metadata management on a budget. Feliciter, 60(2), 24–29.   

 
David, R. H., & Thomas, D. (2015). Assessing metadata and controlling quality in scholarly 

ebooks. Cataloging and Classification Quarterly, 53(7), 801–824. DOI: 

10.1080/01639374.2015.1018397 

 
Dong, E., Glerum, M. A., & Fenichel, E. (2017). Using automation and batch processing to 

remediate duplicate series data in a shared bibliographic catalog. Library Resources & 

Technical Services, 61(3), 143–161. Retrieved from: 

https://journals.ala.org/index.php/lrts/article/view/6395/8442 

 
Finn, M. (2009). Batch-load authority control cleanup using MarcEdit and LTI. Technical  

Services Quarterly, 26(1), 44–50. DOI: 10.1080/07317130802225605 

 
Flynn, E. A., & Kilkenny, E. (2017). Cataloging from the center: Improving e-book cataloging 

on a consortial level. Cataloging and Classification Quarterly, 55(7–8), 630–643. DOI: 

10.1080/01639374.2017.1358787 

 
Heinrich, H. (2008). Navigating the currents of vendor-supplied cataloging. IFLA Conference 

Proceedings, 1–18.   

 
Jackson, R. V. (2003). Authority control is alive and...well? OLA Quarterly, 9(1), 9-12. DOI: 

10.7710/1093-7374.1636 

Kreyche, M., Lisius, P. H., & Park, A. (2010). The DeathFlip project: Automating death date 

revisions to name headings in bibliographic records. Cataloging & Classification Quarterly, 

48(8), 684–695. DOI: 10.0.4.56/01639374.2010.497721 

Mak, L. (2013). Coping with the storm: Automating name authority record updates and 

bibliographic file maintenanc. OCLC Systems & Services, 29(4), 235–245. DOI: 

10.1108/OCLC-02-2013-0006 

 
Martin, K. E., & Mundle, K. (2010). Cataloging e-books and vendor records: A case study at the 


35 
 

University of Illinois at Chicago. Library Resources & Technical Services 54(4), 227-237. 

DOI: 10.5860/lrts.54n4.227 

 
Panchyshyn, R. S. (2013). Asking the right questions: An e-resource checklist for documenting 

cataloging decisions for batch cataloging projects. Technical Services Quarterly, 30(1), 15-

37. DOI: 10.1080/07317131.2013.735951  

 
Park, A. L. (1992). Automated authority control: making the transition. Special Libraries, 83(2), 

75–85.  

 
Sanchez, E., Fatout, L., Howser, A., & Vance, C. (2006). Cleanup of NetLibrary cataloging 

records: A methodical front-end process. Technical Services Quarterly, 23(4), 51–71. DOI: 

10.0.5.20/J124v23n04-04 

 
Snow, K. (2017) Defining, assessing, and rethinking quality cataloging, Cataloging & 

Classification Quarterly, 55:7-8, 438-455, DOI: 10.1080/01639374.2017.1350774 

 
Tingle, N., & Teeter, K. (2018). Browsing the intangible: Does visibility lead to increased use? 

Technical Services Quarterly, 35(2), 164–174. DOI: 10.1080/07317131.2018.1422884 

 
Thompson, K., & Traill, S. (2017). Leveraging Python to improve ebook metadata selection, 

ingest, and management. Code4LibLib, (38), 1–17. 

 
Tsui, S. L., & Hinders, C. F. (1999). Cost-effectiveness and benefits of outsourcing authority 

control. Cataloging & Classification Quarterly, 26(4), 43–61. DOI: 

10.1300/J104v26n04_04 

 
Van Kleeck, D., Nakano, H., Langford, G., Shelton, T., Lundgren, J., & O’Dell, A. J. (2017). 

Managing bibliographic data quality for electronic resources. Cataloging and Classification 

Quarterly, 55(7/8), 560–577. DOI: 10.1080/01639374.2017.1350777 

 
Vellucci, S. L. (2004). Commercial services for providing authority control: Outsourcing the 

process. Cataloging & Classification Quarterly, 39(1/2), 443–456.   

 
Williams, H. (2010). Cleaning up the catalogue. Library & Information Update, (Jan/Feb), 46–

48. 

 
Wolf, S. (2019). Automating the authority control process. Presented at the Ohio Valley Group 

of Technical Services Librarians Annual Conference 2019. Retrieved from 

https://uknowledge.uky.edu/ovgtsl2019/conf/schedule/17/ 

 
Wu, A., & Mitchell, A. M. (2010). Mass management of e-book catalog records: Approaches, 

challenges, and solutions. Library Resources & Technical Services, 54(3), 164–174. DOI: 

10.5860/lrts.54n3.164  

 
Zhu, L., & von Seggern, M. (2005). Vendor-supplied authority control: Some realistic 


36 
 

expectations. Technical Services Quarterly, 23(2), 49–65. DOI: 10.0.5.20/J124v23n02.04 

 
Ziso, Y., LeVan, R., & Morgan, E. L. (2010). Querying OCLC Web Services for name, subject,  

and ISBN. Code4Lib, (9), 1–8.  

 
37 
 

Table 1. Summary of MarcEdit Validate Headings on selected record sets 

 
Record source Biblioraphic 

records loaded 

Number of 

headings 

corrected 

Corrections 

per title 

% 1xx and 7xx 

corrected 

% 6xx 

corrected 

EBSCO* 158,106 1,108 .007008 .002414 .000467 

OCLC WCM 209,777 5468 .026066 .008498 .000903 

Kanopy*** 461,562** 20,447 .044300 .01123 .01346 

Films on 

Demand*** 

10,064 1,273 .126490 .056359 .000162 

Alexander 

Street Press 

5,304 394 .074284 .012495 .008153 

*EBSCO discovery layer records. These were often brief records with few or no 

access points, accounting for the relatively small number of corrections. 

**Kanopy records were routinely re-loaded as a collection, at the vendor’s 

recommendation, as corrections or changes to records were continuous. UA’s set of 

Kanopy records was less than 20,000 titles, but the set was reloaded in its entirety 

monthly. 

***Kanopy and Films on Demand records were also pre-edited with MarcEdit 

tasks that addressed certain recurring errors as mentioned above in the text. This 

somewhat decreased the overall number of changes made by the Validate Headings 

report, but nonetheless the rates of corrections are still greater than the OCLC 

benchmarks. 

  
38 
 

Table 2. Headings types in Sierra and WorldCat 

 
Tagging 

prefix 

Type of 

authority 

Sierra index WorldCat  authority 

index 

Sierra load table 

a100 Personal name Author LC Names Name authority 

a110 Corporate body Author LC Names Name authority 

a111 Conference 

name 

Author LC Names Name authority 

b700 Personal name Other 

author 

LC Names Name authority 

b710 Corporate body Other 

author 

LC Names Name authority 

b711 Conference 

name 

Other 

author 

LC Names Name authority 

d600 Personal name Subject LC Names Subject 

authority 

d610 Corporate body Subject LC Names Subject 

authority 

d611 Conference 

name 

Subject LC Names Subject 

authority 

d630 Uniform title Subject LC Names Subject 

authority 

d650 Subject Subject LCSH Subject 

authority 

d651 Geographic 

name 

Subject LCSH Subject 

authority 

 
39 
 

Table 3. Alpha and Beta results 

 
Names  Names-as-subjects Subjects Total  

Alpha query 44410 5774 41307 91491 

Alpha ARs retrieved 27935 1753 2203 31891 

Alpha success rate (entries/ARs) .629025 .303602 

 
.053332 .34857 

Beta query 49570 5916 31591 87077 

Beta ARs retrieved 24424 1657 2893 28974 

Beta success rate (entries/ARs) .492717 .280088 .091577 .33274 

 
40 
 

Table 4. Search strings retrieved by the Alpha and Beta queries 

 
Alpha Beta 

Names 10771 10788 

Names-As-Subjects 1403 1408 

Subjects 9320 9339 

 
41 
 

Figure 1.  

 
Headings used for 
the first time 

Field: b7001 |aAdolph, Wolfram,|earranger of music,|einstrumentalist 

Indexed as AUTHOR: adolph wolfram 

Preceded by “a”: adolph vincent r 

Followed by “a”: adolphe bruce 

From: b6097185x Bach, Johann Sebastian, 1685-1750, composer Rèveries 

15 □ 


42 
 

Figure 2.   

 
a1001 |aChough, Sung Kwun,|d1985- 
a1001 |aChua, Hui Tong,|eauthor 
a1001 |aChubb, Kit,|d1936- 
a1001 |aCohen, Louis H.|q(Louis Harold),|d1906-|eauthor 
a1001 |aColombo, Maria,|eauthor 
a1001 |aCranburne, Charles,|d-1696,|edefendant 
a1001 |aCrozier, C. W.,|d1807?-|eauthor 

Chough  Sung  Kwun  1985- 
Chua  Hui Tong 
Chubb  Kit  1936- 
Cohen  Louis H   Louis Harold   1906- 
Colombo  Maria   
Cranburne  Charles  -1696   
Crozier  C  W  1807 -  

 
43 
 

Figure 3.  

 
.folddata 

.report 

.report 

.title 
$(14810) 
 
.end 
.subtitle 
$(14180)Wed May  1 10:02:02 2019 
 
.end 
.footing r 
 
 
.end 
      1   100: 0  : |a'ãAolåi,|eauthor.|?UNAUTHORIZED 
      1   100: 0  : |aA mi.|?UNAUTHORIZED 
      1   100: 0  : |aAQ,|eauthor.|?UNAUTHORIZED 
      1   100: 0  : |aAbraham bar Hiyya Savasorda,|dapproximately 1065-approximately 1136.|?UNAUTHORIZED 
      1   100: 0  : |aAbram,|cder Tate,|d1874-1962.|?UNAUTHORIZED 
      1   100: 0  : |aAbåu Dåa®åud Sulaymåan ibn al-Ash°ath al-Sijiståanåi,|d817 or 818-889.|?UNAUTHORIZED 
      1   100: 0  : |aAbåu Nuwåas,|dapproximately 756-approximately 810.|?UNAUTHORIZED 
      2   100: 0  : |aAbåu al-Faraj al-Iòsbahåanåi,|d897 or 898-967.|?UNAUTHORIZED 
      1   100: 0  : |aAbåu °Ubayd al-Qåasim ibn Sallåam,|dapproximately 773-approximately 837,|eauthor.|?UNAUTHORIZED 
      2   100: 0  : |aAbåu °Ubayd al-Qåasim ibn Sallåam,|dapproximately 773-approximately 837.|?UNAUTHORIZED                           
      1   100: 0  : |aAce Hood,|d1988-|4prf|?UNAUTHORIZED 
      1   100: 0  : |aAce Hood.|4prf|?UNAUTHORIZED 
      1   100: 0  : |aAce Hood.|?UNAUTHORIZED 
      1   100: 0  : |aAce.|?UNAUTHORIZED 
      1   100: 0  : |aAchad,|cFrater,|d1886-|?UNAUTHORIZED 
      1   100: 0  : |aAcharya Shunya,|eauthor.|?UNAUTHORIZED 
      1   100: 0  : |aAchdâe,|d1961-|?UNAUTHORIZED 
      1   100: 0  : |aAding,|d1972-|eauthor.|?UNAUTHORIZED 

 
44 
 

Figure 4.  

 
Aoli   
A mi 
AQ 
Abraham bar Hiyya Savasorda  approximately 1065-approximately 1136  
Abram  der Tate  1874-1962  
Abu Daud Sulayman ibn al-Ashth al-Sijistani  817   818-889 
Abu Nuwas  approximately 756-approximately 810  
Abu al-Faraj al-Isbahani  897   898-967   
Abu Ubayd al-Qasim ibn Sallam  approximately 773-approximately 837 
Abu Ubayd al-Qasim ibn Sallam  approximately 773-approximately 837                           
Ace Hood  1988-  
Ace Hood  
Ace Hood  
Ace  
Achad  Frater  1886-  
Acharya Shunya    
 Achde  1961-  
 Ading  1972- 

 
45 
 

List of figure captions 

 
Figure 1. Sample “Headings used for the first time” report entry 

Figure 2. Sierra fields before and after Alpha processing 

Figure 3. SirsiDynix Symphony “Unauthorized Headings” report 

Figure 4. Processed “Unauthorized Headings” report 


	The University of Akron
	From the SelectedWorks of Michael Monaco
	2020

	Methods for in-sourcing authority control with MarcEdit, SQL, and regular expressions
	tmp1LmiBe.pdf