Α Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 _________________ Received: 21.5.2017 Accepted: 2.9.2017 ISSN 2241-1925 © ISAST Identity and Access Management for Libraries Qiang Jin 1 and Deren Kudeki 2 1 Senior Coordinating Cataloger, and Authority Control Team Leader, University Library, University of Illinois at Urbana-Champaign, USA 2 Visiting Research Programmer, School of Information Science, University of Illinois at Urbana-Champaign, USA Abstract. Linked open data will change libraries in a dramatic way. It will redesign metadata, and display metadata on the web. In order to prepare for linked open data, libraries will gradually transition from authority control creating text strings to identity and access management using identifiers to select a single identity. The key step for moving into the linked open data for identity and access management for libraries is to correct thousands of incorrect names and subject access points in our online catalogs. This article describes a case study of the process of cleaning up unauthorized access points for personal and corporate names in the University of Illinois Library online catalog. The authors hope that this article will help readers of other libraries prepare for linked open data environment. Keywords Authority maintenance, controlled vocabularies, identity and access management, linked open data 1. Introduction Linked Data is to use the Web to connect data, information, and knowledge on the Semantic Web using URIs and RDF.(1) Linked Data provides library users with searching a vast range of local and remote content through a single point of entry across a comprehensive index into a library’s collection. Authority control is the area of Linked Data transition that has caused the most concern. (2) It is critical when we group works by authors with URIs in the linked data environment. Unfortunately, thousands of incorrect personal names, corporate names, and subjects in our online catalogs hinder users from finding library resources. Authority control is the process of selecting one form of name or title from among available choices and recording it, the alternatives, and the data sources used in the process. It is essential for effective retrieval of resources. It provides consistency in the form of access points used to identify persons, families, Qiang Jin and Deren Kudeki 356 corporate bodies, and works. (3) Authority control is central to the organization for information. Authority control has gone through many changes over the last hundred years or so. One major new concept on authority control in libraries emerged in 2010 after IFLA created Functional Requirements for Authority Data (FRAD) A Conceptual Model (4), and Functional Requirements for Subject Authority Data (FRSAD) A Conceptual Model (5). These two models analyze entities person, family, corporate body, work, expression, manifestation, item, concept, object, event, place and their relationships. They describe entities of highest significance, attributes of each entity, and relationships among entities in regard to user needs. FRAD helps catalogers rethink about how catalogs should function, and establish standards. In 2010, Resource Description & Access (RDA)(6), a new international cataloging code was released adopting FRAD, and FRSAD. Two major new concepts emerged in libraries in recent years. One is the creation of Bibliographic Framework Initiative (BIBFRAME), and the other is Schema.org. In 2012, the Library of Congress (LC) released a BIBFRAME model, a linked data alternative to MARC developed by Zepheira, a data management company. BIBFRAME is expressed in Resource Description Framework (RDF). (7) It serves as a general model for expressing and connecting bibliographic data. It is set to replace MARC 21 standards, and to use linked data principles to make library data discoverable on the Web while preserving a robust data exchange that supports resource sharing and resources discovery on the Web. In 2016, LC put out BIBFRAME model and vocabulary 2.0. The other major new concept is Schema.org, which is an initiative launched in 2011 by Bing, Google and Yahoo to create and support a common set of schemas for structured data markup on web pages. In 2012, OCLC took the first step toward adding linked data to WorldCat by appending Schema.org descriptive markup to WorldCat.org pages, making rich library bibliographic and authority data available on the Web. (8) 2. Background Information The University of Illinois at Urbana-Champaign in the United States includes about 44,087 undergraduate and graduate students, and 2,548 faculty. The University of Illinois at Urbana-Champaign (UIUC) Library is one of the largest libraries in North America. Its online catalog holds more than thirteen million volumes, 24 million items and materials in all formats, languages, and subjects, including 9 million microforms, 120,000 serials, 148,000 audio-recordings, over 930,000 audiovisual materials, over 280,000 electronic books, 12,000 films, and 650,000 maps. (9) Although many large research libraries routinely do authority maintenance work, the UIUC Library has never done any systematic authority work for the last several decades. In order to prepare to move to the linked data environment, in 2015, the UIUC Library decided to do authority maintenance work locally due to limited budget to correct an estimated over one million incorrect personal names, corporate names, geographic names, series titles and the Library of Congress subject headings in our online catalog. Because of Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 357 budget cuts for the last several years, the UIUC Library staff cutbacks and the expanding need for professional librarians to work on digital collections, dealing with controlled vocabularies has become more stringent. Cleaning controlled vocabularies are clearly critical to the success of linked data since they are the basis for the URIs that create linkages. The goal is to provide enhanced discovery of library data, bringing together the comprehensive collections of content with indexes for deeper searches with millions of unique descriptive data components for books, images, microforms, etc. In 2015, a small team was formed at the UIUC Library including one tenured librarian, one academic hourly, and two graduate assistants, and started working on authority maintenance work. The team discussed and decided to begin with fixing personal and corporate names. Even though the Library has over 13 million volumes, there are actually around 8 million bibliographic records in our online catalog. The team ran reports using SQL queries (10) to find “see” references for personal and corporate names in our bibliographic records because a major part of our problem with our personal and corporate names belong to this category. We separated the results of the queries into csv files holding up to 5,000 broken names each to run through the script we created locally. We solved various issues with the script before we were able to run all the files successfully. For each “see” personal or corporate name in our bibliographic record, the script looks for an authorized access point in the Library of Congress Authority File, and WorldCat trying to find matches. If a “see” reference finds an authorized access point and WorldCat also lists that access point, we consider a match. If no good access point is found or WorldCat does not list the access point, we save that name for human intervention in the future. Out of nearly 8 million bibliographic records in our online catalog, we corrected around 300,000 personal and corporate names successfully by machine, but we still have over 100,000 personal and corporate names that need to have people go over them one at a time checking the Library of Congress Authority File and correct them in our online catalog. Our precaution is necessary because our catalog is comprehensive. 3. Literature Review Many experts in the cataloging field have stated the importance of authority control for decades. Michael Gorman in his paper indicates that authorit y control is central and vital to the activities we call cataloging. (11) Tillett says that authority control is necessary for meeting the catalog’s objectives of enabling users to find the works of an author and to collocate all works of a personal or corporate body. (12) Hillman, Marker, and Brady mention that the basic goals of a controlled vocabulary are to “eliminate or reduce ambiguity; control the use of synonyms; establish formal relationships among terms and test and validate terms. (13) For decades, many libraries have either done in-house or have hired commercial vendors for their authority maintenance work. Some libraries have done in- house authority maintenance work due to their small scales of catalogs, or their budgets. For example, the Wichita State University Libraries did authority Qiang Jin and Deren Kudeki 358 maintenance work locally. They designed their workflow, and redesigned their cataloging staff structure for authority maintenance work. At the end of the project, they felt that they needed continuing support of their library administration’s commitment to allocate more staff in order to continue their authority maintenance work. (14) Commercial vendors usually offer several types of authority control services including authority work after cataloging bibliographic records, retrospective cleanup to supply their authority control after, or as part of a conversion project, and ongoing authority control for libraries. (15) Not only vendors help libraries do authority work for their print collections for decades, they have started to do authority work for library digital collections in non-MARC. In 2013, Backstage Library Works did authority work for the University of Utah’s Library digital collection in non-MARC. Their project demonstrates that it is possible to complete major updates to records in order to bring them in line with the authorized terms in commonly used controlled vocabularies without a large of amount of manual work. (16) 4. Initial Resources The goal of this project is to replace variant name access points in our bibliographic records with their authorized form. We are able to detect variant name access points through an SQL query that selects name access points that have been labeled as “s” in our database, which means the access point is a “see” reference with linked bibliographic records. While this provides a large number of access points to fix, it is worth noting that only unauthorized name access points that have been labeled as a “see” reference are being addressed by this process. If a name access point is not authorized and not labeled as a “see” reference, it will be ignored entirely. The reason we chose “see” reference with linked bibliographic records was because the big percentage of our incorrect personal names, corporate names, and subject headings belong to this category that lists name access points from our catalog that match references in authority records based on the text string only. The query we used is from Consortium of Academic and Research Libraries in Illinois: CARLI since we are part of the consortium, which consists 134 member libraries in Illinois. This query was used to determine the scope of this project, by running on 8 blocks of 1,000 bibliographic records to sample the more than 8 million bibliographic records in our online catalog. We found that there were 1,251 “see” references across the 8,000 bibliographic records examined. Using this rate, we estimate that there should be roughly 1 million “see” references in our database, all of which need to be examined for this project. There are five different kinds of access points, each of which may have its own intricacies for finding the correct authorized form. The table below shows what share of the “see” references each group contains. Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 359 Group Share of “see” References Name-title 3.6% Subject 45.7% Title 3.7% Name (corporate) 22.1% Name (personal) 25% Because of the large amount of data involved, and the special considerations needed for each group, the team decided to develop our tools and processes for authority maintenance around one specific group, while also building a general structure that can be used in the future when the other groups are given specific attention. We chose to focus our efforts on fixing personal names, because the specific considerations needed for finding the appropriate authorized access point for any given problematic personal name access points are relatively simple. Also, since personal names make up about a quarter of the problematic access points we can detect, it’s easy to produce a large sample to work on, and fixing personal names fixes a good portion of all the problematic access points. Once we finished development on fixing personal names, we were able to expand to fixing corporate names relatively quickly thanks to the similarities between personal and corporate names, and a code structure built to support the addition of modules for fixing the other groups. We chose to develop our tools in Python due to the large number of third party libraries available. Thanks to third party libraries, we are able to easily establish Z39.50 connections and convert MARC-8 records into unicode. We chose to run our queries in SQL Developer because of its speed. 5. Approach The general process we developed to update unauthorized name access points is splitting into three distinct steps. The first step is querying our database for the unauthorized name access points in our bibliographic records. The second step is processing the results of that query. The final step is uploading the changes found in the processing. The specific implementation of this process for personal and corporate names is as follows: 5.1. The Query We run a query in SQL Developer over a given range that returns a list of unauthorized personal and corporate names in bibliographic records in that range that are recognized as "see" variants, rather than an authorized name. For each of these problematic names, the query returns several important pieces of information. It gives us the unauthorized name, complete with all associated subfields. It gives us the bibliographic id (BIBID) number, which is the unique identifying number of the bibliographic record containing the name in our Qiang Jin and Deren Kudeki 360 database. And finally, it gives us the OCLC number of that same bibliographic record. (17) 5.2. Processing the Query Results The query results are run through a Python script that searches for the authorized name that best fits the problematic name from the record. If an authorized name cannot be found, the problematic name is added to a list of other unresolved names that are meant to be reviewed by humans. If an authorized name can be found, the correction is applied to the bibliographic record, which is added to a master collection of corrected records. To begin the processing of the query results, the problematic names are all read into the script, and grouped by BIBID. Each full problematic record is then retrieved from the database, one at a time, using a Z39.50 request. Each problematic name from the query is then matched with a personal or corporate name field in the bibliographic record. This match is made by calculating the Levenshtein distance between each name from the query, and each name in the record, and associating the pairs of names with the smallest calculated difference. Once all the problematic names have been found in the record, each name is processed individually to find the authority record that best matches it. The first step of this process is to call a web application program interface (API) with the selected problematic name. For personal names, the Virtual International Authority File (VIAF) AutoSugest API is called, and for corporate names the VIAF SRU Search API is called. The API returns a list of suggested authorities in VIAF, which is then reduced to a list of authorities that are listed as either personal or corporate names and have a Library of Congress Control Number. The list of Library of Congress Control Number (LCCN) is used to retrieve the Library of Congress (LC) authority record for each personal name through a series of Z39.50 queries. For each authority record, the Levenshtein distance is calculated between the problematic name, and all the versions of the name listed in the record. The smallest Levenshtein distance is found across all suggested authority records, and if that Levenshtein distance is small enough, the authorized name in that record is selected as the solution for the problematic name being examined. If the smallest Levenshtein distance found is not small enough, the problematic name, along with the authorized name with the smallest Levenshtein distance, is placed in a list of names that should be assessed by a human being. Once the best guess has been selected from VIAF's list of authority records, the selection needs to be independently verified, because some of the suggestions that come out of this process are incorrect, but very similar to the problematic name in question. To do this, the OCLC number from the bibliographic record is used to retrieve OCLC's version of the bibliographic record. Each of the names in the OCLC record is compared to the solution that has been selected. If any of the names in the OCLC record contains an exact match for all of the information in the selected name that is considered independent confirmation. If the OCLC record fails to confirm the selected name, the problematic personal name along Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 361 with our selection is placed in a list of names that should be assessed by a human being. Otherwise, the name selected by VIAF API and the Levenshtein distance calculation has been confirmed by the OCLC record, and is now considered safe to upload to the database as a correction. Once a name is to be uploaded, the problematic name is removed from the bibliographic record that was retrieved at the beginning of the process, and replaced with the authorized name that has been selected. Once all the problematic names in a record have been processed, if any of them have been replaced, the updated record is written to a collection of updated records. When the script has finished running this collection of records needs to be uploaded to the database to apply the changes. 5.3. Uploading the Changes We are able to apply the changes to our bibliographic records one at a time by importing the collection of records that have been updated into the Voyager Client and manually overwriting the each existing record with its revised version. Since there are around 300,000 bibliographic records need to be updated, we are now waiting to talk to system services people in our consortium, and ask them to upload these changes in bulk. Our authority maintenance work will also help other 133 academic and research libraries in our consortium. Qiang Jin and Deren Kudeki 362 6. Development The project goal of fixing an unauthorized access point is to find the authority record that lists the unauthorized access point as a variant, and to replace the unauthorized access point in the bibliographic record with the authorized access point from the authority record. This would be easy if the unauthorized access point listed some sort of unique id number that points to the authority record that the access point is meant to be associated with. While this is theoretically doable, it is not generally done, and none of the bibliographic records we looked at during the development process had any direct pointers to authority records. This means that the authority record that is needed is not immediately obvious, but instead needs to be discovered by using relevant data from the unauthorized access points, and the bibliographic record that access point is in. The tool we use to search for the correct authority file for personal names is VIAF’s AutoSuggest API, which takes a string as an input, and outputs a JSON file listing all the access points that may be relevant to the query string. This tool was chosen because it is easy to send a query and get a response programmatically, which makes it convenient for automation. The AutoSuggest API may return a variety of suggestions based on the input, and we have to sort through those suggestions to see which, if any is the authority record that we are looking for. The main challenge with AutoSuggest is sending queries that will get meaningful results back. There are simple cases to look out for, specifically when the query string ends with a comma or dash that ensure no results will be returned. This kind of pattern is easy to detect, and easy to fix. Simply removing the final character in these cases tends to return relevant suggestions. Less straightforward is how to handle queries with unusual diacritics. In some cases using all the unusual diacritics in the search will turn up nothing, but revising the query to remove those diacritics will yield relevant results. Sometimes this case is reversed, where the diacritics are the only way to get good suggestions. Our best solution for this is to always send a query with all the diacritics present, but if no results are returned and there are diacritics from outside the ASCII table, the query is re-sent with the non-ASCII characters removed. Examples of personal names with unusual diacritics are: aMi-ʾgyur-rdo-rje, cYoṅs-dge Gter-ston aMourik , D oula Another major issue for querying AutoSuggest is knowing how many subfields should be included. Including subfields like dates, titles or numeration can make the search key more specific, but if it includes information that VIAF does not have, the results can come up blank. During the development process we found search terms that include a single additional subfield can produce a small number of relevant looking results, but if more than one subfield is included, typically no results are returned. Because of this we send multiple queries to AutoSuggest until results are returned, each query including a different subfield, Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 363 and on with just the name. All of this adds up to the potential for multiple queries being sent for a single name until we get some result to examine. For example, for the problematic access point “aŚāhajī, cKing of Tanjore, dfl. 1684-1712.” our first two queries to VIAF contain the name and date information (http://www.viaf.org/viaf/AutoSuggest?query=Śāhajī,+fl+1684- 1712 and http://www.viaf.org/viaf/AutoSuggest?query=Sahaji,+fl+1684-1712) return no results. Our third query which includes the name and title information (http://www.viaf.org/viaf/AutoSuggest?query=Śāhajī,+King+of+Tanjore,) returns one unique personal name, which is then selected as our solution for this name. In contrast, a query that combines all three subfields (http://www.viaf.org/viaf/AutoSuggest?query=Śāhajī,+fl+1684- 1712+King+of+Tanjore,) returns no suggestions. Once AutoSuggest has given us results, we need to decide if any of the authorities that suggests are what is meant by the unauthorized access points we’re looking at. During development we noticed that in some of the authorities that AutoSuggest returned, the name from the query is listed as an associate, for example a search for “Robert Craft” would return the Library of Congress control number (LCCN) for “Igor Stravinsky,” which lists “Craft” as an associate. To avoid these cases, and find the best case, we began checking the 100 and 400 fields of the authority record against the unauthorized name. This successfully excluded the obviously wrong cases, but it also excluded some cases that were actually correct, but had minor differences in punctuation, where the unauthorized name may have an extra period or comma that the 100 or 400 field did not. Examples of such cases are “Stephenson, Andrew G.,” which is the name listed in our bibliographic record while the closest match that could be found is “Stephenson, Andrew G.”, and “Goldschmidt, Jenny ELisabeth.” from our record vs “Goldschmidt, Jenny Elisabeth” as the closest match. 700 10 Stephenson, Andrew G. 400 1 Stephenson, Andrew G. 100 1 Mohammadou, Eldridge 400 1 Mohamadou, Eldridge This, along with a case where the unauthorized name had a pipe character it should not have, written in MARCXML as Cockrell, W. D. |q (William D.), made it clear that it may be impossible to find an exact match for the unauthorized name we are looking at, even if we find the correct authority record. The pipe character showed that we could not simply exclude periods or commas, as the variations may be more unpredictable than that. Instead, our solution was to look through all the 100 and 400 fields from the suggested authority records and determine how similar each of them is to the unauthorized name we are trying to fix. We do this by calculating the Levenshtein distance, also known as the edit distance, to determine the fewest number of changes it takes to change the unauthorized name to the field we are assigning a similarity score to. Few changes mean the names are already quite similar. At this point we just need to Qiang Jin and Deren Kudeki 364 concern ourselves with the authority record with the smallest Levenshtein distance calculated. If that distance is too big, we can determine that none of the suggested authorities were similar enough to be considered good matches. This is where we can adjust how conservative or aggressive we want our changes to be. A larger maximum allowed Levenshtein distance will return more changes, but runs the risk of selecting more bad solutions. A smaller maximum will mean fewer changes, but those changes will be more likely to be correct. We wanted to be fairly conservative for this project and found that a Levenshtein distance of 2 was the highest where we were satisfied with the suggestions. One of the hurdles in this process was how to retrieve the LC authority records we’ve been discussing. We get the LCCN for the authority record from AutoSuggest, but it was not immediately clear what the best service was to retrieve the full authority record. The easiest and most accessible service was using the LCCN Permalink service to retrieve the authority as a MARCXML file. LCCN permalinks are URLs for LC bibliographic records and authority records. This is a well-documented service and consists of simply adding the LCCN to the end of a standard URL, and sending an HTTP request with that URL. The problem is that the Library of Congress limits access to this service to one request every six seconds, which is too slow for a project on this scale. The Library of Congress allows more frequent access through their Z39.50 service, but that service isn’t as well documented or straightforward. We eventually took the time to learn how to send requests over Z39.50 because the speed was so important. We did this by importing a Python library called PyZ3950 that allows us to make Z39.50 requests, and send queries formatted as “@attr 1=9 [LCCN]”, which gets us the authority record we’re looking for. There were a few issues concerning problems with character encoding that arose. First, the results of our SQL query returned characters outside the standard ASCII character set as question marks. This means that a name with unusual diacritics would have a question mark in the middle of it, or that a name in full Hebrew script would be entirely made up of question marks with the exception of any date information. For example the name listed as “|a בלבן, ”??in our records was listed as “a????????, ??????????,d1944 ”.־d 1944| ,אברהם in the query results we were reading from. Eventually it became clear that the names in the second case are always in the 880 field, which should actually be labeled as a “see” reference, and we simply need to ignore those names. To get around the first case, when we retrieve the full record from our database, we match the unauthorized names with the 100 and 700 fields in the bibliographic record, again calculating the Levenshtein distance, to find the name in the record. At first these full bibliographic records that we imported had character encoding issues of their own. Diacritics were being applied to the character before they should have been, the diacritics themselves were wrong, and 880 fields were being filled with gibberish instead of question marks. For example the name “|a לחיה אל בין : |b עוז עמוס של ביצירתו עיון / |c בלבן אברהם.” would be incorrectly written in MARCXML as: (2aio `l lgid :(B (2rieo aivixze yl rneq ref /(B Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 365 (2`axdm alao.(B All this led to the conclusion that the bibliographic records were not encoded in UTF-8. The fact that the diacritics were coming before the character they should have been applied to, as opposed to following the character as is the case in UTF-8, eventually led to the realization that these records were encoded in MARC-8. To decode the MARC-8 characters, we now retrieve the bibliographic record as an .mrc file and load that into the pymarc library in Python, which has the built in ability to decode MARC-8. Once these problems were solved, the names from the bibliographic record were in good shape to be used in the query sent to AutoSuggest. Using the methods described above to query VIAF AutoSuggest and use the Levenshtein distance to find the best suggestion ended up producing results for 93% of the names we looked at. But when we manually examined the quality of those results, we found that 5% of them were the wrong name. This project is expected to be run on around half a million personal and corporate names, and an error rate of 5% would result in tens of thousands of unauthorized access points to be overwritten with authorized but incorrect access points. It was determined that a good way of filtering out these errors would be to compare our solutions with solutions that are acquired through an unrelated method, and only allow solutions where both methods agree. No method is likely to be completely accurate, but if the two methods are independent, errors are unlikely to occur for the same name in both methods. The independent method we chose was to compare our results to OCLC’s bibliographic records. This is easy to do because all of our bibliographic records have an OCLC number listed, so it’s simply a matter of retrieving the record and comparing the names. We also do not want to simply copy all the information in OCLC, because OCLC has its own errors that we do not want to blindly perpetuate and use in place of our own potentially correct information. Combining our conclusions with OCLC’s records gives us the best of both sources, while avoiding replacing our current access points with incorrect ones. The percent of names we replace is reduced with this method, down to around 75%, while reducing errors to a fraction of a percentage point. While this leaves a large group of names unfixed, we output or best guesses for the unfixed names in a spreadsheet, which gives humans fixing the errors a good place to start. 7. Testing & Results Because of the large scale of this project, an important part of the development cycle was testing the code on a sample set, seeing what errors occur, and adjusting the code to account for those errors. We want to improve our existing database without pushing many errors, which is a danger with automation, so testing and analyzing the results were a constant part of our process. For early development, we tested our code on a sample of around 550 unauthorized access points. When the code could not find a solution for an access point, we looked for ways to find a solution based on the unauthorized access points and adjusted the code accordingly. When we did find a solution, we examined the solution to determine if it looked correct, and if it was not we looked for ways to fix it, or Qiang Jin and Deren Kudeki 366 exclude it if there was no way to fix it. As our solutions for this sample became good, we expanded our testing to another range of around 1,000 unauthorized access points to see how well our code performed and to address any problems that emerged with the new sample. For a second set of about 1,000 unauthorized access points, we assigned expected solutions to all the access points before we ran the code, and after the code ran we compared the results. The results were as follows. The solution that the algorithm chose was wrong 1.3% of the time, but none of those cases were pushed as final solutions, instead the algorithm detected a problem with all of these answers and simply listed them as a best guess for these names that it could not find an acceptable solution for. 2.8% of the time the solutions selected by the human and the program were different, but both were valid. In other words, these were cases where multiple authority records existed for the same person, and the human and computer selected different records. In this case the human tended to choose authority records with 670 fields with the title of the work in question, or with a title that looked to be in the same field as the work in question, while the algorithm did not take the 670 field into consideration. 1.8% of the time the algorithm chose correctly, and the human chose incorrectly. In the remaining 94% of cases the human and computer agreed on the solution. The main takeaway from these results is that our algorithm is roughly on par with a human, but it is much faster and can catch its own mistakes, though we can still improve performance by checking the 670 field. Overall, when we run the code across all three samples of training data, it looks at 2,412 unauthorized personal names, and finds what it considers to be an acceptable solution for 1,915, or 79.4% of access points. A remaining 20.6% of access points are not changed by the algorithm and require a human to examine and fix. The margin of error for these numbers is ±2%. At the time of writing, we have found that 2 of the written solutions are wrong, which is 0.08% of the unauthorized names examined. We compared the performance of our code to MarcEdit’s Validate Headings tool, which also tries to automatically fix unauthorized access points. Both tools were run on a sample of 705 unauthorized personal names, and we compared the quality of the results. In 542 cases, or 77% of the time, we found that our tool and MarcEdit both came to the same conclusion. This could mean they both chose the same solution, or they both were unable to come up with a solution. There are four cases (1% of the time) where the two tools come up with different solutions but both solutions are wrong. There are 13 cases (2% of the time) where the two tools come up with different solutions, but there’s not enough information to determine which solution is better. There are 100 cases (14% of the time) where the two tools come up with different solutions, and our code produced the correct solution. The reverse case, where the two tools disagree and MarcEdit is the one that produces the correct solution, happened in 46 cases, or 7% of the time. In addition to our code coming up with more good solutions, our code has a major advantage over MarcEdit. Whenever MarcEdit finds a solution, it writes that solution to the collection of records being processed, regardless of the Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 367 quality of that result. Our code will only write a result to the records if that result can be found in the OCLC version of the bibliographic record being fixed. What this means is that every time we found that MarcEdit had made a mistake, that mistake was written to the same record that the other solutions are written to. But every time we found that our code had found an incorrect solution, the code itself had also determined that its solution was not good enough, and wrote the solution to a spreadsheet for humans to look at, and it did not write the incorrect solution to the collection of records to be updated. So MarcEdit produces fewer correct solutions and generates new incorrect data, while our code produces more correct solutions and avoids pushing errors into the records. For these reasons we feel that our code does a better job at fixing unauthorized access points than MarcEdit’s Validate Headings tool. Because of our methods, there are limitations to how many unauthorized access points we can fix. First, because of how the query is structured, we only look at names that are labeled as “see” references. Any access point that is unauthorized, but not labeled will not be collected by the query. While our code to process the query results will work with a spreadsheet of non-see reference unauthorized names, as long as the spreadsheet is formatted correctly, the code is largely untested on this sort of data, so it’s not clear how good the results would be. Second, so far the tool has only been developed and tested for personal and corporate names. The code is designed so that in the future other access point like subjects, series, titles, and name-titles can be addressed, but these methods have not been tested on anything but personal and corporate names, and nothing has been written to address specific issues that other access point types might have. Third, there is an upper bound of names based on the double check against the OCLC records. Because we ignore results that don’t agree with the OCLC record, we can not fix a name that is not already accurate in OCLC. One way to get around this limitation would be to find a second independent source to check both our result and OCLC’s record against. This way if any two of the three sources agree, that can be selected as the solution. Finally the speed of our process is limited by the number of external calls our code makes. Every record requires two external calls: one to access our local record and one to access the OCLC version of the record. Every name that is processed calls the AutoSuggest API between 1 and 16 times depending on if any usable suggestions are returned. Every suggestion from AutoSuggest results in a Z39.50 call to retrieve the authority record from the Library of Congress. The number of calls here varies based on what the results from AutoSuggest are, but there are typically only a small number of suggestions. The calls to AutoSuggest have the greatest impact on the speed of this process. It will realistically be called all 16 times for any names that VIAF doesn’t have or doesn’t have an LCCN for, both cases are a small, but sizable fraction of our dataset. In addition, we have observed AutoSuggest to have the slowest response time of all the services we call in this process. Qiang Jin and Deren Kudeki 368 There is no way to avoid the worst case scenario for AutoSuggest because a slightly different query has the potential to return relevant results, even when other queries have failed. The best way to speed up processing an entire database with this tool would be to split the data that needs to be processed into smaller chunks and to run a few of these chunks in parallel on different machines at the same time. This would have to be limited to only a few machines at a time to avoid overloading the services this tool is using, but processing the data across a few machines at once would divide the total amount of time needed to process it all. Bib Record No. Total Automatically Changed Need Manual Correction 0 - 1 million 79091 63783 15308 1 - 2 million 77073 60869 16204 2 - 3 million 65512 50427 15085 3 - 4 million 62124 46572 15552 4 - 5 million 51012 34000 17012 5 - 6 million 27994 17788 10206 6 - 7 million 34971 23184 11787 7 - 8 million 12378 3696 8682 Total 400,500 295,500 (73.8%) 105,000 (26.2%) At the time of writing (April 2017), we run the query to find all unauthorized personal and corporate names in our online catalog, and discover that there are around 400,500 personal and corporate names that need to be corrected. After we run the script for multiple 5,000 files for those 400,000 personal and corporate names, we have successfully corrected around 300,000 personal names and corporate names by machine. We still have over 100,000 personal and corporate names that we need to check them by hand. Among those 100,000 personal and corporate names, we believe many of them should be correct. We set them aside for our conservative calculations. Qualitative and Quantitative Methods in Libraries (QQML) 6: 355-370, 2017 369 The automatically changed records are in XML format. So we need to convert XML to MARC format using MarcEdit, and then import MARC files into Voyager and update the records. Given the large number of auto-corrected records, our team hopes that our consortium will help to set up a profile to finish the work. The records that need manual check and correction are in csv format. Catalogers who are familiar with multiple foreign languages may be hired to finish this work through exploring LC Authority files and OCLC records. In the meantime, we are working to create a different script to correct subject access points in our online catalog. In the future, we plan to use the script and the workflow to continuously work on authority maintenance work since incorrect names and subjects appear in our online catalog every day. 8. Conclusion A critical part of linked library data lies within the establishment of its backbone: authority data. Our script has not only corrected around 400,000 personal and corporate names in western European languages, but also has also fixed personal and corporate names in diacritics and Unicode non-Roman languages in our online catalog. We hope that our script and workflow of fixing unauthorized access points in our bibliographic records can help other libraries as they are preparing to migrate to the linked open data environment. Notes 1. Linked data: Connect Distributed Data Across the Web http://linkeddata.org/ 2. BIBFLOW: A Roadmap for Library Linked Data Transition, Prepared 14 March, 2017, MacKenzie Smith, Carl G. Stahmer, Xiaoli Li ,Gloria Gonzalez, University Library, University of California, Davis Zepheira Inc http://roytennant.com/BIBFLOWRoadmap.pdf 3. Qiang Jin, Demystifying FRAD: Functional Requirements for Authority Data (Santa Barbara, Libraries Unlimited, 2012), 3. 4. IFLA Study Group on Functional Requirements and Numbering of Authority Records (FRANAR), Functional Requirements for Authority Data A Conceptual Model: Final Report (Munchen: K. G. Saur, 2013 http://www.ifla.org/files/assets/cataloguing/frad/frad_2013.pdf 5. IFLA Working Group on the Functional Requirements for Subject Authority Records (FRSAR), Functional Requirements for Subject Authority Data (FRSAD) A Conceptual Model (Munchen: K. G. Saur, 2010) http://www.ifla.org/files/assets/classification-and-indexing/functional- requirements-for-subject-authority-data/frsad-final-report.pdf 6. Joint Steering Committee for Development of RDA: Resource Description and Access, 2010 https://access.rdatoolkit.org/ 7. Bibliographic Framework Initiative https://www.loc.gov/bibframe/ 8. OCLC WorldCat https://www.oclc.org/en/worldcat/data-strategy.html 9. University of Illinois at Urbana-Champaign Library www.library.uiuc.edu 10. CARLI (Consortium of Academic and Research Libraries in Illinois) https://www.carli.illinois.edu/products-services/i-share/reports 11. Michael Gorman, “Authority Control in the Context of Bibliographic Control in the Electronic Environment,” Cataloging & Classification Quarterly, 39 (2004): 13 http://roytennant.com/BIBFLOWRoadmap.pdf http://www.ifla.org/files/assets/cataloguing/frad/frad_2013.pdf https://access.rdatoolkit.org/ https://www.loc.gov/bibframe/ https://www.oclc.org/en/worldcat/data-strategy.html https://www.carli.illinois.edu/products-services/i-share/reports Qiang Jin and Deren Kudeki 370 12. Barbara B. Tillett, “Authority Control: State of the Art and New Perspectives,” Cataloging & Classification Quarterly, 39 (2004): 23. 13. D.I. Hillmann, R. Marker, & C. Brady, “Metadata standards and applications,” The Serials Librarian, 54 (2008): 1. 14. Sha Li Zhang, “Planning an Authority Control Project at a Medium-sized University Library,” College & Research Libraries, 62, no. 5, (2001): 395. 15. Sherry L. Vellucci, “Commercial Services for Providing Authority Control: Outsourcing the Process,” Cataloging & Classification Quarterly, 39 (2004): 443 16. Silvia B. Southwick, Cory K. Lampert, and Richard Southwick, “Preparing Controlled Vocabularies for Linked Data: Benefits and Challenges, Journal of Library Metadata, 15 (2015): 177. 17. System Properties Comparison Microsoft Access vs. Microsoft SQL Server vs. Oracle, https://db- engines.com/en/system/Microsoft+Access%3bMicrosoft+SQL+Server%3bOracle https://db-engines.com/en/system/Microsoft+Access%3bMicrosoft+SQL+Server%3bOracle https://db-engines.com/en/system/Microsoft+Access%3bMicrosoft+SQL+Server%3bOracle