TSQ 23(4) Prepublication.vp Cleanup of Netlibrary Cataloging Records: A Methodical Front-End Process Elaine Sanchez Leslie Fatout Aleene Howser Charles Vance ABSTRACT. Electronic resources and ebooks in particular, have be- come a very important source of information for library patrons. When our library was given access to more than 20,000 ebooks, we were faced with bibliographic records of unknown quality. To provide high-quality records in a timely manner, we identified as many potential problems as we could, worked with reference staff to create the best PAC displays, and created efficient record-editing methods to address these issues prior to loading the records in our database. This article documents that pro- cess and describes the MarcEdit, Word, and Excel strategies used to methodically correct and improve these records. It also offers practical solutions and procedures for database maintenance and quality control for NetLibrary or any outsourced cataloging records. The future of ebooks and other related cataloging issues, including authority control, are also discussed as points that remain to be addressed. [Article copies available for a fee from The Haworth Document Delivery Service: 1-800- HAWORTH. E-mail address: Website: © 2006 by The Haworth Press, Inc. All rights reserved.] Elaine Sanchez is Monographs Cataloging Librarian/Unit Head; Leslie Fatout is Library Systems Coordinator/Circulation Librarian; Aleene Howser is Head Monographs Cataloging Assistant; Charles Vance is Database Management Services Librarian/Unit Head, all at Alkek Library, Texas State University-San Marcos, 601 University Dr., San Marcos, TX 78666-4604. Technical Services Quarterly, Vol. 23(4) 2006 Available online at http://www.haworthpress.com/web/TSQ © 2006 by The Haworth Press, Inc. All rights reserved. doi:10.1300/J124v23n04_04 51 This electronic prepublication version may contain typographical errors and may be miss- ing artwork such as charts, photographs, etc. Pagination in later versions may differ from this copy; citation references to this material may be incorrect when this prepublication edition is replaced at a later date with the finalized version. http://www.HaworthPress.com>�2006 http://www.haworthpress.com/web/TSQ KEYWORDS. NetLibrary, cataloging, procedures, cleanup, ebook, database maintenance, front-end, process, quality control, edit, outsour- cing, online catalog, metadata, record loading, bibliographic, MarcEdit, Microsoft Word, Microsoft Excel, macro, spreadsheet INTRODUCTION NetLibrary (www.netLibrary.com), a division of OCLC, is the major provider of electronic books (ebooks) to the library community,2 with more than 60,000 ebooks from more than 400 publishers, covering all subject areas, and serving more than 6,700 libraries.3 Ebooks are pub- lished works such as research materials, reference books, and textbooks that have been converted into digital format for electronic distribution. They are an important supplement to print materials as they provide instant access for patrons in remote locations.4 Ebooks offer other ad- vantages such as full-text searching; instant linking to related resources; no risk of theft, damage, or loss; potential savings in processing costs; and no physical space requirements.5 Furthermore, while the debate rages about the future roles of printed books versus electronic resources, patrons are growing increasingly more reliant on electronic resources for their information needs. Given this fact, it is imperative that the library offers access to ebooks, ejournals, and all other remote electronic information. In 2002, through TexShare (www.texshare.edu), Texas’ statewide resource-sharing program, a NetLibrary collection of more than 20,000 electronic books was made available to member institutions. The ebooks were a welcome addition at a time when the enrollment at Texas State University-San Marcos (www.txstate.edu) was growing rapidly (from 22,471 in 2000 to 26,375 in 2003) and there was a strong empha- sis on using technology to improve and expand access. However, only the most curious and adventuresome library users were aware of the ebook collection because the catalog, beyond which many users never venture, did not disclose it. Therefore, upon learning that TexShare of- fered NetLibrary MARC records to its member libraries at no cost, our university librarian initiated a high-priority project to load biblio- graphic records for the ebook collection into our catalog. In the summer of 2003, reference librarians met with the system librarian and catalogers to plan for the addition of ebook records to the public access catalog. Reference staff wanted to ensure that these 52 TECHNICAL SERVICES QUARTERLY cataloging records would clearly distinguish electronic from print re- sources, and provide simple, direct access to the ebooks. Cataloging staff were concerned about the quality of the NetLibrary records; this would be the first time that bibliographic records which we had not cataloged would be loaded into our database. Our cataloging standards are high, with full encoding level, AACR2, ISBD, careful au- thority control, consistent series authority work, and complete classifi- cation and subject access. We wondered how NetLibrary records would compare to our own, and how they would affect the integrity of our carefully constructed bibliographic and authority databases. We found the richest source of information to be Autocat, where many problems were listed and discussed. We reviewed published liter- ature that covered the cataloging of ebooks and discussed fields and standard values to include, but at the time did not address cataloging problems in the set of NetLibrary records we were going to acquire. At the same time, the system librarian communicated with the Data Research Associates (DRA) Classic user group to inquire about sys- tem-specific problems and solutions other sites might offer. Points that were mentioned included dealing with the lack of physical items, hard copy records for the same titles, and variations in cataloging rules and practice. There are a couple of programs that can create or message pre-existing MARC data; however, several of the users recommended Terry Reese’s MarcEdit program as a gem of a tool which made batch editing records a breeze. Web site citations that describe the other pro- grams are included in the bibliography at the end of this article. Learning about the problems within the records, we knew we had to find a way to get them corrected. How would we deal with this task? Would there be a simple way to identify the different types of problems? Who would be responsible for finding and correcting the problems? It made sense to approach the cleanup systematically instead of deal- ing with records on an “as-found” basis. First, having the records iso- lated from the entire database allowed the task to proceed more quickly and efficiently. Second, doing most of the cleanup work before the re- cords were loaded made them more accessible and useful to the patrons as soon as they appeared in the catalog. Third, the tools that are avail- able with our current system (DRA Classic) are limited in scope and functionality. It would be more efficient to use programs that had greater search, sort, replace, and edit capabilities: MarcEdit, Microsoft Word, and Microsoft Excel. We have succeeded in creating a methodical front-end process that raises the quality of the NetLibrary records, as nearly as possible, to our Sanchez et al. 53 own local standards. A link to detailed procedures appears in the notes at the end of this article. SURVEY OF THE LITERATURE A survey of the literature was performed on OCLC, the Internet, and in the Library Literature & Information Science full-text database re- garding the quality and cleanup of NetLibrary records. The results were • some information, articles, and Web sites on MarcEdit and two other MARC-editing software programs • a small number of functional applications of these utilities in use by libraries, explaining in broad terms how certain features could be used to perform various types of edits, such as globally editing the 049 field • a few articles that had brief mentions of non-traditional software applications used to perform editing prior to loading MARC re- cords in catalogs. We found no article or information on the overall process of using non-traditional, non-ILS supplied editing utilities to correct MARC re- cords prior to loading them in the catalog. Neither was there any infor- mation on the actual procedures for these workflows, nor any discussion on the quality controls and editing standards required to bring the records up to the quality level of records already existing in the catalog. This article presents the entire process, as experienced at Texas State University-San Marcos, which can be emulated for any outsour- ced MARC record cleanup project. PROBLEMS AND RAMIFICATIONS Reference staff were guardedly positive about adding bibliographic records for NetLibrary titles to the catalog, provided the following is- sues were addressed: • The OPAC display must be patron-friendly and unambiguous. • The link to the online titles must be as simple and direct as possible. • Users must be able to search by keyword and limit the result to ebooks. 54 TECHNICAL SERVICES QUARTERLY Reference staff were also concerned about the absence of item records that caused the system-generated note: “The library currently has no holdings for this title.” Cataloging staff, on the other hand, were cautious about loading NetLibrary records, sight unseen, into the database. This was our first experience with accepting someone else’s cataloging en masse and without local cataloging oversight. We wanted to learn everything we could about the quality and potential problems of these records. After the first set of records was loaded in August 2003, we had the search capabilities to evaluate the records. Records were loaded at night, when staff was not cataloging, so the records appeared in sequential database control number (dbcn) order. This enabled staff to identify NetLibrary records that were retrieved when performing random searches for possi- ble problems and errors. One way to directly retrieve NetLibrary re- cords was by the dbcn as we knew the range of numbers for these titles. Another simple way was by the subject “Ebook,” as this had been added to all the NetLibrary records. Both of these searches allowed staff to limit searching to only NetLibrary titles. Our initial searches in the records uncovered various issues such as • A title search for initial articles in any language brought up a few records with incorrect filing indicators. • Searches for authors, corporate bodies, and other access points showed that these were not linking to the assigned authority re- cords in our database because of minor typos, capitalization dis- crepancies, diacritics, incorrect subfield codes, punctuation, or other problems. • Limitations in our system caused series with initial articles not to link with their authorized headings. • NetLibrary records with call number formatting problems; we used the 050 and 090 section of OCLC’s Bibliographic Format and Standards as examples. We also wanted to learn what other cataloging agencies had found. We searched Autocat archives to see what messages had been posted on the topic of NetLibrary records. There were disturbing problems with records created in 2001: • Duplicate records • 7xx fields stripped from records • Acceptable subject headings stripped from the full OCLC record, sometimes leaving only one subject heading Sanchez et al. 55 • MESH subjects retained, but LCSH deleted • Errors in 245s; non-English language cataloging description • Main entry vs. title entry errors, etc. Cataloging discussion lists detailed NetLibrary record problems in 2002 including the following: • Treatment of single serial issues and single volumes of multipart monographs that varies from our local practice • Records without call numbers • Print format had accompanying CD or software and a note or 300 $e accompanying material text reflecting this, and the NetLibrary record retained this note • Duplicate call numbers for records in classed-together series • The possibility of 126 duplicate records in the NetLibrary set released by SOLINET • Authority headings of all types conflicting with OCLC and LC authority files. Since 2002, the NetLibrary records have been corrected and redis- tributed to address the 2001 problems and some of the 2002 problems. Later records, made available in 2003 and 2004, demonstrate continued improvements in quality. Duplicate records are rare, and there are fewer records lacking call numbers. However, the following problems still remain: • Formatting errors in call numbers • Treatment of single serial issues and single volumes of multipart monographs differs from our local treatment of print serial and monographic set equivalents • Print format had accompanying CD or software and the 300 field $e accompanying material text was retained in the electronic format record • Duplicate call numbers for records in classed-together series • Authority conflicts with OCLC, LC, or local authority files. In our review of NetLibrary records we have identified continuing problems (see Figure 1). We are working on our processes to detect and fix these errors before they get into our database where they are harder to find and fix. It is important to address these errors because they create barriers to consistent and correct access. 56 TECHNICAL SERVICES QUARTERLY Sanchez et al. 57 FIGURE 1. NetLibrary Problems and Ramifications Figure 1 lists only the basic ramifications of the NetLibrary errors we identified. While some of these errors are not as critical, those that im- pair patron identification and use of desired materials are critical. AUTHORITY CONTROL In addition to the bibliographic description issues, authority control was a particular area of concern. Large batches of records were being loaded into our database with no systematic method to verify headings. Had the records been cataloged locally, we could have felt secure that between the efforts of our staff and the maintenance procedures we have in place, the headings entered would be up to our standards. In this case, however, we were dealing with authority records of unknown quality. The DRA system provides us with reports of headings (“index dumps”) that have been loaded into the database but not authorized. We use these reports to perform routine authority control, and with some minor adjustments to our normal procedures, we determined that they could be used for the NetLibrary record loads as well. The index dumps are set to run at the end of the work week to catch the headings entered during the course of normal cataloging. Because the NetLibrary records are loaded when other cataloging work is not be- ing done, we can run a special dump to isolate the NetLibrary records and their corresponding unauthorized headings. This special index dump created a large set of authority headings which were imported into an Excel spreadsheet. Headings were then di- vided among cataloging units and are being searched in our catalog and OCLC’s authority file for any necessary conflict correction, export of authority records from OCLC, or creation of local authority records. SOLUTIONS IDENTIFIED Our system librarian headed up the NetLibrary record load project. In February 2003, she initiated the record-review process by asking cata- logers what kinds of problems they found with these records and how the records should be edited to best serve the patrons and match the quality of the records in our existing database. Working with printouts of some of these records, we were able to determine how they were cata- loged, which fields and texts were used, what types of errors we could 58 TECHNICAL SERVICES QUARTERLY identify, and how they should be modified. By June, the following solu- tions and standard texts had been identified (see Figure 2). We also noted that these records were in OCLC-MARC format, not MARC21, as there were certain variable and fixed fields in these records Sanchez et al. 59 FIGURE 2. Solution and Standard Text with Rationale that were not included in MARC 21. Also, we observed that the 049 field for these records would have the holding code of the cataloging agency IKMN, rather than our own TXIM holding code, which we anticipated would cause no problems. The pros and cons of having holdings for ebooks were also discussed, and expediency required loading the bibliographic records without including a holding record for each title to contain call number and barcode information. The absence of holdings records causes the text “The library currently has no holdings for this title” to display in our Web catalog; Reference staff were concerned this could confuse pa- trons. Our system allows for customization of this message, but to date it has not caused problems. We have agreed that we will revisit the item record and holdings issue in the future, as our ejournals have holdings records, but our ebooks do not. Our serials cataloging unit had been cataloging ejournals since 1998 and already had established cataloging parameters. We reviewed these procedures and determined that their policies did not relate as closely to ebook cataloging needs as we had thought. Their electronic resource cataloging procedures did, however, reinforce the decisions we had already made: • Add text after the call number to alert patrons that the title is an ebook • Modify the text “Bibliographic record display” in the 856 $3 field to “Online version.” Finally, we determined parameters for the bibliographic load process: • Splitting the large file of records into sets of more manageable size so that pre-editing could be done more easily • Running the keyword index program on the split files of records rather than all at once, so that keyword indexing proceeded more quickly • Trying to use the load program to identify duplicate print titles for their NetLibrary counterparts (unfortunately, our bibliographic load program lacked this capability) • Tracking the database control numbers for each file in case we needed to isolate these records later for global updating. 60 TECHNICAL SERVICES QUARTERLY TOOLS AND STRATEGIES MarcEdit MarcEdit (http://oregonstate.edu/~reeset/marcedit/html/) is a free MARC-editing utility developed by Terry Reese, Assistant Librarian at Oregon State University’s Map and Aerial Photography Library. It in- cludes a tool that “breaks” MARC records into an easily readable, tagged text file, and another which restores broken records to MARC format. It also includes a powerful editor which provides the ability to find and replace text; edit fields, subfields, and indicators; and count the fields and subfields in a file of records. The system librarian downloaded MarcEdit and experimented with a sample of the NetLibrary records to familiarize herself with the pro- gram’s capabilities and limitations. She then met with cataloging staff that had been reviewing the records and identifying problems. It was clear that MarcEdit would be effective in fixing several of the problems. First, MarcBreaker was used to convert the records to display as tagged text (see Figure 3). With the records broken, various MarcEdit editing tools were used to fix several of the problems which had previously been identified (see Figure 4). Field Count MarcEdit includes a tool to count the fields and subfields in a file. This proved useful in determining various problems that affected access and accurate description, and certain descriptive cataloging practices that we do not use, including • call numbers lacking $b • 050 call numbers with multiple $a’s • records without call numbers • 245 fields with $n and $p that need review and revision of dupli- cate call numbers, along with other problems • 300 fields with $e’s indicating accompanying material, usually CDs or software • 440s with initial articles, which our DRA system does not link to the authorized series heading • 653 and 655, which we do not use in our online catalog • 6xx fields with indicators other than 0 or 1, as we use only LCSH Sanchez et al. 61 http://oregonstate.edu/~reeset/marcedit/html/ • 7xx fields with $e relator terms, which we do not use and which conflict with our authority records. The field count report was also very useful because it provided a sure method to review the contents of the MARC record tags and subfield 62 TECHNICAL SERVICES QUARTERLY FIGURE 3. Pre-Cleanup MARC Record Converted to Tagged Text (.mrk Format) codes for any other unidentified problems. For example, we found that we could use the overall number of certain required fields, such as the 049, to learn the exact number of records in the batch. We could then compare this number to other fields that should have the same number, such as 050/090, to determine if we had records lacking a call number. Figure 5 shows selections from a MarcEdit Field Count report and its usefulness in identifying problems in the content of the records. Many of these specific problem records can then be isolated by using the spreadsheet strategies described in the next section. Microsoft Word Macros A different solution was sought for a second group of problems because we were unable to fix them using MarcEdit (solutions may be Sanchez et al. 63 FIGURE 4. MarcEdit Tools and Fixing Bibliographic Record Problems FIGURE 5. MarcEdit Field Count Report Examples available in new versions of the software). Because the records were converted to display as text using MarcBreaker, Microsoft Word mac- ros were developed to handle this group of problems (see Figure 6). Microsoft Word Find and Replace The next task presented a challenge that sent the system librarian into Word’s online help. Our local cataloging policy dictates removal of 6XX fields with a second indicator of anything but 0 or 1, excluding 690. Word’s find/replace function, with the wildcard option, provided the solution (see Figure 7): Find what: =6[!9][0-9] ?[!01]*= Replace with: = Spreadsheet Strategies Importing MARC records into a spreadsheet enabled grouping fields to examine their contents for error identification. We initially arrived at the idea of using a spreadsheet to identify a record lacking a known field, the 050, and then later found it useful for pinpointing several other problems. The steps are outlined below: 1. Load the file of tagged records into an Excel spreadsheet. 2. Use Excel functions to number the lines and the records. 3. Sort the records to bring like tag numbers together. 4. Then: (a) visually inspect for missing or incorrect data; (b) visually inspect for missing sequential record numbers; or (c) select a group of records and perform a search for text within the selected group. This was an invaluable cleanup method, unavailable in our system’s traditional database maintenance programs. There were many other ar- eas for which we were able to use the spreadsheet. Depending on the type of error we found, we had two cleanup options: 1. Immediately edit the text-file copy of the records with the identified corrections. These were types of problems that were relatively straightforward, and required little or no cataloging judgment (see Figure 8). 64 TECHNICAL SERVICES QUARTERLY Sanchez et al. 65 FIGURE 6. Microsoft Word Macros and Fixing Bibliographic Record Problems 2. For errors which were more complex or required additional cata- loging tools, we extracted subsets of the spreadsheets containing those records. These we saved and printed for correction after the records were loaded into our catalog (see Figure 9). 66 TECHNICAL SERVICES QUARTERLY FIGURE 8. Spreadsheet Examples: Problems to be Corrected Before Records are Loaded FIGURE 7. Explanation of Word Find/Replace Using Wildcards TRANSFORMING CLEANUP TASKS INTO A METHODICAL FRONT-END PROCESS Library staff had already determined the specific edits that were needed and had the tools to make the changes, namely • MarcEdit to identify field and subfield anomalies and perform global edits • Word macros and find/replace editing tools to retrieve problem texts and perform global edits • Excel spreadsheets to perform data sorts which identified and grouped other anomalies and problems. With these tools, we had the ability to upgrade all NetLibrary cataloging records to our standards before we loaded them into our Sanchez et al. 67 FIGURE 9. Spreadsheet Examples: Problems to be Corrected After Records are Loaded bibliographic database. This was a breakthrough in our method of bibliographic record cleanup, which had previously been done after the records were in our catalog. Because these were new bibliographic record cleanup processes that used new editing tools, we must • Establish file-naming conventions and report parameters • Set up workflows and specific tasks for staff performing the work • Create procedures that detail a step-by-step approach to editing and revision tasks. Cataloging staff from the monographs cataloging unit and the database management services librarian created new workflows and correspond- ing documentation. As we created a cleanup process and procedure, we tried it out on a copy of our existing set of NetLibrary records, using the new cleanup tools and honing the procedure as necessary until its steps were correct and in the correct order. The result is a methodical front-end record cleanup process that is efficient, robust, and effective. After the cleanup procedures were completed and tested and revision steps documented, we began the work of implementing our newly es- tablished cleanup methodology on the actual NetLibrary records. The system librarian, who introduced us to the new tools, rejoined us at this point to boost confidence and provide insight as we put them to use. FUTURE CONSIDERATIONS While we have made every effort to identify and correct as many er- rors as possible before loading the records, it is likely that we will con- tinue to encounter new and different problems. As we do so, we will look for ways to incorporate new solutions into our pre-load cleanup procedures. There are also future considerations regarding NetLibrary records for which we are unable to provide definitive solutions; these include the following: General Issues • Permanence of collection and future viability of NetLibrary itself; NetLibrary has already been in financial trouble but was saved by OCLC. • Will TexShare funding be continued for NetLibrary titles? 68 TECHNICAL SERVICES QUARTERLY • In the mix of ebook providers (Project Gutenberg, Million Book Project, Internet Archive, etc.), what will NetLibrary’s role be in the future of electronic resource dissemination? • How will our arrangement with TexShare be affected by Baker & Taylor’s partnership with OCLC to provide NetLibrary titles? Technical Issues • Ongoing authority issues: Outsourcing vs. in-house cleanup • Item records for ebook titles: Yes or no? • Can this cleanup process be used for other outsourcing projects? • Quality of future NetLibrary records: Certain types of problems appear to have been corrected in recent batches; will this trend continue? • Will there be a mechanism put in place to allow error-reporting to the agency who catalogs NetLibrary records? • Monitor MarcEdit for functionality enhancements, and identify other potentially useful software or strategies • How to handle the relationship between print and ebook manifes- tations of the same title • Will this front-end cleanup process of vendor-supplied biblio- graphic records become a regular database maintenance function? • Will metadata description of ebooks assume a larger role in the future, perhaps replacing MARC as a communication format? With this project we have achieved a high degree of quality control over cataloging records from one specific source of electronic re- sources. However, with the proliferation of ebook sources that use very basic cataloging or none at all, we will face larger issues of how, or if, we can continue to provide consistent, quality cataloging and authority control for these titles. If some entity does not provide cataloging for the universe of ebooks, will other methods such as basic Internet search en- gines be sufficient to provide access? CONCLUSION Cataloging, reference, and the system librarian worked together as key players in the NetLibrary record load process. Our desire to have front-end quality control over the vendor-supplied records required that we look outside of the traditional database maintenance tools available in our integrated online system. Our literature survey revealed no com- Sanchez et al. 69 prehensive information available on the process that we envisioned. However, our system librarian found the tools and initiated the process. MarcEdit, Word, and Excel were identified as the software applications that would fill this role. They have given us more flexibility and power in our database maintenance work than we had ever imagined possible. We will continue to use them in the future in order to assure the quality of any other vendor-supplied records before we load them into our catalog. NOTES 1. These procedures are employed in the Alkek Library Cataloging Department of Texas State University-San Marcos, to perform cleanup of NetLibrary records prior to loading them into the database. They encompass a variety of tasks that assure the qual- ity of NetLibrary records and provide a clear and consistent OPAC display. http:// www.library.txstate.edu/cat/netlibrary/procedures/index.htm 2. OCLC PICA: NetLibrary Ebooks Available. http://oclcpica.org/?id=1012& ln=uk 3. DA Information Services–Electronic Media: To Sample Some eBooks, Go to NetLibrary. http://www.dadirect.com/Emedia/emediatitle1.asp?id=3 4. FAQs: NetLibrary: MINITEX eBooks Collection: CPERS: Programs and Services: MINITEX. http://www.minitex.umn.edu/ebook/netlib/faq.asp 5. Hyatt, Shirley, “netLibrary,” Ariadne, Oct. 10, 2002. http://www.ariadne.ac.uk/ issue33/netlibrary/ BIBLIOGRAPHY MarcEdit Bigwood, David, “MarcEdit,” Catalogablog, Jan. 11, 2005. http://catalogablog. blogspot.com/2005/01/marcedit.html Kentucky State University Libraries, Department Manual: Using MarcEdit to Edit Large Numbers of Bib Records. http://www.lib.ksu.edu/depts/techserv/manual/ general/marcedit.html MarcEdit Homepage: Your Complete Free MARC Software. http://oregonstate. edu/~reeset/marcedit/html/ Palermo, Natalie, Using the MarcEdit Program, LOUIS Users Conference 2002, Loui- siana State University. http://www.nsula.edu/watson_library/acrl/Using%20the %20MarcEdit%20Program.ppt Other MARC-Editing Tools MITINET/marc Library Services: MARC Magician. http://www.mitinet.com/ Products/ p_cleanup.htm The next link takes you to search results for the MARC subsection of Perl scripts within CPAN.ORG, Comprehensive Perl Archiving Network. http://search.cpan.org/ search?query=marc&mode=all 70 TECHNICAL SERVICES QUARTERLY http://www.library.txstate.edu/cat/netlibrary/procedures/index.htm http://oclcpica.org/?id=1012& http://www.dadirect.com/Emedia/emediatitle1.asp?id=3 http://www.minitex.umn.edu/ebook/netlib/faq.asp http://www.ariadne.ac.uk/ http://catalogablog http://www.lib.ksu.edu/depts/techserv/manual/ http://oregonstate http://www.nsula.edu/watson_library/acrl/Using%20the http://www.mitinet.com/ http://search.cpan.org/