310 American Archivist / Vol. 57 / Spring 1994 Research Article The Epic Struggle: Subject Retrieval from Large Bibliographic Databases HELEN R. TIBBO Abstract: Archivists have talked at length about the virtue of contributing records to a national bibliographic utility to provide enhanced access to collections. There has been little discussion, however, of the difficulties of finding materials in such large database environments. This article discusses a retrieval study that focused on collection-level ar- chival records in the OCLC Online Union Catalog, made accessible through the EPIC search system. Data were also collected from the local OP AC at the University of North Carolina-Chapel Hill (UNC-CH) in which UNC-CH-produced OCLC records are loaded. The chief objective was to explore the retrieval environments in which a random sample of USMARC AMC records produced at UNC-Chapel Hill were found—specifically, to obtain a picture of the density of these databases in regard to each subject heading applied and, more generally, for each record. Key questions were (1) how many records would be retrieved for each subject heading attached to each of the records and (2) what was the nature of these subject headings vis-a-vis the number of hits associated with them. Findings show that large retrieval sets are a potential problem with national bibliographic utilities and that the local and national retrieval environments can vary greatly. The need for specificity in indexing is emphasized. This article is based on a paper given at the Society of American Archivists' 1992 annual meeting in Montreal. OCLC supported this research. The author wishes to thank Patricia Haberkern, who did much of the searching. About the author: Helen R. Tibbo is presently an assistant professor in the School of Information and Library Science at the University of North Carolina at Chapel Hill. She earned a B.A. in English from Bridgewater State College, an M.L.S. from Indiana University, an M.A. in American Studies from the University of Maryland, and a Ph.D. in Library and Information Science from Maryland as well. She teaches in the areas of reference, on-line information retrieval, and archival studies. Her primary research interests focus on optimizing information retrieval, particularly for informa- tion systems that support humanistic and archival research. She is a member of the Society of American Archivists, serving on its Editorial Board and as Chair of the Archival Educator's Round- table, 1992-94. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 311 ARCHIVISTS1 HAVE TALKED AT LENGTH about the virtue of contributing records to a national bibliographic utility such as the Online Computer Library Center (OCLC) or Research Libraries Information Network (RLIN) in order to enhance access to their collections.2 There has been little discus- sion, however, of the difficulties of finding materials in such large database environ- ments.3 Ironically, electronic services such as OCLC and RLIN, which promise vastly improved access to archival collections on a nationwide or even international level 'Archives and archivists are being used herein for convenience to indicate both institutional archives and manuscript repositories and archivists and manuscript curators, respectively, unless otherwise noted. 2See for example David Bearman, "Archives and Manuscript Control with Bibliographic Utilities: Challenges and Opportunities," American Archivist 52 (Winter 1989): 26-39; David Bearman, Toward National Information Systems for Archives and Man- uscript Repositories: The National Information Sys- tems Task Force (NISTF) Papers, 1981-1984 (Chicago, 111.: Society of American Archivists, 1987); Elaine D. Engst, "Nationwide Access to Archival In- formation," Documentation Newsletter 10 (Spring 1984): 4-6; H. Thomas Hickerson, "Archival Information Exchange and the Role of Bibliographic Networks," Library Trends (Winter 1988): 553-71; H. Thomas Hickerson, "Expand Access to Archival Sources," Reference Librarian 13 (Fall 1985-Winter 1986): 195-99. James O'Toole has noted that "ar- chivists fulfill only half their responsibility to make records available if they sit and wait for users to come to them. Instead, archivists must be active in publi- cizing their holdings. This responsibility implies the necessity of sharing information about what is in each archives," Understanding Archives and Manuscripts (Chicago, 111.: Society of American Archivists, 1990), 67. 3Avra Michelson ("Description and Reference in the Age of Automation," American Archivist 50 [Spring 1987]: 192-203) has discussed the lack of consistency in archival descriptive practice, especially the assignment of subject headings for MARC AMC records and the implications for retrieval. Matthew Gilmore has noted that the requirement of most bib- liographic information systems to include at least one LCSH term in each MARC AMC record "means that archivists frequently must use a very general heading rather than the specific local thesauri," resulting in those materials "disappearing into a void." "Increas- ing Access to Archival Records in Library Online Public Access Catalogs," Library Trends 36 (Winter 1988): 610-11. over that possible in printed tools such as the National Union Catalog of Manuscript Materials (NUCMC), present enormous re- trieval problems themselves.4 As Lester Asheim has noted, "increasing the amount of information and speeding up access to it is more likely to result in information overload and entropy than it is to improve the receiver's ability to benefit from the in- formation."5 The user's goal is to find all relevant ma- terial and nothing more.6 As simple as this sounds, it is exceedingly difficult to accom- plish, whether the retrieval system is word of mouth, printed format, or an electronic database. As systems grow in size, com- plexity, and power, they become more in- clusive, but barriers to optimal retrieval effectiveness increase as well. This should not be surprising, as information retrieval power is never without its price. The larger and more heterogeneous the database, the more difficult it is to conduct subject or free-text searches effectively. Even known- item searches become slower and poten- tially more difficult as the search space in- creases. Lancaster and his associates observe that the on-line catalog has not improved sub- ject access but may have made the situation worse because it has led to the creation of much larger catalogs that represent the holdings of many libraries.7 Merging sev- eral catalogs into one, when each compo- nent catalog provides inadequate subject "Library of Congress, National Union Catalog of Manuscript Collections (Washington, D.C.: Library of Congress, 1962-). 'Lester Asheim, "Ortega Revisited," Library Quarterly 52 (July 1982): 215. 'Although it can be argued that a user might only want a subset of all potentially relevant materials, that subset becomes all the items that are situationally rel- evant for that particular individual at that time. T . W. Lancaster, Tschera H. Connell, Nancy Bishop, and Sherry McCowan, "Identifying Barriers to Effective Subject Access in Library Catalogs," Li- brary Resources and Technical Services 35 (October 1991): 388. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 312 American Archivist / Spring 1994 access, exacerbates the problem, since the larger the catalog, the more discriminating must be the subject access points provided. In recent years, catalogs have grown much larger without any significant compensa- tory increase in their discriminating power. Chandra Prabha of OCLC calls large re- trievals " a problem of the 1990s."8 She goes on to note that, "the problem of large retrievals is accentuated in an OP AC [on- line public access catalog] environment be- cause a majority of users are occasional or casual users."9 With 30 million records, the OCLC Online Union Catalog (OUC) clearly poses a challenging retrieval envi- ronment. Representing archival collections so as to optimize subject retrieval from a large bibliographic utility such as OCLC can truly be an " e p i c " struggle. Regardless of the type of material rep- resented—be it books, serials, or archival collections—document retrieval in large bibliographic databases depends on well- constructed document representations or surrogates. The semantic condensation re- quired to represent a 350-page book or a 50-box collection in a catalog entry, or an abstract, or even an archival inventory de- mands that more is left unsaid than re- corded in these surrogates. In the process of semantic condensation, information is necessarily lost. This loss may seem un- fortunate, but the remaining distillation, when well selected, becomes a more pow- erful retrieval tool than the full text of the original. A "good" surrogate eliminates "noisy" information that is found in all full texts and could cause an item to be retrieved when it should not be; a good sur- rogate also includes information that will facilitate its retrieval in response to appro- priate queries. It is the processor's job to create a sur- rogate, be it an archival finding aid or a USMARC AMC (Machine Readable Cat- aloging, Archives and Manuscript Control) record, that captures the most important material in the item represented in as suc- cinct and specific a manner as possible. Of increasing importance in extremely large databases, the surrogate must not merely represent its parent document and/or col- lection, it must be able to distinguish it from a multitude of other very similar items. The most subjective elements of MARC AMC records in bibliographic databases, yet certainly some of the most important regarding access, are the subject fields. Many of the other fields, such as collection title, extent, or location, are relatively straightforward.10 Collection titles can pro- vide some manner of subject access, but for most researchers who want to find col- lections that contain materials related to a particular topic, a search of the 12 subject fields in a MARC AMC record will be ap- propriate." This article discusses a retrieval study that focused on collection-level archival re- cords in the OCLC Online Union Catalog, made accessible through the EPIC search system. I also collected retrieval data from the local OP AC at the University of North Carolina-Chapel Hill (UNC-CH) in which OCLC records produced by UNC-CH are loaded. The chief objective was to explore the retrieval environments in which a ran- dom sample of MARC AMC records pro- sChandra Prabha, "Managing Large Retrievals: A Problem of the 1990s?" in OPACs and Beyond, Pro- ceedings of a Joint Meeting of the British Library, DBMIST, and OCLC, OCLC Online Computer Li- brary Center, Inc., Dublin, Ohio, August 17-18, 1988 (Dublin, Ohio: OCLC, 1989), 33. 'Prabha, "Managing Large Retrievals," 33-34. '"Even with these fields there can be serious re- trieval problems, as when institutions just use "Pa- pers" as the full title for a collection. "For a detailed description of these fields, see Har- riet Ostroff, "Subject Access to Archival and Manu- script Materials," American Archivist 53 (Winter 1990): 100-05. See also Online Computer Library Center, Archives and Manuscript Control Format, 2nd ed. with updates (Dublin, Ohio: OCLC, 1986). D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 313 duced at UNC-Chapel Hill were f o u n d - specifically, to obtain a picture of the den- sity of these databases in regard to each subject heading applied and, more gener- ally, for each record. Key questions were (1) how many records would be retrieved for each of the subject headings attached to each of the records and (2) what was the nature of these subject headings vis-a-vis the number of hits associated with them? I was particularly interested in seeing if the subject headings used at UNC-CH incurred an overwhelming number of postings in the national database and how this related to the number found in the UNC-CH OP AC. I also wanted to compare the number of postings for topical headings and personal names. This type of information is impor- tant in assessing how well a database is serving the research community because catalog persistence studies indicate that re- searchers, even in university settings, rarely are willing to look through hundreds of items in a catalog. Summarizing earlier OP AC studies, Ray Larson notes that users of on-line catalogs frequently find too many items or none at all.12 If subject head- ings applied to MARC AMC records incur hundreds of hits in OCLC, even if they work well in the contributing institution's local catalog, it is doubtful that researchers will find the records in the larger national bibliographic environment. To optimize the archival community's investment in pro- viding national access to materials, archi- vists must explore these large retrieval environments and adjust cataloging and re- trieval techniques appropriately. The EPIC Service OCLC's EPIC service is a commercially available interactive on-line searching serv- 12Ray R. Larson, "Managing Information Overload in Online Catalog Subject Searching," Proceedings of the ASIS Annual Meeting, 1989 (Medford, N.J.: Learned Information, 1989), 129. ice that provides access to several large da- tabases.13 The database with which archi- vists are most concerned is the OCLC On- line Union Catalog. If an archives sends MARC AMC records to OCLC, this is the database in which the records will appear. Currently, this database contains well over 30 million records representing informa- tion sources in a wide variety of materials and languages. It is growing at a rate of 2 million records per year, or 40,000 records per week. This is OCLC's original data- base, which library catalogers and interli- brary loan librarians have used for over 20 years for cooperative cataloging and for lo- cating known items for interlibrary loan. The Library of Congress sends an average of 5,000 records per week to OCLC, with other OCLC member libraries contributing about 34,000. Until the advent of the EPIC search serv- ice in 1990 and, more recently, First- Search,14 OCLC provided a search interface designed specifically for catalog- ers. The classic OCLC search protocol re- lies on the searcher having a book or other material in hand so that the author, title, publisher, and publishing date are known. The searcher enters parts of the title and the author's name so as to locate any ex- isting cataloging records for that particular item. The system then retrieves any records that match the given known-item specifi- cations. While the cataloger may have sev- "For more about EPIC, see Nita Dean, "EPIC: A New Frame of Reference for the OCLC Database," OCLC Newsletter (March-April 1991): 21; "The EPIC Service Is Introduced," OCLC Newsletter (Jan- uary-February 1990): 10-16; and Laurie Whitcomb, "OCLC'S EPIC System Offers a New Way to Search the OCLC Database," Online 14 (January 1990): 4 5 - 50. '"According to OCLC, "FirstSearch is an interac- tive searching system for library patrons" that allows them "to search a variety of bibliographic databases. . . . By following on-screen instructions, patrons can search successfully without special training." Online Computer Library Center, Inc., The FirstSearch Cat- alog (Dublin, Ohio: OCLC, 1992), 1. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 314 American Archivist/ Spring 1994 eral variant records to look through, they will all be for the particular title in hand (different editions perhaps), or, if only au- thor information is entered, they will all represent works by that individual. Despite the size of the database, a searcher can very quickly locate items via this system be- cause all searches are based on specific, concrete information such as titles, authors' names, and International Standard Book Numbers. The OCLC Online Union Cata- log has always held subject information in the form of subject headings (usually Li- brary of Congress Subject Headings [LCSH]) for each record, but it was not until the development of the EPIC service that OCLC provided a means by which to do subject searching, thus using these ex- isting access points. The EPIC search service complements the original OCLC search engine by pro- viding keyword, phrase, and subject searching. A searcher can use Boolean, proximity, and range searching features as well as truncation and index scanning.15 The EPIC command interface, the search language, is based on the NISO Common Command Language for Interactive Infor- mation Retrieval (Z39.58). The EPIC search interface is extremely powerful, but this does not mean that users will easily be able to produce good searches. The more simplistic FirstSearch system designed spe- cifically for end users also presents serious retrieval problems because the main prob- lems lie not with the searching front ends but with the OCLC OUC database itself. While this enormous database works ex- tremely well for cataloging and interlibrary loan, where the searcher has a specific title or author in mind, it is a relatively unex- plored morass for subject searching. The most evident problems revolve around the size of the database and the use of broad, precoordinated Library of Congress Sub- ject Headings for postcoordinate retrieval. These problems are not restricted to archi- val searching and MARC AMC records; indeed, producing manageable and com- plete subject search results for monographs in such a system is potentially even more difficult. In an effort to adapt LCSH terms for electronic retrieval, OCLC takes each sub- ject heading assigned to a book, archival collection, or other material and breaks it apart. This is very useful as it eliminates the need for users to construct lengthy LCSH strings in order to do subject searches and allows more flexible search- ing.16 To retrieve items assigned the head- ing ' 'North Carolina—History—Civil War, 1861-1865—Personal Narratives, Confed- erate," a searcher would enter a statement with the following elements in any order connected by the Boolean and: find su=(North Carolina and History and Civil War 1861-1865 and Personal Narratives Confederate). The su= tells EPIC to look only through subject headings but does not limit retrieval to only records with this par- ticular subject heading string. For example, an item with the following combination of subject headings would also be retrieved: "United States—History—Civil War, 1861-1865—Personal Narratives, Confed- erate" and "North Carolina—Description and Travel and 19th Century." Unfortu- nately, there is no mechanism by which the searcher can just receive items with a par- ticular subject heading string, nor can the searcher browse complete subject strings in the scan mode and see how many items are posted to each. "Index scanning does not work well with the sub- ject fields, as the subject strings, common to LCSH, are broken into constituent parts and do not appear in any scannable index as complete strings. 16Many individual library OPACs require users to enter full LCSH strings with correct syntax in order to retrieve items on a topic. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 315 Information Overload In general, the two primary purposes of subject control are (1) to allow the user to find material on a subject, and (2) to col- locate a repository's materials on a subject at one point in the catalog, thus giving the user a summary of what is contained in that collection on the given topic. National un- ion catalogs, such as OCLC's OUC, go one step further. Because the OCLC Union Catalog is a national database that employs LCSH, it collocates topical materials from around the country at each subject heading. Richard Smiraglia further notes that "when LCSH is used to supply subject headings for AMC formatted records, the archival materials will collocate with published ma- terials on the same topic in an integrated bibliographic system (network or local), thus giving a user an opportunity to browse bibliographic records for both published works and primary source material under a topical heading."17 While this is theoretically a wonderful research opportunity that might well bring new researchers into archival repositories because they find archival materials next to books in the catalog, such collocation works best, or perhaps only works at all, with relatively small collections. OCLC's Union Catalog, with over 30 million re- cords, hardly fits into the "relatively small" category. If 15,000 records collo- cate at a subject heading, or Boolean com- bination of terms—not an unheard-of retrieval in EPIC—the chance that the re- searcher will view any one of the records is greatly diminished; indeed, it becomes a chance event dependent on when the search is done, when the record was en- tered into the database, and how users deal with information overload. Researchers are not without resources to deal with information overload. Joel and Mary Jo Rudd list several ways in which library users turn a potential information overload into a manageable load.18 They explain that in addition to using Herbert Simon's principle of "satisficing" (acquir- ing a "satisfactory" subset of available in- formation), researchers faced with cognitive and temporal limitations on in- formation acquisition frequently just "skim off the top," looking only at the first few items they find in a catalog or on the shelves. Because most bibliographic databases present retrieval sets in last-in, first-out (LIFO) order, any given record collocated at a subject heading may fall victim to the "Andy Warhol" phenome- non, wherein each record is famous for its 15 minutes until it sinks into the morass of the database as newer records pile on top of it. The problem here, of course, is that the most appropriate records, particularly in fields such as history, where information does not go out of date quickly, may be at the bottom of the pile. Indexing consis- tency becomes important for only the most comprehensive searches and tenacious da- tabase searchers, but distinction drawn among items comes to the fore. Ortega y Gasset's 1934 definition of a librarian as " a filter interposed between man and the torrent of books" can now apply to the ar- chivist and the on-line catalog or on-line bibliographic systems.19 Stephen E. Wiberley, Jr., Robert A. Daugherty, and James A. Danowski con- ducted a "users' persistence" study in "Richard P. Smiraglia, "Subject Access to Archi- val Materials Using LCSH," in Describing Archival Materials: The Use of the Marc AMC Format, edited by Richard P. Smiraglia (New York: Haworth Press, 1990), 64. 18Joel Rudd and Mary Jo Rudd, "Coping with In- formation Load: User Strategies and Implications for Librarians," College and Research Libraries 47 (May 1986): 315-22. "Jose Ortega y Gasset, The Mission of the Librar- ian, translated by James Lewis and Ray Carpenter (Boston: G.K. Hall, 1961). D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 316 American Archivist / Spring 1994 1987 20 They looked for what David Blair calls the "anticipated futility point."21 Blair defines this as the number of docu- ments a researcher will be willing to begin to browse through. Karen Markey has called this user "perseverance."22 Wiber- ley and his colleagues adapted Blair's def- inition to the number of references in an on-line catalog that users were willing to scan in discretionary information-seeking situations. Subject searching fits into this discretionary type of information seeking in that the user never knows the extent of information available and thus feels no compulsion to search out a particular fact or title. Wiberley, Daugherty, and Dan- owski studied user persistence or persever- ance with an academic library OP AC that contained more than 425,000 records. They studied user transaction logs and question- naires. The median response to the ques- tion "How many postings would you consider to be too many?" was fifteen. The transaction log data indicated a sharp drop- off in persistence with more than 30 post- ings and a great drop-off after sixty. More specifically, they found that while a major- ity of users "displays all general records for searches that retrieve between eleven and thirty postings, when searches retrieve more than thirty postings, a majority of users displays no records."23 MStephen E. Wiberley, Jr., Robert A. Daugherty, and James A. Danowski, "User Persistence in Scan- ning Postings of a Computer-Driven Information Sys- tem: LCS," Library and Information Science Research 12 (October-December 1990): 341-53. See also Stephen E. Wiberley, Jr., and Robert A. Daugh- erty, "Users' Persistence in Scanning Lists of Ref- erences," College and Research Libraries 49 (March 1988): 149-56. 2lDavid C. Blair, "Searching Biases in Large In- teractive Document Retrieval Systems," Journal of the American Society for Information Science 31 (July 1980): 271. "Karen Markey, Subject Searching in Library Cat- alogs: Before and After the Introduction of Online Catalogs (Dublin, Ohio: OCLC, 1984), 67-71. "Wiberley, Daugherty, and Danowski, "User Per- sistence," 352. OP AC users, such as those in the Wib- erley, Daugherty, and Danowski study, may tolerate fewer citations than on-line- search service clients, who may turn to commercial on-line databases only when they want an exhaustive search. The searching literature and vendors such as DIALOG Information Services generally hold that very few on-line-search clients are willing to look through more than 100 citations, with many people willing to scan only 50 or fewer items. This information holds serious implications for archival re- searchers using on-line databases such as OCLC's OUC and locally or Internet-avail- able library catalogs. To understand how best to represent documents or collections of materials in these contexts, we need first to explore these retrieval environments. Methodology In February 1992 I selected a random sample of 60 MARC AMC records repre- senting collections held in UNC-CH's Southern Historical Collection from the OCLC Online Union Catalog. A graduate assistant searched the subject headings at- tached to each of these records in OCLC as well as in the university's on-line cata- log in March 1992. For example, "Mer- chants—North Carolina—History—19th century" retrieved 67 items in the UNC- CH on-line catalog and 106 items in the OCLC OUC in March 1992. In August 1992 and June 1993 I again searched all headings in the on-line catalog, the entire OCLC database, and the manuscripts por- tion of the OCLC database. In comparing the data I discovered that because one rec- ord was such an outlier it distorted the pic- ture for the mean number of hits per search term and per record. In this case, one head- ing—Sermons—received 54,904 hits in OCLC in August 1992. I eliminated this record from the sample, thus bringing the usable population to fifty-nine. I also dis- covered that the graduate student had D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 317 Table 1. Mean Number of Postings per Term Table 3. Median Number of Postings per Term EPIC—June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss—June 1993 EPIC/mss—August 1992 Local Total—June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 Table 2. Mean Number of per Term per Record EPIC-^June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss—June 1993 EPIC/mss—August 1992 Local Total-^June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 229 207 196 67 59 42 39 29 20 Postings 252 235 220 C D C O C O C O 45 41 29 20 searched the local records in a different manner, so that the local data for March 1992 are not able to be compared to the August 1992 results but are comparable to one set of the June 1993 findings. Findings Table 1 shows the mean number of hits or postings for the 519 subject headings as- sociated with the 59 records. The mean number of subject headings per record was 8.8, with a median of 8.0. The first EPIC search retrieved an average of 196 postings per heading. Only five months later this number rose by 11 points, and nine months after that it went up another 22 points. Keeping Wiberley, Daugherty, and Dan- owski's findings in mind, these results should be alarming. Even when the manuscript records are separated from the other materials in EPIC—June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss-^June 1993 EPIC/mss—August 1992 Local Total-^lune 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 Table 4. Median Number of per Term per Record EPIC-^June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss-^June 1993 EPIC/mss—August 1992 Local Total—June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 101 93 79 46 43 26 24 21 14 Postings 128 120 105 44 40 27 24 21 16 OCLC, the average retrieval was 67. News is better for the local catalog, with an av- erage of 42 hits in June 1993 and 39 in August 1992. These numbers represent the total figure given for an entry such as "Virginia—Civil War." The UNC-CH catalog provides this figure for this term and all subdivisions, such as "Correspon- dence" or "Stores and supplies" before listing any brief titles on the screen. The searcher in March had gone to the second step of looking in the index—the actual list of subject headings used in the catalog— and had recorded the number of items spe- cifically attached to the broader term (e.g., "Virginia—History—Civil War, 1861- 1865"), but she did not include figures for any of the subtotals. Thus, the March fig- ure, which took more searching expertise to derive, is as conservative as possible and is still twenty. This number rose to 29 by June 1993. Table 2 provides data on the D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 318 American Archivist / Spring 1994 Table 5. 9 16 20 20 46 67 80 110 125 150 Postings per 152 179 198 257 299 355 360 406 422 500 Record, EPIC, 525 536 552 646 651 652 658 668 735 810 June 1993 841 866 966 971 974 1,155 1,206 1,306 1,357 1,482 1,497 1,565 1,808 1,813 1,846 1,867 2,877 3,161 3,728 3,918 4,589 4,723 4,778 5,749 6,320 6,453 9,454 13,622 16,666 — Table 6. Mean Number of Posting per Record, EPIC, June 1993 2.25 4.00 5.00 5.30 7.67 15.20 15.23 15.71 16.00 16.75 25.00 27.18 31.25 33.83 41.67 42.83 44.75 46.72 50.18 51.43 52.50 52.75 67.50 76.57 83.17 96.60 107.67 116.69 123.50 128.33 129.14 130.40 131.60 144.33 147.00 150.75 156.50 162.75 169.63 177.50 186.57 194.20 205.10 243.50 259.00 265.44 302.58 314.87 319.67 327.79 338.91 351.22 489.75 526.67 567.58 668.00 859.46 1,290.60 4,166.50 — Table 7. 6 9 10 14 17 22 23 30 35 38 Posting per Record, 47 78 84 85 97 106 108 114 122 124 Local Total, 141 148 151 154 161 164 178 178 189 207 June 1993 217 224 254 296 299 311 344 346 348 348 352 406 425 450 465 496 532 699 762 829 955 961 1,022 1,097 1,122 1,133 1,207 1,233 1,700 — mean number of postings per term per rec- ord. The results are even worse with this method of calculation. The median number of postings per term (table 3) and per term per record (table 4) may represent a more accurate picture of the data due to a few extremely heavily posted terms that distorted the means. Al- though it contains lower figures, table 3 shows a 22-point increase in the OCLC fig- ures over the 14-month period. Enumerations of the number of postings per record and the mean number of post- ings per term per record show the range in postings (tables 5 through 8). Table 9, showing the greatest number of hits per term, indicates how useless a sub- ject heading can become in a database of 30 million records. "United States—His- tory—Revolution—1775-1783' ' retrieved 16,393 items in the June 1993 complete OCLC search and 1,628 items from the D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 319 Table 8. Mean Number of Postings per Item per Record, Local Total, June 1993 1.50 1.80 2.50 4.25 4.67 4.70 5.00 5.00 5.75 7.08 7.60 8.31 8.40 9.64 11.50 12.00 12.13 13.50 14.83 15.40 16.44 18.50 19.00 19.50 20.33 20.67 22.00 24.86 25.43 27.13 27.18 30.20 32.00 34.56 35.42 36.29 38.83 40.25 46.50 51.38 55.11 56.25 57.33 59.63 60.94 63.50 68.13 69.60 70.40 70.50 76.00 81.20 86.50 86.82 92.11 109.73 120.13 224.20 425.00 — Table 9. Greatest Number of Post- ings Per Term Table 10. Terms with Only 1 Posting EPIC-^June 1993 EPIC—August 1992 EPIC—March 1992 16,393* 15,641* 15,001* EPIC/mss^June 1993 2,213** EPIC/mss—August 1992 1,797** Local Total—June 1993 1,628* Local Total—August 1992 1,438* Local Specific—June 1993 1i012 t Local Specific—March 1992 962+ *United States—History—Revolution—1775 -1783 **North Carolina—History World War, 1914-1918—France UNC-CH catalog. "North Carolina—His- tory" retrieved 2,213 just from the manuscripts in OCLC. The number of headings incurring only one hit—the vast majority of these being personal names—indicates that the remain- ing topical subject headings received more postings on average than shown above (see table 10). The picture becomes bleaker and bleaker for the use of topical subject head- ings in such a large database, especially when we realize that many of the headings analyzed are used predominately by archi- vists, and archivists have been contributing to OCLC only for a few years! Even the UNC-CH catalog now averages over 60 postings for the multiple-hit headings (ta- ble 11). These figures would be even EPIC—June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss—June 1993 EPIC/mss—August 1992 Local Total-^June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 132 144 145 154 167 181 190 189 241 25% 28 28 30 32 35 37 36 46 Table 11. Mean Number of Postings per Multiple-Hit Terms EPIC-June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss—June 1993 EPIC/mss—August 1992 Local Total^June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 307 286 272 95 87 64 60 45 37 higher if subject headings with two and three hits (still mostly names) were added to those with just one. Table 12 shows the number of hits on individuals' names and the average number of postings on the names as a whole. There were 108 individuals' names included as subject access points in these records. In comparison, entries for 29 families, such as "Rogers," "Smith," and "Erwin," D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 320 American Archivist / Spring 1994 Table 12. Total and Average Number of Postings for Individuals' Names Date of Search EPIC—June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss^June 1993 EPIC/mss—August 1992 Local Total—June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 Number of Postings 450 432 408 189 186 225 214 200 183 Average Number of Postings 4.2 4.0 3.8 1.8 1.7 2.1 2.0 1.9 1.7 yielded many more postings, particularly in the OCLC OUC (table 13). Discussion What do these data tell archivists, who both create records for national databases and help researchers locate materials in them? First and foremost, it is important to realize that what may work locally will not necessarily work in a 30-million-record da- tabase. This is not to say that the use of such databases for national access is not a good idea. Rather, archivists have to un- derstand the nature of the environments into which they are sending their records and do all they can to help them compete. In database terms this means providing ac- cess points that will help the records to be retrieved and read when they are relevant items. Both local and national concerns must be balanced. A "good" record is a little bit like the proverbial good child: It should speak only when spoken to—that is, present itself for retrieval when it is rele- vant to a researcher's needs, but otherwise be silent. As with children, it is often dif- ficult to make bibliographic records be- have. To extend the analogy, most child ex- perts will tell you that environment, as well as genetics or specific parental teachings, plays a role in how children behave. Such is the case with bibliographic records. A bibliographic record that does not use stan- dardized subject access terms may never be found in a national database. Such practice will lead to low-recall searches. At the same time, a seemingly excellent record with standardized subject headings that represents a collection very well may find itself buried in other seemingly excellent records if there is much material on that topic in a large database. In this scenario, document discrimination and search preci- sion become overriding concerns. The rec- ord and its creator must adapt to this environment or risk oblivion. The same record may work well "at home," where there are relatively few items on this topic in the on-line catalog. Conversely, the local catalog may require augmented local sub- ject headings that make sense in that en- vironment. Not only must archivists consider collection and user characteristics in providing subject access, they must also consider the environment into which the records will be sent. This may mean send- ing one record off to a national utility while placing another record, perhaps with local subject headings and location infor- mation, in the home OPAC. There is no reason, other than additional processing costs, why the two records must be iden- tical. Representing Archival Collections Most important in making records behave D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 321 Table 13. Total and Average Number of Postings for Family Names Date of Search EPIC—June 1993 EPIC—August 1992 EPIC—March 1992 EPIC/mss—June 1993 EPIC/mss—August 1992 Local Total—June 1993 Local Total—August 1992 Local Specific—June 1993 Local Specific—March 1992 Number of Postings 2,963 2,684 2,621 590 277 165 149 139 112 Average Number of Postings 102.0 92.5 90.4 20.4 9.6 5.7 5.1 4.8 3.9 in any bibliographic environment is the ar- chivist's responsibility for capturing the key concepts of the materials in their find- ing aids. As David Bearman argues, con- sistency of topical headings is not so important if we provide very good search- ing tools such as switching vocabularies and "intelligent" front ends (and that is a big " i f ' ) . 2 4 Selecting and representing key concepts are highly subjective and difficult tasks, and those selected will not always fit the needs and visions of future users. This work will never be scientific, but it will always be important, just as archival proc- essing has been important in the past. The great service here is to reduce the bulk of information to be searched in a meaningful and rational manner, keeping in mind that it is better to do this work now than to wait for perfection that will never come. Rep- resenting materials completely and suc- cinctly, while differentiating them from a multitude of similar documents, lies at the heart of any information storage and re- trieval system. As with the MARC AMC format, archivists now need to focus on the types of subjects to be documented. They need to build a subject access framework to identify what subjects in archival collec- 24David Bearman, "Authority Control Issues and Prospects," American Archivist 52 (Summer 1989): 288. tions should be represented in subject in- dexing, as Bearman and others have pointed out.25 Beyond ensuring that the truly signifi- cant material in a collection is represented, appropriate indexing language is central to creating good bibliographic records. Re- gardless of the item, be it an entire collec- tion or a series, the specificity and exhaustivity of the indexing language are important. If these elements are appropriate to the material being represented, some subject-indexing consistency should fol- low, with strict authority control being left to more specific forms of information, such as personal, corporate, and geopolitical names, as well as collection forms and functions. Database producers have long recognized the importance of appropriate indexing languages for their materials. Thus, databases such as MEDLINE, ERIC, and Psychological Abstracts all have their own controlled vocabularies and thesauri. "Bearman, "Authority Control," 286-99; Helena Zinkham, Patricia D. Cloud, and Hope Mayo, "Pro- viding Access by Form of Material, Genre, and Phys- ical Characteristics: Benefits and Techniques," American Archivist 52 (Summer 1989): 300-19. He- len Tibbo extends this notion to a framework for ab- stracting in Abstracting, Information Retrieval and the Humanities: Providing Access to Historical Literature (Chicago: American Library Association, 1993). D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 322 American Archivist / Spring 1994 Broad, undifferentiated topical headings, common to LCSH, do not appear to work well for retrieval from large electronic da- tabases. If repositories collecting in similar areas work together on authority lists, ap- propriate index terms, and user thesauri, these efforts could increase the consistency with which the institutions with key col- lections represent their materials without sacrificing necessary specificity to a mon- olithic indexing language. This would also allow archivists to retain much of their "rugged individualism"26 while cooperat- ing with related institutions. Archivists could then coordinate and disseminate such vocabularies nationally. Avra Michelson sent a common descrip- tion of an archival collection to several re- positories and discovered a total lack of consistency in descriptive practice, espe- cially in the assignment of subject head- ings.27 While no conclusions regarding indexing consistency can be drawn from the present study, it is clear that archivists from across the United States are applying the same subject headings, even quite lengthy and complicated LCSH strings, to hundreds and thousands of records. In many cases they use terms that librarians also select quite frequently. We do not know if archivists are consistently applying these terms to the same concepts, but we do know that large numbers of postings are accruing at certain topical headings, even when these are delimited by geographic lo- cation and date. Because archivists in dif- ferent institutions never index the same collections, more context-sensitive studies of indexing consistency may be necessary if we are to judge accurately the extent of indexing consistency. What is clear is that better document discrimination, possibly 26Janet Gertz and Leon I Stout, "The MARC Ar- chival and Manuscripts Control (AMC) Format: A New Direction in Cataloging," Cataloging and Clas- sification Quarterly 9 (1989): 5. "Michelson, "Description and Reference." through more specific, appropriate, and ex- haustive indexing languages, is necessary as databases continue to grow. Another representation issue is deter- mining the best archival level on which to base MARC AMC records. It is important to recognize that these records serve to de- scribe collections in only a minimal fash- ion. The primary function of MARC AMC records is to lead the searcher to a finding aid, which in turn documents and describes in detail the collection and parts thereof.28 As such, MARC AMC records cannot fully describe a collection, nor should they. Hav- ing said this, I should add that it is prob- ably best to provide collection-level access in MARC AMC records, as the introduc- tory information in an inventory serves as an umbrella for the series and folder de- scriptions. Certain situations, however, can make the creation of just collection-level records arbitrary. A particular series, or even an individual item, may outweigh the value of the rest of the collection. If this is the case, and if the general terms that best describe the collection as a whole do not provide optimal specificity for the impor- tant part of the collection, a separate MARC AMC record would help to facili- tate access. Such a record, however, would have to lead the researcher to the collec- tion-level record or provide enough prov- enancial context so that the researcher could locate the collection. In OCLC, with a limited number of sub- ject headings per record, the archivist fre- quently cannot assign enough headings to index appropriately both the entire collec- tion and its significant parts. In RLIN, 28In Archives, Personal Papers, and Manuscripts, 2nd ed. (Chicago, 111.: Society of American Archi- vists, 1989), Steve Hensen notes that "The chief source of information for archival materials is the finding aid prepared for those materials" (p. 9) and not the materials themselves unless there is no finding aid or provenance or accession records. Thus, the cat- aloging record is derived from the finding aid as the finding aid is based on the collection. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 323 where any number of headings can be used, excessively long lists and long re- cords can discourage researchers from looking at the items they retrieve. In both cases, separate records that provide access to the collection as a whole and to specific parts would provide better access to this material than does one inadequate or overly long record. If different names, or- ganizations, or institutions are prominent in various series of a collection it may be a good idea to make linked MARC AMC re- cords for each relevant series and to index these with the prominent names. Subject access common to all series in a collection should be kept with the main record so as not to replicate the topical headings for the collection several times within the data- base. This is not to say that all series, fold- ers, or items need to be represented just because a few are deemed to be important. What might appear to be uneven represen- tation of the collection in terms of a finding aid could provide optimal access to key el- ements. In this way, cataloging and access become intricately tied to appraisal. It is important to remember that in a database all records, whether they represent impor- tant or relatively insignificant materials, be- come equal in the retrieval game. Responsible appraisal of what should be represented in the database becomes a powerful retrieval tool. As Bearman notes, a large number of subject headings per record gives that rec- ord a better chance to be retrieved.29 When repeated by everyone, the practice of ap- plying more and more subject headings will serve primarily to increase the size of the database and will result in overwhelm- ingly large retrieval sets and long records. This is already the case in RLIN, where records may go on for 12 screens and have MBearman, "Authority Control," 289. over 200 subject headings.30 The best pol- icy is to select important material and rep- resent it accurately and precisely. As with appraisal, selection is critical. It is irre- sponsible to "pollute" a retrieval environ- ment with extraneous or repetitive postings to terms just to increase the likelihood that a given record will be retrieved. We do not want to clog up our databases any more than our shelving or backlog areas. Retrieving Archival Materials Ref- erence archivists must become expert searchers of national bibliographic systems and on-line catalogs that are available on the Internet if they are to provide their cli- ents with the highest possible level of serv- ice. Since both OCLC and RLIN must be employed for comprehensive searches, and since many archival researchers want high- recall searches,31 archivists must become well versed in both systems. This means becoming familiar with the searching lan- guages and capabilities and, more impor- tant, with basic information-retrieval prin- ciples and strategies. Today's electronic in- formation-retrieval systems are deceptively easy to use, so much so that even the nov- ice searcher can find something on most topics. At the same time, it is often very difficult to do a good search that optimizes recall and precision. This is particularly true in large databases. Archivists must be prepared to do searches for clients and to assist clients in conducting their own searches. Indeed, there is a large role for user education, particularly with CD-ROM products and Internet-available on-line cat- alogs. Searching guides and instructional classes will become necessary if clients are to do their own searching. '"Kathleen Roe discussed the problems related to lengthy RLIN records at the 1992 SAA Annual Meet- ing in Montreal in a paper titled, "Autonomy vs. Community: Life in an Archives Database Commune." "Mary Jo Pugh, "The Illusion of Omniscience: Subject Access and the Reference Archivist," Amer- ican Archivist 45 (Winter 1982): 33-44. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 324 American Archivist / Spring 1994 Archivists must not only learn how best to apply subject headings, they must also turn this knowledge into searching exper- tise. Librarians are coming to see the dif- ficulty of using a precoordinate indexing language, such as LCSH, for postcoordi- nate retrieval, and hopefully there will be significant improvements in future LCSH versions in the age of OPACs. OCLC has recognized the precoordinate problem, and thus breaks each LCSH heading and sub- heading apart to facilitate more flexible re- trieval. Because of this, the searcher does not have to worry about matching the syn- tax of lengthy LCSH strings in EPIC, al- though this may still be the case with on-line catalogs available locally and on the Internet. Reference archivists must be- come skilled in searching all of these tools. They must know how to construct LCSH strings for searching OPACs and realize that the breadth of many LCSH terms, even when combined with other terms and de- limiters in the OCLC or RLIN OUC, may prohibit precise retrieval. When searching individual OPACs via the Internet, it is important to remember Avra Michelson's study. Archivists tend to use different terms (even when restricted to a controlled vocabulary) to describe the same things. Thus, when searching some- one else's catalog, we should remember that it is important to use a number of syn- onomous search terms to ensure high recall (if that is the objective). It is always easier to search our own catalog wherein we know the terms local staff members tend to use over and over. It would be a great serv- ice to the field if institutions with like col- lections cooperated in building "common term" lists and then made these available to other institutions and clients, complete with examples on how to make searches as specific as possible. These could even be mounted on Internet gopher servers for easy access. Headings divided by geographic and temporal elements—facets found to be im- portant to historians' information-seeking methodologies—work well to distinguish items that are topically related.32 Jackie Dooley also notes the importance of space and time delimiters for providing more re- fined subject access.33 Such delimiters, however, provide only a partial answer. As can be seen from examples given above, even when subject headings contain locales and date ranges, a large number of records may be retrieved, and thus the actual top- ical subject terms must also be specific. Conversely, many items may be omitted from date- or place-restricted retrieval sets if processors failed to include all possible specific delimiters and subheadings. When a collection covers several geographical ar- eas and years, processors may be forced to use broader terms because they are re- stricted in the number of more specific des- ignations they can make. Reference archivists should advise clients searching OCLC or similar databases to use geo- graphical and temporal elements in search strategies, but clients should also be aware that many relevant records will not be re- trieved with these limitations. Processors must assign geographic and temporal sub- headings to LCSH when these notions are central to the collection being represented, and reference archivists must explain the realities and limitations of database search- ing to clients. If only primary materials are desired, limiting a search to the manuscripts seg- ment of the OCLC database seems a good strategy to limit set size. Examination of records in the larger OCLC sets, however, reveals that many archival materials have been entered in the MARC book format. Thus, searches restricted to manuscripts will not retrieve all relevant items. Fur- thermore, such searches will not collocate 32Tibbo, Abstracting, Information Retrieval, and the Humanities. "Jackie Dooley, "Subject Indexing in Context," American Archivist 55 (Spring 1992): 348. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 Subject Retrieval from Large Bibiolographic Databases 325 published and unpublished sources, which may be what the researcher wants. This strategy should be used quite carefully and explained to the client. Subdivision by form is a useful retrieval strategy, but headings such as "Sermons" or "Diaries" by themselves get lost in the shuffle. It is very important in large data- bases to combine form headings with other topical, temporal, or geopolitical headings. Along this line, headings such as "Brown Family," while they may work in our local catalogs where there is only one Brown family, produce quite undesirable results in a national catalog. Ideally, each subject heading is supposed to denote only one concept. Although there may be linkages among the over 200 records in OCLC with the heading "Brown Family," in many cases individual families that are in no way related are represented. This indicates a to- tal lack of authority control and results in excessive postings because separate con- cepts (different families) are represented by the same term.34 Searchers should usually try to limit queries with family names to particular geographic locations. Entry of specific personal or corporate names, which can be expected to have very few hits even in large databases, seems to be one way to provide'specific access with- out running the risk of unwieldy retrieval sets. Without time-consuming name au- thority work, however, names may provide only partial access to relevant materials. Fortunately, searchers may be able to over- come many variations in names with trun- cation and other search tactics, but total pseudonyms will remain invisible to a searcher unless a link is made in the data- base.35 The primary drawback to retrieval by personal name is that the researcher must know the key players in the area be- ing studied before finding the material. While names and institutions provide a type of subject access, they augment rather than replace topical access points. The Future At this time, we just do not know enough about how researchers at- tempt to look for archival materials in na- tional databases or in local on-line catalogs. This information should drive the design of our information systems and our document representations. In its absence, the cardinal rule of indexing—"Index at the most specific level possible"—should always apply, but this edict is often ambig- uous. Even more problematic is the sear- cher's analog: "Search at the most specific level possible." Richard Pearce-Moses raised valuable questions in this regard in a posting to the ARCHIVES listserv in De- cember 1992: Fixing up LCSH and MARC may be the last steps we should be wor- rying about. Maybe we need to de- fine some common research strategies based on patron needs: What are patrons asking of our ma- terials? and What tools do we need to match our material to those re- quests?36 In addition to user studies, much more re- search into the nature of retrieval from large bibliographic databases is needed. This work would benefit all players in the information community, as most databases 34Thcre are two ways in which authority control (i.e., use of a control vocabulary) can be violated: (1) the same concept can be represented by different terms, and (2) different concepts can be represented by the same term. The former case is most often con- sidered, but the latter may be more difficult to over- come from a retrieval point of view, particularly when large numbers of records are retrieved. "Actually, a sophisticated search system would be able to retrieve pseudonyms of any name entered without the searcher ever being worried with the mat- ter, if so programmed. This is not the reality of major search services today and the upkeep cost of such a service makes it unlikely in the near future. "Richard Pearce-Moses, "LCSH—Summation and Opinions—Sources—1992," ARCHIVES listserv (15 December 1992). D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021 326 American Archivist / Spring 1994 are growing at an alarming rate. Retrieval studies comparing OCLC, RLIN, and In- ternet-available OPACs are also needed. Because the RLIN record structure is more felicitous to archival information, most ar- chivists believe it is the information system of choice for archival materials. Only re- search will substantiate this belief. If re- searchers know which repositories hold the materials they want, searching individual catalogs via the Internet may produce the best retrieval results once users deal with all the OPAC search variations. This ap- proach is the electronic equivalent to writ- ing individual archivists to see what their collections hold in a given area. Many in- teresting studies wait to be conducted. In this day of information gluttony and those surfeited years that surely lie ahead, responsible appraisal and provision of ac- cess to significant materials are central to the archivist's function. We know we can- not save everything. Now we must learn that only a portion of what we do save will merit specialized avenues of access. If we do not practice such restraint and temper- ance, the national bibliographic databases will grow to useless proportions and our processing backlogs will overwhelm us. We need to represent those materials deemed worthy with as much specificity as possible to stem the tide against the mean- inglessness of massive retrievals from elec- tronic systems. As noted earlier, catalogs need to describe works and collections while distinguishing them from a myriad of others. To achieve the former without the latter will produce databases that are both enormous and brutally random. They will become the archivist's, and the librarian's, Moby Dick: an obsession to maintain with an overwhelming whiteness and lack of meaning and direction. Lester Asheim has observed that "the rich store of informa- tion to which librarians can now provide access has a tremendous potential for good—to the individual and to the soci- ety." He continues by noting that, "as collectors, librarians have contributed to the information overload which inhibits rather than promotes achievement of the goal we had in view." He asks librarians if they do not "have an obligation now to provide a solution to the problem [that they] have helped to create."37 Is it not time that archivists started to face the prob- lem of information overload and stopped being lulled into a false sense of security offered by national databases and the allure of superficial subject access? Some call for scrapping the information systems we now have and starting over, but this will not solve all the problems. There will never be a "perfect" information stor- age and retrieval system for archival ma- terials, even if archivists design it from scratch specifically to meet their needs, be- cause language and the human mind are the real problems. Subject retrieval—or for that matter, any form of text representa- tion—will never be perfect. Archivists must recognize this and move forward, bal- ancing local and national needs and build- ing systems that are useful and possible. In the long run, there is no substitute for well- selected index terms that represent the pri- mary aspects of a collection. This is never easy, but the less effort put into represent- ing materials in a database, the more dif- ficult retrieval will be. Archivists must decide on which side of the retrieval equa- tion they wish the effort and cost to fall. "Asheim, "Ortega Revisited," 225-26. D ow nloaded from http://m eridian.allenpress.com /doi/pdf/10.17723/aarc.57.2.f0650763x258t4p5 by C arnegie M ellon U niversity user on 06 A pril 2021