lib-MOCS-KMC364-20131012113937 268 The Use of Automatic Indexing for Authority Control Martin DILLON: University of North Carolina at Chapel Hill ; Rebecca C. KNIGHT: Wichita State University, Wichita, Kansas; Margaret F. LOSPINUSO: University of North Carolina at Chapel Hill; and John ULMSCHNEIDER: National Library of Medicine. Thesaurus-based automatic indexing and automatic authority control share common ground as word-matching processes. To demonstrate the resemblance, an experimental system utilizing automatic indexing as its core process was implemented to perform authority control on a collection of bibliographic records. Details of the system are given and results dis- cussed. The benefits of exploiting the resemblance between the two systems are examined. INTRODUCTION It is not often realized how close the relationship is between automatic indexing using a thesaurus , on the one hand , and automatic authority control, on the other. Making the connection is worthwhile for many reasons. The first has to do with terminology. Though one would be naive to hope for a reduction in specialized vocabulary, it is helpful to appreciate that what is called a thesaurus in one application is referred to as an authority file in the other; that the two have virtually the same structure, similar working parts, and play the same role in controlling the content of fields in a bibliographic file in their creation and, at least potentially, during retrievals by users. A second reason emerges in system development. Below we discuss the various ways that a library can implement authority control. They range from a fully manual system, where the authority file exists only in card form, to online, automatic authority management. There are intermedi- ate points as well. For each of the automated implementations, the system investment in software can be great. Recognition of the close parallel in function of these two library needs allows for parallel development of software for any of these stages. A third reason looks to the future. Successful system-patron interaction Manuscript received Apri11981 ; accepted September 1981. Automatic Indexing/DILLON, et al. 269 ought not to depend upon a patron's knowledge of the authorized entry forms currently in use for a library. First, the concept of a controlled vocabulary is far too narrow: authority control should encompass all fields available for searching. But the patron need not be aware of complicating details: substitutions of recognized variants for authorized forms ought to be carried out automatically during patron retrievals (with due regard, of course, for the intent of the patron). This article describes a project in authority control in a specialized system environment, one that is increasingly typical in many of its fea- tures. The file of records is relatively small, currently below 10,000, and has a potential for growth not exceeding 100,000. The collection, derived from the Annabel Morris Buchanan Collection of American religious tune books at the University of North Carolina (Chapel Hill) Music Library, has many similarities with standard book collections, but its details vary greatly and cataloging conventions have been developed locally. Its use for scholarly research is similar to that for any standard collection of biblio- graphic records. A great many such nonstandard collections exist-the morgue file in a newspaper, machine-readable data files, even properties marketed by co- operatives of real estate agencies. Developing automated retrieval systems for such collections are similar enterprises, sharing similar goals and prob- lems. In particular, all require extensive authority control similar to that required by a tune-book collection. The important feature of the method of authority control described here, one that makes it likely to be of interest to others, is its use of the same structures and software that are used for general vocabulary control. The three major software components we will refer to below are: thesaurus maintenance, automatic indexing, and automatic updating. These com- ponents antedated our effort to implement a similar system for authority control. When the problems that dealt with authority control per se were investigated, it was discovered that the system already available for subject control could be used exactly as it stood for authority control as well. Initial experiments confirmed this relationship. 1 Authority Control and Automatic Indexing Automatic authority control has been approached largely as a unique problem requiring special software development for its implementation. But authority control shares common ground with automatic subject in- dexing. Both are term-matching activities based on a list of preferred terms plus a much larger list of match terms. Each preferred term is tied to a number of match terms, but each match term is tied to only one preferred term. In the indexing environment, document text is examined for certain terms; these "free text" (uncontrolled vocabulary) terms are tied to equiva- lent (controlled vocabulary) terms in a thesaurus. When an uncontrolled vocabulary term is encountered in a document, its associated controlled 270 Journal of Library Automation Vol. 14/4 December 1981 vocabulary term is posted to the document as a descriptor. In authority control, document text is also examined for certain terms, e.g., author names. These "free-text" author names (i.e., names just as they appear on a title page) are tied to their authoritative name form (controlled vocabu- lary) in an authority file . When a "free-text" author name is encountered, the authoritative name is posted to the document or book (i.e., assigned as a heading or entry point). An automatic authority control system, then, is realizable by applying standard automatic subject-indexing software, which exploits the resem- blance between the two processes. The input would consist of a thesaurus (in this case, an authority file) and bibliographic records; the indexing discovers matches between the list of possible terms in the thesaurus (vari- ants of author names) with the "free-text" terms (title-page author names) , and posts the appropriate controlled thesaurus terms (authoritative author name form) whenever a match occurs. (See figure 1.) THE TUNE-BOOK PROJECT An experimental version of an authority control system using automatic indexing was implemented to test the feasibility of automatic indexing as I THESAURUS I (Authority File) \ \ I I \ Fig. 1. At1thority Control by Indexing. MATCHING AND POSTING , l ' PDATED RECORDS I \ ' I BIBLIOGRAPHIC RECORDS ~ Automatic Indexing/DILLON, et al. 271 the core process for authority control. The goal was automatic authority control for the Buchanan Collection index, the first step in work on a more comprehensive project, an index of American religious tune books, in par- ticular, the shape-note tune books. For the study of American cultural and musical history it is important to be able to trace the dissemination of these hymn tunes and texts, but the absence of a comprehensive index of American hymn tune books severely constrains such studies. Many factors have discouraged scholars from con- structing an index, among them the magnitude of the repertory . Using computers to sort, file, and print reduces many of the problems associated with the size of the repertory, but does not address those created by the diverse forms of names and texts used by the tune-book compilers. Correct hymn titles and especially accurate composer attributions were not impor- tant to the compilers of the tune books. Consequently, although many tune-book compilers did attempt to indicate who had composed the work, the names of the composers appeared in various forms. For example, the name "Israel Holdroyd" might appear as simply "Holdrad" or "Holdrayd" with no first name given, or a first initial might be added, or an abbrevi- ated first name, such as "Is." might be used with one of several forms of the family name. Automatic authority control over these names is necessary to the study of this collection, since only automatic means can address the problems of magnitude encountered in approaching the index as a whole. The database now contains about 6,000 records for these tune books. They are stored in MARC format with variable-length fields giving a variety of information about each tune . Creation of the Authority File A thesaurus of authority records for the Buchanan Collection was manu- ally created and placed in an online file. The initial authority file com- prises a selection of composers whose names are present in conflicting forms in the present database. These were obtained by analyzing the file sorted by tune names, noting those tunes for which it appeared that the name of the same composer was given in more than one form. All forms of the name found were entered on cards along with the name of the tune (or tunes) through which the relationship was established . We used an explicit algorithm as a guide in determining which names were actually forms of the same name (see appendix for details). This process resulted in a list of 266 distinct composers, each with one to four different name forms. All were compared with the list sorted by composers, noting additional forms. These names were then checked in several reference works, and authorita- tive forms (with dates) were established when possible. IMPLEMENTATION Software Systems File processing for the tune records and the authority thesaurus was 272 Journal of Library Automation Vol. 14/4 December 1981 accomplished using a local software product, Bibliographic/MARC Pro- cessing System (BPS). BPS is a general-purpose software package for the manipulation of MARC-format records. This experiment used BPS subsys- tems for creation of MARC-format records, sorting and formatting, and file updating (i.e., updating a master file with the contents of a transaction file). The automatic indexing program used here was intended as part of a thesaurus-based document query system. 2 It is compatible with BPS, but utilizes generalized automatic indexing principles-its compatibility de- pends only on properly formatted thesaurus and bibliographic records. It includes file-processing programs for the thesaurus (authority file) and the bibliographic records (tune records) and a matching program that per- forms the indexing. Posting of the authoritative name forms to the proper MARC record is done with standard BPS updating procedures using out- put from the matching program. Automatic Authority Control Process As input the system uses a thesaurus and the text of fields selected from MARC-format document records. The thesaurus consists of pairs of terms: the first of each pair is the term searched for in a document, the second is the authority term assigned to the document, whenever the first term is found. Figure 2 gives examples. The text may be abstracts, titles, or the contents of any field selected from the documents for authority control. In this case, the text is derived from the composer field; for authority work in general, any field requiring authority control would be input. The first step in authority control is as follows. The text sample and a stop-word list are input to the initial text-processing program. The incom- AU'IHCRITY FCRI'I Cole, J_ I Cvle, Joh~ 1774-1855 Clarkf", Thos. 1 Clark, Thomas \:ol e!' , ~ eo. I cuzens, 9. / Cuzens, Benjamin ilall , ::;_ B- I Ba 11 , R. F- Holraj / Hcld r oyd , Israel aolroyd I Hcldroyd, Israel Fig. 2. Thesaurus/Authority File Format . Automatic Indexing/DILLON , et al. 273 ing text (in this case, composer names) is separated into individual words. The stop-word list is used to remove designated words from the input, which in authority control might be titles of address and so on- terms such as "Miss," "Elder," or "Reverend." (Automatic indexing uses the stop-word list to eliminate similarly noncontributory terms, such as conjunctions and prepositions.) The processing program can also convert plurals to singulars if desired. The purpose of this option in automatic indexing is to pare down variants in order to increase matches by standardizing term forms. How- ever, plurals are not converted in authority control, since names are usu- ally distinguished from one another by their full forms. The processing produces a list of individual terms. Each term is given once along with the number of words in the term, then broken up with the document number attached to each piece. The thesaurus authority records are edited by the thesaurus processing program into specially formatted matched pairs of variant and authorita- tive forms. Input is the match-term/variant-term file (figure 2) and the same stop-word list used for document processing. The stop-word list elim- inates all unwanted words in the list of variant name forms. Output is a file containing all possible name forms (variants), the number of terms in each name and their positions in the name, and the authoritative name form, as in figure 3. Next the two files are used as input to a matching program that creates an inverted file of the processed document text, then compares each match term from the prepared thesaurus with the inverted file. A match is discov- ered according to one of the following criteria: 1. Exact match: Match term and document term are the same words, in the same order, and adjacent. 2. Stop word exact match: Words are the same in match term and in document term, and in order, but deleted stop words may intervene between words in the document term. 3. Any order match: Term must be the same words and adjacent (i.e., without intervening words) and may be in any order. VA!'IANI tWC!lD S ~:UTIV~ AUTI:-Ci\IlY ?cs: no FCH Hlstin'js, 'Ihos. 2 1 2 rastinq~ , TL:HII.l S 17~4-l tl7 _ Hastl.nqs, l h:>s :le i 1 2 rds tL nq.< , Th.:>llll S 17cl~ - 1 -!72 Holde a':! l!ol:lccyd , l S cd: AB-1054, .\3-166Q, AD-1248, AQ-133b, ••• Fig. 4. Update File. Results Table 1 gives some statistics on the experimental runs. In the 5, 788 bibliographic records, 760 distinct composer names were present, the re- mainder (one composer per record) being duplicate forms; many of these are simply "anon," where the composer was not known. Earlier test runs on a subset of the file had fewer duplicates, and additions to the full database show few new composer name forms. Thus the database is near- ing a stable state with an exhaustive list of composers; this stability contrib- Table 1. Implementation Statistics F ile Statistics: Total number of bibliograp hi c records Number of composer names in biblio reco rds Ave rage number of compositions per composer Tota l number of authorit y na me forms (in authority file) Tota l number of variant and authority names (in authority file) Run Statisti cs: Total number of variant thesauru s names matched Total numbe r of variant thesaurus n am es unmatched Average number of documents per match ed ter m Average number of docume nts per term Total number of reeords updated b y authority form 5,788 760 13.2 266 599 372 213 5.87 3.61 2, 110 276 journal of Library Automation Vol. 14/4 December 1981 JQC 10: Af- 1 14 7 .\NT HO L O.:; Y ; 'I h <:> ~ n ion iliH Jl on y I MJRIN : : sel~cted ty ;ecr qe Y~njr~ckson TUNE NA:1E: i e::-usa lem FIRS: LIN~:Je~us, my all tc h~~v•n is gone, PCN: Walk e r, William 18 09 -187 5 CC.'1P!)3:':R: loi al k e r, \oJr • JOC I D: AA-1353 "ANTHOLO.:;Y: The Sacred harp IMPRINl': oy 3. F. lthite, E . J. King [and D.P. White}--- 4th ed.---Atalnta : D. P. Byrd, 1870 TUNE NAME: the hilt cf zion Frgsr ~INE:The Hill cf Zion yield s , PC~: White, Benjamin Franklin 1800-1879 COi1PO SER: White, B. F. )Ot: ID: Afl -1100 ANTHOLOGY: The Culcia;er IMPRINT : or, 'Ihe New York coll~ction of ~acred music 1 by I. B. Woccbury. --- Neli York f. J. Huntington TUNE NAME: Carson fiRST LINE:Jesus an1 shall it ever be, PCN: Bradbury, Williaa; Batchelder 1816-1868 COMPOSER: Er, W. !l. Fig. 5. Updated Records. utes to decreasing errors and fewer unmatched composer names in the automated authority control process. The total numbe r of thesaurus records matched applies to variant forms, authoritative forms (matching occurs for these also) , and for those few forms that have no variants. The unmatched terms (213) are largely vari- ants not in the database but gleaned from reference sources in anticipation of their occurrence, and authority forms, most of which do not occur in the database. The 2, 110 matched represent the total number of composer names matched of the originalS, 788 names. Most of the unmatched names are the "anon" entries (more than 2 ,000); the remainder are unanticipated forms not detected in the initial manual construction of the authority file. These unanticipated forms become new variants added to the authority file as described above. CONCLUSIONS Automated authority control as presented here has a number of advan- tages, either for libraries with their own processing facilities or for the management of information collections outside the standard library envi- ronment. Unifying the processes of subject control and authority control by using the same procedures and software for both simplifies the tasks of Automatic Indexing/DILLON, et al. 277 systems personnel and information managers. Where catalog access is on- line, the patron benefits by applying subject access facilities to other searches. Ideally, substitutions for all variants would occur automatically, accompanied by an alerl lo the patron where it was felt necessary. At a minimum, the same command structure would be available for referenc- ing names as would be normally available for consulting an online the- saurus. In either case, the difficulties of the patron are reduced, both in comprehending how the system works, and in acquiring a facility for using system commands. REFERENCES 1. Gordon Ellyson Jessee, "Authority Control: A Study of the Concept and Its Implemen- tation Using an Automated Indexing System" (Master's paper, School of Library Sci- ence, University of North Carolina at Chapel Hill, 1980). 2. Margaret S. Strode, "Automatic Indexing Using a Thesaurus" (Master's thesis, Depart- ment of Computer Science, University of North Carolina at Chapel Hill, 1977). APPENDIX Rules for Decisions on Similar Names The following conditions may exist: A = identical tune name B = identical surname C = identical first initial D = same first letter of surname and close match of the rest of the surname. (55 percent match of latters in content, not in order. Such a similarity is presumed to represent a similarity in sound. ) E = similar tune name (same criteria as in D for percentage of match). EXCEPTION: words "new" and "old" cancel any presumed relation between similar tune names. F = information in CMP subfield x field is identical in content The following combinations of conditions indicate the same person, expressed in decreas- ing order of reliability: l. A&B 2. B&C 3. A&D 4. C&D 5. B&E 6. C&D&E 7. D&E 8. F&(BorD) Note: points seven and eight are regarded as tentative, and matches using these combina- tions are flagged for later checking. Martin Dillon is associate professor of library science at the University of North Carolina at Chapel Hill. Rebecca C. Knight is administrative services librarian at Wichita State Uni- versity, Wichita, Kansas. Margaret F. Lospinuso is music librarian at the University of North Carolina at Chapel Hill. John Ulmschneider is library associate at the National Library of Medicine.