lib-s-mocs-kmc364-20141005044532 156 Corporate Author Entry Records Retrieved by Use of Derived Truncated Search Keys Alan L. LANDGRAF, Kunj B. RASTOGI, and Philip L. LONG, The Ohio College Library Center. An experiment was conducted to design a corporate author index to a large bibliographic file. The nature of corporate entries necessitates a different search key construction from that of personal names or titles. Derivation of a search key to select distinct corporate entry rec01'ds is discussed. INTRODUCTION This paper describes the findings of an experiment conducted to design a corporate author index to entries in a large file of catalog records at the Ohio College Library Center; a companion paper describes findings of a similar investigation into retrieval employing a personal author index. 1 The center has operated an on-line, shared cataloging system since August 1971. In addition to a Library of Congress card number index, the system maintains truncated name-title and title index files. The user is thus able to retrieve entries employing truncated search keys. Three previous papers report results of experiments which led to the design of the name-title and title indexes.2- 4 For monographs having personal names as main entries, a truncated 3,3 search key consisting of the first three letters of the author's name plus the first three letters of the first non-English-article word of the title was judged to be satisfactory in that this key yielded five or fewer entries per query in more than 99 percent of the cases when keys were selected at ran- dom.5 However, a recent study by Guthrie and Slifko reveals that a model which employs random selection of entries yields results closer to actual ex- perience, and with a higher average number of entries per reply.6 A search key composed of the first five or four characters of the sur- name and the first or first and second initials makes possible efficient re- trievaP However, the situation is different in the case of corporate entries because many corporate names begin with the same or similar words. For example, in the records examined, the initial words of more than 1,300 publications are "U.S. Congress, House Committee On .. .. " Obviously a Corporate Author Entry RecordsjLANDGRAF, et al. 157 type of search key different from that which proved efficient for retrieving personal authors is required for retrieval of corporate entries. MATERIAL AND METHODS The experiment used a file of approximately 200,000 MARC II records having a total of 68,169 corporate name entries. Corporate entries were ex- tracted from the llO, Ill, 410, 411, 710, 711, 810, and 811 fi elds in the rec- ords. A program edited the file to extract keys; initial English language ar- ticles were removed from each entry, and the words "United States," "U.S .," "U. S.," "Great Brit.," and "Great Britain" appearing anywhere in the entry were replaced with "US" and "Gt Brit" respectively. A blank was substituted for each subfield delimiter and associated code, and unwanted characters such as punctuation, diacritics, and special symbols were re- moved; the program also closed up the space that the unwanted character had occupied. One blank replaced multiple blanks. The elements extracted consisted of five segments of eight characters each, representing the initial eight characters of the first five words of the corporate entry. Segments containing fewer than eight characters were padded out with blanks. If a corporate name had fewer than five words, the remaining segments were blank. To study a given type of key, the file was sorted on a specified number of initial characters of each segment; these initial characters were then employed as search keys by a program which sequentially compared the characters in the key, counting distinct and identical keys. RESULTS AND DISCUSSION Table 1 presents the number of distinct keys and the maximum number of occurrences of identical keys for the structures studied in the experi- ment. The larger the number of distinct keys for a fixed number of en- tries in the file, the better the key will be for retrieval purposes. Given two search keys which are more or less equally specific, the one which is sim- pler to use is preferable. The peculiarity of corporate-entry keys can be observed from Table 1. Even for the 8,8,8)8,8 key structure the percentage of distinct keys ( 33.7 per- cent) is low, and the maximum number of occurrences of an identical key ( 1304) is high. Another observation revealed by Table 1 is that as the key structure goes from five to three segments, there is a steady decrease in the percentage of distinct keys and consequently an increase in the maximum number of entries per key. However, a reduction in the number of char- acters in a segment does not cause a great deal of deterioration. For exam- ple, for 8,8,8,~,8 keys, the percentage of unique keys and the maximum number of entries per key are respectively 33.7 percent and 1304, while for 2,2,2,2,2 keys, the corresponding figures are 32.3 percent and 1307. Thus, the 2,2,2,2,2 key structure seemed a good candidate for a corporate 158 Journal of Library Automation Vol. 6/ 3 September 1973 Table 1. Number of Distinct Keys and Maximum Number of Identical Entries Per Key for Different Key Structures in 68,169 MARC II Records. Key Structure 8,8,8,8,8 8,8,8,8,0 8,8,8,0,0 4,2,2,2,2 4,2,2,2,1 4,2,2,2,0 4,2,2,1,0 4,2,2,0,0 3,3,2,2,2 3,3,2,2,1 3,3,2,2,0 3,3,2,1,0 3,3,2,0,0 2,2,2,2,2 2,2,2,2,1 2,2,2,2,0 2,2,2,1,0 2,2,2,0,0 1,1,1,1,1 Number of Distinct Keys 22982 20476 16283 22411 22120 19513 18589 14801 22417 22132 19560 18654 14922 22053 21743 19034 18036 13842 19028 Number of Distinct Ker1s as a Percent of Total Number of Records 33.7 30.0 23.9 32.9 32.4 28.6 27.3 21.7 32.9 32.5 28.7 27.4 21.9 32.3 31.9 27.9 26.5 20.3 27.9 Maximum Number of Entries Per Key 1304 1305 1802 1307 1308 1311 1311 1807 1307 1308 1311 1311 1806 1307 1308 1311 1311 1807 1308 entries index and therefore the number of entries per reply for this key structure was more intensely studied. On the average it is desirable that the number of replies per query be such that information by which the user can choose among the possible re- plies can be displayed on a single CRT screen. This maximizes the utility of a computer system, since it minimizes the amount of system activity to promptly satisfy a user's request. Since some query keys produce but one reply while others produce hundreds of candidate records, it is necessary to use the mathematics of probability to determine the likely long-term ef- fect of a given choice of system parameters. Using the approach indicated Table 2. Average Number of Entries Per Reply for Key St1·ucture 2,2,2,2,2 for Various Multiplicity of Entries. Number of Average Number Maximum Frequency Total Records Percent of Distinct Keys of Entries of Any Entries in File in File Total Records Eliminated Per Reply 19 44174 64.8 389 5.0 29 48127 70.6 223 6.6 39 50854 74.6 142 8.1 49 52422 76.9 107 9.1 59 53513 78.5 87 10.1 Corporate Autho-r Entry RecordsjLANDGRAF, et al. 159 as useful by Guthrie and Slifko, the analysis of the effect of various choices of search key becomes the following. Assume that every entry has an equal probability of being accessed. Then, in attempting to retrieve each entry once, keys having i number . of entries will cause a total of i 2 entries to be accessed. If ft denotes the fre- quency of keys having i number of entries and M denotes the maximum allowable occurrences of any key in the file, the average number of entries per reply y, is given by: Jl{ where ~ i ft is the number of entries in the file whose derived keys have • = 1 a frequency of M or less. The above formula yields the average number of entries per reply for the 2,2,2,2,2 key to be much larger than 20 for M > 100; but some 2,2,2,2,2 keys corresponded to more than 500 file entries. A typical CRT display ter- minal can accommodate only ten or fewer entries per screen. Therefore, if the average number of entries per reply is desired to be ten or fewer, it is necessary either to ignore entries with high multiplicity or to adopt a different scheme of storing and retrieving such items, in which case the mathematical result would be the same as ignoring high-frequency items. The average number of entries per reply was computed for five different values of M ( 19,29,39,49, and 59); the results of these computations are in Table 2, which reveals that if keys in the file are allowed a maximum recurrence of 39 entries per key, it would be possible to have keys in the main index for about 75 percent of total records, while entries for only 142 high frequency keys would have to be shunted to a secondary index. In this case, the average number of entries per reply would be about eight. Table 3 gives the probability of number of entries per reply for the in- dex file consisting of 50,854 (out of a total of 68,169) records with the maximum frequency of any key in the file being 39. For preparing this table the assumption is made that each entry in the file has an equal prob- ability of being accessed. Thus the probability of obtaining i entries per reply is given by: P(i)= Jft 'f. ifJ i= 1 where f, is frequency of keys occurring exactly i number of times in the index file. An inspection of this table shows that in 87.7 percent of the 160 Journal of Library Automation Vol. 6/ 3 September 1973 Table 3. Probability of Number of Entries Per Reply for an Index File Using 2,2,2,2, 2 Key. Number of Entries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Frequency 14820 2893 1276 726 427 312 248 195 150 120 78 88 56 71 62 48 41 28 24 22 18 16 23 25 13 9 12 18 10 11 11 13 6 9 7 6 11 5 2 Probability Pt·,ccntasc- 29.1 11.4 7.5 5.7 4.2 3.7 3.4 3.1 2.6 2.4 1.7 2.1 1.4 1.9 1.9 1.5 1.3 1.0 0.9 0.9 0.7 0.7 l.l L.l 0.7 0.4 0.7 1.0 0.5 0.7 0.7 0.8 0.4 0.6 0.4 0.5 0.8 0.3 0.2 CumulutioC ProlHJhiiJlll 1' ~rr.cnltrEW 29.1 40.5 48.0 53.7 57.9 61.6 65.0 68.1 70.7 73.1 74.8 76.9 78.3 80.2 82.1 83 .6 84.9 85.9 86.8 87.7 88.4 89.1 90.2 91.3 92.0 92.4 93.1 94.1 94.6 95.3 96.0 96.8 97.2 97.8 98.2 98.7 99.5 99.8 100.0 time there would be 20 or fewer replies. This represents two screensful of information on a typical CRT display. CONCLUSION A file containing only those entries for which the frequencies of 2,2,2,2,2 search keys is 39 or fewer would produce 20 or fewer entries per Corporate Autlwr Entry RecordsjLANDGRAF, et al. 161 reply approximately 88 percent of the time, but such a file excludes 142 high frequency keys for 17,315 of a total of 68,169 entries . Therefore, a special technique for handling corporate~entry derived keys of high multi~ plicity is desirable. REFERENCES 1. A. L. Landgraf and F. G. Kilgour, "Catalog Records Retrieved by Personal Author Using Derived Search Keys," Journal of Library Automati{)n 6:103-8 (June 1973}. 2. F. G. Kilgour, P. L. Long, and E. B. Leiderman, "Retrieval of Bibliographic Entries from a Nam~Title Catalog by Use of Truncated Search Keys," Proceedings of the American Society for Information Science 7:79-82 ( 1970}. 3. F . G. Kilgour, P. L. Long, E. B. Leiderman, and A. L. Landgraf, "Titl~Only Entries Retrieved by the Use of Truncated Search Keys," Journal of Library Auto- mation 4:207-10 (Dec. 1971). 4. P. L. Long and F. G. Kilgour, "A Truncated Search Key Title Index," Journal of Library Automation 5:17-20 (March 1972}. 5. Kilgour, Long, Leiderman, "Retrieval of Bibliographic Entries." 6. G. D. Guthrie and S. D. Slifko, "Analysis of Search Key Retrieval on a Large Bibliographic File," Journal of Library Automation 5:96--100 (June 1972}. 1. Landgraf and Kilgour, "Catalog Records Retrieved."