BIBLIOGRAPHIC RETRIEVAL FROM BIBLIOGRAPHIC INPUT; THE HYPOTHESIS AND CONSTRUCTION OF A TEST Frederick H. RUECKING, Jr.: Head, Data Processing Division, The Fondren Library, Rice University, Houston, Texas 227 A study of problems associated with bibliographic retrieval using unveri- fied input data supplied by requesters. A code derived from compression of title and author information to four, four-character abbreviations each was used for retrieval tests on an IBM 1401 computer. Retrieval accuracy was 98.67%. Current acquisitions systems which utilize computer processing have been oriented toward handling the order request only after it has been manually verified. Systems, such as that of Texas A & I University (1), have proven useful in reducing certain clerical routines and in handling fund accounting ( 2). Lack of a larger bibliographic data base and lack of adequate computer time have prevented many libraries from studying more sophisticated acquisitions systems. At the time the MARC Pilot Project ( 3) was started, the Fondren Li- brary at Rice University did not have operating computer applications in acquisitions, serials, or cataloging. The University administration and the Research Computation Center provided sufficient access to the IBM 7040 to permit the study of problems associated with bibliographic retrieval using input data which has varying accuracy. In 1966, Richmond expressed the concern of many librarians about the lack of specific statements describing the techniques by which on-line re- trieval could be accomplished without complicating the problems pre- sented by the current card catalog ( 4). She had previously described some of the problems created by the kind and quality of data being uti- lized as references by library users ( 5). 228 Journal of Library Automation Vol. 1/ 4 December, 1968 An examination of the pertinent literature indicates that most of the current work in retrieval, while related to problems of bibliographic re- trieval, does not offer much assistance when the input data is suspect ( 6, 7,8 ). Tainiter and Toyoda, for example, have described different tech- niques of addressing storage using known input data ( 9,10). One of the best-known retrieval systems is that of the Chemical Abstracts Service, which provides a fairly sophisticated title-scan of journal articles with a surprising degree of flexibility in the logic and term structure used as input. Comparable systems are used by the Defense Documentation Center, Medlars Centers, and NASA Technology Centers. These systems have one specific feature in common: a high level of accuracy in the input data. USER-SUPPLIED BIBLIOGRAPHIC DATA The reliability of bibliographic data supplied to university libraries from faculty and students has long been questioned ( 5). Any search system which accepts such data must be designed 1) to increase the level of con- fidence through machine-generated search structures and variable thresh- holds and 2) to reduce the dependence upon spelling accuracy, punctu- ation, spacing and word order. The initial task of formulating an approach to this problem is to deter- mine the type, quality, and quantity of data generally supplied by a user. To derive a controlled set of data for this purpose, the Acquisition Depart- ment of the Fondren Library provided Xerox copies of all English language requests dated 1965 or later and a random sample of 295 requests was drawn from that file of 5000 items. This random sample was compared to the manually-verified, original order-requests to determine 1) the frequency with which data was sup- plied by the requestor and 2) the accuracy of the provided information. Results of this study are given in Table 1. Table 1. Level of Confidence in the Input Data Data Times Times Level of Elements Given Correct Accuracy Confidence Edition 295 294 99.6 99.6 Title 295 292 99.0 99.0 Author 290 264 91.0 82.7 Publish. 268 218 81.3 73.9 Date 265 215 81.1 72.8 The results suggest that edition can have great significance when speci- fied and should be used as strong supporting evidence for retrieval. It should not necessarily be a restrictive element because of the low-order magnitude of actual specification, which was five times in the sample. (Unstated editions were considered as first editions, and correct. ) Bibliographic Retrievalj RUECKING 229 Title is the most significant and most reliable element. As Richmond indicates, use of the entire title for searching would present distinct prob- lems for retrieval systems ( 4) . Consequently, an abbreviated version of the title must be derived from the input data which will reduce the impact and significance of the problems described by Richmond (5). THE HYPOTHESIS It is hypothecated that retrieval of correct bibliographic entries can be obtained from unverified, user-supplied, input data through the use of a code derived from the compression of author and title information sup- plied by the user. It is assumed that a similar code is provided for all en- tries of the data base using the same compression rules for main and added entry, title and added title information. It is further hypothecated that use of weighting factors for individual segments of the code will provide accurate retrieval in those cases when exact matching does not occur. Before the retrieval methodology can be described, it is necessary to outline the compression technique to be used with author and title words. TITLE COMPRESSION To gain some understanding of the problems to be faced in compressing title information, a random sample of 500 titles was drawn from the first half of the initial MARC I reel (about 4800 titles). Each of these titles was analyzed for significant words and tabulations were made on word strings and word frequencies. The following words. were considered as non-significant: a, an, and, by, if, in, of, on, the, to. The tabulated data, shown in Table 2, contain some surprising attributes. Approximately 90% of the titles contain less than five significant words, which suggests that four significant words will be adequate to match on title. Table 2. Significant Word Strings in Titles Length of Word String 1 2 3 4 5+ Total Number of titles 42 151 179 76 52 500 Percentage 8.4 30.2 35.8 15.2 10.4 100.0 Cumulative Percentage 8.4 38.6 74.4 89.6 100.0 Letting n stand for the corpus of words available for title use, the ran- dom chance of duplicating any specific word in another title can be stated 1 as - . When a string of words is considered, the chance of randomly n 1 selecting the same word string may be considered as -a, where 'a' is the n number of words in the string. 230 Journal of Library Automation Vol. 1/ 4 December, 1968 Certain words are used more frequently than others, and the occurrence of such words in a given string reduces the uniqueness of that string. The curve displayed in Figure 1 shows the frequency distribution of words in the sample. The mean frequency of words in the title-sample is 1.33. 'iOO ( )B~f 800 700 600 t.r) 0 a: 0 3.500 lL. D 0:: LLJ CDfOO ~ =:I :z: 3()() 2IXJ \ 100 fi'}.. I~ K f+!.~ \' Jtl-' __() (I) I (~ _c[).l I z 3 '1- s 6 7 8 f/ 10 II /2 Ffi!EQUENCY Fig. 1. Frequency Distribution of Words in Sample. Bibliographic RetrievaljRUECKING 231 Therefore, the chance of selecting an identical word string can be more accurately expressed as: n" An examination of word lengths, as shown in Table 3, shows that 95% of the significant title words contain less than ten characters. An examina- tion of the word list revealed that some 70% of the title words contain inflections and/ or suffixes. If these suffixes and inflections are removed, approximately 43% of the remaining word stems contain less than five characters and 59% contain less than six. Table 3. Distribution of Character Length and Stem Length Length in Total Different Percent Stems Percent Characters Words Words 1 7 5 0.5 5 0.8 2 25 14 1.3 14 2.3 3 87 48 4.6 48 7.9 4 172 117 11.1 196 32.3 5 229 163 15.5 92 15.2 6 198 153 14.5 94 15.5 7 202 159 15.3 64 10.6 8 158 122 11.6 45 7.4 9 121 102 9.7 15 2.5 10 84 69 6.6 8 1.3 11 54 48 4.6 7 1.2 12 38 28 2.7 2 0.3 13 14 12 1.1 2 0.3 14 6 4 0.4 0 0.0 15 3 3 0.3 0 0.0 16 2 2 0.2 0 0.0 Summary 1400 1049 592 The reduction of word length does affect the uniqueness of the individ- ual word, merging distinct words into common word stems at a mean rate of 2.5 to 1.0. In Table 3 the difference between 1049 words and 592 stems reflects the reduction of similar words into a common stem; for example: America, American, Americans, Americanism, etc., into A.mer. Thus, the uniqueness of a string of title words is reduced to the following chance of duplication: (2.5 X 1.33 )• 3.3• n• or-n" 232 Journal of Library Automation Vol. 1/ 4 December, 1968 An analysis of consonant strings made by Dolby and Resnikoff provides frequencies of initial and terminal consonant strings occurring in 7000 common English words (with suffixes and inflections removed) ( 11,12, 13). These frequency lists clearly show that the terminal string of conso- nants has considerable information-carrying potential in terms of word identification. The starting string also carries information potential, but significantly less than the terminal string. By combining the initial and terminal strings, it is possible to generate an abbreviation which has ade- quate uniqueness and reduces the influence of spelling. The high percentage of four-character word stems and the fact that the maximum terminal string contains four consonants suggest the use of a four-character abbreviation. To compress a title word into four characters, it is necessary to specify a set of rules. The first rule will be to delete all suffixes and inflections which terminate a title word. The second rule will be to delete vowels from the stem until a consonant is located or the four-character stem is produced. The suffixes and inflections deleted in this procedure are contained in Table 4. When the stem contains more than four characters, the third compression rule states that the four-char- acter field is filled with the terminal-consonant string and remaining posi- tions are filled from the initial- character string. Table 4. Deleted Suffixes and Inflections -ic -ive -in -et -ed -ative -ain -est -aged -ize -on -ant -oid -ing -ion -ent -ance -og -ation -ient -ence -log -ship -ment -ide -olog -er -ist -age -ish -or -y -able -al -s -ency -ible -ial -es -ogy -ite -ful -ies -ology -ine -ism -ives -ly -ure -urn -ess -ry -ise -ium -us -ary -ose -an -ous -ory -ate -ian -ious -ity -ite The relative uniqueness of the generated abbreviation can be calcu- lated using the data supplied by Dolby and Resnikoff. For example, Car- ter and Bonk's Building Library Collections would be abbreviated- BULD, LIBR,COCT. The random chance of duplicating any abbreviation can be stated as consisting of the product of the random chance of duplicating the initial string and the random chance of duplicating the terminal string: Bibliographic Retrievalj RUECKlNG 233 fl ft -X- x3.32 n1 nt The frequencies listed by Dolby and Resnikoff may be substituted in the above equation producing the following chances for duplication: 324 63 1 x - - x 10.89 = -- for BULD 6800 6800 208 288 6800 277 6800 1 1 x 6800 x 10.89 = 14745 for LIBR 16 1 x 6800 x 10.89 = 1041 for COCT The random chance of duplicating this string of three abbreviations can be calculated by multiplying the individual calculations, which yields the random chance of 1 in 32 x 108• This high uniqueness declines rapidly when the title contains less than three significant words and contains high frequency words, such as the title Collected Works, for which the same uniqueness calculation produces the random chance of 1 in 44 x 104• To increase the level of uniqueness on short titles, like Collected Works, it becomes necessary to provide supporting data to the title information. It is clear that the supporting data must come from supplied author text. AUTHOR COMPRESSION The same compression algorithms can be used for both personal and corporate names with some modifications. The frequent· substitution of "conference" for "congress" and "symposia" for "symposium" suggests that meeting names should be considered as a secondary sub-set of non-signifi- cant words. Names of organizational divisions, such as bureau, department, ministry, and office, can be considered as part of the same sub-set. The rules which govern the deletion of inflections, suffixes and vowels can be used for corporate names, but personal author names must be car- ried into the compression routine without modificatjon. Only the last name of an author would be compressed into a code. CONSTRUCTING THE TEST Four, four-character abbreviations are allowed for title compression and four for author. Rather than use a 32-character fixed field for these codes, the lengths of the input and main-base codes are variable, with leading control digits to specify the individual code sizes for the title and author segments. . Provision is made for the inclusion of date, publisher and/ or edition in the search-code sh·ucture although these were not implemented in the test performed. . 234 Journal of Library Automation Vol. 1/ 4 December, 1968 At the time the input data is read, the existence of title, author, edition, publisher and date is indicated by the setting of indicators which control the matching mask and which, in part, control the specification of the retrieve threshhold. The title indicator specifies the number of compressed words in the supplied title which must be matched by the base code. A simple algorithm is used to calculate the threshhold values given in columns two through four of Table 5. Columns five through seven are obtained by adding two to the calculated threshholds. Each agreement within the mask adds to a retrieve counter the values indicated in the last five columns of Table 5, the values of X and Y being the number of matching code words in the title and author segments respectively. CONDUCTING THE TEST As mentioned above, the initial tests of the retrieve were based upon title and author matching exclusively and required three runs on the Fondren Library's 1401 computer. The first loaded 2874 original order- requests, generated a search code utilizing the rules specified in this paper and created an input tape. The second run extracted title and author data from the MARC I data base, created multiple search codes for title, main entry, added title and added entry. Both tapes were sorted into ascending search-code sequence. The final run was the search program which attempted to match input codes with the MARC I base codes. When there was agreement based on relationship of threshhold and retrieve counter, the printer displayed threshhold, short author and short title on one line, and retrieve value, input author and title on the next line as illustrated in Figure 2. The printed results were compared to validate the accuracy of the retrieve. This comparison was cross-checked against the results of the acquisition department's manual procedures. The search program also provided for an attempt to match titles on the basis of a rearrangement of title words. In such attempts the retrieve threshhold was raised. ANALYSIS OF RESULTS The raw data obtained from this experimental run are shown in Table 6. Of the 287 4 items represented in the input file , 48.4%, or 1392, were actually found to exist in the data base. Of those actually present 90.4%, or 1200, were extracted with an overall accuracy of 98.67%. An examination of the sixteen false drops revealed several omissions in the compression routines for the input data and for the data base. One of the more significant omissions was failing to compensate for multi-char- acter abbreviations, particularly 'ST.' and 'STE.' for 'Saint.' A subroutine for acceptance of such abbreviations added to the search-code generating program would increase the retrieve accuracy to 99%. Table 5. Values for Variable Threshhold Data Threshhold Values Agreement Values Given Full-Code Test Individual Code Test Title Author Edition Publish. Date TAEPD 3 or 4 2 1 3 or 4 2 1 XYlll 12 8+2Y 4+2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 XYllO 12 8+2Y 4+2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 XYlOl 12 8+2Y 4+ 2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 XYlOO 12 8+2Y 4+ 2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 XYOll 12 8+2Y 4+2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 l::x; .... XYOlO 12 8+2Y 4+ 2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 ~ g:' XY001 12 8+2Y 4+2Y 14 10+2Y 6+2Y 4X 2Y 3 2 1 (1Q ~ "';j XYOOO 12 8+2Y 4+2Y 14 18+2Y 6+2Y 4X 2Y 3 2 1 ;;:to .... ~ XOlll 12 11 7 13 12 7 4X 2Y 3 2 1 ::x; {';) xouo 12 11 7 13 12 7 4X 2Y 3 2 1 ..... "'t .... {';) c: X0101 12 11 7 13 12 7 4X 2Y 3 2 1 ~ "' X0100 12 11 7 13 11 7 4X 2Y 3 2 1 !:l:l c::: xoou 12 10 6 13 11 7 4X 2Y 3 2 1 trl (") p.:: XOOlO 12 10 6 13 Not permitted 4X 2Y 3 2 1 -z X0001 12 9 5 13 Not permitted 4X 2Y 3 2 1 0 1:-0 xooo 12 Not permitted Not permitted c.:> CJ1 10 4ME R4M8RHCHS 10 AM~R4MBRHCHS Ob AME~BOLL Ob AMEii.BOLL 12 AMERBUSQSHOWZIEN 12 AMERBUSQSHOWZEIN 12 AMERCNTRCAMPBRTH 12 AMERCNTRCAMP 12 AHERJEWSISRLISCS 1~ AMERJEWSISRLISCSHALOR 12 AMEROCCPSTCTBLAU 1~ AMEROCCPSTCTBLAU 12 AMEROCCPSTCTOUNN 12 AHEROCCPSTCTBLAU 12 AMERPARTSYSMCHRS 14 AHERPARTSYSMCHRS 10 AMERPREOWARN 10 AMERPREOWARN 10 AMERSCHKilLCK 10 4MERSCHKBLCK 10 AMERSCHOSEXI'i 10 AMERSCHOSEXNPATCCAYO 12 AMERSPACEXPRSHtN 1~ AMERSPACEXPRSHEN 12 AMERTHETTOOAOOWR 1~ AMERTHETTOOAOOWR 12 AME R THTii.AS SEENBRWN 11> AMERTHTRAS SEE NMOS SMONSJ 12 AMEiHHTRAS SEENMOSS 18 A!'IERTHTRAS SEEI'iMOSSMO'ISJ 12 AN4ZPHPHARGUMCGL 12 ANAZPHPHARGUMCGFJAN PHIP 12 ANCIHUNTFAR WESTPOUD 18 ANCIHUNTFAR WE STPOUO Fig. 2. Sample of Retrieved Citations. HEINRICHS, WALDO H. HEINRICHS* BOSWELL, CHARLES. BOSWELlt liEOMAN, IRVING . lEIOMANt BOSWORTH, ALLAN R. CLAY, c. T.t ISAACS, HAROLD ROBERT; ISAACS, HAROLD R.t BLAU, PETER MICHAEL. BLAUt DUNCAN, OTIS OUOLEYo JO BLAUt CHAMBERS, WILLIAM NISBET CHAHBERSt WARREN, StONEY, 1916 - WARREI'it BLACK, HILLEL. BLACKt SEXTON, PATRICIA CAYO. SEXTOI'io PATRICIA CAYOt SHELTON, WILLIAM ROYo SHELTONt DOWNER, ALAN SEYMOUR, OOWNERt . BROWN, JOHN MASON, 1900 HOSES, MOi'iTROSE J.t AMERICAN AMBASSAOOR JOSEPH C. GR AMERICAN AMBASSAOJRt THE AMERICA THE STORY OF THE WORL THE AMERICA. THE STORY OF THE WORLD THE AMERICAN BURLESQUE SHOW. THE AMERICAN BURLESQUE SHOWt AMERICA-S CONCENTRATION CAMPS BY AMERICA-S CONCENT~ATION CAMPSt AMERICAI'i JEWS IN ISRAEL BY HAAO AMERICAI'i JEWS IN ISRAELt THE AMERICAI'i OCCJPATIONAL STRUCTUR THE AMERICAN etCUPATIONAL STRUCTURE THE AMERICAN OCC~PATIONAl STRUCTUR THE AMERICAN OCCUPATIONAL STRUCTURE THE AMERICAN PARTY SYSTEMS STAGES THE AMERICAN PART~ SYSTEMS• STAGES THE AMERICAN PRESIDENT THE AMER[CAN PRESIOENTt THE AMERICAN SCHJOLBOOK. THE AMERICAN SCHOOLBOOK* READINGS THE AMERICAN SCHOOL A SOCIOLOGIC THE AMERICAN SCHOlL. A SOCIOLOGICAL AMERICAN SPACE EXPLORATION THE F AMERICAN SPAt~ EXPLORATION. THE FIR THE AMERICAN THEATER TODAY, EOITE THE AMERICAN THEATER. TODAY* THE AMERICAN THEATRE AS SEEN BY IT THE AMERICAN THEATRE AS SEEN BY ITS MOSES, MONTROSf JQNASo THE AMERICAN THE4TRE AS SEEN BY IT HOSES, MONTROSE J.t THE AMERICAN THEATRE AS SEEN BY ITS MCGREAL, IAN PHILIP, 19 ANALYZING PHILOSOPHICAL ARGUMENTS MCGREAF, JAN PHILLIPt ANALYZING PHILOSOPHICAL ARGUMENTS. POURAOE, RICHARD F. POURADE* ANCIENT HUNTERS JF THE FAR WEST, ANCIENT HUNTERS OF THE FAR WEST* ~ o;, 0" ~ ....... .Q.. t"'l .... ~ ~ ~ I e· ;:$ < 0 r- ....... ~ t1 (!) (') (!) g. (!) ..:-: ....... CD 85 Bibliographic RetrievaljRVECKING 237 Table 6. Table of Results Retrieve Total Correct False Percentage Values Hits Hits Hits Correct 6 14 14 0 100 8 0 0 0 0 10 311 311 0 100 12 264 248 16 93.3 14 232 232 0 100 16 118 118 0 100 18 260 260 0 100 20 1 1 0 100 Totals 1200 1184 16 98.7 Table 7. Distribution of Errors Title Errors Author Errors No. of Title Author Author Codes Error Spelling Lacking Error Spelling Other Total 1 2 3 10 12 27 4 58 2 2 6 17 26 60 23 134 3 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 Total 4 9 27 38 87 27 192 The occurrence of titles with the words "selected". or "collected," etc., produced additional false drop when the title word string exceeded two words. A modification to the search program to raise the threshhold when the input data contain codes such as 'SECT; 'COCT' would increase the retrieve accuracy to 99.17% The presence of personal names in titles, such as 'Charles Evans Hughes' and 'Franklin Delano Roosevelt' caused seven additional false drops. At present it seems unlikely that a simple method to prevent them can be included. CONCLUSION The experimental results indicate that the hypothesis suggested is valid. Use of multiple codes for added entry, added title in addition to the main entry, and main title data are clearly necessary. Approximately 10% of the correctly retrieved items were produced by the existence of an added entry code. The influence of spelling accuracy was lessened by use of a compres- sion technique. An inspection of extracted titles revealed the existence of 43 spelling errors which did not affect retrieval. Thus, the search code reduced the significance of spelling by some 30%. Utilizing table search followed by table look-up and linking random- 238 Journal of Library Automation Vol. 1/ 4 December, 1968 access addresses, should enable the search code approach to bibliographic retrieval to provide rapid, direct access to the title sought. ACKNOWLEDGMENT This study was supported in part by National Science Foundation grants GN-758 and GU-1153 and by the Regional Information and Communica- tion Exchange. The assistance of the Acquisitions Department staff, the Research Computation Center staff and the staff of the Fondren Library's Data Processing Division is gratefully acknowledged. REFERENCES 1. Morris, Ned C.: "Computer Based Acquisitions System at Texas A & I University," Journal of Library Automation, 1 (March 1968 ), 1-12. 2. Wedgeworth, Robert: "Brown University Library Fund Accounting System," I ournal of Library Automation, 1 (March 1968), 51-65. 3. U. S. Library of Congress: Project MARC, an Experiment in Auto- mating Library of Congress Catalog Data (Washington: 1967). 4. Richmond, Phyllis A.: "Note on Updating and Searching Computer- ized Catalogs," Library Resources and Technical Services, 10 (Spring 1966), 155-160. 5. Richmond, Phyllis A.: "Source Retrieval," Physics Today, 18 (April 1965)' 46-48. 6. Atherton, P.; Yorich, J. C.: Three Experiments with Citation Indexing and Bibliographic Coupling of Physics Literature (New York, Ameri- can Institute of Physics, 1962). 7. International Business Machines Corporation: Reference Manual, Index Organization for Information Retrieval (IBM, 1961). 8. International Business Machines Corporation: A Unique Computable Name Code for Alphabetic Account Numbering (White Plains, N.Y.: IBM, 1960). 9. Tainiter, M.: "Addressing Random-Access Storage with Multiple Bucket Capacities," Association for Computing Machinery Journal, 10 (July 1963 ), 307-315. 10. Toyoda, Junichi; Tazuka, Yoshikazu; Kasahara, Yoshiro: "Analysis of the Address Assignment Problems for Clustered Keys," Association for Computing Machinery Journal, 13 (October 1966), 526-532. 11. Dolby, James L.; Resnikoff, Howard L.: "On the Structure of Written English Words," Language, 40 (Apr-June 1964), 167-196. 12. Resnikoff, Howard L.; Dolby, James L.: "The Nature of Affixing in Written English, Part I," Mechanical Translation, 8 (March 1965), 84-89. 13. Resnikoff, Howard L.; Dolby, James L.: "The Nature of Affixing in Written English, Part II," Mechanical Translation, 9 (June 1966), 23-33.