105 Application of the Variety-Generator Approach to Searches of Personal Names in Bibliographic Data Bases-Part 1. Microstructure of Personal Authors' Names Dirk W. FOKKER and Michael F. LYNCH: Postgraduate School of Librarianship and Information Science, University of Sheffield, England. Conventional approaches to processing records of linguistic origin for storage and retrieval tend to regard the data as immutable. The data gen- erally exhibit great variety and disparate frequency distributions, which are largely ignored and which entail either the storage of extensive lists of items or the use of complex numerical algorithms such as hash coding. The results in each case are far fmm ideal. The variety-generator approach seeks to reflect the microstructure of data elements in their description for storage and search, and takes advan- tage of the consistency of statistical characteristics of data elements in homogeneous data bases. In this paper, the application of the variety-generator approach to the description of personal author names from the INSPEC data base by means of small sets of keys is detailed. It is shown that high degrees of partitioning of names can be obtained by key-sets generated from the ini- tial characters of surnames, fmm the terminal characters of surnames, and from the initials. The implications of the findings for computer-based bibliographical in- formation systems are discussed. INTRODUCTION The application of computer technology to the storage of bibliographic data bases and to the selection of items from them on the basis of the con- tent of specified data elements poses considerable problems. Among the most important of these, from the viewpoint of the efficiency of computer use, is the fact that many of the individual data elements exhibit great variety (i.e., lists of their contents are extensive), and show relatively dis- parate distributions. This behavior is encountered in different degrees in regard to items such as words in the titles of monograph or periodical ar- 106 ]oumal of Library Automation Vol. 7/2 June 1974 ticles, assigned subject headings, authors' names, and citations.1- 4 Such dis- tributions have been extensively studied in various contexts by Bradford, Zip£, and Mandelbrot.4-6 In general, the distributions are approximately hyperbolic, so that a small proportion of items may account for a substan- tial proportion of occurrences, while the majority of items occur only in- frequently. The studies have been well reviewed by Fairthorne.7 Of all the data elements, personal author names exhibit a distribution which is at its most exh·eme in one direction. As is shown later in this pa- per, the most frequent author name in a file of 50,000 names occurred only sixteen times, while over 35,000 of the names, or over 70 percent of the file, occurred once only. A simple and general strategy for dealing with searches of data ele- ments, the contents of which show large variety and disparate distribu- tions, is under development by the Research Unit at the Sheffield School, and has thus far been elaborated in regard to searches of chemical struc- tures and of natural-language data bases. 8• 9 Based on information-theoret- ic principles, it involves a two-stage search procedure in which in the first and rapid stage the majority of items which cannot possibly fulfill the search criteria are eliminated, while those which meet the criteria are ex- amined for an exact match at the second stage. The criteria (or attributes) are selected on the basis of an examination of the microstructure of the items in the data base, and are chosen so that their frequencies are ap- proximately equal. The number of criteria or attributes chosen for de- scription of the items is variable within a wide range; with their aid, the variety of items can be described so as to facilitate discrimination among them. In the context of substructure searching, the attributes are representa- tions of fragments of chemical structures,10 while in the case of text, they are strings of characters which are variable in length. These strings are long when the characters comprising them represent frequent combina- tions, and short when the characters are infrequent.11 Since the sets of at- tributes can generate, in an approximate manner, the variety of items en- countered in the data base, they are termed variety generato1·s. They are in- termediate in number between the primitive set of symbols ( alphanumer- ic characters in the case of text, atoms and bonds in that of chemical struc- tures) and the actual variety of items in the collection (words or word fragments in text in the first instance, and molecules in the second). The variety-generator approach involves recognition of the fact that the statistical properties of specific data elements within homogeneous data bases are relatively constant, and that the primitive symbols of the data elements themselves usually show hyperbolic distributions. New symbol sets can therefore be defined, consisting of sequences of primitive symbols such that their frequencies of occurrence become comparable. The new symbol sets then constitute the attributes which are employed, singly or in combination, to represent the items within a search file. These symbol sets Variety-Generator ApproachjFOKKER and LYNCH 107 approximate to the ideal of equifrequency postulated by Shannon for op- timal efficiency in communication. 12 Only an approximation can be ob- tained, however, since the distributions of the newly defined symbols still cover a relatively wide range, and since they are seldom entirely indepen- dent of one another in statistical terms, and may often be strongly asso- ciated. The variety-generator concept is not entirely novel. Indeed, it was antici- pated most closely in precisely the present context by Merrill and by Cutter with a view to subdividing a library's holdings into equal groups of items.13 • 14 However, the greater flexibility of computer techniques would appear to make its use today even more attractive. This paper thus describes a study of a large file of authors' names with a view to identifying attributes of the names which can be used for effi- cient reh·ieval purposes. Assessment of the effectiveness of the attributes in retrieval is described in Part 2 of this series. (t The main terms used here are n-gram, key, and key-set, where an n-gram is a string of n adjacent char- acters. A key consists of an n-gram, and keys are chosen so that the fre- quencies of a set of keys (or key-set) are approximately equivalent in a given file. The measures used in assessing frequency distributions are Shannon's ex- pressions for the entropy of a sequence of symbols: and relative entropy: i H = - I p1log2pi i= 1 H _ Hactual r- Hmaximum Hmaxlmum is reached when the probabilities of occurrence of the symbols of the sequence are equal; its value is the binary logarithm of the variety of symbols, since 1 1 H =- n(-log2-) =log2n n n The value of the relative entropy is thus a measure of the degree of equi- frequency of a set of symbols, and is independent of their variety. CHARACTERISTICS OF NAME FILE The file studied was a collection of 100,000 personal names taken from ten issues of the INSPEC data base dating from the period 1969 to 1972. The names are represented in variable-length format, surname followed by a comma, space and initials each followed by a period. For the present purpose, case and diacritic shift symbols were ignored. <~>To appear in the September 1974 issue of the Journal of Library Automation. 108 Journal of Library Automation Vol. 7/2 June 1974 Subsets of the file were first sorted into sequence on the basis of the full names, and distributions determined both for surnames and initials, and for surnames alone, as shown in Table 1 for the subset of 50,000 names. Since the great majority of full names occur once only, the relative en- tropy of this distribution, at 0.975 (computed with respect to the 50,000 names, i.e., Hmax= log250,000), is high, while that for surnames alone is lower, at 0.904. An analysis of the ratio of unique surnames to the total number of entries in files of 25,000, 50,000, 75,000 and 100,000 names showed that the proportion of different surnames added to the file as it in- creases in size is predictable. The relationship between the number of dif- ferent surnames (D) and the total number of entries ( N) conforms to the expression: D=aNtl where a = 5.89 and {3 = 0.78. Next, the frequencies of characters at different positions in the sur- names and of the initials were determined. The most important positions in the surname are the first and last characters, as will be seen shortly. The distributions of these characters and of the first and second initials are shown in Table 2. The relative entropy of the first initial is, interestingly, Table 1. Distribution of full names and surnames alone in a file of 50,000 INSPEC names. Frequency f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > 20 Full Names No. of Names with %of Names with Frequency f Frequency f 35,187 70.37 4,768 19.07 1,060 6.36 302 2.42 88 0.88 34 0.41 16 0.22 7 0.11 3 0.05 1 0.03 2 0.05 1 0.03 Total number of different full names = 41,469 H = 15.22 Hmax = 15.61 ( log250,000) Hr = 0.9753 SU1·names No. of Surnames with Frequencyf % of Surnames with Frequencyf 19,894 39.79 4,258 17.03 1,597 706 395 235 134 104 68 54 36 39 36 28 24 24 15 19 16 9 112 9.58 5.65 3.75 2.82 1.88 1.66 1.22 1.08 0.79 0.94 0.94 0.78 0.72 0.77 0.51 0.68 0.61 0.36 8.44 Total number of different surnames = 27,803 H = 14.11 Hmax = 15.61 ( log250,000) Hr = 0.9042 Variety-Generator ApproachjFOKKER and LYNCH 109 Table 2. Distributions of first and last characters of surname and of initials in 50,000 INSPEC name me. First Character Last Character First Second of Surname of Surname Initial Initial s 0.113 N 0.164 J 0.100 Space 0.371 B 0.083 R 0.102 A 0.083 A 0.066 M 0.080 A 0.084 R 0.081 M 0.045 K 0.076 s 0.082 M 0.064 J 0.043 H 0.056 I 0.074 G 0.058 s 0.035 G 0.055 E 0.068 v 0.051 L 0.033 p 0.053 v 0.067 D 0.050 E 0.033 c 0.052 y 0.043 H 0.050 R 0.031 R 0.047 T 0.042 s 0.047 p 0.031 L 0.047 0 0.041 E 0.043 G . 0.030 D 0.044 L 0.040 p 0.042 c 0.030 T 0.040 H 0.037 w 0.038 w 0.028 w 0.040 K 0.033 K 0.036 v 0.028 A 0.036 D 0.030 L 0.036 H 0.027 F 0.034 G 0.026 c 0.035 D 0.026 N 0.025 z 0.013 T 0.033 I 0.026 v 0.025 M 0.013 B 0.032 F 0.024 E 0,018 u 0.013 N 0.026 N 0.024 J 0.017 F 0.006 F 0.026 K 0.022 0 0.016 c 0.005 I 0.023 B 0.020 z 0.013 w 0.005 y 0.023 T 0.013 I 0.013 p 0.004 0 0.010 y 0.007 y 0.011 X 0.004 Space 0.005 0 0.005 u 0.005 B 0.003 z 0.005 z 0.002 Q 0.001 J 0.001 u 0.004 u 0.001 X Q 0.0002 Q 0.0002 Q 0.0002 X 0.0001 X 0.0001 H =4.309 H =4.039 H =4.374 H =3.688 Hmax = 4. 700 (log,26) Hmax = 4. 700 (log,26) Hmax = 4.755 (Iog.27) Hmax = 4, 755 (log,27) Hr = 0.917 Hr = 0.859 H. =0.920 H. =0.776 the highest of the four; the highest ranking initial is J, which is one of the least frequent characters in English text. Thereafter follow the first and last letters of the surname, and the second initial. The low relative entropy of the last is partly accounted for by the fact that a single initial occurred in 37 percent of the entries. Distributions were also obtained for the second and subsequent char- acters of the surname. These, and also the distributions of the first char- acter, are in general agreement with the results of earlier studies by Bourne and Ford, and by Ohlman, and indicate that consonants predom- inate in the first position, vowels in the second position, while thereafter the distributions become less disparate. 15• 16 However, due to the variable lengths of names, the dominant character at the sixth and subsequent po- sitions of the surname is the space character. KEY-SET GENERATION TECHNIQUE The basic key-set generation technique involves creating fixed-length 110 Journal of Library Automation Vol. 7/2 JuBe 1974 n-grams from some point or points of reference within each record, the strings generated being initially of length greater than those anticipated within the key-set. These strings are sorted into lexicographic order and counted. (The resultant distribution of the fixed-length strings is again hy- perbolic.) The frequencies are compared with a predetermined threshold frequency-at the first stage none of the string frequencies should exceed this value. The strings are then shortened by truncation of the right-hand character, and the frequencies of the strings which have become identical through truncation are accumulated. The new n-gram frequencies are compared with the threshold value; any strings which exceed the value are noted. The procedure is repeated until the single characters are reached. Two types of analysis are possible, redundant and nonredundant. ·In the latter, any string exceeding the threshold value is removed from the list and not processed further, while in the former they continue to the next processing stage. While redundant analysis is valuable at the exploratory stage, the nonredundant type is preferred for key-set generation. The procedure was first applied to strings of characters starting with the first character of each surname, as illustrated in Figure 1. n-gram FOREMAN FOREMA FOREM FORE FOR FO F Frequency 11 13 24 98 143 214 1685 Fig. 1. Successive right-hand truncations of a surname during key-set generation Here the frequency of the surname FOREMAN in a _file of 50,000 names is eleven. When successively shortened, other surnames with the same ini- tial n-gram are included in the count. Comparison of the count with a threshold value results in selection of a key. Here, if the threshold were 100, the key selected would be FOR. Application of the procedure to the surnames of the 50,000 name file (the name records had a maximum of eighteen characters, left-justified and space-filled if less than this length), with a threshold frequency of 300 (i.e., a probability of 0.006), gave a key-set consisting of eighty-seven keys, including all the alphabetic characters. The key-set is shown, in al- phabetic order, together with the probabilities, in Table 3. It is clear that the most frequent characters at the beginning of the surname have pro- duced most keys, S and M with eight keys each, B with seven, K with six, and H, G, P, and R each with five keys. Whereas the relative entropy of the initial surname letter was 0.917, that of the key-set is 0.977. The prob- abilities of no less than seventy of the eighty-seven keys now lie between 0.005 and 0.015. The key-set itself consists of the twenty-six alphabetic characters (one of these, X, is not represented in the collection), fifty- Variety-Generator ApproachjFOKKER and LYNCH 111 Table 3. Key-set of 87 keys produced from 50,000 surnames from INSPEC files. Key P1'0bability Key Probability Key Probability Key Probability A .023 GA .009 M .001 RO .016 AL .007 GO .011 MA .022 s .027 AN .006 GR .012 MAR .008 SA .016 B .012 GU .007 MC .007 SCH .014 BA .013 H ,006 ME .010 SE .008 BAR .006 HA .021 MI .012 SH .016 BE .017 HE .010 MO .012 SI .010 BO .014 HO .012 MU .008 so .007 BR .014 HU .007 N .011 ST .016 BU .009 I .013 NA .008 T .030 c .013 J .010 NI .006 TA .010 CA .011 JO .007 0 .017 u .005 CH .016 K .015 p .011 v .015 co .013 KA .018 PA .014 VA .010 D .015 KI .008 PE .011 w .011 DA .009 KO .017 PO .010 WA .011 DE .013 KR .008 PR .006 WE .008 DO .007 KU .010 Q .001 WI .010 E .018 L .013 R .007 X F .025 LA .012 RA .011 y .011 FR .008 LE .014 RE .008 z .013 G .015 LI .009 RI .006 H=6.2952 Hmax = 6.443 (log,87) H, =0.977 eight digram keys, and the three trigram keys BAR, MAR, and SCH. The predominance of vowels as the second character of keys is noticeable; for- ty-nine of the sixty-one n-grams have a vowel in the second position. The size of the key-set produced from a given data base can be varied arbitrarily by changing the threshold value. An approximately hyperbolic relation obtains between the value of the threshold and the number of keys selected. As the size of the key-set increases, the length of the longest n-gram in the key-set increases, and the distribution of n-grams shifts to- ward higher values, as shown in Figure 2. Stability of the key-sets with increase in file size is clearly an important factor. To determine the extent of this, successive portions of the entire file of 100,000 surnames were subjected to the analysis at a threshold value of 0.005. As illustrated in Table 4, the key-sets are remarkably stable in re- gard to total key-set size, the number of keys of each length, and to the actual keys. Table 4. Stability of size and composition of keys with increasing file size. Number of Number of Number of Number of Total Size Entries in File Characters Digrams Trigrams of Key-set 25,000 26 76 10 112 50,000 26 74 9 109 75,000 26 74 10 110 100,000 26 75 10 111 No, of keys common to key-sets 26 73 9 108 112 ]oU1'nal of Library Automation Vol. 7/2 June 1974 400 300 Number of n-grams 200 100 1 2 3 4 5 6 7 8 9 Length of n-grams Key-set size A 184 B 332 c 572 D 1034 Threshold probability 0.0025 0.0015 0.0010 0.0007 10 11 12 13 Fig. 2. Distribution characteristics of n-grams generated from 10,000 surnames from INSPEC for four different threshold values As the size of the key-set increases, the range of probabilities represent- ed among the keys narrows, and the relative entropy of the distribution in- creases, becoming eventually asymptotic with the value of one. This i~ illus- trated in Figure 3, for the surnames in a file of 50,000 entries. Beyond a key-set size of about 100, increases in the relative entropy of the resultant distribution are marginal. Furthermore, with increasing key-set size, the Va1'iety-Gene1'ato1' AppmachjFOKKER and LYNCH 113 shorter and more frequent surnames begin to appear in their entirety as keys. As an alternative to increasing the variety of the keys, the production of keys from character positions after the first letter of the surname was con- sidered. The problem of variations in name length, as well as the very dif- ferent distributions of the characters at these positions, were not encourag- ing, and instead the production of key-sets from the last letter of the sur- 1 .99 .98 .97 .96 .95 .94 .93 Hr .92 .91 .90 .89 .88 .87 .86 0 20 40 60 80 100 Total number of keys for the front of surnames Fig. 3. Increase in relative entropy with increase in key-set size; keys generated from 50,000 surnames 114 J oumal of Library Automation Vol. 7/2 June 1974 name was investigated, and proved much more ath·active, since it is largely independent of surname length. KEY-SETS FROM THE END OF THE SURNAME For this purpose, each surname in the file was reversed within a record and subjected to key-generation. The relative entropy of the last character of the surname is substantially lower than that of the first character, at 0.860. Accordingly, the key-sets have a higher proportion of longer keys than those produced from the front of the surname, as shown in Table 5. This key-set consists of the twenty-six characters, seventy-eight digrams, Table 5. Key-set of 155 n-grams produced from last letter of 50,000 INSPEC surnames at threshold of 0.003. Key P1'obability Key P!'obability Key Probability Key Probability A .012 VICH ,005 EIN .005 IS .012 CA .003 GH .003 KIN .007 NS .006 DA .008 SH .003 LIN .005 INS .003 KA .006 TH .005 TIN .003 OS .004 MA .007 ITH .004 NN .010 RS .006 NA .003 I .014 ON .009 ss .005 INA .004 AI .004 SON .013 TS .004 RA .010 HI .007 LSON .004 us .004 TA .008 II .009 NSON .006 T .012 VA .004 VSKII .005 RSON .004 DT .003 OVA .010 KI .006 TON .009 ET .004 WA .004 SKI .005 0 .017 NT .004 YA .005 WSKI .004 KO .003 RT .003 B .003 LI .005 NKO .010 ERT .004 c .005 NI .007 NO .004 ST .004 D .009 RI .005 TO .007 TT .005 LD .005 TI .004 p .004 ETT .003 ND .006 J .001 Q .001 u .013 RD .009 K .010 R .005 v .001 E .020 AK .006 AR .006 EV .018 DE .003 CK .009 ER .016 ov .012 EE .004 EK .004 BER .003 KOV .008 GE .004 IK .004 DER .006 IKOV .004 KE .006 L .007 GER .005 LOV .005 LE .008 AL .006 NGER .003 NOV .006 NE .008 EL .012 HER .006 ANOV .006 RE .006 LL .004 IER .005 ROV .006 SE .005 ALL .004 KER .007 sov .003 TE .004 ELL .008 LER .007 w .005 F .003 M .008 LLER .005 X .004 FF .003 AM .005 MER .003 y .017 G .004 N .009 NER .010 AY .004 NG .004 AN .017 SER .003 EY .006 ANG .003 MAN .014 TER .008 LEY .007 ING .007 RMAN .003 OR .004 KY .004 RG .007 YAN .003 s .016 RY .005 H .004 EN .018 AS .007 z .007 CH .009 SEN .007 ES .011 TZ .006 ICH .003 IN .019 NES .004 H=7.059 Hmax = 7.276(log.155) Hr = 0.970 Va1'iety-Generator ApproachjFOKKER and LYNCH 115 1 .99 .98 .97 .96 .95 .94 .93 .92 Hr .91 .90 .89 .88 .87 .86 0 40 80 120 160 200 Total number of keys for the end of sumames E!g. 4. Increase in relative entropy with increase in key-set size; keys generated from 50,000 surnames forty trigrams, ten tetragrams, and a single pentagram. The breakdown of the individual terminal characters of the surname is also more extreme, since the distribution is more skew. Thus N, the most frequent last char- acter, has no fewer than nineteen different keys in this set, closely followed by R, with seventeen keys. The relative entropy of the distribution is again high, at 0.970 for this key-set. Figure 4 shows the relation between key-set size and relative entropy, and indicates that a larger number of keys from the last character of the surname is required to reach the same relative en- 116 Journal of Library Automation Vol. 7/2 June 197 4 tropy as keys from the first character. There is an anomalous section of the curve, which may well derive from the much greater prevalence of suffixes than prefixes in personal names. CONCLUSIONS This study has demonstrated the feasibility of devising partial represen- tations of author names by applying the variety-generator approach to overcome the substantial frequency variations encountered in their dis- tributions. It has also been shown that within a homogeneous file, i.e., one of consistent provenance, there exists a substantial level of consistency in terms of character distributions, as illustrated in Table 4. The character- istics may vary substantially between data bases of different provenance, e.g., as between INSPEC and MARC files. 17 Conventional approaches to processing records comprising linguistic data tend to disregard the statistical properties of the items, and attempt to overcome the resultant problems either by storage of extensive lists of items or by using complex numerical algorithms. Typical of this latter ap- proach, in the present context, is the use of truncated search keys for ac- cess to bibliographical files in direct access stores, in which fixed-length character strings are the keys, as, for instance, in the system in operation at the Ohio College Library Center.18 The problems encountered in the use of fixed-length truncated author and title search keys for monograph data are indicated by the fact that the search files using hash-addressing are operated, on average, at a density of only 62.5 percent. Once the density reaches 75 percent, the proportion of collisions and the resultant degrada- tion in performance are such that the files are recreated at a density of only 50 percent. Fixed-length keys from author and title entries are demonstrably ineffi- cient in performance since the information content is low. The distribu- tion of the initial trigrams of 50,000 names from the INSPEC file pro- vides corroboration of this fact. The number of possible combinations of three characters is 17,576 (263 ), yet only 3,285 trigrams were represented in the file, or 18.7 percent of the total variety. Moreover, the relative en- tropy of the trigrams is much lower than that of the initial characters of the surnames, at 0.73. Performance figures for precision illustrate this point.19 The present work, together with other studies of the scope for applica- tion of the variety-generator approach, thus stands in considerable con- trast to prior work, and must be viewed as a means whereby the microstruc- ture of particular data elements is fully reflected in their manipulation, affording substantial advantages. 20 Part 2 of this paper illustrates this in re- gard to searches of personal names. ACKNOWLEDGMENTS We thank M. D. Martin of the Institution of Electrical Engineers for Vm·iety-Generator ApproachjFOKKER and LYNCH 117 provisiOn of a part of the INSPEC data base and of file-handling soft- ware, and the Potchefstroom University for C.H.E. (South Africa) for awarding a National Grant to D. Fokker to pursue this work. We also thank Dr. I. J. Barton and Dr. G. W. Adamson for valuable discussions, and the former for n-gram generation programs. REFERENCES I. P. B. Schipma, Term Fragment Analysis for Inversion of Large Files (Chicago: Illi- nois Institute of Technology Research Institute, 1971). 2. J. C. Costello and E. Wall, "Recent Improvements in Techniques for Storing and Retrieving Information," in Studies in Co-ordinate Indexing, vol. 5 (Washington, D.C.: Documentation Inc., 1959). 3. L. H. Thiel and H. S. Heaps, "Program Design for Retrospective Searches on Large Data Bases," Information Storage and Retrieval8:1-20 (Feb. 1972). 4. S.C. Bradford, Documentation (London: Crosby-Lockwood, 1948). 5. G. K. Zip£, Human Behaviour and the Principle of Least Effort (Cambridge, Mass: Addison-Wesley, 1949). 6. B. Mandelbrot, "An Informational Theory of the Statistical Structure of Language," in W. Jackson, ed., Communication Theory (London: Butterworth, 1953), p.486- 501. 7. R. A. Fairthorne, "Empirical Hyperbolic Distributions (Bradford-Zipf-Mandelbrot) for Bibliometric Description and Prediction," ]oumal of Documentation 25:319-43 (Dec. 1969). 8. M. F. Lynch, "The Microstructure of Chemical Data-bases, and Their Repre- sentation for Retrieval," Proceedings, CN AI NATO Advanced Study Institute on Computer Representation and Manipulation of Chemical Information (in press). 9. I. J. Barton, S. E. Creasey, M. F. Lynch, and M. J. Snell, "An Information-Theo- retic Approach to Text Searching in Direct-Access Systems," Communications of the ACM (in press). 10. G. W. Adamson, J. Cowell, M. F. Lynch, A. H. W. McLure, W. G. Town, and A. M. Yapp, "Strategic Considerations in the Design of Screening Systems for Substructure Searches of Chemical Structure Files," ]oumal of Chemical Docu- mentation 13:153-57 (Aug. 1973). 11. A. C. Clare, E. M. Cook, and M. F. Lynch, "The Identification of Variable-Length, Equifrequent Character Strings in a Natural Language Data Base," Computer Journal15:259-62 (Aug. 1972). 12. C. E. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal 27: 398-403 ( 1948) . 13. W. C. B. Sayers, A Manual of Classification for Librarians and Bibliographers (London: Grafton, 1926), 14. C. A. Cutter, C. A. Cutter's Alphabetic Order Table ... Altered and Fitted with Three Figures by Kate E. Sanborn (Boston: Boston Library Bureau, 1896). 15. C. P. Bourne and D. F. Ford, "A Study of the Statistics of Letters in English Words," Information & Control4:48-67 (1961). 16. H. Ohlman, "Subject Word Letter Frequencies; Applications to Superimposed Coding," Proceedings of the Inte1'national Conference of Scientific Information, Vol. 2 (Washington, D.C.: National Academy of Science, 1959), p.903-16. 17. D. W. Fokker and M. F. Lynch, "A Comparison of the Microstructure of Author Names in the INSPEC, Chemical Titles and B.N.B. MARC Data-bases" (in preparation). 118 ]oumalof Library Automation Vol. 7/2 June 1974 18. F. G. Kilgour, P. L. Long, A. L. Landgraf, and J. A. Wyckoff, "The Shared Cata- loging System of the Ohio College Library Center," Journal of Library Automation 5:157-83 (Sept. 1972). 19. F. G. Kilgour, P. L. Long, and E. B. Leiderman, "Retrieval of Bibliographic Entries from a Name-Title Catalog by Use of Truncated Search Keys," Proceedings of the ASIS 7:79-82 (1970). 20. I. J. Barton, M. F. Lynch, J. H. Petrie, and M. J. Snell, "Variable-Length Character String Analysis of Three Data-Bases, and Their Application for File Compression," Proceedings, 1st Informatics Con£., Durham, 1973 (in press).