201 Application of the Variety-Generator Approach to Searches of Personal Names in Bibliographic Data Bases-Part 2. Optimization of Key-Sets, and Evaluation of Their Retrieval Efficiency Dirk W. FOKKER and Michael F. LYNCH: Postgraduate School of Librarianship and Information Science, University of Sheffield, England. Keys consisting of variable-length chamcter strings from the front and rear of surnames, derived by analysis of author names in a particular data base, am used to provide approximate representations of author names. When combined in appropriate mtios, and used together with keys for each of the first two initials of personal names, they provide a high degme of discrimination in search. Methods for optimization of key-sets are desc1·ibed, and the perform- ance of key-sets varying in size between 150 and 300 is determined at file sizes of up to 50,000 name entries. The effects of varying the proportions of the queries present in the file are also examined. The results obtained with fixed-length keys are compared with those f01' variable-length keys, showing the latter to be greatly superior. Implications of the work for a variety of types of information systems a1'e discussed. INTRODUCTION In Part I of this series the development of variety generators, or sets of variable-length keys with high relative entropies of occurrence, from the initial and terminal character strings of authors' surnames was described.1 Their purpose, used singly or in combination, is to provide a high and con- stant degree of discrimination among personal names so as to facilitate searches for them. In this paper the selection of optimal combinations of the keys and evaluation of their efficiency in search are described. The performance of combined key-sets of various compositions is determined at a range of file sizes and compared with fixed-length keys. In addition, 202 1 ournal of Lib1'm'y Automation Vol. 7 I 3 September 197 4 the extent of statistical associations among keys from different positions in the names is determined. BALANCING OF KEY-SETS The relative entropies of distribution of the first and last letters of the surnames of authors in the file of 100,000 entries from the INSPEC data base differ significantly, the former being 0.92 and the latter 0.86. As a re- sult, a larger key-set has to be produced from the back of the surnames to reach the same value of the relative entropy as that of a key-set of given size from the front of the surname. For instance, the value of 0.954 is reached by a key-set comprising 41 keys from the front of the name, but a set of 101 keys from the back is needed to attain this value. It seemed reasonable to assume that keys from the front and rear should be com- bined in different proportions in order to maximize the relative entropy of the combined system, and that their proportions should reflect the redun- dancies of each distribution (redundancy = 1 - Hr). In order to test this, a series of combined key-sets of different total sizes was produced, in which the proportions of keys were varied around the ratio of the redun- dancies of the first and last character positions, i.e., ( 1 - 0.92): ( 1 - 0.86), or 8:14. The relative entropies of the name representations provided by combining these key-sets with keys for the first and second initials were de- termined by applying them to the 50,000 name file, and the entropy value used to determine the optimal ratio of keys. In one case, the correlation between the value of the relative entropy and retrieval efficiency, as mea- sured by the precision ratio, was also studied, and shown to be high. The sizes of the combined key-sets studied were 148 and 296, with an in- termediate set of 254 keys. The values of 148 and 296 were chosen in view of the projected implementation in the serial-parallel file organization.2 This relates the size of the key-set to the number of blocks on one cylinder of a disc. (The 30Mbyte disc cartridges available to us have 296 blocks per cylinder.) Otherwise the choice of key-set is arbitrary, and can be varied at will. The minimum key-set size is 106, consisting of 26 letters each for the first and last letter of the surname, and 27 ( 26 letters and the space sym- bol) each for the first and second initials. The numbers of n-gram keys ( n ::::,. 2) required for the key-sets numbering 148, 254, and 296 in size are . thus 42, 148, and 190. Full details are given of the composition of the first and third of these sets. A slight refinement to key-set generation was employed to ensure as close an approximation to equifrequency as possible, especially with the small- est key-sets. Precise application of a threshold frequency may occasionally result in arbitrary inclusion of either very high or very low frequency keys. Thus, if almost all the occurrences of a longer key are accounted for by a shorter key (as with -MANN and -ANN), only the shorter n-gram is included. Va1'iety-Generato1· Approach/FOKKER and LYNCH 203 OPTIMAL SET OF 148 KEYS The number of n-gram keys ( n ::::::,. 2) to be added to the minimum set of 106 keys is 42, the presumed optimum proportion being 8:14, which im- plies about 16 keys from the front of the name and 26 from the back. In order to examine the relationship between the ratio of keys from the front and rear of the surname and the relative entropy of the combined sets, the ratios were varied at intervals between 1:1 and 1:3 so that the numbers of n-grams varied from 21 and 21 to 11 and 31 respectively. For each ratio the keys were applied to the 50,000 name entries, and the distri- bution of the resultant descriptions determined. The ratios, the number of n-gram keys, and the relative entropies of the distributions are shown in Table 1. The maximum value of the entropy is taken to be log250,000. In this case the balancing point, with the key-set including 16 n-gram keys Table 1. Relation between Ratio of n-grams f1'0m F1'Dnt and Rear of Surname, Entropy of Combined Key-Sets, and Retrieval Efficiency for a Series of Sets of 148 Keys Ratio Numbm· of n-gram Number of Diffm·ent Relative · Precision(%) of n-gram Keys Representations Entropy (File Size= Keys Front Back in 50,000 Entries of System 25,000) 1:1 21 21 33,485 0.9450 71.5 3:4 18 24 33,501 0.9450 71.3 17:25 17 25 33,434 0.9447 70.9 8:13 16 26'* 33,454 0.9453 72.2 5:9 15 27 33,402 0.9450 72.0 1:2 14 28 33,378 0.9449 72.1. 1:3 11 31 33,126 0.9437 71.5 Total number of different name entries = 41,469. '* Key-set with highest relative entropy. from the front and 26 from the back, corresponds with the ratio of the redundancies of the first and last letters of the surnames. Table 2 shows the composition of the optimal key-set of 148 keys, while Table 3 gives the distribution of the name representations compiled from the combined key-set, and its corresponding relative entropy. OPTIMAL SET OF 296 KEYS A similar procedure to that used for the optimal148-key key-set was also applied in this instance. Here the ratios of front and rear n-gram keys varied from 57 and 133 to 69 and 121 respectively. For each of the sets chosen, the distributions of the entries resulting from application of the combined key-sets to the file of 50,000 names were determined. These showed virtually no difference in terms of the relative entropy alone, al- though the total number of different entries differed slightly between key- sets, and the highest value was used to choose the optimal set, detailed in Table 4. The range of combinations studied is shown in Table 5, and the distribution of the entries for the optimal set is given in Table 6. .. , -::: 204 Journal of Library Automation Vol. 7/3 September 197 4 Table 2. Composition of Balanced Key-Set of 148 Keys Keys from front of surname ( 42) : Key P• Key P• Key P• Key P• A .035 G .055 MA .030 SH .016 B .020 H .035 N .025 ST .016 BA .020 HA .021 0 .017 T .040 BE .017 I .013 p .038 u .005 BO .014 J .017 PA .014 v .025 BR .014 K .041 Q .001 w .040 c .036 KA .017 R .032 X CH .016 KO .017 RO .017 y .011 D .044 L .033 s .049 z .013 E .018 LE .014 SA ,016 F .034 M .050 sc .015 Keys from rear of surname (52) : A .060 II .015 NN .010 IS .012 RA .010 KI .015 ON .018 T .042 VA .015 J .001 SON .027 u .013 B .003 K .033 0 .028 v .001 c .005 L .013 KO .013 EV .018 D .030 EL .012 p .004 ov .026 E .068 LL .016 Q .001 ,KOV .012 F .006 M .013 R .016 NOV .on G .012 N .009 ER .064 w .005 NG .014 AN .020 LER .013 X .003 H .020 MAN .017 NER .010 y .031 CH .017 EN .025 s .055 EY .012 I .044 IN .039 ES .015 z .013 Keys from first initial: 27 characters Keys from second initial: 27 characters Table 3. Frequencies of Entries Represented by Optimall48-Key Key-Set in a File of 50,000 Names Frequency Number of Entries with f Frequencyf 1 24,363 2 5,622 3 1,850 4 757 5 372 6 193 7 103 8 68 9 32 10 24 11-15 54 16--20 11 21-30 4 33 1 Total number of different entries = 33,454 Maximum number of possible combinations= 1,592,136 (i.e., 42 x 52 x 27") H = 14.7553 Hmax = 15.6096{log,50,000) Hr = 0.9453 Variety-Generator Approach/FOKKER and LYNCH 205 Table 4. Composition of Balanced Key-Set of 296 Keys Keys from front of surname ( 87) : A BU E HA KI MA NI RA SI WA AL c F HE KO MAR 0 RE so WE AN QA FR HO KR MC p RI ST WI B CH G HU KU ME PA RO T X BA co GA I L MI PE s TA y BAR D GO J LA MO PO SA u z BE DA GR JO LE MU PR sc v BO DE GU K Ll N Q SE· VA BR DO H KA M NA R SH w Keys from rear of surname ( 155) : A LD NG VSKII EL LIN R OR NT sov CA ND ANG KI LL TIN AR s RT w DA RD lNG SKI ALL NN ER AS ERT X KA E RG WSKI ELL ON BER ES ST y MA DE H LI M SON DER NES TT AY NA EE CH NI AM LSON GER IS ETT EY INA GE ICH RI N NSON NGER NS u LEY RA KE VICH TI AN RSON HER INS v KY TA LE GH J MAN TON IER OS EV RY VA NE SH K RMAN 0 KER RS ov z OVA RE TH AK YAN KO LER ss KOV TZ WA SE ITH CK EN NKO LLER TS IKOV YA TE I EK SEN NO MER us LOV B F AI IK IN TO NER T NOV c FF HI L EIN p SER DT ANOV D G II AL KIN Q TER ET ROV Keys from first initial: 27 characters Keys from second initial: 27 characters Table 5. Relation between Ratio of n-grams from Front and Rear of Surname and Entropy of Combined Key-Sets for a Series of Sets of 296 Keys (File Size= 50,000) Ratio ofn-gram Keys 3:7 61:129 13:25 69:121 Number of n-gram Keys Front 57 61 65 69 Back 133 129'* 125 121 '* Key-set with highest number of different entries. Number of Different Representations 39,182 39,191 39,186 39,179 Relative Entropy of System 0.9679 0.9679 0.9679 0.9679 In this instance, the ratio of n-gram keys from the front and back of the surnames has been displaced from the ratio of the redundancies of the first and last characters of the surnames, i.e., 8:14 (1:1.7). Here the ratio is roughly 1:2. This is undoubtedly due to the fact that the relative entro- pies of key-sets from the back of the surname increase less rapidly than those of key-sets from the front, and hence larger sets must be employed. EVALUATION OF RETRIEVAL EFFECTIVENESS The keys in the optimized key-sets represent name entries in an approxi- ,, I' i: 206 ] oumal of Librm·y Automation Vol. 7 I 3 September 197 4 Table 6. Frequencies of Entries Represented by Optimal Key-Set of 296 Keys in a File of 50,000 Names Frequency f 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Total number of different entries = 39,191 Number of Entries with Frequencyf 31,705 5,394 1,371 442 164 63 27 12 4 3 2 2 1 1 Maximum number of possible combinations= 9,830,565 (i.e., 87 X 155 x 27') H = 15.108 Hmax = 15.6096(log,50,000) Hr = 0.9679 mate manner only, so that when a search for a name is performed, addi- tional entries represented by the same combination of keys are identified. While these may be eliminated in a subsequent character-by-character match of the candidate hits, the proportion of unwanted items should re- main low if the method is to offer advantages. In evaluating the effectiveness of the key-sets in the retrieval, the names in the search file were represented by concatenating the codes for the keys from the front and back of the surnames and the initials, and subjecting the query names to the same procedure. The matching procedure produced lists of candidate entries, of which the desired entries were a subset. The final determination was carried out manually. The tests were performed first with names sampled from the search file, so that correct items were retrieved for each query. Since searches for name entries may be performed with varying probabilities that the authors' names are present in the file (especially in current-awareness searches), varying proportions of names of the same provenance, but known not to be present in the search file, were also added. In these cases candidate items were selected which included none of the desired entries. Recall tests were also performed and recall shown to be complete. The measure used in determining the performance of the variety-gen- erator search method is the precision ratio, defined as the ratio of correctly identified names to all names retrieved. It is presented both as the ratio of averages (i.e., the summation of items retrieved in the search and cal- culation of the average) and as the average of ratios (i.e., averaging the Val'iety-Genemtor App1'0ach/FOKKER and LYNCH 207 figures for individual searches). The latter gives higher figures, since many of the individual searches give 100 percent precision ratios. The precision ratio was found to be dependent on file size and to fall somewhat as the size of file increases. This is due to the fact that the key- sets provided only a limited, if very high, total number of possible combi- nations, while the total possible variety of personal names is virtually un- limited. The evaluation was performed with a sample of 700 names, selected by interval sampling. This number ensured a 99 percent confidence limit in the results. A comparison of the interval sampled query names with ran- domly sampled names showed that no bias was introduced by interval sam- pling. A test to confirm that the retrieval effectiveness reached a peak at the maximum value of the relative entropy of a balanced key-set was per- formed first. This was carried out on a file of 25,000 names, using as queries names selected from the file and the optimal 148-key key-set. As shown in Table 1, the values of the precision ratio (ratio of averages) and of the relative entropy both peak at the same ratio of n-gram keys from the front and back of the surnames. The performance of the optimal key-sets of 148, 254, and 296 keys with files of 10,000, 25,000, and 50,000 names is shown in Table 7. Calculated as the ratio of averages, the smallest key-set ( 148 keys) shows a precision ratio of 64 percent with a file of 50,000 names, which means that of every three names identified in the variety-generator search, two are those de- sired. With the largest key-set ( 296 keys), this rises to nine correctly identi- fied names in every ten retrieved at this stage. On the other hand, calculat- ed as the average of ratios, the precision ratios rise to 81 percent and 94 percent respectively. For smaller file sizes-typical, for instance, of cur- rent-awareness searches-the figures for all of these are cotTespondingly higher. Table 7. Precision Ratios Obtained in Variety-Generator Searches of Personal Names-Queries Sampled from Sea1'ch File (Confidence Level= 99 Pm·cent) Precision as ratio of averages (%) : File Size 50,000 25,000 10,000 Precision as average of ratios (%) : File Size 50,000 25,000 10,000 148 64 71 84 148 81 87 93 Key-Set Size 254 87 90 93 Key-Set Size 254 91 95 97 296 90 91 94 296 94 96 97 '; ~;: 208 Journal of Library Automation Vol. 7/3 September 1974 The effect of sampling from a larger file, so that increasing proportions of the names searched for are not present in the search file, is shown in Table 8 for a file of 25,000 names. In this case, the proportion of correct- ly identified names in the total falls, so that overall performance is some- what reduced. Thus, depending both on file size and on the expected pro- portion of queries identifying hits, the key-set size can be adjusted to reach a desired level of performance. In addition, tests to determine the Table B. Effect of Varying Proportion of Query Names Not Present in Search File of 25,000 Names, Using 296 Keys (Ratio of Averages) %of Names Not Precision% Number of Names Number of Names in Search File (Ratio of Averages) Ret1·ieved Correctly Retrieved 21 90 766 691 42 85 595 505 61 83 449 371 74 76 319 242 84 68 228 154 applicability of a key-set optimized for one file of 50,000 names to another file of the same provenance and size were carried out. The three key-sets derived from the first file were applied to the second, query names sam- pled from the latter, and the precision ratios determined. Some reduction in performance was observed; expressed as ratio of averages, the precision with the 296-key key-set fell from 90 to 83 percent, with the 254-key key- set from 87 to 82 percent, and with the 148-key key-set from 64 to 56 per- cent, figures which seem unlikely to prejudice the net performance in any marked way. Nonetheless, monitoring of performance and of data base name characteristics over a period of operation might well be advisable. DISTRIBUTION CHARACTERISTICS OF OTHER TYPES OF KEYS It is particularly instructive to examine the distribution characteristics of other types of keys, including those of fixed length, generated from various positions in the names, and to compare them with those of the opti- mal key-sets employed in the variety-generator approach. To this end, the file of 50,000 names was processed to produce the following keys or key- sets: 1. Initial digram of surname. 2. Initial trigram of surname. 3. Key-set of ninety-four n-grams from the front of the surname, with first and second initials. 4. Key-set consisting of first and last character of surname, with first and second initials. The figures (Table 9) show clearly that all have distributions which leave no doubt as to their relative inadequacy in resolving power, where this is defined as the ratio of distinct name representations provided by the key-set used to the number of different name entries ( 41,469) in the file. At the digram level, the value of the resolving power is 0.009, i.e., each Vm·iety-Generator Approach/FOKKER and LYNCH 209 digram represents, on average, 110 different name entries, while no fewer than thirty-two specific digrams each represent between 500 and 1,000 dif- ferent names. At the trigram level, the value of the resolving power rises to 0.08, a tenfold increase; however, one trigram still represents between 500 and 1,000 different names. Use of the first and last letters of the surname plus the initials again in- creases the value of the resolving power to 0.627, or 1.6 distinct names per entry; eight of the representations now account for between thirty-one Table 9. Distributions of a Variety of Other Representations of Personal Names in a File of 50,000 Entries 94 n-grams from First and Last Frequency Initial Digram Initial Trigram Front of Surname Letter of Surname f of Surname of Surname Plus 2 Initials Plus 2 Initials 1 40 735 8,964 16,346 2 22 428 3,929 4,919 3 16 249 1,884 2,025 4 11 197 1,006 973 5 7 170 646 581 6 7 110 397 340 7 10 112 234 224 8 4 98 186 146 9 7 81 144 92 10 5 66 108 72 11 6 61 70 49 12 2 56 88 36 13 5 51 74 33 14 1 48 50 24 15 2 35 51 23 16 3 37 36 25 17 2 35 29 15 18 3 33 29 11 19 8 35 28 6 20 8 40 23 5 21-30 21 207 127 49 31-40 23 109 47 8 41-50 13 88 13 51-100 36 142 3 101-200 24 62 201-500 57 15 501-1000 32 1 Total 375 3,301 18,166 26,002 Resolving power .009 .080 .438 .627 and forty distinct entries. In contrast, however, the key-set of 148 keys comprising ninety-four n-gram keys from the front of the name and the first and second initials, although almost 50 percent larger than the four- character representation, has a resolving power of only 0.438 (or 2.28 en- tries per representation). This contrast provides particularly strong evi- dence for the superiority of keys from the front and rear of the surnames over those from the front alone, even when the latter are variable in •' 210 Journal of Library Automation Vol. 7/3 September 1974 length. As expected, the precision ratio of the four-character representa- tion is low, at 37 percent (ratio of averages), compared with 64 percent for the optimal148-key key-set. EXTENT OF STATISTICAL ASSOCIATION AMONG KEYS Thus far, the frequency of occurrence of variable-length character strings from the front and back of the surnames is the only factor consid- ered in their selection as keys. It is well known in other areas that statisti- cal associations among keys can influence the effectiveness of their combi- nations. 3 Where a strong positive association between two keys exists, their intersection results in only a small reduction of the number of items re- trieved over that obtained by using each independently. When the associa- tion is strongly negative, the result of intersection may be much greater than that predicted on the basis of the product of the individual proba- bilities of the keys. To assess the extent of associations among keys from the front and rear of surnames and initials, sets of both fixed- and variable-length keys from each of these positions were examined.· The Kendall correlation coefficient V was calculated for each of the twenty most frequent combinations of these. This is related to the chi-square value by the expression X2 =m V2 where m is the file size, or 50,000. Table 10 shows the values of the associa- tion coefficient for certain of the characters in the full name. Those above .012 are significant at a 99 percent confidence level. Positive associations are Table 10. A8sociation Coefficients for Sets of the Most Frequent Digrams from Various Posi- tions in Personal Names First and Last First Letter of Surname First and Second Letters of Surname and First Initial Initials Digram v Digram v Digram v KV .064 KV .054 HV .078 WR .050 HJ .027 MV .069 KA .038 BR -.024 KV .069 HN .028 SJ -.023 RV -.055 SA .024 DJ .022 DV -.053 SN .024 BG .018 TV .053 CN .022 KA .018 JV -.045 KN -.020 CJ ,018 SV .034 MA .014 SD .015 FV .033 KR -.011 sv .013 NV -.029 sv ,010 MM .011 GV .022 RN .010 MJ ,007 LV -.022 BN -.008 BJ ,005 IV -.019 BR .008 SG -.004 AV -.019 MN -.007 SR .004 CV -.018 SR .007 BA .004 PV .017 MR .004 MA ,004 WV -.014 SI -.002 SM -.003 YV .010 GN .001 MR .002 BV .005 LN .001 SA -.000 EV -.002 Variety-Generator App1'0ach/FOKKER and LYNCH 211 more frequent than negative. The figures indicate that intersection of cer- tain of these characters as keys in search would result in some slight dimi- nution in performance against that expected. The figures for the association coefficients among the twenty most fre- quent combinations of keys from the front and back of surnames in the 148- and 296-key key-sets show magnitudes (mostly positive) which are sub- stantially greater than those for single characters (see Table 11). The rea- sons for these values are obvious; in certain instances, e.g., MILLER, JONES, and MARTIN, common complete names are apparent, while in one case, LEE, an overlap between keys from the front and rear exists. In others, linguistic variations on common names can be discerned, as with BR N-BROWN or BRAUN. Table 11. Association Coefficients in the Twenty Most Frequent Key Combinations from Front and Back of Surnames in Two Key-Sets Key-Set Size Key-Set Size 148 296 Keys v Keys v s H .146 s ITH .343 J SON .127 JO NSON .297 sc ER .104 JO NES .278 w s .043 AN RSON .274 T A .038 SI GH .249 T I .038 LE EE .221 w ER .038 MU LLER .214 c E .034 TA OR .195 F ER .033 GU TA .168 p s .025 BR N .160 D E .023 MI LLER .151 L E .022 MAR TIN .145 w E .022 WI s .137 G IN .020 F HER .133 M E .009 sc DER .121 s A .008 SA TO .110 G E .006 T AS .084 M A .005 sc ER .069 M ER -.004 CH EN .055 G ER -.000 T SON .050 Such associations are inevitable. When the selection of keys is based solely on frequency, some deviation from the ideal of independence must result, becoming larger as the size of the key-sets increases, and as the length of certain of the keys increases. However, since its effect in the most extreme cases is merely to lead to virtually exact definition of the most fre- quent surnames, no particular disadvantage results. POSSIBLE IMPLEMENTATIONS OF THE VARIETY-GENERATOR NAME SEARCH APPROACH The variety-generator approach permits a number of possible implemen- tations of searches for personal names to be considered, if only in outline f ( f•j/ 212 Journal of Library Automation Vol. 7/3 September 1974 at this stage, using a variety of file organization methods. The most widely known methods (apart from purely sequential files) are direct access (uti- lizing hash-addressing), chained, and index sequential files. Direct application of the concatenated key-numbers as the basis for hash-address computation appears attractive in instances where the person- al name is used alone or in combination (as, for instance, with a part of the document title). The almost random distribution of the bits in this code should result in a general diminution of the collision and overflow problems commonly encountered with fixed-length keys. Since only four keys are used to represent each name, and the four sets of keys from which these are selected are limited in number and of ap- proximately equal probability, the keys can be used to construct chained indexes, to which, however, the usual constraints still apply. Index sequential storage again offers opportunities, in particular since the low variety of key types means that the sorting operations which this entails can be eliminated. In effect, each name entry would be represented by an entry in each of four lists of document numbers or addresses, and documents retrieved by intersection of the lists. While four such numbers are stored for each name, in contrast to a single entry for the more con- ventional name list, the removal of the name list itself would more than compensate for the additional storage required for the lists. In the index sequential mode, the lists of document addresses or num- bers stored with each key are more or less equally long. They may thus be replaced by bit-vectors in which the position of a bit corresponds to a name or document number. If the number of keys bears a simple relation to the number of blocks on a disc cylinder, the vectors can be stored in pre- determined positions within a cylinder, resulting in the serial-parallel file. The usefulness of this file organization has yet to be fully evaluated; however, it also promises substantial economies in storage. On average, only four of the bits are set at the positions in the vectors corresponding to the name or document entry. On average, then, the density of 1-bits is very low, and long runs of zeros occur in the vectors. They can, therefore, be compressed using run-length coding, for instance as applied by Brad- ley.3· 4 Preliminary work with the 296-key key-set has indicated already that a gross compression ratio of nine to one is attainable, so that the explicit storage requirements to identify the association between a name and a document number would be just over thirty bits. CONCLUSIONS The work described here relates solely to searches for individual occur- rences of personal names. Clearly, in operational systems in which one or more author names are associated with a particular bibliographical item, it will be necessary to provide for description of each of these for access. If this is provided solely on the basis of a document number, some false coordination will occur-for instance, when the initials of one entry are Variety-Generator Approach/FOKKER and LYNCH 213 combined with the surname of another. A number of strategies can be en- visaged to overcome this problem. , The performance figures show clearly that a small number of character- istics-between 100 and 300 in this study-are sufficient to characterize the entries in large files of personal names and to provide a high degree of resolution in searches for them. While performance in much larger files, involving the extension of key-set sizes to larger munbers, has yet to be studied, the logical application of the concept of variety generation would appear to open the way to novel approaches to searches for documents as- sociated with particular personal names, which seem likely to offer ad- vantages in terms of the overall economic performance of search systems, not only in bibliographic but also in more general computer-based infor- mation systems. ACKNOWLEDGMENTS We thank M. D. Martin of the Institution of Electrical Engineers for provision of a part of the INSPEC data base and of file-handling soft- ware, and the Potchefstroom University for C.H.E. (South Mrica) for awarding a National Grant to D. Fokker to pursue this work. We also thank Dr. I. J. Barton and Dr. G. W. Adamson for valuable discussions, and the former for n-gram generation programs. REFERENCES 1. D. W. Fokker and M. F. Lynch, "Application of the Variety-Generator Approach to Searches of Personal Names in Bibliographic Data Bases-Part 1. Microstructure of Personal Authors' Names," Journal of Library Automation 7:105-18 (June 1974). 2. I. J. Barton, S. E. Creasey, M. F. Lynch, and M. J. Snell, "An Information-Theoretic Approach to Text Searching in Direct-Access Systems," Communications of the ACM (in press). 3. S. D. Bradley, "Optimizing a Scheme for Run-Length Encoding," Proceedings of the IEEE 57:108-9 (1969). 4. M. F. Lynch, "Compression of Bibliographic Files Using an Adaptation of Run- Length Coding," Information Storage and Retrieval 9:207-14 (1973). r /' I,