AN ALGORITHM FOR VARIABLE-LENGTH PROPER-NAME COMPRESSION 257 James L. DOLBY: R & D Consultants Company, Los Altos, California Viable on-line search systems require reasonable capabilities to automa- tically detect (and hopefully correct) variations between request format and stored format. An important requirement is the solution of the prob- lem of matching proper names, not only because both input specificatiof.I,S and storage specifications are subject to error, but also because various transliteration schemes exist and can provide variant proper name forms in the same data base. This paper reviews several proper name matching schemes and provides an updated version of these schemes which tests out nicely on the proper name equivalence classes of a suburban telephone book. An appendix lists the corpus of names used for algorithm test. A viable on-line search system cannot reasonably assume that each user will invariably provide the proper input information without error. Human beings not only make errors, but also expect their correspondents, be they human or mechanical, to be able to cope with these errors, at least at some reasonable error-rate level. Many of the difficulties in implementing com- puter systems in many areas of human activity stem from failure to rec- ognize, and plan for, routine acceptance of errors in the systems. Indeed, computing did not become the widespread activity it is now until the so- called higher-level languages came into being. Although it is customary to think of higher-level languages as being "more English-like," the height of their level is better measured by the brevity with which various jobs can be expressed (for brevity tends to reduce errors) and the degree of sophistication of their automatic error detection and correction procedures. The processing of catalog information for the purposes of exposing and retrieving information presents at least two major areas for research in automatic error detection and correction. At the first stage, the data bank must be created, updated and maintained. Methods for dealing with input errors at this level have been derived by a number of groups and it seems reasonable to assert that something in the order of 60% of the input errors can be detected automatically ( 1,2,3 ). With the possibility of human proof- 258 Journal of Library Automation Vol. 3/4 December, 1970 reading and error detection through actual use, it is reasonable to expect a mature data base to have a very low over-all error rate. At the second stage, however, when a user approaches the data base through a terminal or other on-line device, the errors will be of a recurring nature. Each user will generate his own error set and, though experience will tend to minimize the error rate for a particular user, there will be an essentially irreducible minimum error rate even for an experienced user. If the system is to attract users other than professional interrogators, it must respond intelligently at this minimal error level. This paper explores certain problems associated with making "noisy matches" in catalog searches. Because preliminary information indicates that the most likely source of input errors is in the keyboarding of proper names, the main emphasis of the paper is on the problem of algorithmically compressing proper names in such a way as to identify similar names (and likely misspellings) without over-identifying the list of possible authors. EXISTING NAME-COMPRESSION ALGORITHMS The problem of providing equivalence classes of proper names is hardly new. Library catalogs, telephone directories and other major data bases have made use of "see-also"-type references for many years. Some years ago Remington-Rand derived an alphanumeric name compression algor- ithm, SOUNDEX, that could be applied either by hand or by machine for such purposes ( 4). Perhaps the most widely used on-line retrieval system presently in existence, the airline reservation system (such as SABRE), makes use of such an algorithm (5). The closely related problem of compressing English words (either to establish noisy matches, to elimi- nate misspelled words, or simply to achieve data bank compression) has also received some attention ( 6, 7, 8). Implementation of such algorithms has been described ( 9, 10, 11, 12, 13). Although English word structure differs from proper-name structure in some important respects (e.g., the existence of suffixes), three of the algorithms are constructed by giving varying degrees of attention to the following five areas of word structure: 1 ) The character in word initial position; 2) The character set: (A, E, I, 0, U, Y, H, W); 3) Doubled characters (e.g., tt); 4) Transformation of consonants (i.e., all alphabetic characters other than those in 2 above) into equivalence classes; 5) Truncation of the residual character string. The word-initial character receives varying attention. SOUNDEX places the initial consonant in the initial position of the compressed form and then transforms all other consonants into equivalence classes with numeric titles. SABRE maintains the word-initial character even if it is a vowel. In the Armour Research Foundation scheme (ARF), the word-initial character is also retained as is. Algorithm for Name CompressionjDOLBY 259 Both SOUNDEX and SABRE eliminate all characters in the set 2) above. The ARF scheme retains all characters in shorter words and deletes vowels only, to reduce the compressed form to four characters, deleting the "U" after "Q," the second vowel in a vowel string, and then all re- maining vowels. All three systems delete the second letter of a double-letter string. SABRE goes a step further and deletes the second letter of a double- letter string occurring after the vowels have been deleted. Thus, the second "R" of "BEARER" would be deleted. SOUNDEX maps the eighteen consonants into six equivalence classes: 1) B, F, P, V 2) C, G, J, K, Q, S, X, Z 3) D, T 4) L 5) M, N 6) R SABRE and ARF do not perform any transformations on these eighteen consonants. Finally, all three systems truncate the remaining string of characters to four characters. For shorter forms, padding in the form of zeros (SOUNDEX), blanks (SABRE), or hyphens (ARF) is added so that all codes are precisely four characters long. Variable-length coding schemes have been considered but generally rejected for implementation on major systems because of the attendant difficulties of programming and the fact that code compression is en- hanced by fixed-length codes where no interword space is necessary. Although fixed-length schemes of length greater than four have been considered, no definitive data appears to be available as to the enhanced ability of compressed codes to discriminate by introduction of more characters. The SABRE system does add a fifth character but makes use of the person's first initial for added discrimination. Tukey ( 14) has constructed a personal author code for his citation indexing and permuted title studies on an extensive corpus of the statistical literature. In this situation the author code is a semi-mnemonic code in a tag form to assist the user in identification rather than to be used as a basic entry point. However, Tukey does note that in his corpus a three- character code of the surname, plus two initials, is superior to a five- character surname code for purposes of unique identification. MEASURING ALGORITHMIC PERFORMANCE One of the main problems in constructing linguistic algorithms is to decide on appropriate measures of performance and to obtain data bases for implementing such measures. In this case it is clear that certain improvements in existing algorithms can be made- particularly by using more sophisticated b·ansformation rules for the consonants - and that 260 Journal of Librat·y Automation Vol. 3/4 December, 1970 the problems of implementing such changes are not so great in today's context as they were when the systems noted above were originally derived. Improvements in processing speeds and programming languages, how- ever, do not remove the need for keeping "linguistic frills" to a minimum. Ideally, it would be desirable to have a list of common errors in key- boarding names as a test basis for any proposed algorithms. Unfortunately, no such list of sufficient size appears to be available. Lacking this, one can speculate that certain formal properties of the predictability of language might be useful in deriving an algorithm. At the English word level, some effort has been made to exploit measures of entropy as developed by Shannon in this direction (6, 7). However, there is good reason to question whether entropy, at least when measured in the usual way, is strongly correlated with actually occurring errors ( 15). As an alternative, one can study existing lists of personal-name equiva- lence classes to derive such algorithms and then test the algorithm against such classes, measuring both the degree of over-identification and the de- gree of under-identification. Clearly, such tests will carry more weight if they are conducted under economic forcing conditions where weaknesses in the test set will lead to real and measurable expense to the organization publishing the list. The SABRE system operates under strong economic forcing conditions in the sense that airline passengers frequently have a number of competitive alternatives available to them and lost reservations can cause sufficient inconvenience for them to consider these alternatives. However, the main application of the SABRE system is to rather small groups of persons (at least when compared to the number of personal authors in a typical library catalog), so that errors of over-identification are essentially trivial in cost to the airlines. A readily available source of "see-also"-type equivalence classes of proper names is given in the telephone directory system. Here, the eco- nomic forcing system is not so strong as in the airline situation, but it is measurable in that failure to provide an adequate list will lead to increased user dependence on the Information Operator, with consequent increased cost to the telephone company. As a test of the feasibility of using such a set of equivalence classes, the 451 classes found in the Palo Alto-Los Altos (California) telephone directory were copied out by hand and used in deriving and testing the algorithm given in the next section and the SOUNDEX algorithm. There remains the question of deciding what is to constitute proper agreement between any algorithm and the set of equivalence classes chosen as a data base. At the grossest level it seems reasonable to argue that over- identification is less serious than under-identification. False drops only tend to clog the line. Lost reference points, on the other hand, lead to lost in- formation. Investigation of other applications of linguistic algorithms, such as algorithms to hyphenate words, identify semantically similar words through cutting off of suffixes, and so forth, indicates that it is usually Algorithm for Name CompressionjDOLBY 261 possible to reduce crucial error (in this case under-identification) to some- thing under 5%, while preserving something in the order of 80% of the original distinctions (or efficiency) of the system. Efforts to improve materially on the "five-and-eighty" rule generally lead to solutions involv- ing larger context and/or extensive exception dictionaries. In this study efforts are directed at achieving a "five-and-eighty" solution. A VARIABLE-LENGTH NAME-COMPRESSION SCHEME In light of the fact that no definitive information is available on the problems of truncating errors in name-compression algorithms, it is con- venient to break the problem into two pieces. First is derivation of a variable-length algorithm of the required accuracy and efficiency and then determination of the errors induced by truncation. A studying of the set of equivalence classes given in the Palo Alto-Los Altos telephone directory made fairly clear that with minor modifications of the basic five steps used in the other algorithms noted above, it would not be too difficult to provide a reasonably accurate match without requir- ing too much over-identification. The main modifications made consisted of maintaining the position of the first vowel and using local context to make transformations on the consonants. The algorithm is given below. (The rules given must be applied in the order given both with respect to the rules themselves and to the order of the lists within the rules, as the precedence relations are important to the performance of the algorithm.) A Spelling Equivalent Abbreviation Algorithm For Personal Names 1) Transform: "MeG" to "Mk", "Mag" to "Mk", "Mac" to "Mk", "Me" to "Mk". 2) Working from the right, recursively delete the second letter from th f II . I tt · "dt" "ld" " d" " t" " " " d" " t" " '' e o owmg e er parrs: , , n , n , rc , r , r , sc , "sk", "st''. 3) T f ,, , t "k ,, (( , t 1.( , " ., t " ., " , t " ,~ ,, rans orm: x o s , ce o se , c1 o s1 , cy o sy , con- sonant-ch" to "consonant-sh"; all other occurrences of "c" to "k", "z" to "s", "wr" to "r", "dg" to "g", "qu" to "k'', "t" to "d", "ph" to "f' (after the first letter). 4) Delete all consonants other than "1", "n", and Y' which precede the letter "k" (after the first letter). 5) Delete one letter from any doubled consonant. 6) Transform "pf#" to "p#", "#pf" to "#f", "vowel-gh#" to "vowel-£#", "consonant-gh" to "consonant-g#", and delete all other occurrences of "gh". ("#"is the word-beginning and word-ending marker.) 7) Replace the first vowel in the name by the symbol "•". 8) Delete all remaining vowels. 9) Delete all occurrences of "w" or "h" after the first letter in the word. The vowels are taken to be (A, E, I, 0, U, Y) . The remaining literal characters are treated as consonants. 262 Journal of Library Automation Vol. 3/4 December, 1970 The algorithm splits 22 ( 4.9%) of the 451 equivalence classes given by the phone directory. On the other hand, the algorithm provides 349 dis- tinct classes (not counting those classes that were broken off in error) or 77.4% of the 451 classes in the telephone directory data base. Thus has been achieved a reasonable approximation to the "five-and-eighty" per- formance found in other linguistic problem areas. To give a proper appreciation of the nature of these underidentification errors, they are discussed below individually. 1) The name Bryer is put in the same equivalence class with a variety of spellings of the name Bear. The algorithm fails to make this identification. 2) Blagburn is not equated to Blackburn. 3) The name Davison is equated to Davidson in its various forms. The algorithm fails to make this identification and this appears to be one of a modest class of difficulties that occur prior to the -son, -sen names. 4) The class of names Dickinson, Dickerson, Dickison, and Dickenson are all equated by the directory but kept separate, except for the two forms of Dickinson, by the algorithm. 5) The name Holm is not equated with the name Home. 6) The name Holmes is not equated with the name Homes. 7) The algorithm fails to equate Jaeger with two forms of Yaeger. 8) The algorithm fails to equate Lamb with Lamn. 9) The algorithm incorrectly assumes that the final "gh" of Leigh should be treated as an "f." Treating final "gh" either as a null sound or an "f' leads to about the same number of errors in either direction. 10) The algorithm fails on the pairing of Leicester and Lester. The difficulty is an intervening vowel. 11) The algorithm fails to equate the various forms of Lindsay with the forms of Lindsley. 12) The algorithm fails to equate the various forms of McLaughlin with McLachlan. 13) The algorithm fails to equate McCullogh with McCullah. This is again the final "gh" problem. 14) The algorithm fails to equate McCue with McHugh (again the final "gh" problem) . 15) The algorithm fails to equate Moretton with Morton. This is an intervening vowel problem. 16) The algorithm fails to equate Rauch with Roush. 17) The algorithm fails to equate Robinson with Robison (another -son type problem). 18) The algorithm incorrectly assumes that the interior "ph" of Shep- herd is an "£." 19) The algorithm fails to equate Speer with Speier. Algorithm for Name CompressionjDOLBY 263 20) The algorithm fails to equate Stevens with Stephens. 21) The algorithm fails to equate Stevenson with Stephenson. 22) The algorithm fails to equate the various forms of the word Thomp- son (an -son problem.) In several of the errors noted above it may be questioned whether the telephone directory is following its own procedures with complete rigor. Setting these aside, the primary errors occur with the final "gh," the words ending in "son," and the words with the extraneous interior vowels. Each of these problems can be resolved to any desired degree of accuracy, but only at the expense of noticeable ·increases in the degree of complexity of the algorithm. THE TRUNCATION PROBLEM Simple truncation does not introduce errors of under-identification; it can only lead to further over-identification. Examination of the results of applying the algorithm to the telephone directory data base shows that no new over-identification is introduced if the compressed codes are all reduced to the leftmost seven characters. Further truncation leads to the following results: Code Length 7 6 5 4 Cumulative Over-Identification Losses 0 1 6 45 Thus there is a strong argument for maintaining at least five characters in the compressed code. However, there is no real need for restriction to simple truncation. Following the procedures used in the ARF system, further truncation can be obtained by selectively removing some of the remaining characters. The natural candidate for such removal is the vowel marker. If the vowel marker is removed from all the five character codes, only six more over- identification errors are introduced. Removal of the vowel markers from all of the codes would have introduced 17 more errors of over-identification. The utility of the vowel marker is in the short codes. This in turn suggests that introduction of a second vowel marker in the very short codes may have some utility, and this is indeed the case. If the conception of vowel marker is generalized as marking the position of a vowel-string (i.e., a string of consecutive vowels), where for these purposes a vowel is any of the characters (A, E, I, 0, U, Y, H, W), and these markers are main- tained as "padding" in the very short words, 18 errors of over-identification are eliminated at the cost of two new errors of under-identification. In this way the following modification to the variable length algorithm is derived: 1) Mark the position of each of the first two vowel strings with an "o ," if there is more than one vowel. 264 Journal of Library Automation Vol. 3/4 December, 1970 2) Truncate to six characters. 3) If the six-character code has two vowel markers, remove the right- hand vowel marker. Otherwise, truncate the sixth character. 4) If the resulting five-character code has a vowel marker, remove it. Otherwise remove the fifth character. 5) For all codes having less than four characters in the variable-length fonn, pad to four characters by adding blanks to the right. Measured against the telephone directory data base, this fixed-length compression code provides 361 distinct classes (not counting improper class splits as separate classes) or 80% of the 451 given classes. Twenty- four ( 5.3 %) of the classes are improperly split. By way of comparison, the SOUND EX system improperly splits 135 classes ( 30%) and provides only 287 distinct classes (not counting improperly split classes), or 63.8% of the telephone directory data base. ACKNOWLEDGMENTS This research was carried out for the Institute of Library Research, University of California, under the sponsorship of the Office of Education, Research Grant No. OEG-1-7-071083-5068. The author would like to thank Ralph M. Shoffner and Kelley L. Cart- wright for suggesting the problem and for a number of useful comments on existing systems. Allan J. Humphrey was kind enough to program the variable-length version of the algorithm for test purposes. APPENDIX: CORPUS OF NAMES USED FOR ALGORITHM TEST A list of personal-name equivalence classes from the Palo Alto-Los Altos Telephone Directory is arranged according to the variable-length compres- sion code (with the vowel marked "•" treated as an "A" for ordering) . Names whose compressed codes do not match the one given in the first column (and hence represent weaknesses in the algorithm and/ or the directory groupings) are given in italics. A small number of directory entries that do not bear on the immediate problem have been deleted from the list : Bell's see also Bells; Co-op see also Co-operative; St. see also Saint; etc. 0 BL Abel, Abele, Abell, Able 0 BRMS Abrahams, Abrams 0 BRMSN Abrahamson, Abramson •D Eddy, Eddie 0 DMNS Edmonds, Edmunds 0 DMNSN Edmondson, Edmundson 0 DMS Adams, Addems 0 GN Eagen, Egan, Eggen 0 GR Jaeger, Yaeger, Yeager °KN Aiken, Aikin, Aitken °KNS Adkins, Akins °KR OKR ·Ks 0 LBRD ·Ln 0 LN 0 LSN 0 LVR •Ms 0 NGL 0 NL 0 NRS 0 NRSN •Ns 0 RKSN 0 RL 0 RN •RNs •Rs 0 RVN 0 RVNG 0 SBRN B•n B•ns B°KMN B0 L B0 L B0 L B0 L B.L B 0 LN B·M B 0 MN B•N B0 ND B·R B0 R B•R B•R B 0 RBR B•Rc B 0 RGR B 0 RK B 0 RN Algorithm for Name CompressionjDOLBY 265 Acker, Aker Eckard, Eckardt, Eckart, Eckert, Eckhardt Oakes, Oaks, Ochs Albright, Allbright Elliot, Elliott Allan, Allen, Allyn Ohlsen, Olesen, Olsen, Olson, Olsson Oliveira, Olivera, Olivero Ames, Eames Engel, Engle, Ingle O'Neal, O'Neil, O'Neill Andrews, Andrus Andersen, Anderson, Andreasen Ennis, Enos Enrichsen, Erickson, Ericson, Ericsson, Eriksen Earley, Early Erwin, Irwin Aarons, Ahrends, Ahrens, Arens, Arentz, Arons Ayers, Ayres Ervin, Ervine, Irvin, Irvine Erving, Irving Osborn, Osborne, Osbourne, Osburn Beatie, Beattie, Beatty, Beaty, Beedie Betts, Betz Bachman, Bachmann, Backman Bailey, Baillie, Bailly, Baily, Bayley Beal, Beale, Beall, Biehl Belew, Ballou, Bellew Buhl, Buell Belle, Bell Bolton, Boulton Baum, Bohm, Bohme Bauman, Bowman Bain, Bane, Bayne Bennet, Bennett Baer, Bahr, Baier, Bair, Bare, Bear, Beare, Behr, Beier, Bier, Bryer Barry, Beare, Beery, Berry Bauer, Baur, Bower Bird, Burd, Byrd Barbour, Barber Berg, Bergh, Burge Berger, Burger Boerke, Birk, Bourke, Burk, Burke Burn, Byrne 266 Journal of Library Automation Vol. 3/4 December, 1970 B 0 RNR B 0 RNS B 0 RNSN B0 RS BL°KBRN BL 0 M BR 0 D BR 0 N BR 0 N D 0 DS D°F D 0 GN D°K n•KNSN n•KsN n•L n•L n•L D 0 MN n•N n•N n•N n•N n•N D0 NL D.R n•R D 0 RM D 0 VDSN n•vs DR0 SL F• F°FR F 0 GN F 0 L F 0 L F 0 LKNR F 0 LPS F 0 NGN F 0 NL F0 RL F 0 RR F 0 RR F 0 RS Bernard, Bernhard, Bernhardt, Bernhart Berns, Bims, Burns, Byrns, Byrnes Bernstein, Bornstein Bertsch, Birch, Burch Blackburn, Blagburn Blom, Bloom, Bluhm, Blum, Blume Brode, Brodie, Brody Braun, Brown, Browne Brand, Brandt, Brant Diezt, Ditz Duffie, Duffy Dougan, Dugan, Duggan Dickey, Dicke Dickenson, Dickerson, Dickinson, Dickison Dickson, Dixon, Dixson Dailey, Daily, Daley, Daly Dahl, Dahle, Dall, Doll Deahl, Deal, Diehl Diamond, Dimond, Dymond Dean, Deane, Deen Denney, Denny Donahoo, Donahue, Donoho, Donohoe, Donohoo, Donohue, Dunnahoo Downey, Downie Dunn, Dunne Donley, Donnelley, Donnelly Daugherty, Doherty, Dougherty Dyar, Dyer Derham, Durham Davidsen, Davidson, Davison Davies, Davis Driscoll, Driskell Fay, Fahay, Fahey Fifer, Pfeffer, Pfeiffer Fagan, Feigan, Fegan Feil; Pfeil Feld, Feldt, Felt Faulkner, Falconer Philips, Phillips Finnegan, Finnigan Finlay, Finley Farrell, Ferrell Ferrara, Ferreira, Ferriera Foerster, Forester, Forrester, Forster Forrest, Forest F 0 RS F 0 RS F 0 SR FL 0 N FL 0 NGN FR0 FR0 DMN FR0 DRKSN FR°K FR0 NS FR0 NS FR 0 S FR0 SR G0 D G0 DS G°F G0 L G0 LMR G0 LR G0 MS G0 NR G 0 NSLS G0 NSLVS G0 RD c•Rn G 0 RN G 0 RNR c•RR G 0 S GR 0 GR.FD GR0 N GR•s H•n H°F H°FMN H0 G H 0 GN H°K H°KSN H 0 L H•L H•L H0 L H 0 LD Algorithm for Name CompressionjDOLBY 267 Faris, Farriss, Ferris, Ferriss First, Fuerst, Furst Fischer, Fisher Flinn, Flynn Flanagan, Flanigan, Flannigan Frei, Frey, Fry, Frye Freedman, Friedman Frederickson, Frederiksen, Fredickson, Fredriksson Franck, Frank France, Frantz, Franz Frances, Francis Freeze, Freese, Fries Fraser, Frasier, Frazer, Frazier Good, Goode Getz, Goetz, Goetze Goff, Gough Gold, Goold, Gould Gilmer, Gilmore, Gilmour Gallagher, Gallaher, Galleher Gomes, Gomez Guenther, Gunther Gonzales, Gonzalez Consalves, Gonzalves Garratt, Garrett Garrity, Geraghty, Geraty, Gerrity Gorden, Gordohn, Gordon Gardiner, Gardner, Gartner Garrard, Gerard, Gerrard, Girard Gauss, Goss Gray, Grey Griffeth, Griffith Green, Greene Gros, Grose, Gross Hyde, Heidt Hoff, Hough, Huff Hoffman, Hoffmann, Hofman, Hofmann, Huffman Hoag, Hoge, Hogue Hagan, Hagen Hauch, Hauck, Hauk, Hauke Hutcheson, Hutchison Holley, Holly Holl, Hall Halley, Haley Haile, Hale Holiday, Halliday, Holladay, Holliday I 268 Journal of Libra1·y Automation Vol. 3/4 December, 1970 H 0 LG H 0 LM H 0 LMS H 0 LN H0 M H 0 MR H 0 N H 0 N H0 NN H 0 NRKS H 0 NRKSN H0 NS H0 NS I-JONSN H 0 R H 0 R H 0 R H 0 R H 0 RMN H 0 RMN H 0 RMN H0 RN H 0 RN H 0 RN H 0 RNGDN H 0 S H 0 S H 0 S H 0 SN H 0 VR r tFR rFRS tKB rKBsN rKs rL rMs rMSN rNsN rs Ko K°F K°FMN Helwig, Hellwig Holm, Home Holmes, Homes Highland, Hyland Ham, Hamm Hammar, Hammer Hanna, Hannah Hahn, Hahne, Harm, Haun Hanan, Hannan, Hannon Hendricks, Hendrix, Henriques Hendrickson, Henriksen, Henrikson Heintz, Heinz, Heinze, Hindes, Hinds, Hines, Hinze Haines, Haynes Henson, Hansen, Hanson, Hanssen, Hansson, Hanszen Herd, Heard, Hird, Hurd Hart, Hardt, Harte, Heart Hare, Hair Hardey, Hardie, Hardy Hartman, Hardmen, Hardman, Hartmann Herman, Hermann, Herrmann Harman, Harmon Heron, Herrin, Herron Hardin, Harden Hom, Horne Herrington, Harrington Haas, Haase, Hasse Howes, House, Howse Hays, Hayes Houston, Huston Hoover, Hover Jew, Jue Jeffery, Jeffrey Jefferies, Jefferis, Jefferys, Jeffreys Jacobi, Jacoby Jacobsen, Jacobson, Jackobsen Jacques, Jacks, Jaques Jewell, Juhl Jaimes, James Jameson, Jamieson, Jamison Jahnsen, Jansen, Jansohn, Janssen, Jansson, Janzen, Jensen, Jenson Joice, Joyce Kay, Kaye Coffee, Coffey Coffman, Kauffman, Kaufman, Kaufmann K°K K0 L K0 L K0 LMN K0 LR K0 MBRLN K 0 MBS K0 MP K0 MPS K0 N K0 N K0 N K0 N K0 N K0 N K0 N K 0 NL K 0 NR K0 NS K0 P K0 PL K0 R K0 R K0 R K0 R K0 R K 0 RD K0 RLN K 0 RN K0 RSNR K0 S K0 S K0 S K0 SL K0 SLR K 0 SR KL 0 N KL.,RK KL 0 SN KR 0 KR 0 GR KR.,MR KR 0 N KR 0 S KR 0 S Algor·ithm. for Name CompressionfDOLBY 269 Cook, Cooke, Koch, Koche Cole, Kohl, Koll Kelley, Kelly Coleman, Cohnan Koehler, Koeller, Kohler, Koller Chamberlain, Chamberlin Combs, Coombes, Coombs Camp, Kampe, Kampf Campos, Campus Cahn, Conn, Kahn Cahen, Cain, Caine, Cane, Kain, Kane Chin, Chinn Chaney, Cheney Coen, Cohan, Cohen, Cohn, Cone, Koehn, Kahn Coon, Kuhn, Kuhne Kenney, Kenny, Kinney Conley, Conly, Connelly, Connolly Conner, Connor Coons, Koontz, Kuhns, Kuns, Kuntz, Kunz Coop, Co-op, Coope, Coupe, Koop Chapel, Chapell, Chappel, Chappell, Chappelle, Chapple Carrie, Carey, Cary Corey, Cory Carr, Kar, Karr Kurtz, Kurz Kehr, Ker, Kerr Cartwright, Cortright Carleton, Carlton Carney, Cerney, Kearney Kirschner, Kirchner Chace, Chase Cass, Kass Kees, Keyes, Keys Cassel, Cassell, Castle Kesler, Kessler, Kestler Kaiser, Kayser, Keizer, Keyser, Kieser, Kiser, Kizer Cline, Klein, Kleine, Kline Clark, Clarke Claussen, Clausen, Clawson, Closson Crow, Crowe Krieger, Kroeger, Krueger, Kruger Creamer, Cramer, Kraemer, Kl·amer, Kremer Craine, Crane Christie, Christy, Kristee Crouss, Kraus, Krausch, Krause, Krouse 270 Journal of Library Automation Vol. 3/4 December, 1970 KR 0 S KR 0 S KR 0 SNSN Lo Lo L 0 D L 0 DL L 0 DRMN L°K L°KS L 0 LN L 0 LR L 0 MB L 0 MN L 0 MN L0 N L0 N L0 N L0 N L 0 NG L 0 NN L 0 NS L 0 R L 0 RNS L 0 RNS L 0 RSN L 0 S L 0 S L 0 SR L0 V L 0 VD L 0 VL L 0 VN M 0 D M 0 DN M0 DS M 0 DSN M°KL M°KM M°KS M°KS M 0 LN M 0 LN M 0 LR M 0 LR Cross, Krost Crews, Cruz, Kruse Christensen, Christiansen, Christianson Loe, Loewe, Low, Lowe Lea, Lee, Leigh Lloyd, Loyd Litle, Littell, Little, Lytle Ledterman, Letterman Leach, Leech, Leitch Lucas, Lukas Laughlin, Loughlin Lawler, Lawlor Lamb, Lamm Lemen, Lemmon, Lemon Layman, Lehman, Lehmann Lind, Lynd, Lynde Lion, Lyon Lin, Linn, Lynn, Lynne Lain, Laine, Laing, Lane, Layne Lang, Lange London, Lundin Lindsay, Lindsey, Lindsley, Linsley Lawry, Lowery, Lowrey, Lowry Lawrence, Lowrance Laurence, Lawrance, Lawrence, Lorence, Lorenz Larsen, Larson Lewis, Louis, Luis, Luiz Lacey, Lacy Leicester, Lester Levey, Levi, Levy Leavett, Leavitt, Levit Lavell, Lavelle, Leavelle, Loveall, Lovell Lavin, Levin, Levine Mead, Meade M oretton, Morton Mathews, Matthews Madison, Madsen, Matson, Matteson, Mattison, Mattson Michael, Michel Meacham, Mechem Marques, Marquez, Marquis, Marquiss Marcks, Marks, Marx Maloney, Moloney, Molony Mullan, Mullen, Mullin Mallery, Mallory Moeller, Moller, Mueller, Muller M0 LR M 0 LS M 0 N M0 NR M0 NR M0 NSN M 0 R M 0 R M0 R M0 R M0 R M0 RF M0 RL M 0 RN M 0 RS M0 RS MK0 MK0 MK0 MK 0 MK 0 L MK 0 LF MK 0 LM MK 0 N MK 0 NR MK 0 NS MK0 NS MK0 R MK0 R MKD 0 NL MKF 0 RLN MKF 0 RSN MKL 0 D MKL 0 KLN MKL 0 LN MKL 0 N MKL•N MKL 0 S MKM 0 LN MKN°L MKR•o N°KL N°KLS N°KLS Algorithm for Name CompressionjDOLBY 271 Millar, Miller Miles, Myles Mahan, Mann Miner, Minor Monroe, Munro Monson, Munson Murray, Murrey Maher, Maier, Mayer Mohr, Moor, Moore Meyers, Myers Meier, Meyer, Mieir, Myhre Murphey, Murphy Merrell, Merrill Marten, Martin, Martine, Martyn Meyers, Myers Maurice, Morris, Morse McCoy, McCaughey Magee, McGee, McGehee, McGhie Mackey, MacKay, Mackie, McKay McCue, McHugh Magill, McGill McCollough, McCullah, McCullough McCallum, McCollum, McColm McKenney, McKinney Macintyre, McEntire, Mcintire, Mcintyre MacKenzie, McKenzie Maginnis, McGinnis, McGuinness, Mcinnes, Mcinnis Maguire, McGuire McCarthy, McCarty MacDonald, McDonald, McDonnell MacFarland, MacFarlane, McFarland, McFarlane MacPherson, McPherson MacLeod, McCloud, McLeod MacLachlan, Maclachlin, McLachlan, McLaughlin, McLoughlin McClellan, McClelland, McLellan McClain, McClaine, McLain, McLane MacLean, McClean, McLean McCloskey, McClosky, McCluskey MacMillan, McMillan, McMillin MacNeal, McNeal, McNeil, McNeill Magrath, McGrath Nichol, Nicholl, Nickel, Nickle, Nicol, Nicoll Nicholls, Nichols, Nickels, Nickles, Nicols Nicholas, Nicolas 272 Journal of Library Automation Vol. 3/4 D ecember, 1970 N°KLSN N°KSN N°L N°LSN N°MN N°RS N°SBD p•n P 0 DRSN p•c P 0 LK P0 LSN p•N p•R p•R P0 RK P 0 RKS p•Rs r•Rs p•Rs P 0 RSN PR°KR PR 0 NS PR 0 R R• R• R 0 BNSN R•n R•n R 0 D R 0 DR R•ns R 0 GN R•GR R°K R°K R°KR n•L R0 MNGTN R0 MR n•Ms n•N R0 NR R•s Nicholsen, Nicholson, Nicolaisen, Nicolson Nickson, Nixon Neal, Neale, Neall, Neel, Neil, Neill Neilsen, Neilson, Nelsen, Nelson, Nielsen, Nielson, Nilson, Nilssen, Nilsson Neumann, Newman Norris, Nourse Nesbit, Nesbitt, Nisbet Pettee, Petty Peterson, Pederson, Pedersen, Petersen, Petterson Page, Paige Polak, Pollack, Pollak, Pollock Polson, Paulsen, Paulson, Poulsen, Poulsson Paine, Payn, Payne Parry, Perry Parr, Paar Park, Parke Parks, Parkes Pierce, Pearce, Peirce, Piers Parish, Parrish Paris, Parris Pierson, Pearson, Pehrson, Peirson Prichard, Pritchard Prince, Prinz Prior, Pryor Roe, Rowe Rae, Ray, Raye, Rea, Rey, Wray Robinson, Robison Rothe, Roth Rudd, Rood, Rude Reed, Read, Reade, Reid Rider, Ryder Rhoades, Rhoads, Rhodes Regan, Ragon, Reagan Rodgers, Rogers Richey, Ritchey, Ritchie Reich, Reiche Reichardt, Richert, Rickard Reilley, Reilly, Reilli, Riley Remington, Rimington Reamer, Reimer, Riemer, Rimmer Ramsay, Ramsey Rhein, Rhine, Ryan Reinhard, Reinhardt, Reinhart, Rhinehart, Rinehart Reas, Reece, Rees, Reese, Reis, Reiss, Ries R0 S R0 S R0 S R•vs s•BR S°FL s•FN S°FNS S°FNSN S°FR S°FR s•cL S 0 GLR s•K s•Ks s•L s•L s•LR s•Ls s•Lv s•LvR S 0 MKR S 0 MN S 0 MN s•MRS s·Ms s•N S 0 N S 0 NR S0 NRS S 0 PR s·R s·R s·R S 0 R S0 R s•RL S 0 RLNG s•RMN S0 RN s•RR sos SM 0 D Algorithm for Name CompressionjDOLBY 273 Rauch, Rausch, Roach, Roche, Roush Rush, Rusch Russ, Rus Reaves, Reeves Seibert, Siebert Schofield, Scofield Stefan, Steffan, Steffen, Stephan, Stephen Steffens, Stephens, Stevens Steffensen, Steffenson, Stephenson, Stevenson Schaefer, Schaeffer, Schafer, Schaffer, Schafer, Shaffer, Sheaffer Stauffer, Stouffer Siegal, Sigal Sigler, Ziegler Schuck, Shuck Sachs, Sacks, Saks, Sax, Saxe Seeley, Seely, Seley Schell, Shell Schuler, Schuller Schultz, Schultze, Schulz, Schulze, Shults, Shultz Silva, Sylva Silveira, Silvera, Silveria Schomaker, Schumacher, Schumaker, Shoemaker, Shumaker Simon, Symon Seaman, Seemann, Semon Somers, Sommars, Sommers, Summers Simms, Sims Stein, Stine Sweeney, Sweeny, Sweney Senter, Center Sanders, Saunders Shepard, Shephard, Shepheard, Shepherd, Sheppard Stahr, Star, Starr Stewart, Stuart Storey, Story Saier, Sayre Schwartz, Schwarz, Schwarze, Swartz Schirle, Shirley Sterling, Stirling Scheuermann, Schurman, Sherman Stearn, Stem Scherer, Shearer, Sharer, Sherer, Sheerer Sousa, Souza Smith, Smyth, Smythe 274 Journal of Library Automation Vol. 3/4 December, 1970 SM 0 D SN°DR SN°L SP 0 LNG SP 0 R SP 0 R SR 0 DR SR0 DR T0 D T 0 MSN T0 RL TR 0 S v·L v·L v·R w•o W 0 DKR w·nL w·nMN W 0 DR W 0 DRS W 0 GNR W 0 L W 0 L W 0 L W 0 LBR W 0 LF W 0 LKNS W 0 LKS W 0 LN W 0 LR W 0 LRS W 0 LS W 0 LS W 0 LS W 0 LSN W 0 N W 0 R W 0 R W 0 RL W 0 RNR W 0 S w·sMN Schmid, Schmidt, Schmit, Schmitt, Smit Schneider, Schnieder, Snaider, Snider, Snyder Schnell, Snell Spalding, Spaulding Spear, Speer, Speirer Spears, Speers Schroder, Schroeder, Schroeter Schrader, Shrader Tait, Tate Thomason, Thompson, Thomsen, Thomson, Tomson Terrel, Terrell, Terrill Tracey, Tracy Vail, Vaile, Vale Valley, Valle Vieira, Vierra White, Wight Whitacre, Whitaker, Whiteaker, Whittaker Whiteley, Whitley Whitman, Wittman Woodard, Woodward Waters, Watters Wagener, Waggener, Wagoner, Wagner, Wegner, Waggoner Willey, Willi Wiley, Wylie Wahl, Wall Wilber, Wilbur Wolf, Wolfe, Wolff, Woolf, Woulfe, Wulf, Wulff Wilkens, Wilkins Wilkes, Wilks Whalen, Whelan Walter, Walther, Wolter Walters, Walthers, Wolters Wallace, Wallis Welch, Welsh Welles, Wells Willson, Wilson Winn, Wynn, Wynne Worth, Wirth Ware, Wear, Weir, Wier Wehrle, Wehrlie, Werle, Worley Warner, Werner Weis, Weiss, Wiese, Wise, Wyss Weismann, Weissman, Weseman, Wiseman, Wismonn, Wissman Algorithm for Name CompressionjDOLBY 275 REFERENCES 1. Cox, N.S.M.; Dolby, J. L.: "Structured Linguistic Data and the Automatic Detection of Errors." In Advances in Computer Type- setting (London: Institute of Printing, 1966), pp. 122-125. 2. Cox, N.S.M.; Dews, J. D.; Dolby, J. L.,: The Computer and the Library (Hamden, Conn.: Archon Press, 1967). 3. Dolby, J. L.; Forsyth, V. J.; Resnikoff, H. L.: Computerized Library Catalogs: Their Growth, Cost and Utility (Cambridge, Massachu- setts: The M.I.T. Press, 1969) . 4. Becker, Joseph; Hayes, Robert M. : Information Storage and Re- trieval (New York: Wiley, 1963 ), p. 143. 5. Davidson, Leon: "Retrieval of Misspelled Names in Airlines Pas- senger Record System," Communications of the ACM, 5 (1962), 169-171. 6. Blair, C. R.: "A Program for Correcting Spelling Errors," Informa- tion & Control, 3 ( 1960), 60-67. 7. Schwartz, E. S.: An Adaptive Information Transmission System Employing Minimum Redundancy Word Codes (Armour Research Foundation Report, April 1962). (AD 274-135). 8. Bourne, C. P.; Ford, D.: "A Study of Methods for Systematically Abbreviating English Words and Names," Journal of the ACM, 8 ( 1961), 538-552. 9. Kessler, M. M., "The "On-Line" Technical Information System at M.I.T.", in 1967 IEEE International Convention Record. (New York: Institute of Electrical and Electronic Engineers, 1967), pp. 40-43. 10. Kilgour, F. G.: "Retrieval of Single Entries from a Computerized Library Catalog File," American Society for Information Science, Proceedings, 5 ( 1968), 133-136. 11. Nugent, W. R.: "Compression Word Coding Techniques for In- formation Retrieval," Journal of Library Automation, 1 (December 1968), 250-260. 12. Rothrock, H. I.: Computer-Assisted Directory Search; A Dissertation in Electrical Engineering. (Philadelphia: University of Pennsylvania, 1968). 13. Ruecking, F. H.: "Bibliographic Retrieval from Bibliographic In- put; The Hypothesis and construction of a Test," Journal of Library Automation, 1 (December 1968), 227-238. 14. Tukey, J. W.: A Tagging System for Journal Articles and Other Citable Items: A Status Report (Princeton, N.J.: Statistical Tech- niques Research Group, Princeton University, 1963). 15. Resnikoff, H. L.; Dolby, J. L.: A Proposal to Construct a Linguistic and Statistical Programming System, (Los Altos, Cal.: R & D Con- sultants Company, 1967).