250 COMPRESSION WORD CODING TECHNIQUES FOR INFORMATION RETRIEVAL 'William R. NUGENT: Vice President, Inforonics, Inc., Cambridge, Massachusetts A description and comparison is presented of four compression techniques for word coding having application to information retrieval. The emphasis is on codes useful in creating directories to large data files. It is further shown how differing application objectives lead to differing measures of optimality for codes, though compression may be a common quality. INTRODUCTION Cryptographic studies have documented much useful language data having application to retrieval coding. Because unclassified cryptographic studies are few, Fletcher Pratt's 1939 work ( 1) remains the classic in its field. Gaines ( 2) has the virtue of being in print, and the more recent crypto- graphic history of Kahn ( 3), while comprehensive, lacks the statistical data that made the earlier works valuable. The word coding problem for language processing, as opposed to cryptography, has been extensively studied by Nugent and Vegh ( 4). Information theorists have contributed the greatest volume of literature on coding and have added to its mathe- matical basis, largely from the standpoint of communications and error avoidance. A brief discussion of compression codes and their objectives is here presented, and then a description of a project encompassing four com- pression codes having application to retrieval directories. Two of the Compression Word Goding/ NUGENT 251 codings are newly devised. One is Transition Distance Coding, a ran- domizing code that results in short codes of high resolving power. The second is Alphacheck. It combines high readability with good resolution, and permits simple truncation to be used by means of applying a randomized check character that acts as a surrogate of the omitted portion. It appears to have the greatest potential, in directory applications, of the codes considered here. Recursive Decomposition is a selected letter code devised by the author several years ago ( 4). It has been tested and has the advantages of simple derivation and high resolution. Soundex(5) is the only compression code that has achieved wide usage. It was devised at Remington Rand for name matching under conditions of uncertain spelling. OBJECTIVES OF COMPRESSION CODING It is desired to transform sets of variable length words into fixed length codes that will maximally preserve word to word discrimination. In the final directories to be used, the codes for several elements will be accessible to enable the matching of several factors before a file record is selected. The separate codes for differing factors need not be the same length, though each type of code will be of uniform length; nor need the codes for differing factors be derived by the same process. What we loosely call codes, must be formally designated ciphers. That is, they must be derivable from the data words themselves, and not require "code books" to determine equivalences. This is so because the file directories must be derivable from file items, ent:ries in directory form must be derivable from an input query, and these two directory items must match when a record is to be extracted. The ciphers need not be decipherable for the application under consideration, and in general are not. Fixed length codes which provide the rough equivalent and simplicity of a margin entry in a paper directory, are generally desirable for machine directories. The functions of the codes will detennine their form, and a code or file key designed to meet one objective will generally not be satisfactory for any other objective. The following typical objectives serve as four examples: ( 1) Create a file key for extraction of records in approximate file order, as is required for the common Sorting and Print- out Problem. A typical code construction rule is to take the first six letters. JOHNSEN _..,.. JOHNSE JOHNSON _..,.. JOHNSO JOHNSTON _..,.. JOHNST JOHNSTONE _..,.. JOHNST 252 Journal of Library Automation Vol. 1/ 4 December, 1968 ( 2) Create a file key for extraction of records under conditions of uncertainty of spelling (airline reservation problem). A typical code construction rule is Vowel Elimination or Soundex. A typical matching rule is best match. Vowel Elimination Soundex JOHNSEN_..,.. JHNSN J525 _..,.. J52 JOHNSON_..,.. JHNSN J525 _..,.. J52 JOHNSTON _..,.. JHNSTN J5235 _..,.. J52 JOHNSTONE _..,.. JHNSTN J5235 _..,.. J52 ( 3) Create a file key extraction of records from accurate input, with objective of maximum discrimination of similar entries (cataloging search problem). Typical code construction rules are Recursive Decomposition Coding or Transition Distance Coding. Recursive Decomposition Transition Distance JOHNSEN_..,.. JHNSEN BFTZ JOHNSON _..,.. JHNSON DNWU JOHNSTON _..,.. JHSTON ZIKY JOHNSTONE _..,.. JHSONE ECRC For the file keys of primary concern accurate imput data is assumed and the objective is maximum discrimination. Desirably, a code would be as discriminating as Transition Distance Coding and be as readable as truncation coding. This can be achieved to some degree by combining the two codes into one, with an initial portion truncated and a final check character representing the remainder via a compressed Transition Distance Code: Alphacheck. ( 4) Create a file key for human readability and high word to word discrimination. Possible code construction rules are Alphacheck, and simple truncation plus a terminal check character. JOHNSEN _..,.. JOHNSV JOHNSON _..,.. JOHNSX JOHNSTON _..,.. JOHNSD JOHNSTONE _..,.. JOHNSS METHODS The algorithms for creating the preceding codes are described in the following sections. It is axiomatic that randomizing codes give the greatest possible dis- crimination for a given code space. The whole trick of creating a good compression code is to eliminate the natural redundancy of English orthography, and preserve discrimination in a smaller word size. Compression Word Goding/ NUGENT 253 Letter-selection codes can only half accomplish this, due to the skewed distribution of letter usage. They can eliminate the higher-frequency components, but they cannot increase the use of the lower-frequency components. Randomizing codes-often called "hash" codes, properly quasi-random codes-can equalize letter usage and hence make best use of the code space. Prime examples here are the variants of Godel coding devised by Vegh ( 4) in which the principle of obtaining uniqueness via the products of unrepeated primes is exploited, as it is in the randomizing codes con- sidered here. The problem in design of a randomizing code is that the results can be skewed rather than uniformly distributed due to the skewed nature of the letters and letter sequences that the codes operate on. In Transition Distance Coding, the natural bias of letters and letter sequences is overcome by operating on a word parameter that is itself semi-random in nature. The following principle, not quite a theorem, applies: "Considering letters in their normal ordinal alphabetic position, and considering letter transitions to be unidirectional and cyclic, the distribution of transition distances in English words is essentially uniform." In view of the fact that letter usage has an extremely skewed distribu- tion, with a probability ratio in excess of 170 to one for the extremes, it is seen that the more uniform parameter of transition distances is a superior one for achieving randomized codes. The relative uniformity of transition distance needs further investigation, but one typical letter diagram sample from Gaines ( 2) with 9999 transitions (means number of occurrences of each distance = 385) yielded a mean· deviation of 99 and a standard deviation of 123, and an extreme probability ratio of 3.3 to one for the different transition distances from 0 to 25. The distribution can be made more uniform by letter permutation. Permutation is used in the algorithm for Transition Distance Coding but not in Alphacheck. Algorithm The method of Transition Distance Coding is used to operate on a variable length word to achieve fixed length alphabetic or alphanumeric codes that exhibit quasi-random properties. The code is formed from the modulo product of primes associated with transition distances of permuted letters. The method is intended strictly for computer operation, as it is a simple program but an extremely tedious manual operation. There are five steps: ( 1) Permute characters of natural language word. This breaks the diagram dependency that could make the transition dis- tances less uniformly distributed. This step might be dispensed with if the resulting distributions prove satisfactory without it. The permutation process consists of taking the middle letter (or letter right of middle for words with an even number of letters) , the first, the last, the second, the next-to-last, etc. 254 Journal of Library Automation Vol. 1/4 December, 1968 until all letters have been used. That is, for a letter sequence: at, a2, ... at ... ' an The following permutation is taken: arnt ( -i +1), at, an, a2, an-1, ... a