The Binary Vector as the Basis of an Inverted Index File Donald R. KING: Rutgers University, New Brunswick, New Jersey. 307 The inverted index file is a frequently used file structure for the storage of indexing information in a document retrieval system. This paper de- scribes a novel method for the computer storage of such an index. The method not only offers the possibility of reducing storage requirements fot an index but also affords more mpid processing of query statements ex- pressed in Boolean logic. INTRODUCTION The inverted index file is a frequently used file structure for the storage of indexing information in document retrieval systems. An inverted index file may be used by itself or with a direct file in a so-called combined file system. The inverted index file contains a logical record for each of the subject headings or index terms which may be used to describe documents in the system. Within each logical record there is a list of pointers to those documents which have been indexed by the subject heading in question. The individual pointers are usually in the form of document numbers stored in fixed-length digital form. Obviously, the length of the lists will vary from record to record. The purpose of this paper is the presentation of a new technique for the storage of the lists of pointers to documents. It will be shown that this technique not only reduces storage requirements, but that in many cases the time required to search the index is reduced. The technique is useful in systems which use Boolean searches. The relative merits of Boolean and weighted term searches are beyond the scope of this paper, as are the rela- tive merits of the various possible file structures. THE BINARY VECTOR AS A STORAGE DEVICE The exact form of each document pointer is immaterial to the user of a document retrieval system as long as he is able to obtain the document he desires. The standard form for these pointers in most automated systems is a document number. Note that each pointer is by itself a piece of infor- mation. However, if one thinks of a "peek-a-boo" system, the document 308 Journal of Library Automation Vol. 7/4 December 1974 pointer becomes simply a hole punched in a card. In this case the position of the pointer, not the pointer itself, conveys the information. The new technique presented in this paper is an extension of the "peek- a-boo" concept. A vector or string of binary zeroes is constructed equal in length to the number of documents expected in the system. The position of each vector element corresponds to a document number. That is, the first position in a vector corresponds to document number one and the tenth vector posi- tion corresponds to document number ten. A vector is constructed for each subject heading in the system. As a document enters the system, ones are in- serted in place of the zeroes in the positions corresponding to the new doc- ument number in the vectors for the subject headings used to describe the document. As an example, assume the following document descriptions are presented to a system using binary vectors: Document Number 1 2 3 Subject Headings A,B,D C,E A,C The binary vectors for terms A, B, C, D, and E before the insertion of the indexing data would be as follows: Subject Heading A B c D E Vector 000 ... 0 000 ... 0 000 ... 0 000 ... 0 ooo ... ·o After the insertion of the indexing information, the same vectors would appear as follows: Subject Heading A B c D E Vector 101 ... 0 100 ... 0 011 ... 0 100 ... 0 010 ... 0 The binary vector seems to have several advantages over the standard form of storage of document numbers in an inverted file. First, the rec- ords are of fixed length since the vectors are all equal in length to the ex- pected number of documents in the system. Space may be left at the end of each vector for the addition of new documents. Periodic copying of the file may be used to expand the index records with additional zeroes added at the end of each record during the process. Consequently, unless Binary Vector/KING 309 there are limitations of size imposed by the equipment, only one access to the storage device will be needed to retrieve the index record for a term. The second advantage offered by the binary vector method appears in the search process. Most modern computers have a built-in capability of per- forming Boolean logical manipulations on binary digit vectors or strings. Thus, when Boolean operations are specified as part of a query, the imple- mentation of the operations within the· computer is considerably easier and faster for binary vectors than for the standard form of inverted files. Other investigators of the use of the binary digit patterns or vectors have not fully explored its advantages and disadvantages. Bloom suggests, without an explanation or evaluation, the use of bit patterns as the storage technique for inverted files in large data bases in the area of management information systems.1 Davis and Lin, again in the area of management in- formation systems, propose bit patterns as the means of locating pertinent records in a master file. 2 They do not compare the method with other pos- sible techniques. Sammon discusses briefly the use of binary vectors as a storage technique, but dismisses it on the basis that the two-valued ap- proach obviates the possible assignment of weights to index terms in de- scribing documents. 3 Gorokhov discusses the use of a modified binary vec- tor approach in a document retrieval system implemented on a small Soviet computer.4 Faced with the need to minimize storage requirements for his inverted file, Gorokhov concentrated on developing a technique for lo- cating and removing strings of zeroes occurring in the binary vectors used within the system. Since these zeroes represent the absence of information they could be removed if there were a way to indicate the position in the original vector of the ones that remained. He proposed the removal of strings of zeroes and the inclusion of numeric place values with the re- maining vector elements. His result is a file with variable-length index rec- ords. The abandoning of the pure binary vector obviates the process, and Gorokhov found it necessary to expand the vector elements into the orig- inal vector before logical operations could be applied. Even though he does not state so explicitly, Gorokhov seems to have found his method more efficient than the standard inverted file. Gorokhov' s suggestion has led to the development of an algorithm for the compression of binary vectors. Heaps and Thiel have also discussed the use of compressed binary vec- tors as the basis of an inverted index file. 5• 6 Aside from a brief descrip- tion of the method for implementing the concept, they offer no compari- son of the binary vector with the standard inverted file. STORAGE REQUIREMENTS An immediate reaction to the concept of binary vectors is to state that they will obviously take more storage space than the standard inverted file. A closer study shows that this is not always the case. The storage require- ments for the two types of files may be calculated as follows: 310 Journal of Library Automation Vol. 7/4 December 1974 D·N 1. MBv = 8 bytes 2. Msr = D · I · K where: ( binary vector file) (standard inverted file) M = Storage requirements in bytes D = Number of documents in the system N = Number of index terms in the system I = Average depth of indexing in the system K = Size in bytes of a document number stored in the file Using equations 1 and 2 we find that the storage requirements for the binary vector file are, in fact, less than the requirements for the standard inverted file if N < 8 •] • K. It is well lmown that the distribution of the use of index terms follows a logarithmic curve. In simple terms, one might say that a few terms are used very frequently and many terms are used infrequently. This condi- tion implies that in a binary vector file the records for many terms will contain segments in which there are no "ones" in any byte. A method for re- moving these "zero" bytes is called compression. COMPRESSION ALGORITHM The technique for the compression of binary vectors as described here is designed specifically for the IBM 360 family of computers and similar machines. The extension to other machines should be obvious. Within the IBM 360 the byte, which contains eight binary digits, is the basic storage unit, and with the eight binary digits it is possible to store a maximum integer value of 255. For the purpose of describing a proposed compression algorithm for the binary vector in the IBM 360, the term sub- vector will be defined as a string of contiguous bytes chosen from within the binary vector. A zero subvector will be a subvector each of whose bytes contains eight binary zeroes. A nonzero subvecto1· will be a subvector each of whose bytes contains at least one binary one. To compress a binary vec- tor in the IBM 360 the following steps may be taken: 1. Divide the binary vector into a series of zero subvectors and nonzero subvectors. Subvectors of either type may have a maximum length of 255 bytes. For zero subvectors longer than 255 bytes, the 256th byte is to be treated as a nonzero byte, thus dividing the long zero subvector. 2. Each nonzero subvector is prefixed with two bytes. The first of the prefix bytes contains the count of zero bytes which precede the non- zero subvector in the uncompressed vector. The second prefix byte contains a count of the bytes in the nonzero subvector. 3. The compressed vector then consists of only the nonzero subvectors together with their prefix bytes. 4. A two byte field of binary zeroes will end the compressed vector. Binmy Vector/KING 311 The compression of the vectors creates variable-length records and re- moves the advantage of having records which are directly amenable to Boolean manipulation. The effect of file compression on such manipula- tion in the search process is not as severe as it might appear. For the search process, the compressed vector may be expanded into its original form. The process of expansion of the binary vectors is relatively simple, and since only those index term records which are used in a query need to be expanded at the search time, the search time is not significantly affected. As an example of the use of the compression algorithm consider the fol- lowing binary vector. 01100000/10000000/ seven zero bytes j00000001j10000000j ... The slashes indicate the division of the vector into bytes. The vector might be read as indicating the following list of document numbers: 2, 3, 9, 80, and 81. In a standard inverted file with each document number assigned three bytes of storage, fifteen bytes would be required to store these numbers. The compressed vector which results from the application of the algo- rithm is the following: 00000000j00000010j01100000/10000000j00000111/00000010/ 00000001/10000000/ ... Again the slashes separate the vector into bytes. For the purpose of the fol- lowing discussion consider each byte in a vector to be numbered sequential- ly beginning with byte one at the left. In the uncompressed vector bytes one and two form a nonzero subvector. Consequently, the first four bytes in the compressed vector can be inter- preted as follows: Byte one. Binary zero indicating that no zero bytes were re- moved preceding this subvector. Byte two. Binary two indicating that the following nonzero sub- vector is two bytes long. Bytes three, four. Bytes one and two of the original vector. Bytes three through nine of the original vector are a zero subvector, and bytes ten and eleven form a second nonzero subvector. Consequently, the second four bytes of the compressed vector are interpreted as follows: Byte five. Binary seven indicating that a zero subvector of seven bytes has been removed. Byte six. Binary two indicating that the following two bytes are a nonzero subvector. Bytes seven, eight. Bytes ten and eleven of the original vector. Thus the binary vector has been reduced from eleven bytes to eight 312 Journal of Library Automation Vol. 7/4 December 1974 bytes while the space required to record the document numbers in the stan- dard inverted file remains fifteen bytes. MEMORY REQUIREMENTS FOR THE STANDARD INVERTED FILE AND THE BINARY VECTOR FILE To compare memory requirements for the standard inverted file and the compressed binary vector file, we base our comparison on the total number of postings in the file. In the standard inverted file the storage space for the postings is equal to the number of postings times the length of a sin- gle posting, which is usually two, three, or five bytes. Memory requirements for the compressed binary vector file are more difficult to estimate because the distribution of document numbers within the record for each index term is not known. The fact that a single byte in the binary vector file may contain between zero and eight postings is extremely important. The worst possible case occurs if the postings in the binary vector are spaced in such a way that each nonzero byte contains only one posting, and these bytes are separated by zero bytes. Consider the following example: ... /00000000/00010000/00000000/00000100/ ... In this case the compression algorithm will remove the zero bytes, but will add two bytes (the prefix bytes) for each nonzero byte. The resulting com- pressed vector will be essentially the same length as the standard inverted file record if each posting is three bytes long in the standard inverted file. It might seem that the distribution of one posting per byte for the entire vector represents an even worse situation. It is clear that the compression algorithm will, in this case, not reduce the size of the vector. However, it must be remembered that in the standard inverted file each posting will re- quire at least two bytes and perhaps three bytes. Thus, the length of the record in the standard inverted file is two or three times longer than the corresponding binary vector regardless of compression. In data used in two model retrieval systems prepared to compare the standard inverted file and the binary vector file there are 6,121 documents with a total of 94,542 postings. An examination of the binary inverted file for the model systems discloses that there are only 55,311 nonzero bytes in the binary vector file. Thus there seems to be some form of clustering of the document numbers in each index term record. If each nonzero byte in this binary vector is isolated by zero bytes, two prefix bytes would be added for each byte. Thus the total memory requirements for the postings in the compressed file would be 165,933 bytes. Less storage space is required if some nonzero bytes are contiguous. On the other hand, the standard in- verted file will require 189,084 bytes if a two-byte posting is used, or 283,626 bytes if a three-byte posting is used. Further study of the cluster- ing phenomenon is needed. Binary Vector /KING 313 MODEL RETRIEVAL SYSTEMS To test some of the conjectures about the differences between the stan- dard inverted file and the binary vector file, two model systems were pre- pared for operation on an IBM 360/67. Details of the systems and PL/1 program listings are available elsewhere. 7 The data base used was obtained from the Institute of Animal Behavior at Rutgers University. In the data base 6,121 documents were indexed by 1,484 index terms. A total of 94,542 postings in the system gives an average depth of indexing of 15.4 terms per document. Both inverted files were stored on IBM 2314 disc storage devices. To ease the problem of handling variable-length records in both files the logical records for each index term were divided into chains of fixed~ lehgth physical records. For the standard inverted file a physical record size of 331 bytes was chosen. The entire file required 702,713 bytes including record overhead. For the uncompressed binary vector file a physical record size of 1,286 bytes was chosen to include overhead and space for up to 10,216 document numbers. When the compression algorithm was applied, with a physical record length of 130 bytes, the memory requirements for the binary vector file were reduced to 281,450 bytes, or 41 percent of the space required to store the standard inverted file. A series of forty searches of varying complexities were run against both files. The "TIME" function of PL/1 made it possible to accumulate tim- ing statistics which excluded input/output functions. Search times for the binary vector file include expansion of the compressed vectors, Boolean manipulation of the vectors, and conversion of the resultant vector into digital document numbers. The times for the standard inverted file are for the Boolean manipulation of the lists. The following points were noted in the analysis of the times: 1. In twenty-two of the forty queries for which comparative timings were obtained, the search of the binary vector file was faster, in one case by a factor of thirty-five. In the eighteen cases in which the search of the standard inverted file was faster, the search of the stan- dard inverted file was at most 6.17 times faster. 2. The range of the total times for the binary vector file was .79 seconds to 9.72 seconds. The range for searching the standard inverted file was .15 seconds to 202.98 seconds. The fact that the search times for the binary vector file are within a fairly narrow range, in contrast to the wider range of times for searching the standard inverted file, has im- portant implications for the design of an on-line interactive docu- ment retrieval system. In such a system it is important that the com- puter respond to users' requests not only rapidly but consistently. The narrower range of the search times provided by the binary vector file will assist in producing consistent times. 3. The search times for the binary vector file, exclusive of expansion and conversion times, are unaffected by the number of postings con- 314 Journal of Library Automation Vol. 7/4 December 1974 tained in the index terms used in a query. On the other hand, the number of postings in the records used from the standard inverted file appears to cause the differences in search times for that file. To test the conjectures! that 1. search times for the binary vector file are related to the number of index terms in the query, and 2. search times for the standard inverted file are related to the num- ber of postings in the index terms in the query, a correlation analysis was performed. The following correlation co- efficients were obtained: V a1'iables 1' Number of terms in query and search .960 times for the binary vector file. Number of postings in query terms and .979 search times for standard inverted file. The relationships indicated above are significant at the .001 level. No attempt was made to compute an average search time per term for the binary vector file or average search time per posting for the standard in- verted file. Such times would have meaning only for the model systems. SUMMARY The binary vector is suggested as an alternative to the usual method of storing document pointers in an inverted index file. The binary vector file can provide savings in storage space, search times, and programming effort. REFERENCES 1. Burton H. Bloom, "Some Techniques and Trade-Offs Affecting Large Data Base Retrieval Times," Proceedings of the ACM 24 ( 1969). 2. D. R. Davis and A. D. Lin, "Secondary Key Retrieval Using an IBM 7090-1310 System," Communications of the ACM 8:243-46 (April1965). 3. John W. Sammon, Some Mathematics of Information Storage and Retrieval (Tech- nical Report RADC-Tr-68-178 [Rome, New York: Rome Air Development Center, 1968]). 4. S. A. Gorokhov, "The 'Setka-3' Automated IRS on the 'Minsk-22' with the Use of the Socket Associative-Address Method of Organization of Information" (Paper presented at the All-Union Conference on Information Retrieval Systems and Auto- matic Processing of Scientific and Technical Information, Moscow, 1967. Translated and published as part of AD 697 687, National Technical Information Service). 5. H. S. Heaps and L. H. Thiel, "Optimum Procedures for Economic Information Re- trieval," Information Storage & Retrieval6:131-53 (1970). 6. L. H. Thiel and H. S. Heaps, "Program Design for Retrospective Searches on Large Data Bases," Information Storage & Retrieval8:1-20 (1972). 7. D. R. King, "An Inverted File Structure for an Interactive Document Retrieval System" (Ph.D. dissertation, Rutgers University, 1971).