METHODS OF RANDOMIZATION OF LARGE FILES WITH HIGH VOLATILITY 79 Patrick C. MITCHELL: Senior Programmer, Washington State University, Pullman, Washington, and Thomas K. BURGESS: Project Manager, Institute of Library Research, University of California, Los Angeles, California Key-to-address conversion algorithms which have been used for a large, direct access file are compared with respect to record density and access time. Cumulative distribution functions are plotted to demonstrate the distribution of addresses generated by each method. The long-standing practice of counting address collisions is shown to be less valuable in fudging algorithm effectiveness than considering the maximum number of contiguously occupied file locations. The random access disk file used by the Washington State University Library Acquisition sub-system is a large file with a sizable number of records being added and deleted daily. This file represents not only mate- rials on order by the Acquisitions Section, but all materials which are in process within the Technical Services area of the Library. The size of the file currently varies from approximately 12,000 to 15,000 items and has a capacity of 18,000 items. Over 40,000 items are added and purged annually. Each record consists of both fixed length fields and variable length fields. Fixed fields primarily contain quantity and accounting in- formation; the variable length fields represent bibliographic data. Records are blocked at 1,000 characters for file structuring purposes; however the variable length information is treated as strings of characters with delimi- ters. The key to the file is a 16-character structure which is developed from the purchase order number. The structure of the key is as follows: six digits of the original purchase order number, two digits of partial order and credit information, and eight digits containing the computed relative record address. Proper development of this key turns out to be 80 Journal of Library Automation Vol 3/1 March, 1970 the most important factor in achieving efficiency in both file access time and record density within the file. The W.S.U. purchase order numbering system, developed from a basic six-digit purchase order number, allows up to one million entries. Of these, the Library currently uses four blocks: one block for standing orders, one block for orders originating from the University after the system becomes operational, another block used by the systems people in prototype testing of the system, and a fourth block which was given to one vendor who operates an approval book program. In mapping a possible million numbers into eighteen thousand disk lo- cations, there is a high probability that the disk addresses for more than one record will be the same. Disk location, also called disk address, home position, and relative record address ( RRA) in this paper, refers to the computed offset address of a record in the file, relative to the starting address of the file. Currently, the file resides on an IBM 2316 disk pack which can store six 1000-character records per track. Thus if the starting address of the file is track 40, a record with RRA = 5 would have its home position on track 40, while a record with RRA = 6 would have its home position on track 41. It should be noted that routines in this system are required to calculate neither absolute track address nor relative track address and therefore the file could be moved to any direct access device supported by OS/BDAM without program modification. When two records map into the same address, it is called a collision. For a WRITE statement under the IBM 360 Operating System, Basic Direct Access Methods, the system locates that disk address generated and if another record is found there, it sequentially searches from that point forward until a vacant space is found and then stores the new rec- ord in that space. The sequential search is done by a hardware program in the I/ 0 channel and proceeds at the rotational speed of the device on which the file resides. The CPU is free during this period to service other users. Similarily, when searching for a record, the system locates the disk address and matches keys; if they do not match, it sequentially searches forward from that point. Long sequential searches sharply de- grade the operating efficiency of on-line systems. In initial experimentation with this file, it was discovered that some records were 2,500 disk positions away from their computed locations. This seriously reduced response time to the terminals which were operat- ing against those records. The necessity to develop a method for placing each record close to its calculated location became quite obvious. How- ever, the methodology for doing this was not as clear. The upper bound delay for a direct access read/write operation can be defined as the largest number of contiguously occupied record locations within the file. The problem of minimizing this upper bound for a par- ticular file is equivalent to finding an algorithm which maps the keys in such a way that unoccupied locations are interspersed throughout the Randomization of Large Files/MITCHELL and BURGESS 81 file space. One method for doing this is to triple the amount of space required for the file. This has been a traditional approach but is unsatis- factory in terms of its efficiency in space utilization. The method first used by the Library was motivated by the necessity to "get on the air." Its requirements were that it be easily implemented and perform to a reasonable degree. The prime modulo scheme seemed to qualify and was selected. As this algorithm was used, the largest prime number within the file size was divided into the purchase order number and the modulo remainder was used as an address; that is, RRA = [Po Modulo Pr] where RRA is the relative record address, Po is the Purchase Order Number, and Pr is a prime number. During the initial period file size grew to about 8,000 records. Because the Acquisitions Section was converting from its manual operation, the file continued to grow in size and the collision problem became pronounced. When the file reached about 70% capacity-that is when 70% of the space allocated for the file was being occupied by records-this method became unusable; rec- ords were then located so far from their original addresses that terminal response times became degraded and batch process routines began to have significant increases in run times. With no additional space available to expand the size of the file, it became necessary to increase the record density within the existing file bounds. Therefore an adaptation of the original algorithm was developed. In addition to generating the original number by dividing a prime num- ber into the purchase order number and keeping the modulo remainder, the purchase order number was multiplied by 300 and divided by that same prime number to get an additional modulo remainder; the latter was added to the first modulo remainder and the sum then divided by 2: (Po Modulo Pr) + (300 • Po Modulo Pr) 2 RRA = Again this scheme brought some relief, but the file continued to grow as the system was implemented, and it became obvious that this procedure would also fail because of over-crowded areas in the file. A search of the literature using W. B. Climenson's chapter on file struc- ture ( 2) as a start provided some other methods for reducing the colli- sion problem ( 1, 3, 4, 5, 6). Several randomization or hashing schemes were examined. However, none of these methods appeared to be particu- larly pertinent to the set of conditions at Washington State. In order to bring relief from the continuing problem of file and pro- gram maintenance involved with changing the file-mapping algorithm, research was initiated to devise an algorithm which would, independent of the input data, map records uniformly across the available file space. The algorithm which resulted utilizes a pseudo-random number gen- erator, RAND (7) developed at the W.S.U. Computing Center RANDL, Program 360L-13.5.004, Computing Center Library, Computing Center, 82 Journal of Library Automation Vol 3/ 1 March, 1970 Washington State University, Pullman, Washington. The normal use of RAND is to generate a sequence of uniformly distributed integers over the interval [1, M], where M is a specified upper bound in the interval [1, 231 -1]. In addition to M, RAND has a second input parameter: N, which is the last number generated by RAND. Given M and N, RAND generates a result R. RAND is used by the algorithm to generate relative disk addresses by setting M to the size or capacity of the file, by setting N to the purchase order number of the record to be located, and by using R as the relative address of the record. RRA =RAND (Po, M ) . In order to test the effectiveness of this algorithm and others which might be devised, a file simulation program was written BDAMSIM, Pro- gram 360L-06.7.008, Computing Center Library, Computing Center, Washington State University, Pullman, Washington. Inputs to this pro- gram are: a) an algorithm to generate relative record locations; b) a sequential file which contains the input data for "a"; c) various scalar values such as file capacity, approximate number of records in the file, title of output, etc. The program analyzes the numbers generated by "a" operating on "b" within the constraints of "c". The outputs of the program are some sta- tistical results and a graphical plot showing the cumulative distribution function of the generated addresses. Figures 1, 2, and 3 show the plotted output of the three algorithms operating against the current acquisitions file. The abscissas of the plots 8 • )!! II! li 1i :;! 5I ::! !':! ~ ~ N N ~~ a= .. - ~, ,~ -' -' ~)11 I'! a; ·:5 ~li Ma! 0.. 0.. .. .. it ,:: ~ ~ ~--~~~~-±~~~--~~~~~~~~--~--~--~--~~0 21 , 10 '12.20 83.30 111,'10 105. 51 I:M.61 1~7.71 1811,81 IM.tl 211.01 2$!,11 253.2f RELRT IVE RECORD ADDRESSES lX I 02 l Fig. 1. RRA =Po Modulo Pr Randomization of Large Files/ MITCHELL and BURGESS 83 Fig. 2. RRA = ( (Po Modulo Pr) + (300 x Po Modulo Pr) )/ 2. 8 i )C II! ~ ~ Z! 5I Fl !! l':! I<; ~ ;;; :::::: ;::: ~8 z 8::: .; .; ::~ ~ ,.. ~ ..J ..J ~~ ~iii ~M :ti~ a: a: "- "- "' "' ~ ~ ~ ~ Fig. 3. RRA =RAND (Po, Pr). 84 Journal of Library A