A SCATTER STORAGE SCHEME FOR DICTIONARY LOOKUPS 

D. M. MURRAY: Department of Computer Science, Cornell University, 
Ithaca, New York 

Scatter storage schemes are examined with respect to their applicability 
to dictionary lookup procedures. Of particular interest are virtual scatter 
methods which combine the advantages of rapid search speed and reason- • 
able storage requirements. The theoretical aspects of computing hash 
addresses are developed, and several algorithms are evaluated. Finally, 
experiments with an actual text lookup process are described, and a possible 
library application is discussed. 

A document retrieval system must have some means of recording the 
subject matter of each document in its data base. Some systems store the 
actual text words, while others store keywords or similar content indicators. 
The SMART system ( 1) uses concept numbers for this purpose, each 
number indicating that a certain word appears in the document. Two 
advantages are apparent. First, a concept number can be held in a fixed-
sized storage element. This produces faster processing than if variable-
sized keywords were used. Second, the amount of storage required to hold 
a concept number is less than that needed for most text words. Hence, 
storage space is used more efficiently. 

SMART must be able to find the concept numbers for the words in any 
document or query. This is done by a dictionary lookup. There are two 
reasons why the lookup must be rapid. For text lookups, a slow scheme is 
costly because of the large number of words to be processed. For handling 
user queries in an on-line system, a slow lookup adds to the user response 
time. 


174 Journal of Library Automation Vol. 3/3 September, 1970 

Storage space is also an important consideration. Even for moderate 
sized subject areas the dictionary can become quite large-too large for 
computer main memory, or so large that the operation of the rest of the 
retrieval system is penalized. In most cases a certain amount of core 
storage is allotted to the dictionary, and the lookup scheme must do the 
best possible job within this allotment. This usually means keeping the 
overhead for the scheme as low as possible, so that a large portion of the 
allotted core is available to hold dictionary words. The rest of the dic-
tionary is placed in auxiliary storage and parts of it are brought in as needed. 
Obviously the number of accesses to auxiliary storage must be minimized. 

This paper presents a study of scatter storage schemes for application 
to dictionary lookup, methods which appear to be fast and yet conservative 
with storage. The next two sections describe scatter storage schemes in 
general. They are followed by a section presenting the results of various 
experiments with hash coding algorithms and a section discussing the 
design and use of a practical lookup scheme. The final sections deal 
with extensions and conclusions. 

BASIC SCATTER STORAGE 

Method 

A basic scatter storage scheme consists of a . transformation algorithm 
and a table. The table serves as the dictionary and is constructed as follows: 
given a natural language word, the algorithm operates on its bit pattern 
to produce an address, and the concept number for the word is placed in 
the table slot indicated by this address. This process is repeated for every 
word to be placed in the dictionary. The generated addresses . are called 
hash addresses; and the table, a hash table.· · 

There are many possible algorithms for producing hash addresses ( 2,3,4). 
Some of the most common are: 1 ) choosing bits from the square of the 
integer represented by the input word; 2) cutting the bit pattern into 
pieces and adding these pieces; 3) dividing the integer represented by the 
input word by the length of the hash table and using the remainder. · 

Collisions 
In an ideal situation every word placed in the dictionary would 

have a unique hash address. However, as soon as a few slots in the hash 
table have been filled, the possibility of a collision arises-two or more 
words producing the same hash address. To differentiate among collided 
entries, the characters of the dictionary words· must be stored along with 
their concept numb~rs. During lookup, the input word can then be com-
pared with the character string to verify that the correct table entry has 
been located. · · 

The problem of where to store the collided items has several · methods 
of solution ( 3,5). The linear scan method places a collided item in the first 
free table slot after the slot indicated by the hash address. The scan· is 


Scatter Storage for Dictionary Lookups/MURRAY 175 

circular over the end of the table. The random probe method uses a crude 
algorithm to generate random offsets R(i) in the interval [1,H] where H 
is the length of the hash table. If the colliding address is A, slot A+R( 1) 
mod H is examined. The process is repeated until an empty slot is found. 
Both of these methods work best when the hash table is lightly loaded; 
that is, when the ratio between the number of words entered and the 
number of table slots is small. In such cases the expected length of scan 
or average number of random probes is small. 

Chaining methods provide a satisfactory method of resolving collisions 
regardless of the load on the hash table. However, they require a second 
storage table-a bump table-for holding the collided items. When a 
collision occurs, both entries are linked together by a pointer and placed 
in the bump table. A pointer to this collision chain is placed in the hash 
table along with an identifying flag. Further colliding items are simply 
added to the end of the collision chain. 

Table Layout and Search Procedure 

In the virtual scatter storage system described later, the hash table has 
a high load factor. Hence the chained method (or rather a variation of it) 
is used to resolve collisions. Further discussion involves only scatter storage 
systems using collision chains. With this restriction, then, a scatter storage 
system consists of a hash table, a bump table, and the associated algorithm 
for producing hash addresses. A dictionary entry consists of a concept 
number and the character string for the word it represents. These entries 
are placed in the hash-bump table as described above. Consequently there 
are three types of slots in the hash table-slots that are empty, slots holding 
a single dictionary entry, and slots containing a pointer to a collision chain 
held in the bump table. Figure 1 is a typical table layout. 

Hash Table 

0 empty slot 
• 
• 

Concept + Char • nary entry single dictio . . 
Pointer -..J.,ntry 11 'r --~)~Entry 21 

\ 
Collision Cha in 

Fig. 1. Typical Table Layout. 


176 Journal of Libmry Automation Vol. 3/3 September, 1970 

One of the advantages of scatter storage systems is that the search 
strategy is the same as the strategy for constructing the hash-bump tables. 
A word being given, its hash address is computed and the tables searched to 
find the proper slot. During construction, dictionary information is placed 
in the slot; during lookup, information is extracted from the slot. The basic 
search procedure is illustrated by the flow diagram in Figure 2. The 
construction procedure is similar. 

Pointer,----< 

Get Next 
Bump Table 
Entry 

Input the 

Text ~rd 

COIIlpUte 

Hasii. Address 

Return 
Concept 
Number 

Word Never 
Entered in 
Dictionary 

Fig. 2. Flow Diagram for the Lookup Procedure in Basic Scatter 
Storage Systems. 


Scatter Storage for Dictionary Lookups/MURRAY 177 

Theoretical Expectations 

An ideal transformation algorithm produces a unique hash address for 
each dictionary word and thereby eliminates collisions. From a practical 
point of view, the best algorithms are those which spread their addresses 
uniformly over the table space. Producing a hash address is simply the 
process of generating a uniform random number from a given character 
string. If the addresses are truly random, a probability model may be 
used to predict various facts about the storage system. 

Suppose a hash table has H slots and that N words are to be entered in 
the hash-bump tables. Let H, be the expected number of hash table slots 
with i entries for i=0,1, ... N. In other words, Ho is the expected number 
of empty slots, H1 is the expected number of single entries, and H2,Hs, ... , 
HN are the expected number of slots with various numbers of colliding 
items. Even though the items are physically located in the bump table, 
they may be considered to "belong .. to the same slot in the hash table. 

It is expected that: 
N 

1) H=~ H. 
i=O 
N 

2) N="S i H, 
i=O 

Now let 
X _ (1 if exactly i items occur in the r~~ slot 

'
1 

- ~0 if exactly i items do not occur in the j'11 slot 
for i = 1,2, ... , H 

Then H, = E [Xu + X.2 + ... + Xm] 
H 

= ~ E [X,J] 
i= 1 

Assume that any chosen table slot is independent of the others so that 
the probability of getting any single item in the slot is 1/H. Then the 
probability of getting exactly i items in that slot is 

3)P·= (~X1r(1-1f 
Then E[X,1] = 1·P, + 0· (1-P,) 

= P, 
Substituting into the above 

4) H,= H·P, 

= n( ~X~) i ( 1- ~ ri for j = 0,1, ... > N 


178 Journal of Library Automation Vol. 3/3 September, 1970 

For the cases of interest H and N are large, and the Poisson approximation 
can be used in equation 3: 

P - ·NIH (N/H)' •- e il 

The ratio N f H is the load factor mentioned previously. It is usually 
designated by a so that 

a• 
5) H, = He·a if i=O,l, . . . , N 

Equation 5 is sufficient to describe the state of the scatter storage system 
after the entry of N items. Most of the statistics of interest can be pre-
dicted using this expression; a few of them are listed in Table 1. 

The time required for a single lookup using a hash scheme depends on 
the number of probes into the table space, that is, how many slots must be 
examined. Suppose the word is actually found; if it is a single entry, only 
one probe is required. If the word is located in a collision chain, the 
number of probes is one (for the hash table) plus one additional probe 
for each element of the collision chain that must be examined. Suppose 
that the word is not in the dictionary; if its hash address corresponds to 
an empty table slot, again only one probe is needed. However, if the 
address points to a collision chain, the number is one plus the length of 
the chain. 

For words found in the dictionary the average number of probes per 
lookup is : 

I 
6) P = 1 + N[(O)Ht + (1+2)Hz + (1+2+3)Hs + ... 

N i 
= 1 + ~ H , ~ f 

i=2 f=1 

1 N 
= 1 + 2 ~ (i+1)Fi-1 

i=2 

1 N 1 N 
=I+ 2 ~ (i-1) F i-1 +2 l F •. t 

i=2 i=2 

1 N+1 1 N+1 
= 1 + 2 ~ (i-1)Fi-t + 2 ~ F1-1 

i=2 i=2 

(probes) 

+ (1+2+ ... + N)HN] 


Scatter Storage for Dictionary Lookups/MURRAY 179 

Table 1. Expected Storage and Search Properties fo1' Basic Scatter Storage 
Schemes 

Measure 
Load factor 
Number of empty table slots 
Number of single entries 
Number of collision chains of 

length i 
Expected sums 

Fraction of hash table empty 

Fraction of table filled with 
single entries 

Fraction of hash table slots 
with i entries 

Expected sums 

Number of collisions 

Number of entries in the bump 
table 

Total table slots required 
Average lookup time (probes) 

H = number of hash table slots 
N = number of words to be entered 

Formula 
a=N/H 
Ho =He-a 
H1 = Ne-a 

ai 
Hi = H e-a--:-r i = 2,3, ... , N 

z. 
N 

H = ~--Hi 
i=O 
N 

N =~--Hi 
i=O 

1 
Fo = H Ho= e-a 

\ 1 -
F1 = H H1 = aea 

1 a' 
F, = H Hi = e-a if i = 2,3, ... , N 

N 
1 = ~ F, 

i=O 
N 

a = ~ i F, 
i=O 

No = H2 + H a + ... + HN 
= H - Ho-Hl 

B = N-Hl 

S = H+B 


180 Journal of Library Automation Vol. 3/ 3 September, 1970 

VIHTUAL SCATTER STORAGE 

Method 

From Table 1, the expected number of collisions is 
Nc= H- Ho- Ht 

= H( 1 - e ·Ntn _ ~e·NIH) 

For a fixed N, this number decreases as H increases. At the same time 
the number of empty hash table slots 

Ho = H e·Nt n 
increases as H increases. Both of these results are expected; as the hash 
addresses are spread over a larger and larger table space ( H slots), the 
number of collisions should decrease and the number of empties increase 
for a fixed number of entries ( N). 

A virtual scatter storage scheme tries to balance these opposing strains 
by combining hash coding with a sparse storage technique. Large or 
virtual hash addresses are used to obtain the collision properties associated 
with a very large hash table, and the storage technique is used to achieve 
the storage and search properties of a reasonably sized hash table. If the 
virtual hash address is taken large enough, the expected number of collisions 
can be reduced to essentially zero. With no expected collisions, it is possible 
to dispense with verifying that a query word and the dictionary word are 
the same. It is enough to check that they produce the same virtual address. 
Hence, the character strings need not be stored in the hash-bump tables 
at all. 

To implement the virtual scheme a large hash address is computed, say 
in the range ( 0, V), and the address is split into a major and minor part. 
The major portion is used just as before-as an index on a hash table of 
size H. The minor portion is stored in the hash or bump table, in place of 
the character string. With this difference, the virtual scheme works just 
as the basic scheme does. The lookup procedure is identical, but the minor 
portions are used for comparison rather than character strings. All the 
results of the previous section apply as storage and timing estimates. 

The advantage of virtual scatter storage systems is economy of storage 
space. The minor portion is much smaller in size than that of the 
character string it replaces. It is true that the virtual scheme assigns the 
same concept number to two different words if they have the same virtual 
address. This need not be disastrous for document retrieval applications. 
Presumably V is chosen large enough to keep the number of collisions 
small. On the one hand, errors could be neglected because of their low 
probability of occurrance and their small effect on the total performance 
of the retrieval system. On the other hand, it is always possible to resolve 
detected collisions even in a virtual scheme. Collisions may be detected 
during dictionary construction or updating, and the characters for the 


Scatter Storage for Dictionary Lookups/MURRAY 181 

colliding words appended to the bump table. The hash or bump table 
entry must contain a pointer to these characters along with an identifying 
flag. Collisions occurring during actual lookups cannot be detected. 

Collision Problem 

"' In order to use a virtual hash scheme, the virtual table must be large enough to reduce the expected number of collisions to an acceptable level. 
From a practical point of view, a collision may be considered to involve 
only two words, rather than three, four, or more. It is assumed that the 
probability of these other types of collisions is negligible. Let V be the size 
of the virtual hash table. Then the expected number of collisions is simply 

N. =H2 
a2 

= V2e·a 

where a = ~ . In this case V> > N so that a is small and e·a is approxi-
mately 1. 

a2 
7) N.=V2 

N2 
=2v 

Suppose, for example, the dictionary has N = 213 words. If the size of 
the virtual hash table is chosen to be V = 226, then the expected number 
of collisions is 

(213)2 1 
Nc = 2(226) = ]' 

Suppose further that this table size is adopted for the dictionary, and that 
the hash code algorithm produces three collisions. The question arises 
whether the algorithm is a good one-whether it produces uniform random 
addresses. The answer is found by extending the previous probability model. 

Consider a virtual scatter storage scheme in which the virtual table 
size is V, and N items are to be entered into the hash-bump tables. Again 
assume that collisions involve only two items. Let 

P(i) =Prob [i collisions] 
= Prob [i table slots have 2 items and N-2i slots have 1 item] 

The number of ways of choosing the i pairs of colliding words (in an 
ordered way) is: 

(~X N22} .. ( N-~+2 ) - 2' (~~2i)l 


182 Journal of Library Automation Vol. 3/3 September, 1970 

There are il ways of ordering these pairs and 
VI 

(V)N-i = (V-N+i)l 

ways of placing the pairs in the hash table, so that 

( . Nl (V)N-' t N
2 
J 8) P l) = 21il (N-2i)! -yr fori= 0,1, ... , 

In a form for hand computation, 
1 2 N-1 

9) P(0)=(1--y) (1-y-) ... (1---y) 

P( ') =P('- 1 ) (N-2i+ 2) (N-2i+1) f t ' 2i(V-N+ i) or i=1,2, ... , 

These results are exact, but the following approximations can be used with 
accuracy 

N-1 . 
log P ( 0) = ~ log ( 1 - ~ ) 

i=l 
N-1 . 
=~ - -' . 1 v 7= 

Nz 
- -2V 

Let f3 = ~; . Terms linear in N may be neglected in equation 9, giving 
P(O) = exp(-fi) 

P(i) = ~ P(i-1) 
1 

This is also a Poisson distribution: 

10) P(i) = exp(-fi) ft for i = 0,1,2, .. . , l ~ J 
This equation gives the approximate probability of i collisions for a virtual 
scatter storage scheme. It may be used to form a confidence interval 
around the expected number of collisions Nc = /3. 

For the previous example in which V = 22 6, N = 213, N c = ~'the follow-
ing table of values can be made: 

i P(i) ~P(i) 
0 .607 .607 
1 .303 .910 
2 .076 .986 
3 .012 .998 


\ 

Scatter Storage for Dictionary Lookups/MURRAY 183 

The probability is .986 that the number of collisions is less than or equal 
to 2. Since the algorithm gave 3 collisions, it appears to be a poor one. 
The results for the collision properties are summarized in Table 2. 

Table 2. Expected Collision Properties for Virtual Scatter Storage Systems 

Measure 

Collision factor 

Expected number of collisions 

Probability of i collisions 

Probability that the number of 
collisions C lies in [ a,b] 

V virtual hash table size 

Formula 
N2 

{3= 2V 

N.=P 

P( i) = exp(-{3) ~' i=O, 1, ... , [ ~ J 
b 

Prob = ~ P(i) 
i=a 

N number of words to be entered 

EXPERIMENTS WITH ALGORITHMS FOR GENERATING HASH 
ADDRESSES 

Any scatter storage scheme depends on a good algorithm for producing 
hash addresses. This is especially true for virtual schemes in which colli-
sions are to be eliminated. In these experiments three basic algorithms 
are evaluated for use in virtual schemes. The words in two dictionaries-
the ADI Wordform and CRAN 1400 \iVordform-are used. The hash-bump 
tables are filled using these words and the resulting collision and storage 
statistics compared with the expected values. 

Dictionaries 

The ADI Wordform contains 7822 words pertaining to the field of docu-
mentation. It contains 206 common words (previously judged) averaging 
3.93 characters. The remaining 7616 noncommon words average 8.00 
characters. In all there are 61,712 characters. 

The CRAN 1400 Wordform contains 8926 words dealing with aero-
nautics. The common word list consists of that of the ADI, plus four 
additional entries. The 8716 noncommon words average 8.40 characters. 
There is a total of 74,074 characters. 

Figures 3 and 4 show the distribution of the length of the words versus 
percentage of collection. The abrupt end to the curves in Figure 3 is due to 
truncation of words to 18 characters. 

Both dictionaries have approximately the same size and proportions of 
words of various length. However, their vocabularies are considerably 
different. A good hash scheme should work equally well on both dic-
tionaries. 


184 Journal of Library Automation Vol. 3/3 September, 1970 

1/) 
0 Common Words "E 

~ ~ ADI 

>- 0 CRAN 1400 
~ 

0 
c 
0 
+= 
0 ·-
0 -0 -c 
Q) 

~ 8 Q) 
a.. 

0 2 4 6 8 10 14 

Word Length 

Fig. 3. Distribution of Dictionary Words According to Their Lengths. 


>. 
'-
0 
c 

.Q -u 
0 -0 
Q) 

.~ -a 
:; 
E 
::J 
u 

\ 

Scatter Storage for Dictionary Lookups/MURRAY 185 

0 Common Words 
6 ADI 

0 CRAN 1400 

0 2 4 6 8 10 12 14 16 18 20 

Word Length 

Fig. 4. Cumulative Distribution of Dictionary Words According to 
Th eir L engths. 


186 Journal of Library Automation Vol. 3/3 September, 1970 

Hash Coding Algorithms 

By their nature, hash coding algorithms are machine dependent. The 
computer representation of the alphabetic characters, the way in which 
arithmetic operations are done, and other factors all affect the randomness 
of the generated address. The algorithms described below are intended 
for use on the IBM S /360. 

Words are padded with some character to fill an integral number of 
S /360 full words. Then the full words are combined in some manner to 
form a single fullword key, and the final hash address is computed from 
this key. In the experiments which follow, the blank is used as a fill 
character. This is an unfortunate choice because of the binary representa-
tion of the blank 01000000. In some algorithms the zeroes may propagate 
or otherwise affect the randomness. A good fill character is one that 1) is 
not available on a keypunch or teletype, 2) will not propagate zeroes, 3) will 
generate a few carries during key formation, and 4) has the majority of its 
bits equal to 0, so their positions may be filled. A likely candidate for the 
S/360 is 01000101. 

Three basic methods of generating virtual hash addresses-addition, 
multiplication, and division-are studied. The first and second provide 
contrasting ways of forming the single fullword keys. The second and third 
differ in the way the hash address is computed from the key. Variations of 
each basic method are also tested to try to improve speed, programming 
ease, or collision-storage properties. 
l. Addition Methods 

AC-addition and center 
The fullwords of characters are logically added to form the key. The 
key is squared and the centermost bits are selected as the major. 
The minor is obtained from bits on both sides of the major. 

AS-addition with shifting 
Same as AC, except the second, third, etc. fullwords are shifted two 
positions to the left before their addition in forming the key. (An 
attempt to improve collision-storage properties) 

AM-addition with masking 
Same as AC, except the second, third, etc. fullwords have certain 
nonsignificant bits altered by masks before their addition in forming 
the key. (An attempt to improve collision-storage properties) 

2. Multiplication Methods 
MC-multiply and center 

The fullwords of characters are multiplied together to form the key. 
The center bits of the previous product are saved as the multiplier 
for the next product. The key is squared and the centermost bits 
selected as the major. The minor is obtained from the bits on both 
sides of the major. 


Scatter Storage fo1' Dictionary Lookups/MURRAY 187 

MSL-multiply and save left 
Same as MC, but during formation of the key, the high order bits of 
the products, rather than the center, are used as successive multipliers. 
(An attempt to improve speed) 

MLM-multiply with left major 
Same as MC, but taking the major from the left half of the square 
of the key and the minor from the right half. (An attempt to improve 
speed) 

3. Division Methods 
DP-divide by prime 

The fullwords of characters are multiplied together to form the key. 
The center bits of the previous product are saved as the multiplier 
for the next product. The key is divided by the length of the virtual 
hash table-a prime number in this case-and the remainder used as 
the virtual hash address. The major is drawn from the left end of 
the virtual address and the minor from the right. 

DO-divide by odd number 
Same as DP, except using a hash table whose length is odd. (An 
attempt to provide more flexibility of hash table sizes ) 

DT -divide twice 
Same as DP, except two divisions are made. The major is produced 
by dividing the key by the actual hash table size. The minor results 
from a second division. Primes are used throughout as divisors. (An 
attempt to improve storage-collision properties) 

Evaluation 

In the experiments to evaluate each variation of the above hash schemes, 
the size of the virtual hash table varies from 220 to 228 slots. The actual 
hash table varies in size from 212 to 214 slots. Bump table space is used as 
needed. The tables are filled by the words from either the ADI or CRAN 
dictionaries and the collision and storage statistics taken. Because good 
collision properties are most important, they are examined first. The storage 
properties are dealt with later. 

The number of collisions obtained from each scheme versus the virtual 
table length is plotted in Figures 5 to 8. The ADI dictionary is shown in 
Figures 5 and 7, and the CRAN in Figures 6 and 8. The circled lines cor-
respond to curves generated from equations 7 and 10. The horizontal 
one shows the expected number of collisions and the lines above and below 
it enclose a 95% confidence interval about the expected curve. In other 
words, if an algorithm is generating random . addresses, the probability is 
95% that the curve for that scheme lies between the heavy lines. 

Consider Figures 5 and 6 showing the results for all the addition methods 
and the MC variation of the multiplication variation. The AC and MC 
algorithms differ only in that addition is used in forming the key in the 


188 Journal of Library Automation Vol. 3/3 September, 1970 

-0 

ooooooo Theoretical Curves 
(Equations (7) and ( 1 0) 
Experimental Curves 

---- Interpolated Curves 

Virtual Hash Table Size (Power of two) 

Fig. 5. Collisions in the ADI Dictionary for Addition and 
Multiplication Hash Schemes. 

first one and multiplication in the second one. Yet the curves are spec-
tacularly different. The result seems to have the following explanation. 

The purpose of a hash address computation is to generate a random 
number from a string of characters. If the bits in the characters are as 
varied as possible, then the algorithm has a headstart in the right direction. 
However, the S/360 bit patterns for the alphabet and numbers are: 

A to I 1100 xxxx 
J to R 1101 xxxx 
S to Z 1110 xxxx 
0 to 9 1111 xxxx 


Scatter Storage for Dictiona1·y Lookups/MURRAY 189 

en 
c: 
0 
en 

0 
(.) -0 ... 
Q) 
.0 
E 
::t 
z 

20 

ooooooo Theoretical Curves 
(Equations (7) and (l 0) 

Experimental Curves 

--- - Interpolated Curves 

26 

Virtual Hash Tobl e Size (Power of two) 

Fig. 6. Collisions in the GRAN Dictionary for Addition and 
Multiplication Hash Schemes. 

c 

28 

In each case the two initial bits of a character are l's, so that in any given 
word one-fourth of the bits are the same. 

In forming a key, the successive additions in the AC algorithm may 
obscure these nonrandom bits if a sufficient number of carries are generated. 
However, the number of additions performed is usually small-2 or 3-
and it appears that the pattems are not broken sufficiently. The MC 
algorithm uses multiplication to form its keys, which involves many 
additions-certainly enough to make the resulting key random. 

The multiplications in the MC algorithm are costly in terms of computa-
tion time. Therefore the AS and AM algorithms are tried. These addition 


190 Journal of Library Automation Vol. 3/3 September, 1970 

en 
c 
0 
·~ 

0 
u 
.... 
0 

ooooo Theoretical Curves 

Experimental Curves 

Interpolated Curves 

22 

20 
Virtual Hosh Table Size (Power of two} 

Fig. 7. Collisions in the ADI Dictionary for Division and 
Multiplication Hash Schemes. 

variants try to hasten the breakup of the nonrandom bits by shifting and 
masking respectively. Although these variants reduce the number of 
collisions somewhat, none of the addition schemes could be called random. 
Typically a few words are singled out at some point and continue to collide 
regardless of the length of the virtual address. Several collision pairs are 
listed below. Note the similarities between the words. 

COUNT 
WORTH 
TOLERATED 
WHEEL 

-SOUND 
-FORTY 
-TELEMETER 
-SHEET 


In 
1: 
0 
·;;; 

0 
0 ... 
0 

... 
Cl) 

.D 
E 
:I 
z 

20 

Scatter Storage for Dictionary Lookups/MURRAY 191 

0000000 

\ 
\ ,----

\ 
\ 

\ 
\ 

26 

Theoretical Curves 
(Equations (7) and (1 0) 
Experimental Curves 
Interpolated Curves 

28 

Virtual Hash Table Size (Power of two) 

Fig. 8. Collisions in the GRAN Dictionary for Division and Multiplication 
Hash Schemes. 

Consider the multiplication algorithms. During key formation, the pro-
cess of saving the center of successive products adds to the computation 
time. The MSL variation attempts to remedy this by saving only the high 
order bits between multiplications (on the S /360 this means saving the 
upper 32 bits of the 64-bit product) . This method is so inferior that its 
collision graph could not be included with the others. The poor results 
stem from the fact that characters at the end of fullwords have little effect 
on the key and that the later multiplications swamped the effects of the 
earlier ones. Examples of collision pairs are given below. For convenience 
the fullwords are separated by blanks. 


192 Journal of Library Automation Vol. 3/3 September, 1970 

CERTAINTY 
PREVENTED 
HEAVING 
EXPE NSE 
CHARTER 

- CERTAINLY 
-PRESENTED 
-HEAT lNG 
-EXPANSE 
-CHAPTER 

The MC and MLM variants are identical with respect to collision proper-
ties. In general these algorithms produce good results, reducing the number 
of collisions to zero in both dictionaries. The collision curve is always 
beneath the expected one. 

Consider Figures 7 and 8 showing the results for all division methods and 
the MC method. All of the division algorithms display a distinct rise in 
the number of collisions when the virtual table size is near 224-regardless 
of the dictionary. The majority of the colliding word pairs are 4-character 
words having the same two middle letters. This brings to light a curious 
fact about division algorithms. For virtual tables, the divisor of the key 
is large and the initial few bits determine the quotient, leaving the rest 
for the remainder. For words of less than 4 characters (which require no 
multiplications during key formation), dividing by 224 is equivalent to 
selecting the last 3 characters of the word as the hash address. Because 
the divisors are not exactly equal to 224, only the two middle characters 
tend to be the same. Examples are: 

DEAL -BEAR 
TOOK -SOON 
HELD -CELL 
VERB -TERM 

This phenomenon apparently continues for table sizes around 226 and 
228, but there are few or no words of 4 characters or less which agree in 
26 or 28 bits. For divisors smaller than 22 \ a larger part of the key 
determines the quotient and apparently breaks up the pattern. Because 
the above effect occurs only for V = 22 \ these points are passed over on the 
graphs. 

In general, the DT algorithm is superior to the rest of the division 
methods, mostly because each of its two divisors is smaller than those used 
in other methods. Prime numbers seem to produce better results than 
other divisors. 

On the basis of collision properties, the MC, MLM, DT, and possibly 
AS algorithms are the best. Storage-search evaluations are included for 
these methods only. 

The experiments with each hash coding method also include counting 
the frequency of various length of collision chains. Here a collision chain 
refers to chains of words producing the same major. The frequency counts 
are compared with the expected counts given by equation 5. The com-
parison is in terms of a chi-square goodn ess-of-fit test with a 10 % level of 


0 8 
:;::: 
t/) -0 -(j) 
Q) 
~ 

0 
;:) 
CT 
(j) 

I 
.c 
u 

2 

0 

Scatter Storage for Dictionary Lookups/MURRAY 193 

x·---x·---x 

or----DT 

/ 
DT 

A~AS 
MLM 

AS 
----AS,DT 

MLM--MLM 

MLM 

MC MC-MC ~C 
----MC 

Virtual Hash Table Size (Power of two) 

X- curve for 10% level of significance 
Fig. 9. Deviations of Storage-Search Properties from Expected Values 

for Selected Hash Schemes Using the ADI Dictionary. 

significance. Figures 9 and 10 show the results of this test for each diction-
ary. Included in the graphs is the line corresponding to the 10% level of 
significance. If the major portions of the hash addresses are really random, 
there is a probability of 0.90 that the 10% line will lie above the curve 
for the algorithm tested. 

Consider the MC and MLM algorithms which differ only in that the 
major is selected from the center and left of the virtual address. From 
the graphs, it is clear that the multiplication methods produce their most 


194 Journal of Library Automation Vol. 3/3 September, 1970 

.~ -.!:!! -0 -CJ) 
~ 
0 8 
~ 
c:1' 

CJ) 

I 
..c 
(.) 6 

4 

MLM 

x---

DT 

Virtual Hash Table Size (Power of two) 
X- curve for 10% level of significance 

Fig. 10. Deviations of Storage-Search Properties from Expected Values 
for Selected I-Iash Schemes Using the GRAN Dictionary. 

random bits in the center of their product. This is somewhat as expected, 
because the center bits are involved in more additions than other bits. 

The division algorithm, which had fairly good collision properties, seems 
to have rather mediocre storage properties. This is probably due to the 


Scatter Storage for Dictionary Lookups/MURRAY 195 

same causes as the collision problems, but working at a lower level, and 
not affecting the results as much. 

The AS curve is included simply for completeness. The scheme displays 
a well behaved storage curve, but it has poor collision properties. 

In summary, the MC scheme seems to be the best for both dictionaries 
in terms of collision and search properties. In terms of computing time, 
the method is more time consuming than the addition methods, but less 
expensive than the division methods. The difference in computation times 
is not an extremely big factor. All methods required from 35 to 55 micro-
seconds for an 8-character word on the S/360/65. The routines are coded 
in assembly language and called from a Fortran executive. The times above 
include the necessary bookkeeping for linkage between the routines. 

A PRACTICAL LOOKUP SCHEME 
General Description 

The lookup scheme described below is designed for use with dictionaries 
of about 21 :. words. The virtual table size selected is 229 and the actual 
table size is 216• On the basis of the results presented in previous sections, 
when the dictionary is full, it is expected that 

1) 36.8% of the hash table will be empty, 
2) 36.8% of the hash table will be single entries, 
3) the bump table will require ( 0.632 )215 entries, 
4) 1 collision is expected, 
5) the probability of 5 or fewer collisions is 0.999, and 
6) the average lookup will require 2.13 probes. 

Table Layout 
In all previous discussions a dictionary entry has included a minor and 

a concept number. A concept number is simply a unique number assigned 
to each word. The hash address of a word is also unique, and hence can 
be used. There is no need to store and use a previously assigned concept 
number. 

A dictionary entry contains a 14-bit minor and a single bit indicating 
whether the word is common or noncommon: 

1 2 15 

IC Minor 
C = 0 implies the word is common; C = 1 implies the word is noncommon. 

A hash table entry contains 16 bits arranged as : 
0 1 15 

I Flag I Information 
Flag = 0 implies that the information is a dictionary entry; Flag = 1 implies 
that the information is a pointer to the bump table. 

Words that have the same major are stored in a block of consecutive 


196 Journal of Library Automation Vol. 3/3 September, 1970 

locations in the bump table. This eliminates the need for pointers in the 
collision "chains". A bump table entry also has 16 bits structured as: 

0 1 2 w 
I End I C Minor 

End= 0 implies that the entry is not the last in the collision block; End 
= 1 implies that the entry is the last in the block. 

Some convention must be adopted to signify an empty hash table slot. 
A zero is most convenient in the above scheme. Unfortunately a zero is 
also a legitimate minor. However, to cause trouble the word generating 
the zero minor would have to be a common word and a single table entry 
(zero minors in the bump table are no problem). Hopefully this occurs 
rarely because of the size of the minor ( 14 bits) and the small number of 
common words. However, even if this combination of circumstances occurs, 
the common word could be placed in the bump table anyway. 

In designing the tables, it is important to make the hash table entries 
large enough to accommodate the largest pointer anticipated for the bump 
table. For the above scheme, the expected bump table size is less than 215 

so that the 15 bits allocated for pointers is sufficient. 
Search Considerations 

The number of probes needed to locate any given word depends on the 
place that the word occupies in a collision block. The average search time 
is improved if the most common words occupy the initial slots in each block. 
A study of ADI text yields the statistics given in Tables 3 and 4. 

Table 3. Division of Words by Categm·y. 

Number of Words Percent of Total 
17270 Total words 100.0 
8716 Common words 50.5 
8554 Noncommon words 49.5 

Table 4. Distribution of L engths. 

Number of All Common 
Non-

Characters Words 
Percent 

Words 
Percent common Percent 

Words 
1-4 10145 58.8 8057 92.5 2097 24.5 
5-8 4630 26.8 627 7.2 4003 46.8 
9-12 2249 13.0 32 0.3 2217 25.9 

13-16 221 1.3 0 0.0 221 2.6 
17-20 11 0.1 0 0.0 11 0.1 
21-24 5 0.0 0 0.0 5 0.1 

Totals 17270 100.0 8716 100.0 8554 100.0 

Av. Length 6.3 4.3 8.3 


Scatter Storage for Dictionary Lookups/MURRAY 197 

Using the categorical information, it appears that in filling the hash-bump 
tables, the common words should be entered first. Within each category, 
all words should be entered in frequency order if such information is 
known. If frequency information is not available, the distribution by 
lengths can be used as an approximation to it. For common words, this 
means entering the shorter words first. For noncommon words, the words 
of 5 to 8 characters should be entered first. 

The greater the number of single entries, the greater the average search 
speed. Figure 11 shows the fraction of single entries ( F 1) and fraction of 
empty slots ( F o) for various load factors. The fraction of single entries 

.l: 
IJI 
0 
I -0 
c 
0 -(.J 
0 
~ 

u. 

0 .4 .8 

Load Factor 

Fig. 11 . Theoretical Hash Table Usage. 

0 Fraction Empty Slots 
A Fraction of Single Entries 

1.6 


198 Journal of Library Automation Vol. 3/3 September, 1970 

F1=ae-a reaches a maximum for a= 1, but since the slope of the curve is 
small around this point, the load factor in the interval ( 0.8, 1.2 ) is prac-
tically the same. Table usage is better, however, for the larger values of a. 
These facts imply that scatter storage schemes make most efficient use of 
space and time for a=l. 

Most text words can be assumed to be in the dictionary. Thus the order 
of comparisons during lookup should be: 

Hash Table Scan 
1) check minor assuming the text word is a common word 
2) check minor assuming the word is non common 
3) check if the entry is a pointer to the bump table 
4) check if the entry is empty 

First Bump Table Entry (must be at least two) 
5) check minor assuming the word is a common word 
6) check minor assuming the word is non common 

Other Bump Table Entries 
7) check minor assuming the word is non common 
8) check minor assuming the word is common 
9) check if at end of collision block. 

The search pattern can be varied to take advantage of the storage condi-
tions. For example, if all common words are either single entries or the 
first element of a collision block, then step 8 may be eliminated. 

Performance 

The lookup system described above has been implemented and tested 
on the IBM S/360/65. A modified form of the MC algorithm is used to 
compute a 29-bit virtual address and divide it into a 15-bit major and a 14-
bit minor. The modification is the inclusion of a single left shift of the 
fullwords of characters during key formation. This breaks up certain types 
of symmetries between words such as WINGTAIL and TAILWING. 
Without this, such words will always collide. The hash-bump tables were 
filled with entries from the ADI dictionary-common words first, followed 
by noncommon words. The shortest words were entered first. Table 5 gives 
comparison of the expected and actual results. 

Table 5. Lookup System Results. 
a=.239 

Number of empty table slots 
Number of single entries 
Number of collision blocks 
Longest collision block 
Average length of collision blocks 
Size of bump table 
Number of collisions 
Average probes per lookup 

Expected 
25810 
6161 
797 

4 
2.1 

1663 
.06 

1.33 

Actual 
25762 
6250 
756 

4 
2.1 

1572 
0 
1.33 


Scatter Storage for Dictionary Lookups/MURRAY 199 

To obtain the actual lookup times 627 words were processed. The words 
were read from cards and all punctuation removed. Each word was passed 
to the lookup program as a continuous string of characters with the proper 
number of fill characters added. The resulting times are given in Table 6 
(in microseconds); a larger sample of the category of "not-found" words 
processed with less accurate timings indicates that the average time for 
words in this category is about 62 microseconds (standard deviation 26). 

Table 6. Lookup Times 

Category Number 
of Words of Words 

All 627 
Common 288 
Noncommon 338 
Not found 1 

Percent 
of Total 

100.0 
45.9 
53.9 

0.2 

Average 
Time 

57.9 
49.9 
64.7 
53.1 

Standard 
Deviation 

11.7 
6.7 

10.7 
0.0 

Average 
Probes 

1.18 
1.12 
1.24 
1.00 

The time to compute a hash address depends on the length of the word . 
Let n be the number of S /360 full words needed to hold these characters. 
The time to form the initial address is 

I ( n) = 34.5 + 10.2 ( n-1) microseconds. 
The average total lookup time, then, is 

T = I(n) + cP 
where c is the average time per probe into the table space and P is the 
average number of probes. For the words in the experiment n = 2.32 
(average), I ( n) = 40.3, and T = 57.9, so that each probe required about 
15 microseconds. 

C ompadsons 

Timing information for other lookup schemes is difficult to obtain. A tree-
structured dictionary is used for a similar purpose at Harvard. Published 
information indicates 6pq microseconds are needed to process p words in 
a dictionary of q entries. This time is for the IBM 7094. Translating this 
time to the S/360/65, which is roughly four times faster, and using the ADI 
dictionary ( q = 7822), it appears that each lookup averages 11,000 micro-
seconds. Exactly how much computation and input-output this includes 
is unknown. 

EXTENSIONS 

Larger Dictionaries 

As more words are added to the dictionary, the size of the virtual address 
must increase in order to prevent collisions. As a result, the number of 
bits per table slot must also increase in order to accommodate the larger 
minors and pointers that are used. For a fixed-sized hash table, the number 
of entries in the bump table grows as new words are added. At some point 
the space required for tables will exceed the amount of core allotted for 


200 Journal of Library Automation Vol. 3/ 3 September, 1970 

dictionary use. To salvage the scheme, it may be possible to split the buinp 
table into parts-one part for more frequently used words and one for 
words in rather rare usage. During dictionary construction common words 
are entered first, then noncommon, then rare. When a rare word must 
be placed in a collision block, a marker is stored instead, and the item is 
placed in the secondary bump table. Presumably the nature of the words 
in the second bump table will make its usage rather infrequent, thus 
saving access to auxiliary storage to fetch it. 

Suffix Removal 

Many dictionary schemes store only word stems; the lookup attempts 
to match only the stem, disregarding suffixes in the process. This is not 
easily done with scatter storage schemes. One solution is to try to remove 
the suffix after an initial search has failed. Each of the various possible 
stems must be looked up independently until a match is found. Another 
solution is to use a table of correspondences between the various forms of 
a word and its stem. The concept number could be used as an index on th is 
table containing pointers to information about the actual stem. A thesauru s 
lookup can be handled the same way. 

Application to Library Files 

Library fil es-characterized by a large number of entries, personal and 
corporate names, foreign language excerpts, etc.-present special problems 
to lookups. With regard to size, there is no particular reason that scatter 
storage cannot b e extended to such files. The only genuine requirement 
is the ability to compute a virtual address long enough to insure a reasonably 
low number of collisions. As mentioned previously, table space can become 
a problem. For really large files, a two-stage process looks most promising. 
A small hash table is used to address high frequency items and a larger 
hash table is used for addressing all other data. Lookup starts with the 
small tables and continues to the larger ones if the initial search fails. The 
same virtual address can be used in both lookups by shifting a few bits 
from the high-frequency minor to the low-frequency major. This two-stage 
technique should keep the amount of table shufBing to a minimum and 
provide rapid lookup for all textual data in titles, abstracts, etc. 

With respect to bibliographic information, personal and corporate names 
are bothersome because they can occur in several forms. Unfortunately, 
scatter storage schemes do not guarantee that dictionary entries for R. A. 
Jones and Robert A. Jones are near each other, so that if an initial lookup 
fails, the rest of the search can be confined to a local area of the file. There 
are two approaches to the problem : ( 1 ) standardization of names before 
input or ( 2) repeated lookups using variants of a name as it occurs in text. 
Standardization, along with delimiting and formatting bibliographic data, 
is probably the most effective and least expensive approach. In addition, 
it reduces the amount of redundant data in the file. 


Scatter Storage for Dictionary Lookups/MURRAY 201 

Phrases in foreign languages present a difficulty, since the character sets 
on most computing equipment are limited to English letters and symbols. 
However, if an encoding for such symbols is used, lookup can proceed 
normally. The problem of obtaining the dictionary entry for an English 
equivalent of a foreign word is a completely different matter and will not 
be dealt with here. 

CONCLUSIONS 

Virtual scatter storage schemes are well suited for dictionaries, having 
both rapid lookup and economy of storage. The rapid lookup is due to 
the fact that the initial table probe limits the search to only a few items. 
The space savings come from the fact that the actual character strings for 
words are not part of the dictionary. The schemes depend heavily on a 
good algorithm for producing random hash addresses. The theory de-
veloped in the first two sections of this paper gives a basis for judging the 
worth of proposed algorithms. 

For any particular application, the table organization may vary to suit 
different needs and to store different information. However, the advantages 
of scatter storage schemes are still present. 

REFERENCES 

1. Salton, G.: "A Document Retrieval System for Man-Machine Interac-
tion." In Association for Computing Machinery. Proceedings of the 
19th National Conference, Philadelphia, Pennsylvania, August 25-27, 
1964, pp. L2.3-l-L2.3-20. 

2. Mcilroy, M. D.: Dynamic Storage Allocation (Bell Telephone Labora-
tories, Inc., 1965). 

3. Morris, R.: "Scatter Storage Techniques," Communications of the ACM 
(January, 1968 ). 

4. Maurer, W. D.: "An Improved Hash Code for Scatter Storage," Com-
munications of the ACM (January, 1968) . 

5. Johnson, L. R.: "Indirect Chaining Method for Addressing on Secondary 
Keys," Communications of the ACM (May, 1961).