CPS222 Lecture: Hashing last revised 1/25/2013
I. Introduction
- ------------
A. Recall that a map is a data structure that can be used to store and
retrieve information associated with a certain key. The operations on
such a structure can be pictured as follows:
1. Insertion:
__________________
Key, value | Map |
----------> | (key,value pairs)|
|__________________|
2. Lookup:
___________
Key | Map | Value
----------> | | ------->
|___________|
3. Deletion: ____________________________
| Map |
----------> | (key and its value removed)|
|____________________________|
B. We have, at various times in the past, considered the following search
structures:
Structure Insert Lookup Delete
"Pile" O(1) O(n) O(n) [ to find "victim" ]
(Sequential search)
Ordered array O(n) O(log n) O(n)
(Binary search)
Linked list - O(1) O(n) O(n) [ to find "victim" ]
not ordered (Sequential search)
Binary search
tree O(log n)-O(n) O(log n)-O(n) O(log n)-O(n)
(All operations can be guaranteed to be O(log n) if we use
a balancing strategy as we will discuss shortly)
C. Today we introduce the last search structure (for main memory) that we
will consider: the hash table. A hash table exhibits the following
performance:
Hash table O(1)-O(n) O(1)-O(n) O(1)-O(n)
Note that there is something of a gamble involved here. A hash table
can perform unbeatably well; but it can also do as badly as the worst
of the structures we have considered.
II. Hashing
-- -------
A. In a map, the key may be an integer, a character string, or (more rarely)
something else such as a real. For the purpose of hashing, though, we
will want to always work with integer keys. This is possible,
because any non-integer key can be converted into an equivalent integer.
Example: a character string composed only of the 26 letters A-Z plus space
can be regarded as an integer radix 27.
Example: Treated as an integer radix 27, BJORK =
2*27^4 + 10*27^3 + 15*27^2 + 18*27 + 11 = 1,271,144
Example: The real number 3.14 has the (32 bit) binary representation
using IEEE floating point 0100 0000 0100 1000 1111 0101 1100 0011
which is equivalent to the integer 1078523331
B. If we had unlimited storage, we could create a very fast search structure
by allocating an array whose subscripts would range over all possible
values of the key. If a given key exists, it would be found in the
slot indexed by its value.
1. Example: in such a table an entry for BJORK would be in slot 1271144
2. Obviously, however, such a table could not be reasonably created for
keys of any significant size - e.g. 10 letter keys would require
27^10 slots (= about 2 x 10^14) - the vast majority of which would be
empty!
3. Even if such a table could be created, initializing it would be
computationally costly, since we would have to put an indicator in
each slot showing that there is no value there as yet to prevent
wrong results when searching for a nonexistent key.
C. Hashing builds on this basic idea, as follows: A hash table is an array
of BUCKETS, each of which consists of one or more SLOTS. The buckets are
typically numbered 0..b-1 or 1..b. (There are b buckets in all.) The
number of slots, then, is s*b, where s is the number of slots per bucket.
1. A single slot is able to hold one key-value pair.
2. When hashing is used for tables maintained in main memory (our focus
here), it is common for each bucket to consist of a single slot -
in which case the hash table becomes simply an array of slots. We
will develop most of our initial examples along these lines -
extension to the case of buckets having multiple slots is easy.
3. The total number of slots is greater than the total number of keys
expected to be entered in the table, but typically much less (often
by many orders of magnitude) than the number of POSSIBLE keys.
4. Initially, all the slots are set to some special value that indicates
that the slot is empty - e.g. a key of 0 or the like. This is now
feasible because the number of slots is comparable in magnitude to
the number of keys we intend to store, not the number of possible keys.
D. The basic idea behind hashing is this: we devise a key to address
transformation algorithm that converts a LOGICAL key (such as a name) to a
PHYSICAL "home" bucket for the data associated with that key.
________________
| Key to address |
Logical ---->| transformation | ----> Physical bucket
key | algorithm | number
----------------
1. This transformation must allow for the fact that the number of
POSSIBLE logical keys is MUCH greater than the number of possible
resulting bucket numbers.
Example: Let's say we decide to use a hash table with 2000 buckets
for Gordon students (to allow room for growth etc.)
Further, lets suppose we use student ID's as the logical key.
The transformation Gordon ID to bucket number maps 10,000,000
possible logical keys into only 2000 possible physical keys.
2. As a result, any given algorithm has the possibility of mapping two
different logical keys to the same physical bucket. Such keys are said
to be SYNONYMS.
a. As long the number of synonyms for any given bucket is <= the number
of slots per bucket, we have no problem. We simply store successive
synonyms in successive slots in the same bucket. When we go to look
up a key, we calculate its home bucket using the hashing function
and then search all slots in the bucket to see if it is in one of
them.
b. Of course, if we have only one slot per bucket, then there is no
room to put two or more synonyms in the same bucket. Even if we
have multiple slots in the same bucket, a problem can arise if
there are more synonym actual keys for a bucket than there are
slots.
c. The resultant condition is said to be a COLLISION. Since only
one key-value pair can be stored in any given slot in
the table, we will have to devise some strategy for handling these
collisions.
3. Thus, any hash table scheme is characterized by the following
parameters:
a. The number of buckets (b).
b. The number of slots per bucket (s).
c. The number of keys actually present in the table (n) - where
n <= b*s.
d. The hashing function that maps a key to a bucket number in the
range 0..b-1 or 1..b.
e. The strategy for handling collisions.
Later in the lecture, we will consider the last two items in detail,
exploring various alternatives. For now, we will consider one
commonly used hashing function and one commonly used collision-
handling strategy.
E. As we shall see, there are many hashing strategies that could be used.
The simplest is one called the DIVISION REMAINDER METHOD.
home-bucket = key mod b (to produce a result in the range 0..b-1)
or 1 + key mod b (to produce a result in the range 1..b)
F. Likewise, there are many possible strategies for handling collisions.
The simplest is one called LINEAR PROBING, or LINEAR OPEN ADDRESSING:
1. To insert a record:
a. Compute the address of the home bucket, using the hash function.
b. If that bucket has room, put the record there.
c. Otherwise, begin looking at adjacent buckets (in increasing bucket
number order) until a bucket with room is found. Put the record in
the first vacant bucket.
i. If you reach the last bucket in the table, then continue
searching with the first bucket (i.e. treat the bucket numbers
as if they wrap around modulo b).
ii. If you come full circle back to the home bucket, then give up;
the table is full. (In this case, one can replace the table
with a larger one dynamically; but a new hash function would
also be needed to take advantage of the added bucket. This
would mean repositioning every record already in the table as
well.)
Example: Table with 5 buckets, 1 slot per bucket (initially empty);
Entries consist of a numeric ID (key) plus a name.
hash function = key mod 5.
Insert 17 AARDVARK: goes into bucket 2
Insert 23 BUFFALO: goes into bucket 3
Insert 12 CAT: should go into bucket 2, but ends up in 4
Insert 44 DOG: should go into bucket 4, but ends up in 0
2. To locate a record:
a. Compute the address of the home bucket, using the hash function.
b. If that bucket contains the record, we have succeeded. (One must
actually check the data stored to be sure the key matches.)
If the home bucket is vacant, then the record is not in the table.
c. If the home bucket is full - but does not contain the desired key -
begin searching successive buckets (as on insert), until either
- The desired record is found.
- A vacant bucket is found (in which case, we conclude the record is
not in the table, since otherwise insert would have found this
bucket and put the record there.)
- You come full circle to the home bucket (in which case conclude
the record is not there because you have tried every one!)
Example: Trace lookup of each of 17, 23, 12, 44 in turn
Trace lookup of records with key = 31, 30
3. To delete a record:
a. First locate the record as above.
b. Now, can we just simply use the DELETE operation to vacate the
slot? No. Why? Because then a later lookup on another record
may fail.
Example: suppose we deleted 17 AARDVARK by vacating slot 3.
What would happen when we try to lookup 12?
c. Therefore, we instead must replace the record with a dummy record
that fills the slot, but will never match any key we are looking
for. (E.G. if our key is numeric, we might store the letter D
in the key field of the record.)
- On insert, we treat such a slot as if it were, in fact, vacant,
and put a new record there if we need to.
- On lookup, we treat such a slot as occupied, since it once was.
III. Additional Key-To-Address Transformation Techniques (Hashing Functions)
--- ---------- -------------- -------------- ---------- -------- ----------
A. Any hashing function must meet two basic criteria:
1. For all possible logical keys, it must produce a value in the
range 0..b-1 or 1..b.
2. It must disperse the logical keys uniformly - i.e. the probability
that a randomly chosen key hashes to any particular bucket should be
1/b - or very close to this.
3. Actually, the second criterion can be more complicated if the keys
to be used exhibit some pattern or bias. For example, the following
scheme would uniformly distribute random integer keys over 10 buckets:
bucket = key mod 10
However, if the last digits of the key were not uniformly distributed
(e.g. if there were a bias toward even keys), then the resulting
distribution would be non-uniform.
4. A further consideration is that the hashing function should not
be computationally-expensive, since we are trying to compete with
an O(log n) search strategy and can lose our advantage if too much
computation is required.
B. We have already discussed the division-remainder method
1. home-bucket = key mod b (to produce a result in the range 0..b-1)
or key mod b + 1 (to produce a result in the range 1..b)
2. Advantages:
a. Computationally simple if the the key is an integer to begin with,
or if converting it to an integer is not too expensive.
b. Provides good dispersion if b is a prime or at least has no
prime factors <= 20.
c. Flexible choice of b values - many sizes to choose from.
C. The mid-square method
1. home-bucket := middle m bits of sqr(key).
2. This requires that b be a power of 2.
3. Example: Let b be 64, and let keys be integers ranging from 1 to 1000.
Then the square of the key is 20 bits long, and we choose bits 7..12.
50 would hash as follows: bits taken
________
sqr(50) = 2500 = 0000 0000 1001 1100 0100
2
home-bucket = 010011 = 19
2
4. Advantages:
a. Computationally simple if the the key is an integer to begin with,
or if converting it to an integer is not too expensive.
b. Provides good dispersion, since the hash function depends on all
bits of the original key.
c. Tables whose size is a power of 2 are often natural anyway.
D. Folding
1. Folding is one way of avoiding the need to convert a non-integer key
to an integer - which often poses a problem if the key is long enough
that the resultant value would not fit in the word length of the
underlying machine. (E.g. even a 7-letter alphabetic key, treated as
a number radix 27, could have a value bigger than the largest 32
bit integer.)
2. The key is divided into a number of pieces, each of which is treated
as an integer. All of the pieces are added together, either straight
or with alternate pieces reversed.
Example: 123456789012 might be treated as four pieces
123 456 789 012
which could be added together one of two ways:
a. Shift folding:
123
456
789
012
---
1380
b. Boundary folding:
123
654
789
210
---
1776
E. Digit analysis
1. The previous methods required no advance knowledge of the actual set
of keys to be used. If the keys are available in advance, though,
a hashing function might be developed based on an analysis of them.
2. One approach is to calculate the frequency distribution for different
values of each digit (letter) of the keys.
Example: frequency analysis on last names of students in class
F. One last topic we mention briefly is the notion of ORDER-PRESERVING
hash functions.
1. In general, it is not the case that if key1 < key2 then
hash(key1) < hash(key2). However, there are some hash functions that
have this property. They are known as order-preserving hash functions.
2. An order-preserving hash function would be used in a case where one
wishes to have the ability to process table entries in ascending order
of key value, starting at some given point. Such processing is needed
when looking for a RANGE of key values - e.g.
JOHNS <= last_name < JOHNT
IV. Additional Collision Resolution Strategies
-- ---------- ---------- ---------- ----------
A. Though a good hashing scheme that disperses keys uniformly can reduce
the number of synonyms and hence the probability of a collision, we
cannot aovid having to deal with such collisions as do occur somehow.
B. We have already considered linear probing, or linear open addressing:
1. Basically, when we have a collision in a given bucket on insert, we
try the next successive bucket, and we keep trying successive buckets
(wrapping around from the last to the first if need be) until we
find a bucket with a vacant slot - which must always occur, unless
the table is 100% full. (A special case we should detect to
prevent infinite looping!)
2. Looking up a key mirrors the process of inserting a key. We keep
trying successive buckets until we either find our key, or we find
a bucket with a vacant slot - which is where it would have gone if
it were in the table.
3. To delete a key, we need to leave behind some sort of marker that
says the slot was occupied at one point in time. On insert, we
consider such a slot vacant and use it; on lookup, we consider it
occupied and keep looking.
4. Comments on efficiency of linear open addressing
a. At first glance, it appears that hashing with linear open addressing
could be terribly inefficient: it could degenerate to searching the
entire table.
b. On the other hand, if the record we want is, in fact, in its home
bucket or very near to it, then this method works quite well.
c. The success of this method depends on two things:
i. Allocating enough space in the table so that there are sufficient
vacant slots to break up long searches. (A good rule of thumb is
to never allow more than 80% of the slots to actually be used- eg
if we wish to store records on 1600 students, then use a table
with at least 2000 slots, plus an appropriate hash function.)
ii. Choose a hash function that disperses the keys uniformly over the
slots.
d. One remaining problem that is hard to avoid, however, is the
problem of CLUSTERING.
i. Consider the following portion of a hashtable:
|_____________|
| Bucket x |
|_____________|
| Bucket x+1 |
|_____________|
| Bucket x+2 |
|_____________|
| Bucket x+3 |
....
Suppose bucket x overflows. Then a key belonging to bucket x
is inserted into bucket x+1.
ii. Of course, the effect of this overflow is to increase the
probability that bucket x+1 will also overflow, since it is now
receiving keys that map to two different buckets.
iii. When bucket x+1 overflows, it begins adding keys to bucket x+2.
This also becomes the place where further overflows from bucket
x must go, of course. So now bucket x+2 becomes the target for
keys hashing to three different buckets. This further enhances
the chances of bucket x+2 overflowing, which would make bucket
x+3 the target for keys hashing to four different buckets ...
iv. As you can see, linear probing suffers from the problem that
clusters of overflowing buckets can develop such that several
buckets "compete" for the same overflow space. (In the above
case, buckets x, x+1, x+2, x+3 and x+4 would all compete for
space in bucket x+4.) Further, once this clustering starts to
occur, it feeds on and compounds itself.
v. There are several alternatives available to reduce this
clustering problem.
C. Quadratic probing
1. Quadratic probing addresses the clustering problem of linear open
addressing by using a quadratic function to choose overflow buckets.
a. If a key belongs in bucket x, the following series of buckets is
examined until one is found with room to hold it:
x
(x+1) mod b
(x+4) mod b
(x+9) mod b
..
2
i.e. the buckets probed are of the form (home + i ) mod b
b. Notice how this breaks up clusters. In the above example, bucket x would
first overflow to bucket x+1, increasing the probability of overflow there.
But now once bucket x+1 overflows, further overflows from home bucket x would
go into bucket x+4, while overflows from bucket home x+1 would go into bucket
x+2. Thus, buckets x and x+1 would not compete with each other for overflow
space, and the reinforcing effect of local overflows would not occur.
2. Of course, in looking for a key we must probe buckets in the same order as we
do for insertion. We continue probing until we find the key, or we are forced
to abandon the search if some probe leads us to a bucket with an empty slot
(not the result of a deletion).
3. Of course, we want to be sure that an insertion into a nearly full table will
succeed if at all possible, which means that - if necessary - we will
eventually probe each bucket in the table exactly once. It is possible to get
this behavior if we use a variant of quadratic probing in which we go
both forward and backward from a slot (i.e. we go +1, -1, +4, -4 ...). We
will get the desired behavior if we use a table size that is a prime of
the form 4j + 3 - i.e. is a prime one less than a multiple of four.
Example: Consider a table of size 7 (buckets numbered 0..6), and
a key that hashes to bucket 3. If the table is nearly full,
the following buckets will be probed in the order shown in
an attempt to find room for the key:
(home) 3
(home +/- 1) 4 2
(home +/- 4) 0 6
(home +/- 9) 5 1
since we have now tried all buckets, any future probe must fail
D. Rehashing
1. Another approach to solving the clustering problem is REHASHING.
Instead of using a single hash function, we use a series of hash
functions f1, f2, f3 ...
2. When an attempt is made to insert a key in the table, it is first
hashed using f1. If the resultant bucket is full, then the key is
hashed again using f2. If that bucket is full, then f3 is used, etc,
until some hash function hashes the key to a bucket that is empty.
3. The same series of hash functions are used, in turn, when searching
for a key until it is found or some probe takes us to a bucket with
an empty slot, in which case we abandon the search.
4. One obvious challenge is developing a suitable series of hash
functions. Ideally, the hash functions should have the property
that if f1 hashes two keys to the same bucket, then f2 hashes them
to different buckets, etc. In contrast to other overflow handling
methods, this has the effect of causing two keys that collide initially
to not also collide on overflow. On the other hand, it is hard to
find functions having this property that also guarantee that every
bucket will eventually be tried in the case of insertion into an
almost full table.
E. Chaining
1. One problem that all of the variants of hashing we have considered
thus far suffer from is that overflows are handled by using the same
space in the table that is used from the home buckets. Thus, it can
eventually happen that an insertion will fail because the entire table
is full. (E.g. the 1001st insertion into a table of 500 buckets of
2 slots each must fail.)
a. The normal solution to this problem with the schemes we have
considered previously is to rebuild the table with a larger size -
either by increasing the number of buckets or by increasing the
number of slots per bucket.
b. This, of course, requires a recopying of the entire table - and
may involve a change of hash function and a rehashing of all
existing entries if the number of buckets changes. Naturally,
this would make the insertion that triggers the restructuring
appear to be very slow.
2. An alternate approach is to make use of linked lists. This takes
two forms.
a. The hashtable may be structured as an array of buckets, as
before. But now we add to each bucket a link, initially NULL.
b. Insertions are initially made into the home bucket, as before.
However, should the home bucket overflow the following
approach is used:
i. A new bucket is allocated from outside the table structure.
ii. The new key is put into it.
iii. The link of the home bucket is made to point to the overflow
bucket, and the link of the overflow bucket is made NULL.
iv. Should the overflow bucket itself become full, additional
overflow bucket(s) are added to the chain as needed.
Example: Table with 5 buckets, 1 slot per bucket (initially empty);
Entries consist of a numeric ID (key) plus a name.
hash function = key mod 5.
Insert 17 AARDVARK: goes into bucket 2
Insert 23 BUFFALO: goes into bucket 3
Insert 12 CAT: since bucket 2 is full, goes into an
overflow bucket pointed to by 2
Insert 44 DOG: goes into bucket 4 (collision with
overflow from 2 does not occur)
c. Alternately, the hashtable itself may be simply an array of
pointers to lists of buckets. In this way, no bucket need be
allocated for a given hash value until a key with that hash
value actually occurs.
Example: rework the above
F. Another approach to handling collisions is EXTENDIBLE HASHING.
1. Several schemes have been proposed to allow the size of a hash table
to grow dynamically in a smooth, efficient way. We consider only
one here. For others, see Smith and Barnes: Files and Databases
pp 124-135.
2. All such schemes use a hash function that generates a large range of
values. For example, on a 32-bit computer, a typical hash function
used with such a scheme would produce a full 32-bit value.
3. Initially, only a limited number of bits from the hash function are
actually used; the rest are ignored. When a bucket overflows,
however, it is split in two and an additional bit of the hash function
is then used to redistribute the keys between the halves.
4. The scheme we consider here makes use of a table called the directory,
whose size is a power of two. Each entry in the table points to a
bucket of keys, but not necessarily a unique bucket. (That is,
several table entries may point to the same bucket.) When we do
lookups or insertions, we use as many bits from the hash function
as are needed to compute an index for this table, and then follow
the pointer to the correct bucket.
Example: the following is a hashtable with bucket size 2, with
keys and hash values as shown. (Hash values are sums of
ASCII values of characters of keys with bits in reverse
order - not great, but OK). At present, three bits of
the hash function are used to distribute the keys.
------
000 ------------------------> HIPPO 0000000110
------ CAT 0001101100
001 ---------------\
------ \-------> AARDVARK 0011001001
010 -------------\
------ ----------> DOG 0101101100
011 -------------/ JACKAL 0110010110
------
100 ------------------------> ELEPHANT 1000101001
------
101 ------------------------> GOPHER 1010001110
------ FOX 1011011100
110 --------------\
------ ---------> BUFFALO 1111111110
111 --------------/
We now consider the following insertions:
MONKEY 1100101110 - would go in bucket with BUFFALO
OSPREY 0100011110 - would cause bucket containing DOG, JACKAL
to split. As a result, 010 entry in the
table would point to a bucket containing
DOG and OSPREY, while 011 entry would now
point to a bucket with JACKAL
IGUANA 1010110110 - would force us to go to using 4 bits to
differentiate keys, since there are already
two entries with 101 as their first 3 bits.
The new table (including MONKEY and OSPREY
from before) would look like this:
0000------------\
------ -----------> HIPPO 0000000110
0001------------/ CAT 0001101100
------
0010------------\
------ ------------> AARDVARK 0011001001
0011------------/
------
0100------------\
------ ------------> DOG 0101101100
0101------------/ OSPREY 0100011110
------
0110------------\
------ ------------> JACKAL 0110010110
0111------------/
------
1000------------\
------ ------------> ELEPHANT 1000101001
1001------------/
------ GOPHER 1010001110
1010-------------------------> IGUANA 1010110110
------
1011-------------------------> FOX 1011011100
------
1100----------\
------ \
1001------------\
------ ------------> MONKEY 1100101110
1110------------/ BUFFALO 1111111110
------ /
1111----------/