| |||||||||
computer science, a hash table is an associative array data structure that provides fast lookup of a record indexed by a key. Like arrays, hash tables can provide constant-time (O(1)) lookup on average, regardless of the number of items in the table. However, their (very improbable) worst-case time can be as bad as O(n).
Hash tables use an array to hold, or reference, the stored records. However, because we want to allow keys which are much larger than the range of valid indexes, or might even be a different type of data like strings, we need a way to convert each key into a valid index. We achieve this using a hash function, which is a simple function that takes a key and produces a valid index into the array; thus the name hash table.
The problem is that, since we have more potential keys than array indexes, multiple keys will map to the same array index. If we try to insert two keys that map to the same index, this is called a collision. To resolve this, a collision resolution strategy is used. Most collision resolution strategies are some form of linear search through the set of inserted keys that map to the hash index of the one being searched for; it can be shown that it is very improbable any of these sets will become large enough that such a search is expensive. However, to maintain this property, occasionally the hash table's array must be enlarged, and when this happens all the keys in it have to be added all over again. This is a very expensive, but also very infrequent, operation.
Hash tables allow reasonably simple implementations of the essential operations of associative arrays: lookup of a record given a key; setting a record at a given a key (adding or replacing as necessary); and removing a record with a given key. Their main competitor in practice is the self-balancing binary search tree. See a comparison of hash tables and self-balacing binary search trees.
Hash tables are found in a wide variety of programs. Most programming languages provide them in standard libraries. Most interpreted or scripting languages have special syntactic support (examples being Python, Perl, Ruby, Io, Smalltalk, Lua and ICI). In these languages, hash tables tend to be used extensively as data structures, sometimes replacing records and arrays.
Hash tables are commonly used for symbol tables, caches, and sets.
A good hashing of the key is essential for good hash table performance. Hash collisions are generally resolved by some form of linear search, so if a hash function tends to produce similar values for some set of keys, poor performance linear searches will result.
In an ideal hash function, changing any single bit in the key (including extending or shortening the key) will produce a pseudo-random change in all of the bits of the hash, which are independent of the changes caused by any other bits of the key. Because a good hash function can be hard to design, or computationally expensive to execute, much research has been devoted to collision resolution strategies that mitigate poor hashing performance. However, in reality there is no substitute for a good hash.
In practice, the creation of a range reduced index from a key is almost always carried out as a two step process. The first is the generation of a generic hash value which fills a natural machine integer. The second step is its reduction by finding its modulus with respect to the hash table array size.
Hash table array sizes are sometimes set, by construction, to be prime numbers. This is done to avoid any tendency for the large integer hash to have common divisors with the hash table size, which would otherwise induce collisions after the modulus operation.
A common alternative to prime sizes is to use a power of 2 size, with simple bit masking to achieve the modulus operation. Such bit masking may be significantly computationally cheaper than the division operation.
In either case it is often a good idea to arrange the generic hash value to be constructed using numbers that share no common divisors (is coprime) with the table length.
One surprisingly common problem that can occur with hash functions is clustering. Clustering occurs when the structure of the hash function causes commonly used keys to tend to fall closely spaced or even consecutively within the hash table. This can cause significant performance degradation as the table fills. However, not all forms of hash tables are sensitive to clustering, and it is often best to choose an insensitive hash table form.
During debugging, a commonly used hash function is one that returns a constant number, typically the number 1. This reduces the performance of the hash table to linear.
If two keys hash to the same value, they cannot be stored in the same location. We must find a place to store a new value if its ideal location is already occupied. There are a number of ways to do this, but the most popular are chaining and open addressing.
In the simplest, and probably most frequently used chained hash table, each slot in the array references a linked list of records that collide to the same slot.
Chaining hash tables have advantages over open addressing hash tables that the removal operation is simple, resizing the table can be postponed for a much longer time because performance degrades more gracefully and the hash functions are easier to write.
Indeed, many open addressed hash tables may not require resizing- a hash table with a 1000 slots is about a 1000 times faster than a linear search- performance degradation is graceful as the table fills.
Alternative datastructures can be used instead of the linked list. By using an efficient data structure, such as a red-black tree, for each slot, the theoretical worst-case time of a hash table can be brought down, but this is inefficient in practice unless the table grows far beyond its capacity, where, generally, avoiding massively overfilling the table is a better strategy.
From the point of view of suitable writing hash functions, as we shall see in the next section, open hash tables can require really excellent hash functions to avoid clustering. By way of contrast, chained hash tables are insensitive to clustering, only requiring minimisation of collisions which is easier to arrange.
Overall, chained hash tables are simple to implement, robust and high performance. Designers should probably first consider use of this design in preference to open addressed hash tables wherever possible due to the lower complexity.
Chaining hash tables use slightly more memory, however, since the linked lists require storage for pointers; at least one additional pointer needs to be associated with each allocated record.
Open addressing hash tables store all the records within the array. A hash collision is resolved by searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates the there is no such key in the table. Well known probe sequences include:
A critical influence on performance of an open addressing hash table is the load factor, that is, the proportion of the slots in the array that are used. As the load factor increases towards 100%, the number of probes that may be required to find or insert a given key rises dramatically. Once the table becomes full, most algorithms will fail to terminate. Even with good hash functions, load factors are normally limited to 80%. A poor hash function can exhibit poor performance even at very low load factors by generating significant clustering.
The following pseudocode is an implementation of an open addressing hash table with linear probing and single-slot stepping, a common approach that is effective if the hash function is good. Each of the lookup, set and remove functions use a common internal function findSlot to locate the array slot that either does or should contain a given key.
The following is wikicode, a proposed pseudocode for BambooWeb articles.
Another technique for removal is simply to mark the slot as deleted. However this eventually requires rebuilding the table simply to remove deleted records. The methods above provide O(1) updating and removal of existing records, with occasional rebuilding if the high water mark of the table size grows.
The O(1) remove method above is only possible in linearly probed hash tables with single-slot stepping. In the case where many entries are to be deleted in one operation, marking the slots for deletion and later rebuilding may be more efficient.
If all of the keys that will be used are known ahead of time, perfect hashing can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.
Perhaps the simplest solution to a collision is simply to drop the item we're inserting, giving up the position to the item that is already there. In later searches, this may result in a small chance of false negatives, a search not finding an item which has been inserted.
An even more space-efficient solution which is similar to this is to keep only one bit for each bucket, each indicating whether or not a value has been inserted in its bucket. False negatives cannot occur, but false positives can, since if the search finds a 1 bit, it will say the value was found, even if it was just another value that hashed into the same bucket by coincidence. In reality, such a hash table is merely a specific type of Bloom filter.