# 10Data Structures---Hash Tables

## 展开查看详情

1.Hash Tables A map (also called dictionary) is an ADT that contains key-value pairs. It helps retrieve the value using the key as the “address.” As an example, consider a set of two-letter words. You want to build a dictionary so that you can look up the definition of any two-letter word, very quickly. The two- letter word is the key that addresses the definition of the word. Key Definition am First person singular number indicative of be, also, before noon an The form of ‘a’ before a vowel sound … be To exist or to live by (Preposition) near or next to …

2.More applications of maps (University record) Key = student id. Value = information about the student (Social media) Key = user name / email id. Value = user information. So, how will you implement a dictionary of all two-letter words? Take an array with 26x26= 676 elements. Each two-letter will map to a unique index of the array. The content of that array index will be the definition. Is this scalable? NO. Consider large words … Hash tables implement dictionaries Number of keys = n Array size = N, and The number of possible keys M >> N > n

3.A hash function maps a huge set of keys into N buckets by applying a compression function, called a hash function h h (key) = array index (in the dictionary). Set hash compression bucket of code function array key integer keys index hash function Maps or dictionaries are associative arrays, where searching is based on the key, rather than an index to the array storing the values.

4. (Source: Wikipedia) Hash codes key K h(key) Java relies on 32-bit hash codes, so for the base types byte, short, int, char, we can get a hash code by casting its value x0 x1 x2 ... xn−2 xn−1 to the type int, and using the integer representation.

5.For a string object s = s[0] s[1] … s[n-1], the following method is used to compute the hash code h(s): h(s) = s[0]*31^(n-1)+s[1]*31^(n-2)+ ... + s[n-1] This is called polynomial hash code. When using a hash table to implement a map, for consistency, equivalent keys must map to the same bucket. So, if x.equals(y) then x.hashCode == y.hashCode Compression functions Let us get back to the dictionary of all 2-letter words. There are 26 x 26 = 676 possible keys, but perhaps 70 possible meaningful words. Let us convert each 2-letter word xy to an integer i as follows (using f(a) = 0, f(b) = 1, f(c) = 2, …, f(z) = 25): i = 26.f(x) + f(y)

6.Division method. If we plan on storing these words in an array of size 100 (i.e. N=100) then a simple compression function is mod 100 (if N = bucket array size, then use mod N). Thus, the integer i will be placed in location i mod N of the hash table MAD (Multiply-Add-Divide) method Map i to [(a.i + b) mod p] mod N Here p is a prime number and a, b are random integers chosen from the interval [0..p-1] Depending on the value of n/N, it works in most cases, but what if there are two words xy and pq such that h(xy) = h(pq)? This is called collision. Computing the compression function for collision avoidance is a form of black art. Use a function so that two different keys do not map to the same index of the hash table.

7.Hash table operations A hash table supports at least three operations: insert(key, value) Compute the key's hash code. Compress it to determine the entry's bucket. Insert the entry (key and value together) into that bucket (and deal with collision) find(key) Hash the key to determine its bucket. Search the entry with the given key. If found, return the entry, else, return null. delete(key) Hash the key to determine its bucket. Search the list for an entry with the given key. Remove it from the list if found. Return the entry or null if not found.

8.Collision avoidance Hashing with separate chaining (Also called Chain hashing) Each bucket entry references a linked list of entries, called a chain. If several keys are mapped to the same bucket, their definitions all reside in that bucket's linked list. 0 1 2 3 4 5 6 7 8 9 10 But how do we know which definition corresponds to which word? The answer is that we must store each key in the table along with its definition (i.e. both key and value).

9.Open addressing Linear Probing i hash Key Let i be the hash function of an entry X, but assume that slot i of the hash table is not free, since it has been allocated to entry Y (so that is a collision). Try looking for slots i+1, i+2, i+3, … until a free slot is found. Quadratic Probing In case of a collision at slot i, try looking for slots i+12, i+22, i+32, … until a free slot is found.

10.Typically we expect O(1) time performance for each of the operations. This may not be feasible if the load factor n/N is large (> 0.75) or there are too many collisions. The performance can slowly degenerate towards O(n). Resizing the hash table It is not always possible to foresee the number of entries we'll need to store. So, what to do when the load factor increases? One option is to enlarge the hash table when the load factor becomes too large. Allocate a new array (typically at least twice as long as the old), and then walk through all the entries in the old array and “rehash” them into the new. [Note: you just can’t copy the entries of the old array into the slots of the new array, because the compression functions of the two arrays will be different. You have to rehash each entry individually.] For best performance, hash codes often need to be designed specially for each new object.

11.Complexities of insert, delete, search Initially all empty buckets contain “null”. To insert a key K, first find if it K already exists. If so, the new value will replace the old value. Otherwise insert it into a blank slot. To delete a key, first find it, and then replace it by null. Case 1. Chain hashing Needs additional space. Each bucket has to maintain an independent linked list. The lengths of these lists should be as small as possible otherwise search complexity will increase. If the space is tight, this can scale up to O(n).

12.Case 2. Open addressing with linear probing 6 10 Y Z W T X Key X index 6 6 10 Y 10 X Key X As n/N increases, insert and search takes more time. What happens when you use open addressing and delete the key Y (see figure above)? Will it erase the link 10 also? No. So, to delete a key, simply mark it as “defunct” that will preserve the link. A new key can be inserted into the defunct slot only if it does not exist at all (for this look beyond the defunct entry).

13.Computing the hashCode of a given key should be fast, while minimizing the probability of collision. For a string s, the hash is computed as follows: int hash = 0; for (int i = 0; i < s.length(); i++) hash = (31 * hash + s.charAt(i)) % N Why can it be computed fast? Because it avoids the use of a multiplier (which is slow)! How? Horner’s Rule for computing polynomials h(s) = s[0]*31^(n-1)+s[1]*31^(n-2)+ ... + s[n-1] y0 = s[0] y1 = 31*y0 + s[1] y2 = 31*y1 + s[2] y3 = 31*y2 + s[3] … … … h(s) = yn-‐1 = 31*yn-‐2 + s[n-‐1]

14.Example. Consider the following keys to be entered into a hash table, first into a table of size 5. “Tom”, “Dick”, “Harry”, “Sam”, “Pete” Key hashCode() hashCode()%5 Tom 84274 4 Dick 2129869 4 Harry 69496448 3 Sam 82879 4 Pete 2484038 3 The load factor is 100%. Try with a larger table of size 11

15.Tidbits If you use hashCode()%N as compression function, then note that it may sometimes return a negative value (when the argument is negative) leading to an array out-‐of-‐bound exception. Add N to make it +ve. Also, the function Math.abs(s.hashCode()%N can return a negative integer when the 32-‐bit argument is 10000000000…00, since its positive version cannot be represented using 32 bits (needs 33 bits)! One way out is the following private int hash(Key key) { return (key.hashCode() & 0x7fffffff) % M; } It masks the sign bit and converts it into a positive integer.

16.Quadratic probing is slow since it involves multiplication. Here is how you can speed it up to calculate the next index. Initially k= -‐1 k = k+2 index = (index + k) % hashTableSize Why does it work?