# Hashing [video][mit-ocw] ## Motivation * database * compilers and interpreters * network router * substring search * string commonalities * file/dir synchronization Direct Access table: ---------- 0 | | ---------- 1 | | ---------- 2 | | ---------- 3 | | ---------- 4 | | ---------- | ... | ---------- | ... | ---------- | ... | ---------- > Hash requires mapping to an integer. ideally: * prehash hash(x) == hash(y) only when x == y badness: 1. keys may not be non negative integers 1. gigantic memory hog hashing -> hatchet - reduce universe of all possible keys and reduce them down to some reasonable set of integers. - idea: m = theta(n) - size of table proportial to size of # of things. ----------- 0 | | | | | | ---------- 1 | | ---------- 2 | | ---------- 3 | | ---------- 4 | | ---------- | ... | ---------- | ... | ---------- m-1 | ... | ---------- Collision: Multiple keys map to same slot in hash table. * h(ki) == h(kj) but ki <> kj * we can use chaining How to define? * function (h) * all keys map to `h = { 0, ..., m-1 }` Two ways to deal with collisions: * Chaining: store collisions as a list. (i.e. linked list) * Open addressing ## Chaining ---------- 0 | | --> (item1) --> (item2) --> (nil) ---------- 1 | | ---------- 2 | | ---------- 3 | | ---------- 4 | | ---------- | ... | ---------- | ... | ---------- m-1 | ... | ---------- Worst case: * theta(n) (i.e. all items have a collision) Simple uniform hashing: * Convenient for analysis. * each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing. Analysis: * expected length of a chain for n keys, m slots. * n/m == alpha == load factor * O(1 + alpha) ## Hash functions: 1. division method: * `h(k) == k mod m` * works okay if `m` is prime and not close to power of 2 or power of 10. 2. multiplication method: * `h(k) = [(a*k) mod 2^w] >> (w - r)` 3. Universal hasing: * `h(k) = [(ak+b) mod p] mod m` * `a`, `b` are random between 0 and `p-1`. p is a big prime number larger than the universe. * `{0, ..,p-1}` * worst case keys: k1 != k2 * `1/m` ## How to choose m? * Want `m = theta(n)` * `0(1)` Idea: > Start small; grow/shrink as necessary E.g. * If n > m: grow table * grow table: * m -> m' * make table of size `m'` * build new hash function `h'` * rehash * for item in T: * t:insert(item) * `m' = 2m` table doubling Amortization: * operation takes `T(n) amortized` * if `k` operations take <= `k*T(n)` time * spread out the high cost so you pay a little bit, but more often. * ~ Think of meaning `T(n)` on average, where average is taken over all the oeprations. > spread out the high cost so that it's cheap on average all the time. ## Open Adressing * no chaining * use arrays * m: # of slots in the table * m >= # of elements ```plaintext ---------- | item 1 | ---------- | item 2 | ---------- | | ---------- | | ---------- ``` probing: try to see if we can insert something into the hashtable. if we fail, we will compute a slightly different hash until we find an empty slot that we can insert into this table. Hash function specifies the order of slots to probe for a key. (for insert/search/delete) hash function `h(U, trial_count)` * U: universe of keys * trial_count: integer between 0 -> m-1 `h(k,1)` arbitrary key k h(k,1), h(k,2), ... h(k, m-1) to be a permutation of 0,1 ... m-1 ```plaintext ---------- 0 | | ---------- 1 | | ---------- 2 | | ---------- 3 | | ---------- 4 | | ---------- 5 | | ---------- 6 | | ---------- 7 | | ---------- ``` * `insert(k,v)`: keep probing until an empty slot is found. insert item when found. * `search(k)`: As long as the slots encountered are occupied by keys != k. keep probing until you either encounter k or find an empty slot. ## Cryptographic Hash [mit-ocw]: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-8-hashing-with-chaining/