doc/mit-ocw/hash.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162

# Hashing

[video][mit-ocw]

## Motivation

* database
* compilers and interpreters
* network router
* substring search
* string commonalities
* file/dir synchronization

Direct Access table:

  ----------
0 |        |
  ----------
1 |        |
  ----------
2 |        |
  ----------
3 |        |
  ----------
4 |        |
  ----------
  | ...    |
  ----------
  | ...    |
  ----------
  | ...    |
  ----------

> Hash requires mapping to an integer.

ideally:

* prehash hash(x) == hash(y) only when x == y

badness:

1. keys may not be non negative integers
1. gigantic memory hog

hashing -> hatchet

- reduce universe of all possible keys and reduce them down to some reasonable set of integers.
- idea: m = theta(n)
  - size of table proportial to size of # of things.

    -----------
0   | | | | | |
    ----------
1   |         |
    ----------
2   |         |
    ----------
3   |         |
    ----------
4   |         |
    ----------
    | ...     |
    ----------
    | ...     |
    ----------
m-1 | ...     |
    ----------

Collision: Multiple keys map to same slot in hash table.
* h(ki) == h(kj) but ki <> kj
* we can use chaining

How to define?

* function (h)
* all keys map to `h = { 0, ..., m-1 }`

Two ways to deal with collisions:

* Chaining: store collisions as a list. (i.e. linked list)
* Open addressing

## Chaining

    ----------
0   |         | --> (item1) --> (item2) --> (nil)
    ----------
1   |         |
    ----------
2   |         |
    ----------
3   |         |
    ----------
4   |         |
    ----------
    | ...     |
    ----------
    | ...     |
    ----------
m-1 | ...     |
    ----------

Worst case:

* theta(n) (i.e. all items have a collision)

Simple uniform hashing:

* Convenient for analysis.
* each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing.

Analysis:

* expected length of a chain for n keys, m slots.
  * n/m == alpha == load factor
  * O(1 + alpha)

## Hash functions:

1. division method:
  * `h(k) == k mod m`
  * works okay if `m` is prime and not close to power of 2 or power of 10.
2. multiplication method:
  * `h(k) = [(a*k) mod 2^w] >> (w - r)`
3. Universal hasing:
  * `h(k) = [(ak+b) mod p] mod m`
    * `a`, `b` are random between 0 and `p-1`. p is a big prime number larger than the universe.
    * `{0, ..,p-1}`
  * worst case keys: k1 != k2
    * `1/m`

## How to choose m?

* Want `m = theta(n)`
  * `0(1)`

Idea:

> Start small; grow/shrink as necessary

E.g.

* If n > m: grow table

* grow table:
  * m -> m'
    * make table of size `m'`
    * build new hash function `h'`
    * rehash
      * for item in T:
        * t:insert(item)
    * `m' = 2m` table doubling

Amortization:
* operation takes `T(n) amortized`
  * if `k` operations take <= `k*T(n)` time
* spread out the high cost so you pay a little bit, but more often.
* ~ Think of meaning `T(n)` on average, where average is taken over all the oeprations.

> spread out the high cost so that it's cheap on average all the time.

[mit-ocw]: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-8-hashing-with-chaining/