doc/mit-ocw/hash.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226

# Hashing

[video][mit-ocw]

## Motivation

* database
* compilers and interpreters
* network router
* substring search
* string commonalities
* file/dir synchronization

Direct Access table:

  ----------
0 |        |
  ----------
1 |        |
  ----------
2 |        |
  ----------
3 |        |
  ----------
4 |        |
  ----------
  | ...    |
  ----------
  | ...    |
  ----------
  | ...    |
  ----------

> Hash requires mapping to an integer.

ideally:

* prehash hash(x) == hash(y) only when x == y

badness:

1. keys may not be non negative integers
1. gigantic memory hog

hashing -> hatchet

- reduce universe of all possible keys and reduce them down to some reasonable set of integers.
- idea: m = theta(n)
  - size of table proportial to size of # of things.

    -----------
0   | | | | | |
    ----------
1   |         |
    ----------
2   |         |
    ----------
3   |         |
    ----------
4   |         |
    ----------
    | ...     |
    ----------
    | ...     |
    ----------
m-1 | ...     |
    ----------

Collision: Multiple keys map to same slot in hash table.
* h(ki) == h(kj) but ki <> kj
* we can use chaining

How to define?

* function (h)
* all keys map to `h = { 0, ..., m-1 }`

Two ways to deal with collisions:

* Chaining: store collisions as a list. (i.e. linked list)
* Open addressing

## Chaining

    ----------
0   |         | --> (item1) --> (item2) --> (nil)
    ----------
1   |         |
    ----------
2   |         |
    ----------
3   |         |
    ----------
4   |         |
    ----------
    | ...     |
    ----------
    | ...     |
    ----------
m-1 | ...     |
    ----------

Worst case:

* theta(n) (i.e. all items have a collision)

Simple uniform hashing:

* Convenient for analysis.
* each key is equally likely to be hashed to any slot of the table, independent of where other keys hashing.

Analysis:

* expected length of a chain for n keys, m slots.
  * n/m == alpha == load factor
  * O(1 + alpha)

## Hash functions:

1. division method:
  * `h(k) == k mod m`
  * works okay if `m` is prime and not close to power of 2 or power of 10.
2. multiplication method:
  * `h(k) = [(a*k) mod 2^w] >> (w - r)`
3. Universal hasing:
  * `h(k) = [(ak+b) mod p] mod m`
    * `a`, `b` are random between 0 and `p-1`. p is a big prime number larger than the universe.
    * `{0, ..,p-1}`
  * worst case keys: k1 != k2
    * `1/m`

## How to choose m?

* Want `m = theta(n)`
  * `0(1)`

Idea:

> Start small; grow/shrink as necessary

E.g.

* If n > m: grow table

* grow table:
  * m -> m'
    * make table of size `m'`
    * build new hash function `h'`
    * rehash
      * for item in T:
        * t:insert(item)
    * `m' = 2m` table doubling

Amortization:
* operation takes `T(n) amortized`
  * if `k` operations take <= `k*T(n)` time
* spread out the high cost so you pay a little bit, but more often.
* ~ Think of meaning `T(n)` on average, where average is taken over all the oeprations.

> spread out the high cost so that it's cheap on average all the time.

## Open Adressing

* no chaining
* use arrays
* m: # of slots in the table
* m >= # of elements

```plaintext
----------
| item 1 |
----------
| item 2 |
----------
|        |
----------
|        |
----------
```

probing: try to see if we can insert something into the hashtable.
if we fail, we will compute a slightly different hash until we
find an empty slot that we can insert into this table.

Hash function specifies the order of slots to probe
for a key. (for insert/search/delete)

hash function `h(U, trial_count)`

* U: universe of keys
* trial_count: integer between 0 -> m-1

`h(k,1)`
arbitrary key k

h(k,1), h(k,2), ... h(k, m-1)

to be a permutation of 0,1 ... m-1

```plaintext
  ----------
0 |        |
  ----------
1 |        |
  ----------
2 |        |
  ----------
3 |        |
  ----------
4 |        |
  ----------
5 |        |
  ----------
6 |        |
  ----------
7 |        |
  ----------
```

* `insert(k,v)`: keep probing until an empty slot is found. insert item when found.
* `search(k)`: As long as the slots encountered are occupied by keys != k. keep probing until you either encounter k or find an empty slot.

## Cryptographic Hash


[mit-ocw]: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/lecture-8-hashing-with-chaining/