-
Notifications
You must be signed in to change notification settings - Fork 487
/
06-hashes.html
861 lines (771 loc) · 37.6 KB
/
06-hashes.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>CS 2150: 06-hashes slide set</title>
<meta name="description" content="A set of slides for a course on Program and Data Representation">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../slides/reveal.js/dist/reset.css">
<link rel="stylesheet" href="../slides/reveal.js/dist/reveal.css">
<link rel="stylesheet" href="../slides/reveal.js/dist/theme/black.css" id="theme">
<link rel="stylesheet" href="../slides/css/pdr.css">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../slides/reveal.js/plugin/highlight/zenburn.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../slides/reveal.js/css/print/pdf.scss' : '../slides/reveal.js/css/print/paper.scss';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]>
<script src="../slides/reveal.js/lib/js/html5shiv.js"></script>
<![endif]-->
<style>.reveal li { font-size:93%; line-height:120%; }</style>
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section data-markdown id="cover"><script type="text/template">
# CS 2150
### Program and Data Representation
<p class='titlep'> </p>
<div class="titlesmall"><p>
<a href="http://www.cs.virginia.edu/~mrf8t">Mark Floryan</a> (mrf8t@virginia.edu)<br>
<a href="http://www.cs.virginia.edu/~asb">Aaron Bloomfield</a> (aaron@virginia.edu)<br>
<a href="http://github.com/uva-cs/pdr">@github</a> | <a href="index.html">↑</a> | <a href="./06-hashes.html?print-pdf"><img class="print" width="20" src="../slides/images/print-icon.png" style="top:0px;vertical-align:middle"></a>
</p></div>
<p class='titlep'> </p>
## Hash Tables
</script></section>
<section>
<h2>CS 2150 Roadmap</h2>
<table class="wide">
<tr><td colspan="3"><p class="center">Data Representation</p></td><td></td><td colspan="3"><p class="center">Program Representation</p></td></tr>
<tr>
<td class="top"><small> <br> <br>string<br> <br> <br> <br>int x[3]<br> <br> <br> <br>char x<br> <br> <br> <br>0x9cd0f0ad<br> <br> <br> <br>01101011</small></td>
<!-- image adapted from http://openclipart.org/detail/3677/arrow-left-right-by-torfnase -->
<td><img class="noborder" src="images/red-double-arrow.png" height="500" alt="vertical red double arrow"></td>
<td class="top"> <br>Objects<br> <br>Arrays<br> <br>Primitive types<br> <br>Addresses<br> <br>bits</td>
<td> </td>
<td class="top"><small> <br> <br>Java code<br> <br> <br>C++ code<br> <br> <br>C code<br> <br> <br>x86 code<br> <br> <br>IBCM<br> <br> <br>hexadecimal</small></td>
<!-- image adapted from http://openclipart.org/detail/3677/arrow-left-right-by-torfnase -->
<td><img class="noborder" src="images/green-double-arrow.png" height="500" alt="vertical green double arrow"></td>
<td class="top"> <br>High-level language<br> <br>Low-level language<br> <br>Assembly language<br> <br>Machine code</td>
</tr>
</table>
</section>
<section data-markdown><script type="text/template">
# Contents
[ADTs Covered So Far](#/adtssofar)
[Hash Tables](#/hashtables)
[Separate Chaining](#/separatechaining)
[Open Addressing](#/openaddressing)
[Miscellaneous](#/miscellaneous)
</script></section>
<section>
<section id="adtssofar" data-markdown class="center"><script type="text/template">
# ADTs Covered So Far
</script></section>
<section data-markdown><script type="text/template">
## Lists
- Operations
- find
- insert
- remove
- findKth
- Implementations
- Array (vector)
- Linked list
</script></section>
<section data-markdown><script type="text/template">
## Lists
| | Array (vector) | Linked List |
|-|-|-|
| find | Θ(*n*) | Θ(*n*) |
| insert | Θ(*n*) worst case,<br>but often Θ(1) | Θ(1) |
| remove | Θ(*n*) | Θ(*n*) |
| findKth | Θ(1) | Θ(*n*) |
<center>The operations are <i>generally</i> linear-time operations</center>
</script></section>
<section data-markdown><script type="text/template">
## Stacks
- List with data handled last-in first-out
- Operations:
- push
- pop
- top
- Implementations
- Array (vector)
- Linked list
</script></section>
<section data-markdown><script type="text/template">
## Stacks
| | Array (vector) | Linked List |
|-|-|-|
| push | Θ(*n*) worst case,<br>but often Θ(1) | Θ(1) |
| pop | Θ(1) | Θ(1) |
| top | Θ(1) | Θ(1) |
<center>The operations are <i>generally</i> constant-time operations</center>
</script></section>
<section data-markdown><script type="text/template">
## Queues
- First-in first-out list
- Operations:
- enqueue
- dequeue
- Implementations
- Array (vector)
- Linked lists
</script></section>
<section data-markdown><script type="text/template">
## Queues
| | Array (vector) | Linked List |
|-|-|-|
| enqueue | Θ(*n*) worst case,<br>but often Θ(1) | Θ(1) |
| dequeue | Θ(1) | Θ(1) |
<center>The operations are <i>generally</i> constant-time operations</center>
</script></section>
<section data-markdown><script type="text/template">
## Trees
- Goal is Θ(log *n*) runtime for most operations
- Binary search trees
- AVL Trees
- Red-black trees
- Splay trees
</script></section>
<section data-markdown><script type="text/template">
## Trees
| | BST | AVL | Red-black | Splay |
|-|-|-|-|-|
| find | Θ(*h*), where<br>log *n* < *h* ≤ *n*-1;<br>worst case is Θ(*n*) | Θ(log *n*) | Θ(log *n*) | Θ(log *n*)<br>amortized |
| insert | Θ(*h*), where<br>log *n* < *h* ≤ *n*-1;<br>worst case is Θ(*n*) | Θ(log *n*) | Θ(log *n*) | Θ(log *n*)<br>amortized |
| remove | Θ(*h*), where<br>log *n* < *h* ≤ *n*-1;<br>worst case is Θ(*n*) | Θ(log *n*) | Θ(log *n*) | Θ(log *n*)<br>amortized |
<center>Balanced trees are <i>generally</i> logarithmic-time operations</center>
</script></section>
<section data-markdown><script type="text/template">
## Is There Anything Faster?
- Fastest possible search using binary comparison: Θ(log *n*)
- Rephrased: binary comparison searches are Ω(log *n*)
- We can do better: (almost) constant (Θ(1)) is possible with *hash tables*
- Hash tables (lookup table)
- Standard set of operations: find, insert, delete
- No ordering property!
- Thus, no findMin or findMax
</script></section>
</section>
<section>
<section id="hashtables" data-markdown class="center"><script type="text/template">
# Hash Tables
</script></section>
<section data-markdown><script type="text/template">
## Key-value pairs
- Hash tables store key-value pairs
- Each value has a specific key associated with it
- Keys and values need not be the same type!
- Examples
- Definitions: "set", "v.tr. 1. To put in a specified position..."
- Uva e-mail redirects: "aaron@", "asb2t@cms .virginia.edu"
- Anything that can be stored in a tree
- Userid / IDnum pairs
- Userid / lots_of_info_about_them_in_an_object pairs
</script></section>
<section>
<h2>Hash Tables</h2>
<table class="transparent"><tr><td>
<table class="transparent"><tr><td>
<div style="font-size:130%;line-height:110%">
<ul>
<li>Hash table<ul>
<li>fixed size <i>array</i> of some size, usually a prime number<ul>
<li>Should be larger than the number of elements</li></ul></li></ul></li>
<li>Given a key space:</li>
</ul>
</div>
</td></tr>
<tr><td>
<table class="transparent"><tr><td><img alt="blob" src="images/06-hashes/blob.png"></td><td class="middle">
<table class="transparent"><tr><td><div style="font-size:130%;line-height:110%">hash function</div></td></tr><tr><td><div style="font-size:130%;line-height:110%"><i>hash</i>(<i>k</i>)</div></td></tr><tr><td><div style="font-size:200%">→</div></td></tr></table>
</td></tr></table>
</td></tr></table>
</td><td class="middle">
<table class="transparent">
<tr><td> </td><td></td></tr>
<tr><td> </td><td></td></tr>
<tr><td style="text-align:right;">hash</td><td style="text-align:left;">table</td></tr>
<tr><td> </td><td></td></tr>
<tr><td>0</td><td class="border" style="width:100px"></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td> </td><td class="border" style="width:100px"></td></tr>
<tr><td>...</td><td class="border" style="width:100px"></td></tr>
<tr><td> </td><td class="border" style="width:100px"></td></tr>
<tr><td>tablesize‑1</td><td class="border" style="border-bottom:medium solid;"></td></tr>
</table>
</td></tr></table>
</section>
<section data-markdown><script type="text/template">
## Hash function
- A hash function takes in a "thing"...
- string, int, object, etc.
- and returns an *unsigned* integer value
- which is then mod'ed by the size of the hash table to yield a spot within the bounds of the hash table array
- Three *required* properties
- Must be *deterministic*
- Meaning it must return the same value each time for the same "thing"
- Must be *fast*
- Must be *evenly distributed*
- Technically, only the first is required for *correctness*, but the other two are required for fast running times
</script></section>
<section>
<h2>Hash functions KLA</h2>
<ul>
<li>I'm going hash all of you into 10 buckets (0-9) by your birthday<ul>
<li>(you are welcome to make up another birthday, as long as you are consistent)</li></ul></li>
<li>The hash functions:<ul>
<li class="fragment" data-fragment-index="1">By the decade of your birth year<ul>
<li class="fragment" data-fragment-index="1"><i>hash</i>(<i>birthday</i>) = (<i>year</i>/10) % 10</li></ul></li>
<li class="fragment" data-fragment-index="2">By the last digit of your birth year<ul>
<li class="fragment" data-fragment-index="2"><i>hash</i>(<i>birthday</i>) = <i>year</i> % 10</li></ul></li>
<li class="fragment" data-fragment-index="3">By the last digit of your birth month<ul>
<li class="fragment" data-fragment-index="3"><i>hash</i>(<i>birthday</i>) = <i>month</i> % 10</li></ul></li>
<li class="fragment" data-fragment-index="4">By the last digit of your birth day<ul>
<li class="fragment" data-fragment-index="4"><i>hash</i>(<i>birthday</i>) = <i>day</i> % 10</li></ul></li>
</ul></li></ul>
</section>
<section>
<h2>Keys</h2>
<table class="wide">
<tr>
<td class="top">
<div style="width:400px;font-size:130%;line-height:110%">
<ul>
<li>How can we hash the keys if the keys can be anything?</li>
<li>Best one binary comparison can do is eliminate one half of the elements Θ(log <i>n</i>)</li>
<li>We want Θ(1)</li>
<li>The keys must be bits, so we can do better!</li>
</ul></div>
</td>
<td style="width:75px"></td>
<td class="top">
<br><small>"Hello"</small><br> <br><small>['H','i',\0]</small><br> <br><small>3.14</small><br> <br><small>'x'</small><br> <br><small>0x42381a</small><br> <br><small>01001010</small></td>
<!-- image adapted from http://openclipart.org/detail/3677/arrow-left-right-by-torfnase -->
<td style="width:150px;vertical-align:top"><img class="noborder" src="images/red-double-arrow.png" height="500" alt="vertical red double arrow"></td>
<td class="top"> <br> <br><small>Objects</small><br> <br><small>Arrays</small><br> <br><small>Primitive types</small><br> <br><small>Addresses</small><br> <br><small>bits</small></td>
<td> </td>
</tr>
</table>
</section>
<section data-markdown><script type="text/template">
## Lookup Table
| hash(key) | key |
|-|-|
| 000000 | "red" |
| 000001 | "orange" |
| 000010 | "blue" |
| 000011 | `null` |
| 000100 | "green" |
| 000101 | ... |
This can work, unless the key space is sparse, or we don't know the keys ahead of time. But it's slow to look up a value in a table!
</script></section>
<section>
<h2>Example</h2>
<table class="transparent"><tr><td>
<div style="font-size:120%">
<ul>
<li>Key space: integers<br> </li>
<li>Table size: 10<br> </li>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 10<ul>
<li>Technically, <i>hash</i>(<i>k</i>) = <i>k</i>,<br>which is <i>then</i> mod'ed by<br>the table size of 10<br> </li>
</ul></li>
<li>Insert: 7, 18, 41, 34<br> </li>
<li>How do we find them?</li>
</ul></div>
</td><td style="width:100px"></td><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"></td></tr>
<tr><td>1</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">41</span></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">34</span></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="1">7</span></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">18</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;"></td></tr>
</table>
</td></tr></table>
</section>
<section data-markdown><script type="text/template">
## Table size issues...
- Why not just have a table of size 100
- And map them directly to the location corresponding to their key?
- We assume that the key space is too large
- Example: mapping social security numbers for students at UVa
- There are not 999,999,999 students at UVa, even if taken across all time
- Do you see why find max and find min are not easy?
- We have not preserved any ordering info
</script></section>
<section>
<h2>Another Example</h2>
<table class="transparent"><tr><td>
<div style="font-size:120%">
<ul>
<li>Key space: integers<br> </li>
<li>Table size: 6<br> </li>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 6<br> </li>
<li>Insert: 7, 18, 41, 34, <span class='red'>12</span><br> </li>
<li>How do we find them?</li>
</ul></div>
</td><td style="width:100px"></td><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">18</span><span class="fragment" data-fragment-index="5"><span class="red"> 12</span></span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="1">7</span></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">34</span></td></tr>
<tr><td>5</td><td class="border" style="border-bottom:medium solid;"><span class="fragment" data-fragment-index="3">41</span></td></tr>
</table>
</td></tr></table>
</section>
<section data-markdown><script type="text/template">
## Hash Table
- Hash function: *hash*: *key* → [0, *m*-1]
- Really to any unsigned integer, which is then mod'ed by *m*, the table size
- Here, *hash*(*key*) = `firstletter`(*key*)<br>
| Location | Key | Value |
|-|-|-|
| 0 | "Alice" | "red" |
| 1 | "Bob" | "orange" |
| 2 | "Colleen" | "blue" |
| 3 | `null` | `null` |
| 4 | "Eve" | "green" |
| ... | ... | ... |
| *m*-1 | "Zeus" | "purple" |
</script></section>
<section data-markdown><script type="text/template">
## Hash Functions
- Required properties described earlier
- Must be deterministic
- Must be fast
- Must be evenly distributed
- This implies avoiding of collisions
- A perfect hash function has:
- No blanks (i.e., no empty cells)
- No collisions
</script></section>
<section>
<h2>Sample String Hash Functions</h2>
<ul>
<li>Key space: strings</li>
<li>A string <i>s</i> is made up of characters <i>s<sub>i</sub></i></li>
<li>\( s = s_0s_1s_2s_3\ldots s_{k-1} \)</li>
</ul>
<p> </p>
<ol>
<li class="fragment">\( hash(s) = s_0 \mod table\_size \)<br> </li>
<li class="fragment">\( hash(s) = \left( \sum_{i=0}^{k-1}s_i \right) \mod table\_size \)<br> </li>
<li class="fragment">\( hash(s) = \left( \sum_{i=0}^{k-1}s_i*37^i \right) \mod table\_size \)<br> </li>
</ol>
</section>
<section data-markdown><script type="text/template">
## Hash function notes
- They should always return an *unsigned* int
- Otherwise your program will be trying to find a negative array index
- Integer overflow is fine, as long as it overflows *deterministically*
- Meaning the same way each time
- This will especially be true with the last of the string hash functions presented on the previous slide
</script></section>
<section data-markdown><script type="text/template">
## Collision Resolution
- Collision: when two keys map to the same location in the hash table
- Two primary ways to resolve collisions:
1. Separate Chaining (make each spot in the table a 'bucket' or a collection)
2. Open Addressing, of which there are 3 types:
- Linear probing
- Quadratic probing
- Double hashing
</script></section>
</section>
<section>
<section id="separatechaining" data-markdown class="center"><script type="text/template">
# Separate Chaining
</script></section>
<section data-markdown><script type="text/template">
## Separate Chaining
- An animation of this can be found [here](https://www.cs.usfca.edu/~galles/visualization/OpenHash.html)
- Although it is really called "separate chaining", that site calls it "open hashing"
</script></section>
<section>
<h2>Separate Chaining</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<div style="font-size:120%;line-height:110%">
<ul>
<li>All keys that map to the same hash value are kept in a "bucket"<ul>
<li>This "bucket" is another data structure, typically a linked list</li></ul><br> </li>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 10<br> </li>
<li>Insert: 10, 22, 107, 12, 42</li>
</ul></div>
</td></tr></table>
<script type="text/javascript">insertCanvas();</script>
</section>
<section data-markdown><script type="text/template">
## Analysis of find
- Definition: The *load factor*, λ, of a hash table is the ratio of the number of elements divided by the table size
- For separate chaining, λ is the average number of elements in a bucket
- Average time on unsuccessful find: λ
- Average length of a list at *hash*(*k*)
- Average time on successful find: 1 + (λ/2)
- One node, plus half the average length of a list (not including the item)
</script></section>
<section data-markdown><script type="text/template">
## Load factor
- How big should we make the hash table?
- Well, we want "constant" time for find and insert...
- Possible sizes for hash table with separate chaining
- λ = 1
- Make hash table be the number of elements expected; average bucket size is 1
- Also make it a prime number
- λ = 0.75
- Java's [Hashtable](http://docs.oracle.com/javase/7/docs/api/java/util/Hashtable.html) but can be set to another value
- Table will always be bigger than number of elements
- This reduces the chance of a collision!
- Good trade-off between memory use and time
- λ = 0.5
- Uses more memory, but fewer collisions
</script></section>
<section data-markdown><script type="text/template">
## Separate Chaining: find()
- Note that we now have to keep each key in the chain, as well as the value!
- What is the worst case?
- Hint: [Wikipedia](http://en.wikipedia.org/wiki/Hash_table) is wrong on this one...
- In the worst case, every key could hash to the same spot!
- As nobody uses anything other than a linked list as the secondary data structure, this means it will be a Θ(*n*) algorithm to perform a find!
- What is the "hopeful" case?
</script></section>
<section data-markdown><script type="text/template">
## What data structure to use for the buckets?
- AVL & red-black trees will give the best running time
- But that's a lot of overhead!
- Vectors are easier and simpler, but take up a *lot* of space
- All those extra, unused, cells
- Don't *ever* use vectors for this!
- Linked lists are quick and easy, and take up very little extra space
- That's Θ(*n*)!
- Still faster *in practice* than trees due to having a very small number of items in the bucket
</script></section>
<section data-markdown><script type="text/template">
## Requirements for "Hopeful" Case
- Our ideal hash function and hash table:
- Function *hash*(*k*) is well distributed for key space
- For a randomly selected *k* ∈ *K*,
- probability(*hash*(*k*) = i) = 1/*table_size*
- Size of table scales linearly with number of elements
- Expected bucket size is Θ(*num_elements* / *table_size*)
- Finding a good hash function can be tough
</script></section>
<section data-markdown><script type="text/template">
## Separate chaining insert is Θ(1)
- In an unsorted linked list, you can just put it on the front
- So all inserts into a separate chained hash table, that uses linked lists, are actually in constant time
- If you were to *sort* the linked list, that would be linear time
- And finds (and thus deletes) are still linear time
</script></section>
</section>
<section>
<section id="openaddressing" data-markdown class="center"><script type="text/template">
# Open Addressing
</script></section>
<section data-markdown><script type="text/template">
## Open addressing
- An animation of all three open addressing strategies can be found [here](https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html)
- Although it is really called "open addressing", that site calls it "closed hashing"
</script></section>
<section data-markdown><script type="text/template">
## Saving Memory
![separate chaining diagram](images/06-hashes/separate-chaining-diagram.png)
<center>Can we avoid the overhead of all those linked lists?</center>
</script></section>
<section data-markdown><script type="text/template">
## Three Types of Probing Strategies
- The three types:
- Linear
- Quadratic
- Double hashing
- The general idea with all of them is that, if a spot is occupied, to 'probe', or try, other spots in the table to use
- How we determine where else to probe depends on which strategy we are using
</script></section>
<section>
<h2>Linear Probing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">37</span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">14</span></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">21</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">27</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"><span class="fragment" data-fragment-index="1">4</span></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li>Check spots in this order:<ul>
<li><i>hash</i>(<i>k</i>)</li>
<li><i>hash</i>(<i>k</i>)+1</li>
<li><i>hash</i>(<i>k</i>)+2</li>
<li><i>hash</i>(<i>k</i>)+3</li>
<li>etc.</li>
</ul> </li>
<li><i>hash</i>(<i>k</i>) = 3<i>k</i>+7<ul><li>Which is then mod'ed by the table size (10)</li><li>Result: <i>hash</i>(<i>k</i>) = (3<i>k</i>+7) mod 10</li></ul> </li>
<li>Insert: 4, 27, 37, 14, 21<ul>
<li><i>hash</i>(<i>k</i>) values: 19, 88, 118, 49, 70, respectively</li>
</ul></li>
</ul>
</td></tr></table>
</section>
<section data-markdown><script type="text/template">
## Linear Probing
- With all open addressing schemes, we examine ('probe') the cells in the order:
- *p*<sub>0</sub>(*k*), *p*<sub>1</sub>(*k*), *p*<sub>2</sub>(*k*), ...
- where: *p*<sub>i</sub>(*k*) = (*hash*(*k*) + *f*(*i*)) mod *table_size*
- With *linear probing*, <span class="red">*f*(*i*) = *i*</span>
- After searching spot *hash*(*k*) in the array, look in:
- *hash*(*k*) + 1
- *hash*(*k*) + 2
- *hash*(*k*) + 3
- etc.
</script></section>
<section data-markdown><script type="text/template">
## Problems with Linear Probing
- Primary clustering
- Large blocks of occupied cells
- As table fills, increased number of attempts required to solve collision
- And thus slower lookup times
- "Holes" when an element is removed
- We'll see how to solve this later
- When to stop looking?
</script></section>
<section data-markdown><script type="text/template">
## Quadratic Probing
- With all open addressing schemes, we examine ('probe') the cells in the order:
- *p*<sub>0</sub>(*k*), *p*<sub>1</sub>(*k*), *p*<sub>2</sub>(*k*), ...
- where: *p*<sub>i</sub>(*k*) = (*hash*(*k*) + *f*(*i*)) mod *table_size*
- With *quadratic probing*, <span class="red">*f*(*i*) = *i*<sup>2</sup></span>
- After searching spot *hash*(*k*) in the array, look in:
- *hash*(*k*) + 1
- *hash*(*k*) + 4
- *hash*(*k*) + 9
- etc.
</script></section>
<section>
<h2>Quadratic Probing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">14 </span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">37</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">22</span></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="6">34</span></td></tr>
<tr><td>6</td><td class="border" style="width:100px"></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">27</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"><span class="fragment" data-fragment-index="1">4</span></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li>Check spots in this order:<ul>
<li><i>hash</i>(<i>k</i>)</li>
<li><i>hash</i>(<i>k</i>)+1<sup>2</sup> = <i>hash</i>(<i>k</i>)+1</li>
<li><i>hash</i>(<i>k</i>)+2<sup>2</sup> = <i>hash</i>(<i>k</i>)+4</li>
<li><i>hash</i>(<i>k</i>)+3<sup>2</sup> = <i>hash</i>(<i>k</i>)+9</li>
<li>etc.</li>
</ul> </li>
<li><i>hash</i>(<i>k</i>) = 3<i>k</i>+7<ul><li>Which is then mod'ed by the table size (10)</li><li>Result: <i>hash</i>(<i>k</i>) = (3<i>k</i>+7) mod 10</li></ul> </li>
<li>Insert: 4, 27, 14, 37, 22, 34<ul>
<li><i>hash</i>(<i>k</i>) values: 19, 88, 49, 118, 73, 109, respectively</li>
</ul></li>
</ul>
</td></tr></table>
</section>
<section data-markdown><script type="text/template">
## Double Hashing
- With all open addressing schemes, we examine ('probe') the cells in the order:
- *p*<sub>0</sub>(*k*), *p*<sub>1</sub>(*k*), *p*<sub>2</sub>(*k*), ...
- where: *p*<sub>i</sub>(*k*) = (*hash*(*k*) + *f*(*i*)) mod *table_size*
- With *double hashing*, <span class="red">*f*(*i*) = *i* \* hash<sub>2</sub>(*k*)</span>
- Which means we have to define a *secondary* hash function!
- After searching spot *hash*(*k*) in the array, look in:
- *hash*(*k*) + 1 \* *hash*<sub>2</sub>(*k*)
- *hash*(*k*) + 2 \* *hash*<sub>2</sub>(*k*)
- *hash*(*k*) + 3 \* *hash*<sub>2</sub>(*k*)
- etc.
</script></section>
<section>
<h2>Double Hashing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">69</span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="6">60</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">58</span></td></tr>
<tr><td>4</td><td class="border" style="width:100px"></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">49</span></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">18</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"><span class="fragment" data-fragment-index="1">89</span></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li>Check spots in this order:<ul>
<li><i>hash</i>(<i>k</i>)</li>
<li><i>hash</i>(<i>k</i>) + 1 * <i>hash</i><sub>2</sub>(<i>k</i>)</li>
<li><i>hash</i>(<i>k</i>) + 2 * <i>hash</i><sub>2</sub>(<i>k</i>)</li>
<li><i>hash</i>(<i>k</i>) + 3 * <i>hash</i><sub>2</sub>(<i>k</i>)</li>
<li>etc.</li>
</ul> </li>
<li><i>hash</i>(<i>k</i>) = <i>k</i><ul>
<li>The hash function was made simpler for this example...</li>
<li>Which is then mod'ed by the table size (10)</li>
<li>Result: <i>hash</i>(<i>k</i>) = <i>k</i> mod 10</li></ul></li>
<li><i>hash</i><sub>2</sub>(<i>k</i>) = 7 - (<i>k</i> mod 7)<br> </li>
<li>Insert: 89, 18, 58, 49, 69, 60</li>
</ul>
</td></tr></table>
</section>
<section>
<h2>Double Hashing Thrashing</h2>
<table class="transparent"><tr><td class="top">
<table class="transparent">
<tr><td>0</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="1">10</span></td></tr>
<tr><td>1</td><td class="border" style="width:100px"></td></tr>
<tr><td>2</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="2">12</span></td></tr>
<tr><td>3</td><td class="border" style="width:100px"></td></tr>
<tr><td>4</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="3">14</span></td></tr>
<tr><td>5</td><td class="border" style="width:100px"></td></tr>
<tr><td>6</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="4">16</span></td></tr>
<tr><td>7</td><td class="border" style="width:100px"></td></tr>
<tr><td>8</td><td class="border" style="width:100px"><span class="fragment" data-fragment-index="5">18</span></td></tr>
<tr><td>9</td><td class="border" style="border-bottom:medium solid;width:100px"></td></tr>
</table>
</td><td style="width:200px"></td><td class="top">
<ul>
<li><i>hash</i>(<i>k</i>) = <i>k</i> mod 10 <ul>
<li>Same as the previous slide</li>
<li>Result: <i>hash</i>(<i>k</i>) = <i>k</i> mod 10</li></ul> </li>
<li><i>hash</i><sub>2</sub>(<i>k</i>) = (<i>k</i> mod 5) +1<br> </li>
<li>Insert: 10, 12, 14, 16, 18, <span class='red'>36</span></li>
</ul>
</td></tr></table>
</section>
<section data-markdown><script type="text/template">
## Table size must be prime!
- The table size must always be a prime number
- It will prevent the thrashing from the previous slide
- Thrashing will only occur when the double hash value is a *factor* of the table size
- The only factors of a prime number *p* are 1 and *p*
- 1 is effectively linear probing, which is fine
- *p* will mod to zero, which is an invalid return value for a secondary hash function
- It will provide better distribution of the hash keys into the table
- Less clustering, etc.
- A prime number table size does not remove the need for a good hash function!
</script></section>
</section>
<section>
<section id="miscellaneous" data-markdown class="center"><script type="text/template">
# Miscellaneous
</script></section>
<section data-markdown><script type="text/template">
## Rehashing
- Problem: when the table gets too full, running time for operations increases
- Solution: create a bigger table and hash all the items from the original table into the new table
- The position in a table is dependent on the table size, which means we have to *rehash* each value
- This means we have to re-compute the hash value for *each* element, and insert it into the new table!
</script></section>
<section data-markdown><script type="text/template">
## Rehashing
- When to rehash?
- When half full (λ = 0.5)
- When mostly full (λ = 0.75)
- Java's hashtable does this by default
- When an insertion fails
- Some other threshold
- Cost of rehashing
- Let's assume that the hash function computation is constant
- We have to do *n* inserts, and if each key hashes to the same spot, then it will be a Θ(*n*<sup>2</sup>) operation!
- Although it is not likely to ever run that slow
</script></section>
<section data-markdown><script type="text/template">
## Removing an element
- How to handle this?
- You could:
- Rehash upon each delete, which is *very* expensive
- Put in a 'placeholder' or 'sentinel' value
- But the table gets filled with these rather fast
- Perhaps rehashing after a certain number of deletes
- Disallow deletes entirely; you can do this for [lab 6](../labs/lab06/index.html)
- Hash tables are not an ideal data structure if you need to perform a lot of deletions
</script></section>
<section data-markdown><script type="text/template">
## Hashing: MD5
- [MD5: Message Digest 5](http://en.wikipedia.org/wiki/Md5)
- Given a string (or file contents, etc.) generate a 128-bit hash
- 2<sup>128</sup> = 3.4*10<sup>38</sup> (coincidentally, this is is also the [maximum finite value](03-numbers.html#/maxfloatvalue) of a `float`)
- Typically an MD5 is always written in hex: 16e28b7986fd74f65b061de89dc8b78e
- This could then be used as the key
- Obviously having to mod it by the table size
- (Was) good for checking if a download completed successfully
</script></section>
<section data-markdown><script type="text/template">
## Can you reverse an MD5 hash?
- Technically, no
- A 129-bit file has 2<sup>129</sup> possibilities, and if you were to hash each one, it would go into 2<sup>128</sup> buckets
- By the pigeonhole principle, there would be at least one hash value (pigeonhole) with multiple keys (pigeons), and you don't know which one
- In reality, *many* (and eventually *all*) would have multiple keys
- But if a password is stored by it's MD5 hash...
- ... then there are enough online hash libraries that you can find at least one password that hashes to that value
- Try Googling for [3858f62230ac3c915f300c664312c63f](https://www.google.com/search?q=3858f62230ac3c915f300c664312c63f)
- Plus there are lots of weaknesses in MD5...
</script></section>
<section data-markdown><script type="text/template">
## More hashing: SHA
- MD5 has been "broken"
- One can generate two files that have the same hash; this is called a [collision attack](https://en.wikipedia.org/wiki/Collision_attack)
- In fact, I have the students do something similar when I teach Defense Against the Dark Arts (albeit with a [weaker hashing algorithm](https://en.wikipedia.org/wiki/Crc32))
- So it is useless for any security-related purposes
- [SHA (Secure Hash Algorithm)](http://en.wikipedia.org/wiki/Secure_Hash_Algorithm) is a family of algorithms that (the more recent ones) are much more secure
- Same overall idea: it generates a hash value up to 512 bits
- SHA-1 has been broken also, but more recent SHAs are secure
</script></section>
</section>
</div>
</div>
<script src='../slides/reveal.js/dist/reveal.js'></script><script src='../slides/reveal.js/plugin/zoom/zoom.js'></script><script src='../slides/reveal.js/plugin/notes/notes.js'></script><script src='../slides/reveal.js/plugin/search/search.js'></script><script src='../slides/reveal.js/plugin/markdown/markdown.js'></script><script src='../slides/reveal.js/plugin/highlight/highlight.js'></script><script src='../slides/reveal.js/plugin/math/math.js'></script>
<script src="js/settings.js"></script>
</body>
</html>