Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Hash Functions FTW
1. Hash Functions FTW*
Fast Hashing, Bloom Filters & Hash-Oriented Storage
Sunny Gleason
* For the win (see urbandictionary FTW[1]); this expression has nothing to do with hash functions
2. What’s in this Presentation
• Hash Function Survey
• Hash Performance
• Bloom Filters
• HashFile : Hash Storage
3. Hash Functions
int getIntHash(byte[] data); // 32-bit
long getLongHash(byte[] data) // 64-bit
int v1 = hash(“foo”); int v2 = hash(“goo”);
int hash(byte[] value) { // a simple hash
int h = 0;
for (byte b: value) { h = (h<<5) ^ (h>>27) ^ b; }
return h % PRIME;
}
4. Hash Functions
• Goal : v1 has many bit differences from v2
• Desirable Properties:
• Uniform Distribution - no collisions
• Very Fast Computation
8. A Strawman “Set”
• N keys, K bytes per key
• Allocate array of size K * N bytes
• Utilize array storage as:
• a heap or tree: O(lg N) insert/delete/
remove
• a hash: O(1) insert/delete/remove
• What if we don’t have room for K*N
bytes?
9. Bloom Filter
• Key Point: give up on storing all the keys
• Store r bits per key instead of K bytes
• Allocate bit vector of size: M = r * N,
where N is expected number of entries
• Use multiple hash functions of key to
determine which bits to set
• Premise: if hash functions are well-
distributed, few collisions, high accuracy
11. Tuning Bloom Filters
Let r = M bits / N keys (r: num bits/key)
Let k = 0.7 * r (k: num hashes to use)
Let p = 0.6185 ** r (p: probability of false positives)
Working backwards, we can use desired false
positive rate p to tune the data structure space
consumption:
r = 8, p = 2.1e-2 r = 16, p = 4.5e-4
r = 24, p = 9.8e-6 r = 32, p = 2.1e-7
r = 40, p = 4.5e-9 r = 48, p = 9.6e-11
12. Bloom Filter Performance
100MM entries, 8bits/key : 833k ops/s
100MM entries, 32bits/key : 256k ops/s
1BN entries, 8bits/key : 714k ops/s
1BN entries, 32bits/key : 185k ops/s
Hypothesis : difference between 100MM and
1BN is due to locality of memory access in
smaller bit vector
13. Hash-Oriented Storage
• HashFile : 64-bit clone of djb’s constant db
“CDB”
• Plain ol’ Key/Value storage:
add(byte[] k, byte[] v), byte[] lookup(byte[] k)
• Constant aka “Immutable” Data Store
create(), add(k, v) ... , build() ... before lookup(k)
• Use properties of hash table to achieve
O(1) disk seeks per lookup
14. HashFile Structure
• Header (fixed width): table pointers,
contains offests of hash tables and count of
elements per table
• Body (variable width): contains
concatenation of all keys and values (with
data lengths)
• Footer (fixed width): hash “tables”
containing long hash values of keys
alongside long offsets into body
15. HashFile Diagram
HEADER BODY FOOTER
p1s3p2s4p3s2p4s1 k1v1k2v2k3v3k4v4k5v5k6v6k7v7 hk7o7hk3o3hk4o4hk1o1
• Create: initialize empty header, start appending
keys/values while recording offsets and hash values
of keys
• Build: take list of hash values and offsets and turn
them into hash tables, backfill header with values
• Lookup: compute hash(key), compute offset into
table (hash modulo size of table), use table to find
offset into body, return the value from body
16. HashFile Performance
• Spec: ≤ 2 disk seeks per lookup
• Number of seeks independent of number
of entries
• X25E SSD: 1BN 8-byte keys, values (41GB):
650μs lookup w/ cold cache, up to 700x
faster as filesystem cache warms, 0.9μs
when in-memory
• With 100MM entries (4GB), cold cache is
~600μs (from locality), 0.6μs warm
17. Conclusions
• Be aware of different Hash Functions and
their collision / performance tradeoffs
• Bloom Filters are extremely useful for fast,
large-scale set membership
• HashFile provides excellent performance in
cases where a static K/V store suffices
18. Future Work
• Implement cWow hash in Java
• Extend HashFile with configurable hash,
pointer, and key/value lengths to conserve
space (reduce 24 bytes-per-KV overhead)
• Implement a read-write (non-constant)
version of HashFile
• Bloom Filter that spills to SSD