Skip to content

Consistent hashing

Sam Burdick edited this page Jun 23, 2018 · 1 revision

Consistent hashing is a technique for mapping resources (in our case, bit vectors) to machines. The client will write a query involving vectors v1, v2, ..., vn and send it to the master node. The master will map each vector to its location (really a set of locations; it can pick one) and distribute the query across the machines that have that machine. There are two algorithms, Ring and Jump, that we have implemented for this system with ease.
The advantage of consistent hashing over conventional hashing is that on average only K/n keys, where K is the number of keys and n the number of buckets, need to be remapped when we add or remove a cache. This provides us with a bit of fault/addition tolerance since the master node won't be kept busy for terribly long when the network changes.

Ring

Ring is the classical CH algorithm introduced in Karger et al. It basically maps each machine (cache) to different points on a circle. To find where a resource is located,
Implemented by Sam.

Algorithm

Each cache's identifier is added as a node in a red-black tree a number of times equal to the cache's duplication factor df. Each node associated with the cache should contain the hash value of the cache's identifier plus some number between zero and its df. To find which cache a given resource should map to, hash the resource id and find the node with the next greater hash value in the tree.

Analysis

Because we're using an RBT we have amortized O(log N) performance for all operations, where N = Σ cache.df. However, for large numbers of nodes, performance may suffer if cache misses are frequent.

Jump

Implemented by Jahrme.

Clone this wiki locally