Algorithm notes from
A. Limasset, G. Rizk, Rr. Chikhi, and P. Peterlongo: Fast and scalable minimal perfect hashing for massive key sets https://arxiv.org/pdf/1702.03154.pdf.
The goal of this algorithm is to create a conflict-free hash of the keys in the collection. The strategy is to create a zero-valued bit vector with at least 1 bit per key, and hash each key to a position in that vector. Any key that hashes uniquely is assigned that position, and removed from the working set.
Any remaining keys become the working set for the next iteration: We create a fresh zero-valued bit vector with at least one bit per key, and repeat the process with a new hash function. This bit vector is concatenated with the ones generated during previous rounds.
In principle, we continue this iteration until there are no more keys left to be assigned. However, it may be that near the "end" of the collection, we cannot find a hash function that distinguishes the remaining keys any further. In that case, we can insert the remaining keys into an ordinary hash table.
At the end of this process, we have a concatenated series of bit vectors, plus a hash table.
The cost of constructing the input table is governed by how many of the remaining keys are able to hash without conflict in each round. The more collisions there are, the more rounds we will have, and the larger each round's bit vector will be.
One way to mitigate this is to increase the number of bits we allocate per key, to a value > 1. This increases the size of the bit vector, but decreases the probability of collisions.
Given a key k, the algorithm to locate the key in the index is:
def lookup(k):
# Probe the ranks of the bit vector in order.
# A hit in a given rank was non-conflicting by construction, so the first
# hit for k is its assigned position.
#
# Note that this presumes k is known to be in the original set.
# There may be false positives due to hash collisions for keys
# not from the original set.
for rank in bitvector:
pos = rank.hash(k) % len(rank)
if rank.isSet(pos):
return (rank, pos)
# If the key was not found in any of the bitvector ranks, it must
# be in the hash table somewhere.
pos = hashtable.hash(k) % len(hashtable)
loc = hashtable.probe(key, pos)
return (hashtable, loc)
As noted in the comment, this query may produce false positives if given a k that was not from the original set of keys. The objective of the perfect hash is not to distinguish member keys from non-member keys, but to uniquely (and quickly) assign each member key a small index.
The perfect hash can be used for probabilistic membership queries, but the rate of false positives may be high. For approximate membership queries, a Bloom filter or Cuckoo filter is probably a better choice.
The basic query for a minimal perfect hash is to map of the n possible each input keys to a distinct value in [0, n). What we have after completing the construction above is a mapping of each input key to a distinct bit position in the vector or the hash table.
To map these positions into the interval we want, we calculate for each position of the vector that is set to 1, the number of 1 bits to its left in the aggregated vector. For the leftmost set bit, this is 0, for the second-leftmost 1, and so on. This distribution can be precomputed so that it does not need to be solved at query time.
The calculation is similar for the hash table at the end, except using the positions of the occupied slots in the table as the "set bits". Since the hash table must preserve the keys that mapped into it, we do not need to do anything fancy to remember which keys hashed to which positions—we simply look up and probe into the table as necessary using the query key.
Each rank requires a new hash function. In practice, if you have a good (uniform) fast hash function with unsigned integer output, you can generate a "new" hash function by fixing a random constant seed and combining (e.g., via XOR) that seed with the hash function's output. Choose a fresh seed for each new rank of the table, and store the seed with the
Here is an informal schematic of how the data structure looks after construction:
Rank 1 Rank 2 Rank 3 Hashtable
[ 0 1 0 0 0 0 1 1 0 0 1 ][ 1 0 0 0 0 1 0 1 ][ 0 0 1 0 1 ][ x _ _ y _ z _ ]
0 4 7 9 ← distribution
2321 9163 1001 79013 ← hash seed