Created
February 17, 2015 17:19
-
-
Save brandt/8f9ab3ceae37562a2841 to your computer and use it in GitHub Desktop.
Calculate the required bloom filter size and optimal number of hashes from the expected number of items in the collection and acceptable false-positive rate
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Optimal bloom filter size and number of hashes | |
# Tips: | |
# 1. One byte per item in the input set gives about a 2% false positive rate. | |
# 2. The optimal number of hash functions is ~0.7x the number of bits per item. | |
# 3. The number of hashes dominates performance. | |
# Expected number of items in the collection | |
# n = (m * ln(2))/k; | |
n = 300_000 | |
# Acceptable false-positive rate (0.01 = 1%) | |
# p = e^(-(m/n) * (ln(2)^2)); | |
fpr = 0.01 | |
# Optimal size (number of elements in the bit array) | |
# m = -((n*ln(p))/(ln(2)^2)); | |
m = (n * Math.log(fpr).abs) / (Math.log(2) ** 2) | |
# Optimal number of hash functions | |
# k = (m/n) * ln(2); | |
k = (m / n) * Math.log(2) | |
puts "Optimal bloom filter size: #{m.ceil} bits" | |
puts "Optimal number of hash functions: #{k.ceil}" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks..I was really looking for this kind of an answer.