Count unique visitors (IPs, user IDs) without storing all IDs.
Benefits:
- Fixed memory (16KB for any # of visitors)
- ~2% error rate
- Lock-free
- No storage of actual IPs
Use cases:
- Unique visitor counting
- Cardinality estimation
- Distinct value counting
HyperLogLog (HLL) is a probabilistic algorithm for estimating the cardinality (number of distinct elements) in a dataset. Instead of storing all unique values, it uses statistical properties of hash functions to estimate how many unique items have been observed.
Core Concept:
The fundamental insight behind HyperLogLog is based on the observation that when you hash random values uniformly, the probability of seeing certain bit patterns is predictable. Specifically:
- The probability of a hash starting with 1 leading zero bit is 1/2
- The probability of 2 leading zeros is 1/4
- The probability of k leading zeros is 1/2^k
If you observe a hash with (for example) 10 leading zeros, you can estimate that you've likely seen around 2^10 = 1024 unique values. This is the core intuition: rare events indicate large cardinality.
However, using just one register would be extremely inaccurate (high variance). HyperLogLog solves this by:
- Dividing observations into many buckets (registers)
- Tracking the maximum leading zeros in each bucket independently
- Combining these estimates using harmonic mean to reduce variance
This ensemble approach dramatically improves accuracy while maintaining constant memory usage.
Each unique value (IP address, user ID, etc.) is hashed using a uniform hash function. This produces a bit string that appears random and uniformly distributed.
Example: hash("192.168.1.1") → 0b00101110001101... (64-bit hash)
The hash is divided into two parts:
- First p bits: Determine which bucket (register) to update
- With p=14, we get 2^14 = 16,384 buckets
- Remaining bits: Used to count leading zeros
Example with p=14:
Hash: [14 bits for bucket][50 bits for counting]
^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
bucket index count leading zeros here
For the remaining bits, count how many leading zeros appear before the first 1-bit. This count is called ρ (rho).
Examples:
1101010...→ ρ = 0 (no leading zeros)0001101...→ ρ = 3 (three leading zeros)0000001...→ ρ = 6 (six leading zeros)
The value ρ+1 is stored in the register.
Each of the m buckets maintains a register that stores the maximum ρ value ever observed for that bucket.
Bucket[i] = max(Bucket[i], ρ + 1)
This is why we only need 6 bits per register: ρ+1 rarely exceeds 63, even with billions of unique values.
To estimate the total number of unique elements, HyperLogLog uses the harmonic mean of 2^M[j] across all buckets:
E = α_m × m² × 1 / Σ(2^(-M[j]))
Where:
- m = number of buckets (16,384)
- M[j] = value in bucket j
- α_m = correction constant (~0.7213 for large m)
The harmonic mean is crucial because it's resistant to outliers (a few buckets with very high values don't skew the estimate).
Why This Works:
The average maximum leading zero count in a bucket is approximately log₂(n/m), where n is the true cardinality. By combining estimates from all buckets using harmonic mean, we get:
- Individual buckets have high variance
- Harmonic mean of many buckets reduces variance significantly
- Result: ~2% standard error with 16KB memory
HyperLogLog's accuracy depends on the number of buckets (m):
- Standard error = 1.04 / √m
- With m = 16,384 (2^14): error ≈ 1.04 / 128 ≈ 0.81% (theoretical)
- In practice: ~2% error accounting for edge cases
Why 16KB?
The implementation uses:
- m = 16,384 buckets (2^14)
- 6 bits per register
- Total: 16,384 × 6 bits = 98,304 bits ≈ 12KB
- With overhead: ~16KB total
This configuration provides:
- Excellent accuracy (~2% error)
- Reasonable memory footprint
- Ability to count up to billions of unique values
Traditional exact counting requires O(n) space where n is the number of unique elements:
- 1 million unique IPs: ~16 MB (storing 4-byte IPs in a hash set)
- 1 billion unique IPs: ~16 GB
HyperLogLog requires O(m) space where m is the number of buckets:
- Any cardinality: 16 KB (fixed)
- 1 million, 1 billion, or 1 trillion unique values: still 16 KB
The space savings become dramatic as cardinality grows.
The harmonic mean is used instead of arithmetic mean because:
- Harmonic mean is more resistant to outliers (buckets with unusually high values)
- It provides a better bias correction for the estimator
- Arithmetic mean would overestimate cardinality significantly
Consider if one bucket accidentally gets a very high value (say, 50 leading zeros):
- Arithmetic mean: would be heavily influenced, causing overestimation
- Harmonic mean: dampens the effect of this outlier, maintaining accuracy
The algorithm leverages the law of large numbers:
- Each bucket provides a noisy estimate
- With 16,384 independent estimates, the noise averages out
- The combined estimate converges to the true cardinality
- Standard error decreases proportionally to √m
This is why doubling the number of buckets (and memory) only improves accuracy by ~40% (√2), demonstrating diminishing returns.