Frequency sketches (also called "Count-Min" or "Heavy Hitters" sketches) are used when you need to know how often a specific item has appeared without storing every single occurrence. They are the go-to for identifying "Top-K" items or finding popular hashtags/IP addresses in real-time.
| Algorithm | Memory | CPU / Op | Accuracy | Notes |
|---|---|---|---|---|
| Count-Min Sketch | Very Fast | Probabilistic (Overestimates) | The standard for frequency; easy to implement and merge. | |
| Count-Sketch | Higher than Count-Min | Fast | Unbiased (Lower Variance) | Uses |
| Space-Saving | Medium | High for "Heavy Hitters" | Deterministic; maintains a "Stream Summary" of top elements. | |
| Misra-Gries | Fast | High for Top-$K$ | Classic algorithm; finds elements with frequency |
|
| HeavyKeeper | Low | Fast | Exceptional for Top-$K$ | Uses "decay" strategy to evict small items; state-of-the-art for Zipfian data. |
| SketchLearn | Variable | Slow | High (ML-based) | Uses automated modeling to correct sketch bias; higher overhead. |
- Use Count-Min Sketch for basic frequency estimation where overcounting is acceptable (e.g., rate limiting).
- Use HeavyKeeper if your goal is strictly finding the Top-K most frequent items with the highest precision.
- Use Misra-Gries if you need a simple, deterministic way to ensure you don't miss any items above a certain frequency threshold.
Would you like me to explain the "decay" mechanism in HeavyKeeper and why it’s better for skewed data?