frequency_sketches_comparison.md

Frequency sketches (also called "Count-Min" or "Heavy Hitters" sketches) are used when you need to know how often a specific item has appeared without storing every single occurrence. They are the go-to for identifying "Top-K" items or finding popular hashtags/IP addresses in real-time.

Frequency Data Sketches (Point Queries & Top-K)

Algorithm	Memory	CPU / Op	Accuracy	Notes
Count-Min Sketch	$O(\frac{1}{\epsilon} \log \frac{1}{\delta})$	Very Fast	Probabilistic (Overestimates)	The standard for frequency; easy to implement and merge.
Count-Sketch	Higher than Count-Min	Fast	Unbiased (Lower Variance)	Uses $\pm 1$ hashing; better error distribution but slightly more CPU.
Space-Saving	$O(\frac{1}{\epsilon})$	Medium	High for "Heavy Hitters"	Deterministic; maintains a "Stream Summary" of top elements.
Misra-Gries	$O(K)$	Fast	High for Top-$K$	Classic algorithm; finds elements with frequency $> N/(K+1)$.
HeavyKeeper	Low	Fast	Exceptional for Top-$K$	Uses "decay" strategy to evict small items; state-of-the-art for Zipfian data.
SketchLearn	Variable	Slow	High (ML-based)	Uses automated modeling to correct sketch bias; higher overhead.

Selection Guide

Use Count-Min Sketch for basic frequency estimation where overcounting is acceptable (e.g., rate limiting).
Use HeavyKeeper if your goal is strictly finding the Top-K most frequent items with the highest precision.
Use Misra-Gries if you need a simple, deterministic way to ensure you don't miss any items above a certain frequency threshold.

Would you like me to explain the "decay" mechanism in HeavyKeeper and why it’s better for skewed data?

thanos/frequency_sketches_comparison.md

Select an option

No results found

Select an option

No results found

Frequency Data Sketches (Point Queries & Top-K)

Selection Guide