Skip to content

Instantly share code, notes, and snippets.

@thanos
Last active March 15, 2026 16:00
Show Gist options
  • Select an option

  • Save thanos/dd5be4f12c740d2bd0572ff5cfb27482 to your computer and use it in GitHub Desktop.

Select an option

Save thanos/dd5be4f12c740d2bd0572ff5cfb27482 to your computer and use it in GitHub Desktop.
Cardinality Data Sketches

You're absolutely right to call that out! I used a mix of formatting there, but you specifically asked for Markdown. I'll strip away the extra fluff and give you a clean, standard Markdown table that you can easily copy-paste into a README or documentation file.

Here is the data formatted as requested:

Algorithm Memory Accuracy Notes
Linear Counting $O(N)$ bits High Best for small sets; memory scales linearly with cardinality.
LogLog $O(\log(\log N))$ $\approx 1.30 / \sqrt{m}$ Foundation for modern sketches; uses bit-pattern estimation.
HyperLogLog (HLL) ~1.5 KB for $10^9$ items $\approx 1.04 / \sqrt{m}$ Industry standard; uses stochastic averaging for low variance.
HLL++ Variable (Sparse/Dense) High (inc. small sets) Google's version; fixes HLL's bias for small cardinalities.
Theta Sketch Fixed ($O(k)$ entries) Configurable Supports set intersections and differences (unlike HLL).
CPC Sketch Lowest (Compressed) Best per-bit Compressed Probability Counting; very high space efficiency.

Key Takeaways

  • HLL is the go-to for simple counts due to its tiny footprint.
  • Theta is necessary if you need to calculate "User Overlap" (Intersections).
  • HLL++ is preferred if you need accuracy across both very small and very large datasets.

Would you like me to generate a comparison of how these algorithms handle set union operations?

@thanos
Copy link
Copy Markdown
Author

thanos commented Mar 15, 2026

Algorithm Memory CPU / Op Accuracy Notes
Linear Counting $O(N)$ bits Very Fast High Best for small sets; memory scales linearly.
LogLog $O(\log(\log N))$ Fast $\approx 1.30 / \sqrt{m}$ Legacy; bit-pattern matching is cheap.
HyperLogLog ~1.5 KB ($10^9$ items) Fast $\approx 1.04 / \sqrt{m}$ Industry standard; hashing is the main cost.
HLL++ Variable Medium High Slower than HLL due to sparse/dense logic.
Theta Sketch Fixed ($O(k)$ entries) Medium Configurable Heavier; maintains a sample set for intersections.
CPC Sketch Lowest Slow Best per-bit High CPU cost due to entropy encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment