Cardinality Data Sketches

You're absolutely right to call that out! I used a mix of formatting there, but you specifically asked for Markdown. I'll strip away the extra fluff and give you a clean, standard Markdown table that you can easily copy-paste into a README or documentation file.

Here is the data formatted as requested:

Algorithm	Memory	Accuracy	Notes
Linear Counting	$O(N)$ bits	High	Best for small sets; memory scales linearly with cardinality.
LogLog	$O(\log(\log N))$	$\approx 1.30 / \sqrt{m}$	Foundation for modern sketches; uses bit-pattern estimation.
HyperLogLog (HLL)	~1.5 KB for $10^9$ items	$\approx 1.04 / \sqrt{m}$	Industry standard; uses stochastic averaging for low variance.
HLL++	Variable (Sparse/Dense)	High (inc. small sets)	Google's version; fixes HLL's bias for small cardinalities.
Theta Sketch	Fixed ($O(k)$ entries)	Configurable	Supports set intersections and differences (unlike HLL).
CPC Sketch	Lowest (Compressed)	Best per-bit	Compressed Probability Counting; very high space efficiency.

Key Takeaways

HLL is the go-to for simple counts due to its tiny footprint.
Theta is necessary if you need to calculate "User Overlap" (Intersections).
HLL++ is preferred if you need accuracy across both very small and very large datasets.

Would you like me to generate a comparison of how these algorithms handle set union operations?

Algorithm	Memory	CPU / Op	Accuracy	Notes
Linear Counting	$O(N)$ bits	Very Fast	High	Best for small sets; memory scales linearly.
LogLog	$O(\log(\log N))$	Fast	$\approx 1.30 / \sqrt{m}$	Legacy; bit-pattern matching is cheap.
HyperLogLog	~1.5 KB ($10^9$ items)	Fast	$\approx 1.04 / \sqrt{m}$	Industry standard; hashing is the main cost.
HLL++	Variable	Medium	High	Slower than HLL due to sparse/dense logic.
Theta Sketch	Fixed ($O(k)$ entries)	Medium	Configurable	Heavier; maintains a sample set for intersections.
CPC Sketch	Lowest	Slow	Best per-bit	High CPU cost due to entropy encoding.

Algorithm

Memory

CPU / Op

Accuracy

Notes

Linear Counting

$O(N)$ bits

Very Fast

High

Best for small sets; memory scales linearly.

LogLog

$O(\log(\log N))$

Fast

$\approx 1.30 / \sqrt{m}$

Legacy; bit-pattern matching is cheap.

HyperLogLog

~1.5 KB ($10^9$ items)

Fast

$\approx 1.04 / \sqrt{m}$

Industry standard; hashing is the main cost.

HLL++

Variable

Medium

High

Slower than HLL due to sparse/dense logic.

Theta Sketch

Fixed ($O(k)$ entries)

Medium

Configurable

Heavier; maintains a sample set for intersections.

CPC Sketch

Lowest

Slow

Best per-bit

High CPU cost due to entropy encoding.

thanos/cardinality_data_sketches.md

Select an option

No results found

Select an option

No results found

Key Takeaways

thanos commented Mar 15, 2026

Uh oh!