Skip to content

Instantly share code, notes, and snippets.

@thanos
Last active March 15, 2026 16:51
Show Gist options
  • Select an option

  • Save thanos/15ca4c6341f5f4db0937944946cc0b8d to your computer and use it in GitHub Desktop.

Select an option

Save thanos/15ca4c6341f5f4db0937944946cc0b8d to your computer and use it in GitHub Desktop.
et Data Sketches (Advanced Operations)

Set Data Sketches (Advanced Operations)

These sketches are used when you need more than just a count; they allow for complex set operations like Intersections and Differences while maintaining a probabilistic estimate.

Algorithm Memory CPU / Op Accuracy Delete Merge Notes
Theta Sketch Fixed ($O(k)$ entries) Medium Configurable Yes Yes Gold standard for Intersections and Subtractions.
Invertible Bloom Lookup Table (IBLT) $O(D)$ where $D$ is diff size Medium Exact (if $D$ is small) Yes Yes Specifically designed for Set Reconciliation; can list actual missing keys.
Tuple Sketch $O(k)$ + Attributes Medium Configurable Yes Yes Theta extension that tracks metadata/attributes per key.
HLL (HyperLogLog) Extremely Low Fast $\approx 1.04 / \sqrt{m}$ No Yes Efficient for Unions; poor for intersections/differences.
KMV (Min-Wise) $O(k)$ entries Fast $1/\sqrt{k}$ No Yes Foundations for Jaccard Similarity and MinHash.
Amanatidis Sketch Low Slow High No Yes Optimized for Subset/Containment estimation.

Comparison of Set Capabilities

  • Unions: Almost every sketch (HLL, Theta, CPC) handles unions natively and efficiently.
  • Intersections/Differences: Only Theta and Tuple sketches handle these accurately. HLL uses the "Inclusion-Exclusion" principle for intersections, which causes the error rate to explode as the overlap between sets decreases.
  • Associative Data: Tuple Sketches are unique because they allow you to perform set logic (e.g., "users in Group A AND Group B") while simultaneously aggregating a secondary metric (e.g., "total revenue from those users").

Would you like me to provide a Markdown table specifically comparing the Jaccard Similarity performance of these sketches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment