Skip to content

Instantly share code, notes, and snippets.

View thanos's full-sized avatar

thanos vassilakis thanos

View GitHub Profile
@thanos
thanos / mining_hardware.md
Created April 3, 2026 16:01
Mining Hardware
Hardware Generation Architecture Comparative Efficiency (2026 Context) Status
CPU General-purpose logic Negligible for BTC; strong for Monero/Verus Consumer-grade
GPU Parallel graphics processing Inefficient for SHA-256; versatile for altcoins/AI Enthusiast-grade
FPGA Reprogrammable logic Niche; algorithm-agile Specialist-grade
ASIC Fixed-logic circuit Maximum efficiency for specific algorithms Industrial-grade
@thanos
thanos / master_comparison_table_of_sketch_algorithms.md
Created March 16, 2026 00:52
Master Comparison Table of Sketch Algorithms
Family Example Algorithms Answers What Question? Typical Error Mergeable
Cardinality HyperLogLog, CPC How many unique elements exist? ~1% Yes
Frequency Count-Min, SpaceSaving Which items appear most often? Additive Yes
Quantile KLL, DDSketch What are the percentiles? Rank or relative Yes
Membership Bloom, Cuckoo, XOR Have we seen this element? False positives No (usually)
Set Theta, KMV What is the overlap between sets? ~1–2% Yes
Reconciliation IBLT What differs between datasets? Capacity bound Yes
@thanos
thanos / Set-Reconciliation-Data-Sketches.md
Created March 16, 2026 00:49
Set Reconciliation Data Sketches
Algorithm Approach Trade-off
IBLT Hash-based balance sheet Simple, widely used
Characteristic Polynomial Filters (CPF) Invertible polynomials Better space efficiency, higher CPU cost
Eppstein's Straggler Detection Simplified IBLT variant Best for finding a small number of missing elements in a stream
@thanos
thanos / set_data_sketches.md
Last active March 15, 2026 16:51
et Data Sketches (Advanced Operations)

Set Data Sketches (Advanced Operations)

These sketches are used when you need more than just a count; they allow for complex set operations like Intersections and Differences while maintaining a probabilistic estimate.

Algorithm Memory CPU / Op Accuracy Delete Merge Notes
Theta Sketch Fixed ($O(k)$ entries) Medium Configurable Yes Yes Gold standard for Intersections and Subtractions.
Invertible Bloom Lookup Table (IBLT) $O(D)$ where $D$ is diff size Medium Exact (if $D$ is small) Yes Yes Specifically designed for Set Reconciliation; can list actual missing keys.
Tuple Sketch $O(k)$ + Attributes Medium Configurable Yes Yes Theta extension that tracks metadata/attributes per key.
HLL (HyperLogLog) Extremely Low Fast $\approx 1.04 / \sqrt{m}$ No Yes Efficient for Unions; poor for intersections/differences.
@thanos
thanos / membership_filters.md
Last active March 15, 2026 16:24
Membership Filter data sketches

Membership filters (also called Approximate Membership Query or AMQ structures) are used to quickly check if an element is in a set. They are designed to never return a false negative—if the filter says "No," it is definitely not there. If it says "Yes," there's a small chance it might be a false positive.

Here is the table for Membership Filter data sketches:

Algorithm Memory (vs Bloom) CPU / Op Accuracy (vs Bloom) Delete Merge Notes
Bloom Filter $1.44 \log_2(1/\epsilon)$ bits/item Fast Baseline No Yes The industry standard; simple bit-array.
Counting Bloom 3x – 4x Larger Medium Same Yes Yes Adds counters to allow for item removal.
Cuckoo Filter ~20% Smaller Fast Better at low $\epsilon$ Yes No Fingerprint-based; better space efficiency for low error rates.

Frequency sketches (also called "Count-Min" or "Heavy Hitters" sketches) are used when you need to know how often a specific item has appeared without storing every single occurrence. They are the go-to for identifying "Top-K" items or finding popular hashtags/IP addresses in real-time.

Frequency Data Sketches (Point Queries & Top-K)

Algorithm Memory CPU / Op Accuracy Notes
Count-Min Sketch $O(\frac{1}{\epsilon} \log \frac{1}{\delta})$ Very Fast Probabilistic (Overestimates) The standard for frequency; easy to implement and merge.
Count-Sketch Higher than Count-Min Fast Unbiased (Lower Variance) Uses $\pm 1$ hashing; better error distribution but slightly more CPU.
Space-Saving $O(\frac{1}{\epsilon})$ Medium High for "Heavy Hitters" Deterministic; maintains a "Stream Summary" of top elements.
Misra-Gries $O(K)$ Fast High for Top-$K$ Classic algorithm; finds elements with frequency $> N/(K+1)$.

Quantile sketches are a different beast—they don't just count unique items; they help you understand the distribution of your data (like finding the 95th percentile latency or the median salary) without storing every single value.

Here is the breakdown for Quantile data sketches in Markdown:

Algorithm Memory Accuracy Notes
GK Array $O(\frac{1}{\epsilon} \log(\epsilon N))$ Deterministic $\epsilon$-approx The classic foundation. Provides a guaranteed error bound but can be memory-heavy as $N$ grows.
KLL Sketch $O(\frac{1}{\epsilon})$ Probabilistic $\epsilon$-approx Near-optimal space complexity. Excellent for merging multiple sketches (map-reduce friendly).
T-Digest $O(K)$ (clusters) High at extremes Exceptional for tail latencies (99th, 99.9th percentiles) by using "centroids" that get smaller at the edges.
DDSketch $O(\frac{1}{\alpha} \log(\text{max}/\text{min}))$ Relative Error Maintains a fixed relative error (
@thanos
thanos / cardinality_data_sketches.md
Last active March 15, 2026 16:00
Cardinality Data Sketches

You're absolutely right to call that out! I used a mix of formatting there, but you specifically asked for Markdown. I'll strip away the extra fluff and give you a clean, standard Markdown table that you can easily copy-paste into a README or documentation file.

Here is the data formatted as requested:

Algorithm Memory Accuracy Notes
Linear Counting $O(N)$ bits High Best for small sets; memory scales linearly with cardinality.
LogLog $O(\log(\log N))$ $\approx 1.30 / \sqrt{m}$ Foundation for modern sketches; uses bit-pattern estimation.
HyperLogLog (HLL) ~1.5 KB for $10^9$ items $\approx 1.04 / \sqrt{m}$ Industry standard; uses stochastic averaging for low variance.
HLL++ Variable (Sparse/Dense) High (inc. small sets) Google's version; fixes HLL's bias for small cardinalities.
@thanos
thanos / Membership-Filter-Comparison.md
Last active March 13, 2026 21:09
Membership Filter Comparison
Algorithm Deletions Merges Notes
Bloom Filter difficult yes - not dynamic classic
Cuckoo Filter YES YES modern
XOR Filter no no modern
Quotient Filter YES YES SSD friendly
CQF yes yes high performance
Binary Fuse Filter no no extremely compact modern
Ribbon Filter no no space efficient new design
@thanos
thanos / Quantile-Algorithm-Comparison.md
Last active March 13, 2026 19:05
Quantile Algorithm Comparison
Algorithm Strength
KLL General purpose
REQ Optimised for high quantiles (99th, 99.9th)
DDSketch Constant relative error across all quantiles
t-Digest Highly accurate at the tails