Created
May 14, 2026 09:33
-
-
Save samcm/6b8b05346484eb3b278f6b2bbbcc140c to your computer and use it in GitHub Desktop.
ClickHouse skip-index benchmark for log substring search — tokenbf_v1 vs text() vs ngrambf_v1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!doctype html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="utf-8"> | |
| <title>ClickHouse skip-index benchmark for log substring search</title> | |
| <style> | |
| :root { | |
| --fg: #1a1a1a; | |
| --muted: #666; | |
| --line: #e3e3e3; | |
| --accent: #2a6dd9; | |
| --winner: #e7f5e9; | |
| --loser: #fbeaea; | |
| --code-bg: #f6f8fa; | |
| } | |
| html, body { background: #fff; color: var(--fg); } | |
| body { | |
| font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif; | |
| line-height: 1.55; | |
| max-width: 920px; | |
| margin: 2rem auto 4rem; | |
| padding: 0 1.25rem; | |
| font-size: 15.5px; | |
| } | |
| h1 { font-size: 1.85rem; margin: 0 0 .4rem; line-height: 1.2; } | |
| h2 { font-size: 1.3rem; margin: 2.4rem 0 .6rem; padding-top: .4rem; border-top: 1px solid var(--line); } | |
| h3 { font-size: 1.05rem; margin: 1.4rem 0 .3rem; } | |
| .lede { color: var(--muted); margin: 0 0 1.5rem; } | |
| table { border-collapse: collapse; width: 100%; margin: .6rem 0 1.2rem; font-size: 14px; } | |
| th, td { padding: .35rem .55rem; border-bottom: 1px solid var(--line); text-align: left; vertical-align: top; } | |
| th { font-weight: 600; background: #fafafa; } | |
| td.num, th.num { text-align: right; font-variant-numeric: tabular-nums; } | |
| tr.winner td { background: var(--winner); } | |
| tr.loser td { background: var(--loser); } | |
| code, pre { font-family: "SF Mono", Menlo, Consolas, monospace; font-size: 13px; } | |
| pre { background: var(--code-bg); padding: .8rem 1rem; border-radius: 6px; overflow-x: auto; line-height: 1.4; } | |
| code { background: var(--code-bg); padding: .08em .35em; border-radius: 3px; } | |
| pre code { background: none; padding: 0; } | |
| .tldr { | |
| background: #f0f6ff; | |
| border-left: 3px solid var(--accent); | |
| padding: .9rem 1.1rem; | |
| border-radius: 0 6px 6px 0; | |
| margin: 1rem 0 1.6rem; | |
| } | |
| .tldr p { margin: .35rem 0; } | |
| .small { color: var(--muted); font-size: 13px; } | |
| .footnote { color: var(--muted); font-size: 13px; margin-top: 3rem; padding-top: 1rem; border-top: 1px solid var(--line); } | |
| ul { padding-left: 1.4rem; } | |
| li { margin: .25rem 0; } | |
| </style> | |
| </head> | |
| <body> | |
| <h1>ClickHouse skip-index benchmark for log substring search</h1> | |
| <p class="lede">Tested whether <code>text()</code>, <code>ngrambf_v1</code>, or any tunable variant can beat <code>tokenbf_v1</code> for typical <code>WHERE Message LIKE '%pattern%'</code> workloads on real log data. <strong>Surprise: none of them prune granules meaningfully, and no-index is the fastest.</strong></p> | |
| <div class="tldr"> | |
| <p><strong>TL;DR</strong></p> | |
| <p>For substring <code>LIKE</code> queries on log data, the skip-index family is essentially decorative — every common token appears in every granule, so the bloom filter always says "maybe", and pruning is < 0.01% of granules.</p> | |
| <p>The current production choice (<code>tokenbf_v1(32768, 3, 0)</code>) is fine but oversized. <code>tokenbf_v1(8192, 3, 0)</code> gives a <strong>3.5× smaller index at no query cost</strong>. The <code>text()</code> family is 50–500× the size of <code>tokenbf_v1</code> and slightly <em>slower</em>.</p> | |
| <p>If your <code>LIKE</code> patterns are common log tokens (<code>error</code>, <code>timeout</code>, hex prefixes), the index is decorative; you can probably drop it.</p> | |
| </div> | |
| <h2>Setup</h2> | |
| <p>A single day of a production log table (<code>Message String</code> column) was copied 14 times onto one shard, each copy carrying a different skip-index configuration. After <code>OPTIMIZE TABLE … FINAL</code> on every variant, size and query-latency measurements were taken.</p> | |
| <table> | |
| <tr><th>Item</th><th>Value</th></tr> | |
| <tr><td>ClickHouse</td><td>26.2.5</td></tr> | |
| <tr><td>Rows per variant</td><td class="num">437,216,102</td></tr> | |
| <tr><td>Compressed data per variant</td><td class="num">10.82 GiB</td></tr> | |
| <tr><td>Engine</td><td><code>MergeTree</code>, one partition (one day)</td></tr> | |
| <tr><td><code>ORDER BY</code></td><td><code>(tenant, node, timestamp)</code></td></tr> | |
| <tr><td><code>index_granularity</code></td><td class="num">8,192</td></tr> | |
| <tr><td>Total granules per variant</td><td class="num">~53,400</td></tr> | |
| </table> | |
| <p>Each query (<code>SELECT count() FROM <variant> WHERE Message LIKE '%pattern%'</code>) was run twice with cold caches (<code>SYSTEM DROP MARK CACHE</code> + <code>SYSTEM DROP UNCOMPRESSED CACHE</code> between runs; filesystem cache was deliberately left alone to avoid impacting other workloads). Numbers below are run-2 unless otherwise noted, taken from <code>system.query_log</code>.</p> | |
| <h2>Index size after OPTIMIZE FINAL</h2> | |
| <table> | |
| <tr> | |
| <th>Variant</th> | |
| <th>Index spec</th> | |
| <th class="num">Index size</th> | |
| <th class="num">idx / data</th> | |
| </tr> | |
| <tr><td>baseline</td><td><em>(no skip index)</em></td><td class="num">0</td><td class="num">—</td></tr> | |
| <tr class="winner"><td><strong>tokenbf_8k</strong></td><td><code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code></td><td class="num"><strong>70.3 MiB</strong></td><td class="num"><strong>0.63%</strong></td></tr> | |
| <tr><td>tokenbf_v1 <span class="small">(common default)</span></td><td><code>tokenbf_v1(32768, 3, 0) GRANULARITY 4</code></td><td class="num">244.7 MiB</td><td class="num">2.21%</td></tr> | |
| <tr><td>ngrambf_4</td><td><code>ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4</code></td><td class="num">321.0 MiB</td><td class="num">2.90%</td></tr> | |
| <tr><td>text_split_lower</td><td><code>text(splitByString([' ','=',',',':','"',"'"]), lower())</code></td><td class="num">12.00 GiB</td><td class="num">110.9%</td></tr> | |
| <tr><td>text_split</td><td><code>text(splitByString([' ','=',',',':','"',"'"]))</code></td><td class="num">12.02 GiB</td><td class="num">111.1%</td></tr> | |
| <tr><td>text_optimal <span class="small">(kitchen-sink tuning)</span></td><td><code>text(splitByNonAlpha, lower, dict_block_size=1024, dict_frontcoding=1)</code></td><td class="num">13.84 GiB</td><td class="num">127.9%</td></tr> | |
| <tr><td>text_lower</td><td><code>text(splitByNonAlpha, preprocessor=lower)</code></td><td class="num">13.86 GiB</td><td class="num">128.2%</td></tr> | |
| <tr><td>text_dict_big</td><td><code>text(splitByNonAlpha, dictionary_block_size=1024)</code></td><td class="num">13.94 GiB</td><td class="num">128.9%</td></tr> | |
| <tr><td>text_default</td><td><code>text(splitByNonAlpha)</code></td><td class="num">13.97 GiB</td><td class="num">129.1%</td></tr> | |
| <tr><td>text_sparse_cut</td><td><code>text(sparseGrams(3, 5, 4))</code></td><td class="num">40.89 GiB</td><td class="num">377.9%</td></tr> | |
| <tr class="loser"><td>text_ngrams3</td><td><code>text(ngrams(3))</code></td><td class="num">67.86 GiB</td><td class="num">627.2%</td></tr> | |
| <tr class="loser"><td>text_ngrams4</td><td><code>text(ngrams(4))</code></td><td class="num">77.29 GiB</td><td class="num">714.4%</td></tr> | |
| <tr class="loser"><td>text_sparse <span class="small">(partial, killed)</span></td><td><code>text(sparseGrams)</code> <span class="small">— default</span></td><td class="num">~17×</td><td class="num"><em>extrapolated</em></td></tr> | |
| </table> | |
| <h3>Observations on sizing</h3> | |
| <ul> | |
| <li><code>tokenbf_v1(8192, 3, 0)</code> is the size winner — a quarter the size of the common <code>(32768)</code> default at no measurable cost.</li> | |
| <li>Every <code>text()</code> variant is <strong>orders of magnitude larger</strong> than <code>tokenbf_v1</code>. The smallest text variant is <strong>170× larger</strong> than the smallest tokenbf.</li> | |
| <li>The text-index "advanced" knobs (<code>preprocessor</code>, <code>dictionary_block_size</code>, <code>dictionary_block_frontcoding_compression</code>) move the index size by <strong>under 2%</strong>. Tokenizer choice is the only meaningful lever.</li> | |
| <li><code>ngrams</code>-tokenizer text indexes are uniformly disastrous on size (627–714% of the data).</li> | |
| <li><code>sparseGrams</code> default is the worst of all (~17× data) — the explicit cutoff form (<code>sparseGrams(3,5,4)</code>) tames it to ~4× but still terrible.</li> | |
| </ul> | |
| <h2>Query latency</h2> | |
| <p>Five representative <code>LIKE</code> patterns covering different selectivity:</p> | |
| <table> | |
| <tr><th>Pattern</th><th class="num">Match count</th><th class="num">Match rate</th><th>Notes</th></tr> | |
| <tr><td><code>%error%</code></td><td class="num">34,061,067</td><td class="num">7.8%</td><td>common log word</td></tr> | |
| <tr><td><code>%0x%</code></td><td class="num">33,832,079</td><td class="num">7.7%</td><td>short alphanumeric — interesting case for tokenbf since it splits on non-alpha</td></tr> | |
| <tr><td><code>%failed to%</code></td><td class="num">1,330,814</td><td class="num">0.3%</td><td>multi-word phrase with whitespace</td></tr> | |
| <tr><td><code>%attestation%</code></td><td class="num">44,450,697</td><td class="num">10.2%</td><td>single rare-ish word</td></tr> | |
| <tr><td><code>%timeout%</code></td><td class="num">25,602,941</td><td class="num">5.9%</td><td>common log word</td></tr> | |
| </table> | |
| <p>Wall-clock latency (ms, lower is better):</p> | |
| <table> | |
| <tr> | |
| <th>Variant</th> | |
| <th class="num"><code>error</code></th> | |
| <th class="num"><code>0x</code></th> | |
| <th class="num"><code>failed to</code></th> | |
| <th class="num"><code>attestation</code></th> | |
| <th class="num"><code>timeout</code></th> | |
| <th class="num"><strong>avg</strong></th> | |
| </tr> | |
| <tr class="winner"><td><strong>baseline (no index)</strong></td><td class="num">1107</td><td class="num">952</td><td class="num">344</td><td class="num">955</td><td class="num">541</td><td class="num"><strong>780</strong></td></tr> | |
| <tr><td>text_default</td><td class="num">1129</td><td class="num">924</td><td class="num">363</td><td class="num">1001</td><td class="num">589</td><td class="num">801</td></tr> | |
| <tr><td>text_split_lower</td><td class="num">1132</td><td class="num">988</td><td class="num">384</td><td class="num">931</td><td class="num">665</td><td class="num">820</td></tr> | |
| <tr><td>tokenbf_8k</td><td class="num">1188</td><td class="num">977</td><td class="num">387</td><td class="num">1009</td><td class="num">573</td><td class="num">827</td></tr> | |
| <tr><td>tokenbf_v1</td><td class="num">1334</td><td class="num">1062</td><td class="num">438</td><td class="num">1114</td><td class="num">640</td><td class="num">918</td></tr> | |
| <tr><td>ngrambf_4</td><td class="num">1386</td><td class="num">1064</td><td class="num">567</td><td class="num">1193</td><td class="num">810</td><td class="num">1004</td></tr> | |
| <tr class="loser"><td>text_ngrams3</td><td class="num">1569</td><td class="num">999</td><td class="num">574</td><td class="num">1424</td><td class="num">850</td><td class="num">1083</td></tr> | |
| </table> | |
| <p><strong>No-index is the fastest across every pattern.</strong> Every skip-index variant pays a non-zero index-consultation cost on every query without buying back any pruning savings.</p> | |
| <h2>Why? <code>EXPLAIN indexes=1</code></h2> | |
| <p>For <code>WHERE Message LIKE '%attestation%'</code> on each variant, how many granules survived the skip-index check:</p> | |
| <table> | |
| <tr><th>Variant</th><th class="num">Granules kept</th><th class="num">Pruned</th></tr> | |
| <tr><td>tokenbf_v1</td><td class="num">53,397 / 53,397</td><td class="num">0</td></tr> | |
| <tr><td>tokenbf_8k</td><td class="num">53,382 / 53,382</td><td class="num">0</td></tr> | |
| <tr><td>text_ngrams3</td><td class="num">53,379 / 53,383</td><td class="num"><strong>4</strong> (0.007%)</td></tr> | |
| </table> | |
| <p>Even single-word common terms get zero useful pruning. <code>attestation</code> appears in nearly every 8,192-row granule, so the bloom filter says "maybe present" for ~all granules.</p> | |
| <p>The reduced <code>read_rows</code> reported by <code>system.query_log</code> (e.g. 104M of 437M for <code>%failed to%</code>) is <strong>not</strong> coming from skip-index pruning — it comes from ClickHouse's intra-granule <code>LIKE</code> evaluator short-circuiting per row. Compare:</p> | |
| <pre><code>-- bench_no_index: 53,397 / 53,397 granules selected, no skip index | |
| -- bench_tokenbf_v1: same granules selected; index consulted but didn't prune | |
| -- bench_text_ngrams3: skip index tokenizes "failed to" into 7 trigrams, | |
| -- prunes 12 of 53,383 granules (0.02%)</code></pre> | |
| <p>The "useless granule" problem isn't a tokenizer issue — it's a property of the data. With ~8,000 rows per granule and a token appearing in >0.5% of rows, every granule contains it. The bloom filter cannot help.</p> | |
| <h2>What about <code>hasToken()</code> instead of <code>LIKE</code>?</h2> | |
| <p>The <code>text()</code> index is genuinely <em>designed</em> for the <code>hasToken</code>, <code>hasAnyToken</code>, <code>searchAny</code>, etc. operators — not <code>LIKE</code>. Those operators speak directly to the inverted index without round-tripping through pattern matching. This benchmark only covers <code>LIKE</code> because that's what real-world Grafana / dashboard queries against log data generate. If your workload is <code>hasToken()</code>-shaped, re-run with that operator.</p> | |
| <h2>Recommendations</h2> | |
| <ul> | |
| <li><strong>For new schemas:</strong> use <code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code>, or omit the index entirely.</li> | |
| <li><strong>For existing schemas with <code>tokenbf_v1(32768, …)</code>:</strong> consider downsizing — saves ~70% of the index disk for free.</li> | |
| <li><strong>For existing schemas with <code>text()</code> on Message/Body for LIKE-shaped queries:</strong> rip it out. It's costing 50–500× the disk of an equivalent <code>tokenbf_v1</code> and offering no query benefit.</li> | |
| <li><strong>Avoid <code>ngrambf_v1</code> for this workload:</strong> bigger index, slower queries.</li> | |
| <li><strong>If you genuinely need pruning for substring search:</strong> the lever isn't the index — it's the <code>ORDER BY</code>. Move the column you actually filter on to the prefix of <code>ORDER BY</code>, and pre-tokenize known patterns into materialized columns.</li> | |
| </ul> | |
| <h2>Variants tested — full list (16)</h2> | |
| <details> | |
| <summary>Click to expand DDL snippets for each variant</summary> | |
| <pre><code>-- baseline | |
| CREATE TABLE bench_no_index (... Message String CODEC(ZSTD(1))) | |
| ENGINE = MergeTree() ORDER BY (tenant, node, timestamp); | |
| -- tokenbf_v1 family | |
| INDEX idx_msg Message TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 4 -- winner | |
| INDEX idx_msg Message TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 4 -- control | |
| -- ngrambf_v1 family | |
| INDEX idx_msg Message TYPE ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4 | |
| -- text() family | |
| INDEX idx_msg Message TYPE text(tokenizer = 'splitByNonAlpha') GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = splitByString([' ','=',',',':','"','\''])) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, preprocessor = lower(Message)) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = splitByString(...), preprocessor = lower(Message)) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, dictionary_block_size = 1024) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, | |
| preprocessor = lower(Message), | |
| dictionary_block_size = 1024, | |
| dictionary_block_frontcoding_compression = 1) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = ngrams(3)) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = ngrams(4)) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = sparseGrams) GRANULARITY 1 | |
| INDEX idx_msg Message TYPE text(tokenizer = sparseGrams(3, 5, 4)) GRANULARITY 1 | |
| </code></pre> | |
| <p class="small">Note: ClickHouse silently rewrites <code>GRANULARITY 1</code> on <code>text()</code> indexes to <code>GRANULARITY 100000000</code> (100M-row blocks) regardless of what you specify. The pruning unit for <code>text()</code> is much coarser than for <code>tokenbf_v1</code>.</p> | |
| </details> | |
| <p class="footnote">Bench methodology: one shard, single MergeTree (not Replicated) for clean per-shard size measurements. Each variant inserted from a local 1-part source table to avoid re-reading the production replicated source 14 times. Inserts run in waves of 4 concurrent. <code>OPTIMIZE TABLE … FINAL</code> on every variant before measuring. Sizes from <code>system.parts</code>; query latency from <code>system.query_log</code> with <code>SYSTEM FLUSH LOGS</code> for sync.</p> | |
| </body> | |
| </html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment