Skip to content

Instantly share code, notes, and snippets.

@samcm
Created May 14, 2026 09:33
Show Gist options
  • Select an option

  • Save samcm/6b8b05346484eb3b278f6b2bbbcc140c to your computer and use it in GitHub Desktop.

Select an option

Save samcm/6b8b05346484eb3b278f6b2bbbcc140c to your computer and use it in GitHub Desktop.
ClickHouse skip-index benchmark for log substring search — tokenbf_v1 vs text() vs ngrambf_v1
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>ClickHouse skip-index benchmark for log substring search</title>
<style>
:root {
--fg: #1a1a1a;
--muted: #666;
--line: #e3e3e3;
--accent: #2a6dd9;
--winner: #e7f5e9;
--loser: #fbeaea;
--code-bg: #f6f8fa;
}
html, body { background: #fff; color: var(--fg); }
body {
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
line-height: 1.55;
max-width: 920px;
margin: 2rem auto 4rem;
padding: 0 1.25rem;
font-size: 15.5px;
}
h1 { font-size: 1.85rem; margin: 0 0 .4rem; line-height: 1.2; }
h2 { font-size: 1.3rem; margin: 2.4rem 0 .6rem; padding-top: .4rem; border-top: 1px solid var(--line); }
h3 { font-size: 1.05rem; margin: 1.4rem 0 .3rem; }
.lede { color: var(--muted); margin: 0 0 1.5rem; }
table { border-collapse: collapse; width: 100%; margin: .6rem 0 1.2rem; font-size: 14px; }
th, td { padding: .35rem .55rem; border-bottom: 1px solid var(--line); text-align: left; vertical-align: top; }
th { font-weight: 600; background: #fafafa; }
td.num, th.num { text-align: right; font-variant-numeric: tabular-nums; }
tr.winner td { background: var(--winner); }
tr.loser td { background: var(--loser); }
code, pre { font-family: "SF Mono", Menlo, Consolas, monospace; font-size: 13px; }
pre { background: var(--code-bg); padding: .8rem 1rem; border-radius: 6px; overflow-x: auto; line-height: 1.4; }
code { background: var(--code-bg); padding: .08em .35em; border-radius: 3px; }
pre code { background: none; padding: 0; }
.tldr {
background: #f0f6ff;
border-left: 3px solid var(--accent);
padding: .9rem 1.1rem;
border-radius: 0 6px 6px 0;
margin: 1rem 0 1.6rem;
}
.tldr p { margin: .35rem 0; }
.small { color: var(--muted); font-size: 13px; }
.footnote { color: var(--muted); font-size: 13px; margin-top: 3rem; padding-top: 1rem; border-top: 1px solid var(--line); }
ul { padding-left: 1.4rem; }
li { margin: .25rem 0; }
</style>
</head>
<body>
<h1>ClickHouse skip-index benchmark for log substring search</h1>
<p class="lede">Tested whether <code>text()</code>, <code>ngrambf_v1</code>, or any tunable variant can beat <code>tokenbf_v1</code> for typical <code>WHERE Message LIKE '%pattern%'</code> workloads on real log data. <strong>Surprise: none of them prune granules meaningfully, and no-index is the fastest.</strong></p>
<div class="tldr">
<p><strong>TL;DR</strong></p>
<p>For substring <code>LIKE</code> queries on log data, the skip-index family is essentially decorative — every common token appears in every granule, so the bloom filter always says "maybe", and pruning is &lt; 0.01% of granules.</p>
<p>The current production choice (<code>tokenbf_v1(32768, 3, 0)</code>) is fine but oversized. <code>tokenbf_v1(8192, 3, 0)</code> gives a <strong>3.5× smaller index at no query cost</strong>. The <code>text()</code> family is 50–500× the size of <code>tokenbf_v1</code> and slightly <em>slower</em>.</p>
<p>If your <code>LIKE</code> patterns are common log tokens (<code>error</code>, <code>timeout</code>, hex prefixes), the index is decorative; you can probably drop it.</p>
</div>
<h2>Setup</h2>
<p>A single day of a production log table (<code>Message String</code> column) was copied 14 times onto one shard, each copy carrying a different skip-index configuration. After <code>OPTIMIZE TABLE … FINAL</code> on every variant, size and query-latency measurements were taken.</p>
<table>
<tr><th>Item</th><th>Value</th></tr>
<tr><td>ClickHouse</td><td>26.2.5</td></tr>
<tr><td>Rows per variant</td><td class="num">437,216,102</td></tr>
<tr><td>Compressed data per variant</td><td class="num">10.82 GiB</td></tr>
<tr><td>Engine</td><td><code>MergeTree</code>, one partition (one day)</td></tr>
<tr><td><code>ORDER BY</code></td><td><code>(tenant, node, timestamp)</code></td></tr>
<tr><td><code>index_granularity</code></td><td class="num">8,192</td></tr>
<tr><td>Total granules per variant</td><td class="num">~53,400</td></tr>
</table>
<p>Each query (<code>SELECT count() FROM &lt;variant&gt; WHERE Message LIKE '%pattern%'</code>) was run twice with cold caches (<code>SYSTEM DROP MARK CACHE</code> + <code>SYSTEM DROP UNCOMPRESSED CACHE</code> between runs; filesystem cache was deliberately left alone to avoid impacting other workloads). Numbers below are run-2 unless otherwise noted, taken from <code>system.query_log</code>.</p>
<h2>Index size after OPTIMIZE FINAL</h2>
<table>
<tr>
<th>Variant</th>
<th>Index spec</th>
<th class="num">Index size</th>
<th class="num">idx / data</th>
</tr>
<tr><td>baseline</td><td><em>(no skip index)</em></td><td class="num">0</td><td class="num">—</td></tr>
<tr class="winner"><td><strong>tokenbf_8k</strong></td><td><code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code></td><td class="num"><strong>70.3 MiB</strong></td><td class="num"><strong>0.63%</strong></td></tr>
<tr><td>tokenbf_v1 <span class="small">(common default)</span></td><td><code>tokenbf_v1(32768, 3, 0) GRANULARITY 4</code></td><td class="num">244.7 MiB</td><td class="num">2.21%</td></tr>
<tr><td>ngrambf_4</td><td><code>ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4</code></td><td class="num">321.0 MiB</td><td class="num">2.90%</td></tr>
<tr><td>text_split_lower</td><td><code>text(splitByString([' ','=',',',':','"',"'"]), lower())</code></td><td class="num">12.00 GiB</td><td class="num">110.9%</td></tr>
<tr><td>text_split</td><td><code>text(splitByString([' ','=',',',':','"',"'"]))</code></td><td class="num">12.02 GiB</td><td class="num">111.1%</td></tr>
<tr><td>text_optimal <span class="small">(kitchen-sink tuning)</span></td><td><code>text(splitByNonAlpha, lower, dict_block_size=1024, dict_frontcoding=1)</code></td><td class="num">13.84 GiB</td><td class="num">127.9%</td></tr>
<tr><td>text_lower</td><td><code>text(splitByNonAlpha, preprocessor=lower)</code></td><td class="num">13.86 GiB</td><td class="num">128.2%</td></tr>
<tr><td>text_dict_big</td><td><code>text(splitByNonAlpha, dictionary_block_size=1024)</code></td><td class="num">13.94 GiB</td><td class="num">128.9%</td></tr>
<tr><td>text_default</td><td><code>text(splitByNonAlpha)</code></td><td class="num">13.97 GiB</td><td class="num">129.1%</td></tr>
<tr><td>text_sparse_cut</td><td><code>text(sparseGrams(3, 5, 4))</code></td><td class="num">40.89 GiB</td><td class="num">377.9%</td></tr>
<tr class="loser"><td>text_ngrams3</td><td><code>text(ngrams(3))</code></td><td class="num">67.86 GiB</td><td class="num">627.2%</td></tr>
<tr class="loser"><td>text_ngrams4</td><td><code>text(ngrams(4))</code></td><td class="num">77.29 GiB</td><td class="num">714.4%</td></tr>
<tr class="loser"><td>text_sparse <span class="small">(partial, killed)</span></td><td><code>text(sparseGrams)</code> <span class="small">— default</span></td><td class="num">~17×</td><td class="num"><em>extrapolated</em></td></tr>
</table>
<h3>Observations on sizing</h3>
<ul>
<li><code>tokenbf_v1(8192, 3, 0)</code> is the size winner — a quarter the size of the common <code>(32768)</code> default at no measurable cost.</li>
<li>Every <code>text()</code> variant is <strong>orders of magnitude larger</strong> than <code>tokenbf_v1</code>. The smallest text variant is <strong>170× larger</strong> than the smallest tokenbf.</li>
<li>The text-index "advanced" knobs (<code>preprocessor</code>, <code>dictionary_block_size</code>, <code>dictionary_block_frontcoding_compression</code>) move the index size by <strong>under 2%</strong>. Tokenizer choice is the only meaningful lever.</li>
<li><code>ngrams</code>-tokenizer text indexes are uniformly disastrous on size (627–714% of the data).</li>
<li><code>sparseGrams</code> default is the worst of all (~17× data) — the explicit cutoff form (<code>sparseGrams(3,5,4)</code>) tames it to ~4× but still terrible.</li>
</ul>
<h2>Query latency</h2>
<p>Five representative <code>LIKE</code> patterns covering different selectivity:</p>
<table>
<tr><th>Pattern</th><th class="num">Match count</th><th class="num">Match rate</th><th>Notes</th></tr>
<tr><td><code>%error%</code></td><td class="num">34,061,067</td><td class="num">7.8%</td><td>common log word</td></tr>
<tr><td><code>%0x%</code></td><td class="num">33,832,079</td><td class="num">7.7%</td><td>short alphanumeric — interesting case for tokenbf since it splits on non-alpha</td></tr>
<tr><td><code>%failed to%</code></td><td class="num">1,330,814</td><td class="num">0.3%</td><td>multi-word phrase with whitespace</td></tr>
<tr><td><code>%attestation%</code></td><td class="num">44,450,697</td><td class="num">10.2%</td><td>single rare-ish word</td></tr>
<tr><td><code>%timeout%</code></td><td class="num">25,602,941</td><td class="num">5.9%</td><td>common log word</td></tr>
</table>
<p>Wall-clock latency (ms, lower is better):</p>
<table>
<tr>
<th>Variant</th>
<th class="num"><code>error</code></th>
<th class="num"><code>0x</code></th>
<th class="num"><code>failed to</code></th>
<th class="num"><code>attestation</code></th>
<th class="num"><code>timeout</code></th>
<th class="num"><strong>avg</strong></th>
</tr>
<tr class="winner"><td><strong>baseline (no index)</strong></td><td class="num">1107</td><td class="num">952</td><td class="num">344</td><td class="num">955</td><td class="num">541</td><td class="num"><strong>780</strong></td></tr>
<tr><td>text_default</td><td class="num">1129</td><td class="num">924</td><td class="num">363</td><td class="num">1001</td><td class="num">589</td><td class="num">801</td></tr>
<tr><td>text_split_lower</td><td class="num">1132</td><td class="num">988</td><td class="num">384</td><td class="num">931</td><td class="num">665</td><td class="num">820</td></tr>
<tr><td>tokenbf_8k</td><td class="num">1188</td><td class="num">977</td><td class="num">387</td><td class="num">1009</td><td class="num">573</td><td class="num">827</td></tr>
<tr><td>tokenbf_v1</td><td class="num">1334</td><td class="num">1062</td><td class="num">438</td><td class="num">1114</td><td class="num">640</td><td class="num">918</td></tr>
<tr><td>ngrambf_4</td><td class="num">1386</td><td class="num">1064</td><td class="num">567</td><td class="num">1193</td><td class="num">810</td><td class="num">1004</td></tr>
<tr class="loser"><td>text_ngrams3</td><td class="num">1569</td><td class="num">999</td><td class="num">574</td><td class="num">1424</td><td class="num">850</td><td class="num">1083</td></tr>
</table>
<p><strong>No-index is the fastest across every pattern.</strong> Every skip-index variant pays a non-zero index-consultation cost on every query without buying back any pruning savings.</p>
<h2>Why? <code>EXPLAIN indexes=1</code></h2>
<p>For <code>WHERE Message LIKE '%attestation%'</code> on each variant, how many granules survived the skip-index check:</p>
<table>
<tr><th>Variant</th><th class="num">Granules kept</th><th class="num">Pruned</th></tr>
<tr><td>tokenbf_v1</td><td class="num">53,397 / 53,397</td><td class="num">0</td></tr>
<tr><td>tokenbf_8k</td><td class="num">53,382 / 53,382</td><td class="num">0</td></tr>
<tr><td>text_ngrams3</td><td class="num">53,379 / 53,383</td><td class="num"><strong>4</strong> (0.007%)</td></tr>
</table>
<p>Even single-word common terms get zero useful pruning. <code>attestation</code> appears in nearly every 8,192-row granule, so the bloom filter says "maybe present" for ~all granules.</p>
<p>The reduced <code>read_rows</code> reported by <code>system.query_log</code> (e.g. 104M of 437M for <code>%failed to%</code>) is <strong>not</strong> coming from skip-index pruning — it comes from ClickHouse's intra-granule <code>LIKE</code> evaluator short-circuiting per row. Compare:</p>
<pre><code>-- bench_no_index: 53,397 / 53,397 granules selected, no skip index
-- bench_tokenbf_v1: same granules selected; index consulted but didn't prune
-- bench_text_ngrams3: skip index tokenizes "failed to" into 7 trigrams,
-- prunes 12 of 53,383 granules (0.02%)</code></pre>
<p>The "useless granule" problem isn't a tokenizer issue — it's a property of the data. With ~8,000 rows per granule and a token appearing in &gt;0.5% of rows, every granule contains it. The bloom filter cannot help.</p>
<h2>What about <code>hasToken()</code> instead of <code>LIKE</code>?</h2>
<p>The <code>text()</code> index is genuinely <em>designed</em> for the <code>hasToken</code>, <code>hasAnyToken</code>, <code>searchAny</code>, etc. operators — not <code>LIKE</code>. Those operators speak directly to the inverted index without round-tripping through pattern matching. This benchmark only covers <code>LIKE</code> because that's what real-world Grafana / dashboard queries against log data generate. If your workload is <code>hasToken()</code>-shaped, re-run with that operator.</p>
<h2>Recommendations</h2>
<ul>
<li><strong>For new schemas:</strong> use <code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code>, or omit the index entirely.</li>
<li><strong>For existing schemas with <code>tokenbf_v1(32768, …)</code>:</strong> consider downsizing — saves ~70% of the index disk for free.</li>
<li><strong>For existing schemas with <code>text()</code> on Message/Body for LIKE-shaped queries:</strong> rip it out. It's costing 50–500× the disk of an equivalent <code>tokenbf_v1</code> and offering no query benefit.</li>
<li><strong>Avoid <code>ngrambf_v1</code> for this workload:</strong> bigger index, slower queries.</li>
<li><strong>If you genuinely need pruning for substring search:</strong> the lever isn't the index — it's the <code>ORDER BY</code>. Move the column you actually filter on to the prefix of <code>ORDER BY</code>, and pre-tokenize known patterns into materialized columns.</li>
</ul>
<h2>Variants tested — full list (16)</h2>
<details>
<summary>Click to expand DDL snippets for each variant</summary>
<pre><code>-- baseline
CREATE TABLE bench_no_index (... Message String CODEC(ZSTD(1)))
ENGINE = MergeTree() ORDER BY (tenant, node, timestamp);
-- tokenbf_v1 family
INDEX idx_msg Message TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 4 -- winner
INDEX idx_msg Message TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 4 -- control
-- ngrambf_v1 family
INDEX idx_msg Message TYPE ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4
-- text() family
INDEX idx_msg Message TYPE text(tokenizer = 'splitByNonAlpha') GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = splitByString([' ','=',',',':','"','\''])) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, preprocessor = lower(Message)) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = splitByString(...), preprocessor = lower(Message)) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, dictionary_block_size = 1024) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha,
preprocessor = lower(Message),
dictionary_block_size = 1024,
dictionary_block_frontcoding_compression = 1) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = ngrams(3)) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = ngrams(4)) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = sparseGrams) GRANULARITY 1
INDEX idx_msg Message TYPE text(tokenizer = sparseGrams(3, 5, 4)) GRANULARITY 1
</code></pre>
<p class="small">Note: ClickHouse silently rewrites <code>GRANULARITY 1</code> on <code>text()</code> indexes to <code>GRANULARITY 100000000</code> (100M-row blocks) regardless of what you specify. The pruning unit for <code>text()</code> is much coarser than for <code>tokenbf_v1</code>.</p>
</details>
<p class="footnote">Bench methodology: one shard, single MergeTree (not Replicated) for clean per-shard size measurements. Each variant inserted from a local 1-part source table to avoid re-reading the production replicated source 14 times. Inserts run in waves of 4 concurrent. <code>OPTIMIZE TABLE … FINAL</code> on every variant before measuring. Sizes from <code>system.parts</code>; query latency from <code>system.query_log</code> with <code>SYSTEM FLUSH LOGS</code> for sync.</p>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment