samcm · May 14, 2026 09:33
diff --git a/report.html b/report.html
 <!doctype html>
 <html lang="en">
 <head>
  <meta charset="utf-8">
  <title>ClickHouse skip-index benchmark for log substring search</title>
  <style>
    :root {
      --fg: #1a1a1a;
      --muted: #666;
      --line: #e3e3e3;
      --accent: #2a6dd9;
      --winner: #e7f5e9;
      --loser: #fbeaea;
      --code-bg: #f6f8fa;
    }
    html, body { background: #fff; color: var(--fg); }
    body {
      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
      line-height: 1.55;
      max-width: 920px;
      margin: 2rem auto 4rem;
      padding: 0 1.25rem;
      font-size: 15.5px;
    }
    h1 { font-size: 1.85rem; margin: 0 0 .4rem; line-height: 1.2; }
    h2 { font-size: 1.3rem; margin: 2.4rem 0 .6rem; padding-top: .4rem; border-top: 1px solid var(--line); }
    h3 { font-size: 1.05rem; margin: 1.4rem 0 .3rem; }
    .lede { color: var(--muted); margin: 0 0 1.5rem; }
    table { border-collapse: collapse; width: 100%; margin: .6rem 0 1.2rem; font-size: 14px; }
    th, td { padding: .35rem .55rem; border-bottom: 1px solid var(--line); text-align: left; vertical-align: top; }
    th { font-weight: 600; background: #fafafa; }
    td.num, th.num { text-align: right; font-variant-numeric: tabular-nums; }
    tr.winner td { background: var(--winner); }
    tr.loser td { background: var(--loser); }
    code, pre { font-family: "SF Mono", Menlo, Consolas, monospace; font-size: 13px; }
    pre { background: var(--code-bg); padding: .8rem 1rem; border-radius: 6px; overflow-x: auto; line-height: 1.4; }
    code { background: var(--code-bg); padding: .08em .35em; border-radius: 3px; }
    pre code { background: none; padding: 0; }
    .tldr {
      background: #f0f6ff;
      border-left: 3px solid var(--accent);
      padding: .9rem 1.1rem;
      border-radius: 0 6px 6px 0;
      margin: 1rem 0 1.6rem;
    }
    .tldr p { margin: .35rem 0; }
    .small { color: var(--muted); font-size: 13px; }
    .footnote { color: var(--muted); font-size: 13px; margin-top: 3rem; padding-top: 1rem; border-top: 1px solid var(--line); }
    ul { padding-left: 1.4rem; }
    li { margin: .25rem 0; }
  </style>
 </head>
 <body>

 <h1>ClickHouse skip-index benchmark for log substring search</h1>
 <p class="lede">Tested whether <code>text()</code>, <code>ngrambf_v1</code>, or any tunable variant can beat <code>tokenbf_v1</code> for typical <code>WHERE Message LIKE '%pattern%'</code> workloads on real log data. <strong>Surprise: none of them prune granules meaningfully, and no-index is the fastest.</strong></p>

 <div class="tldr">
  <p><strong>TL;DR</strong></p>
  <p>For substring <code>LIKE</code> queries on log data, the skip-index family is essentially decorative — every common token appears in every granule, so the bloom filter always says "maybe", and pruning is &lt; 0.01% of granules.</p>
  <p>The current production choice (<code>tokenbf_v1(32768, 3, 0)</code>) is fine but oversized. <code>tokenbf_v1(8192, 3, 0)</code> gives a <strong>3.5× smaller index at no query cost</strong>. The <code>text()</code> family is 50–500× the size of <code>tokenbf_v1</code> and slightly <em>slower</em>.</p>
  <p>If your <code>LIKE</code> patterns are common log tokens (<code>error</code>, <code>timeout</code>, hex prefixes), the index is decorative; you can probably drop it.</p>
 </div>

 <h2>Setup</h2>

 <p>A single day of a production log table (<code>Message String</code> column) was copied 14 times onto one shard, each copy carrying a different skip-index configuration. After <code>OPTIMIZE TABLE … FINAL</code> on every variant, size and query-latency measurements were taken.</p>

 <table>
  <tr><th>Item</th><th>Value</th></tr>
  <tr><td>ClickHouse</td><td>26.2.5</td></tr>
  <tr><td>Rows per variant</td><td class="num">437,216,102</td></tr>
  <tr><td>Compressed data per variant</td><td class="num">10.82 GiB</td></tr>
  <tr><td>Engine</td><td><code>MergeTree</code>, one partition (one day)</td></tr>
  <tr><td><code>ORDER BY</code></td><td><code>(tenant, node, timestamp)</code></td></tr>
  <tr><td><code>index_granularity</code></td><td class="num">8,192</td></tr>
  <tr><td>Total granules per variant</td><td class="num">~53,400</td></tr>
 </table>

 <p>Each query (<code>SELECT count() FROM &lt;variant&gt; WHERE Message LIKE '%pattern%'</code>) was run twice with cold caches (<code>SYSTEM DROP MARK CACHE</code> + <code>SYSTEM DROP UNCOMPRESSED CACHE</code> between runs; filesystem cache was deliberately left alone to avoid impacting other workloads). Numbers below are run-2 unless otherwise noted, taken from <code>system.query_log</code>.</p>

 <h2>Index size after OPTIMIZE FINAL</h2>

 <table>
  <tr>
    <th>Variant</th>
    <th>Index spec</th>
    <th class="num">Index size</th>
    <th class="num">idx / data</th>
  </tr>
  <tr><td>baseline</td><td><em>(no skip index)</em></td><td class="num">0</td><td class="num">—</td></tr>
  <tr class="winner"><td><strong>tokenbf_8k</strong></td><td><code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code></td><td class="num"><strong>70.3 MiB</strong></td><td class="num"><strong>0.63%</strong></td></tr>
  <tr><td>tokenbf_v1 <span class="small">(common default)</span></td><td><code>tokenbf_v1(32768, 3, 0) GRANULARITY 4</code></td><td class="num">244.7 MiB</td><td class="num">2.21%</td></tr>
  <tr><td>ngrambf_4</td><td><code>ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4</code></td><td class="num">321.0 MiB</td><td class="num">2.90%</td></tr>
  <tr><td>text_split_lower</td><td><code>text(splitByString([' ','=',',',':','"',"'"]), lower())</code></td><td class="num">12.00 GiB</td><td class="num">110.9%</td></tr>
  <tr><td>text_split</td><td><code>text(splitByString([' ','=',',',':','"',"'"]))</code></td><td class="num">12.02 GiB</td><td class="num">111.1%</td></tr>
  <tr><td>text_optimal <span class="small">(kitchen-sink tuning)</span></td><td><code>text(splitByNonAlpha, lower, dict_block_size=1024, dict_frontcoding=1)</code></td><td class="num">13.84 GiB</td><td class="num">127.9%</td></tr>
  <tr><td>text_lower</td><td><code>text(splitByNonAlpha, preprocessor=lower)</code></td><td class="num">13.86 GiB</td><td class="num">128.2%</td></tr>
  <tr><td>text_dict_big</td><td><code>text(splitByNonAlpha, dictionary_block_size=1024)</code></td><td class="num">13.94 GiB</td><td class="num">128.9%</td></tr>
  <tr><td>text_default</td><td><code>text(splitByNonAlpha)</code></td><td class="num">13.97 GiB</td><td class="num">129.1%</td></tr>
  <tr><td>text_sparse_cut</td><td><code>text(sparseGrams(3, 5, 4))</code></td><td class="num">40.89 GiB</td><td class="num">377.9%</td></tr>
  <tr class="loser"><td>text_ngrams3</td><td><code>text(ngrams(3))</code></td><td class="num">67.86 GiB</td><td class="num">627.2%</td></tr>
  <tr class="loser"><td>text_ngrams4</td><td><code>text(ngrams(4))</code></td><td class="num">77.29 GiB</td><td class="num">714.4%</td></tr>
  <tr class="loser"><td>text_sparse <span class="small">(partial, killed)</span></td><td><code>text(sparseGrams)</code> <span class="small">— default</span></td><td class="num">~17×</td><td class="num"><em>extrapolated</em></td></tr>
 </table>

 <h3>Observations on sizing</h3>
 <ul>
  <li><code>tokenbf_v1(8192, 3, 0)</code> is the size winner — a quarter the size of the common <code>(32768)</code> default at no measurable cost.</li>
  <li>Every <code>text()</code> variant is <strong>orders of magnitude larger</strong> than <code>tokenbf_v1</code>. The smallest text variant is <strong>170× larger</strong> than the smallest tokenbf.</li>
  <li>The text-index "advanced" knobs (<code>preprocessor</code>, <code>dictionary_block_size</code>, <code>dictionary_block_frontcoding_compression</code>) move the index size by <strong>under 2%</strong>. Tokenizer choice is the only meaningful lever.</li>
  <li><code>ngrams</code>-tokenizer text indexes are uniformly disastrous on size (627–714% of the data).</li>
  <li><code>sparseGrams</code> default is the worst of all (~17× data) — the explicit cutoff form (<code>sparseGrams(3,5,4)</code>) tames it to ~4× but still terrible.</li>
 </ul>

 <h2>Query latency</h2>

 <p>Five representative <code>LIKE</code> patterns covering different selectivity:</p>

 <table>
  <tr><th>Pattern</th><th class="num">Match count</th><th class="num">Match rate</th><th>Notes</th></tr>
  <tr><td><code>%error%</code></td><td class="num">34,061,067</td><td class="num">7.8%</td><td>common log word</td></tr>
  <tr><td><code>%0x%</code></td><td class="num">33,832,079</td><td class="num">7.7%</td><td>short alphanumeric — interesting case for tokenbf since it splits on non-alpha</td></tr>
  <tr><td><code>%failed to%</code></td><td class="num">1,330,814</td><td class="num">0.3%</td><td>multi-word phrase with whitespace</td></tr>
  <tr><td><code>%attestation%</code></td><td class="num">44,450,697</td><td class="num">10.2%</td><td>single rare-ish word</td></tr>
  <tr><td><code>%timeout%</code></td><td class="num">25,602,941</td><td class="num">5.9%</td><td>common log word</td></tr>
 </table>

 <p>Wall-clock latency (ms, lower is better):</p>

 <table>
  <tr>
    <th>Variant</th>
    <th class="num"><code>error</code></th>
    <th class="num"><code>0x</code></th>
    <th class="num"><code>failed to</code></th>
    <th class="num"><code>attestation</code></th>
    <th class="num"><code>timeout</code></th>
    <th class="num"><strong>avg</strong></th>
  </tr>
  <tr class="winner"><td><strong>baseline (no index)</strong></td><td class="num">1107</td><td class="num">952</td><td class="num">344</td><td class="num">955</td><td class="num">541</td><td class="num"><strong>780</strong></td></tr>
  <tr><td>text_default</td><td class="num">1129</td><td class="num">924</td><td class="num">363</td><td class="num">1001</td><td class="num">589</td><td class="num">801</td></tr>
  <tr><td>text_split_lower</td><td class="num">1132</td><td class="num">988</td><td class="num">384</td><td class="num">931</td><td class="num">665</td><td class="num">820</td></tr>
  <tr><td>tokenbf_8k</td><td class="num">1188</td><td class="num">977</td><td class="num">387</td><td class="num">1009</td><td class="num">573</td><td class="num">827</td></tr>
  <tr><td>tokenbf_v1</td><td class="num">1334</td><td class="num">1062</td><td class="num">438</td><td class="num">1114</td><td class="num">640</td><td class="num">918</td></tr>
  <tr><td>ngrambf_4</td><td class="num">1386</td><td class="num">1064</td><td class="num">567</td><td class="num">1193</td><td class="num">810</td><td class="num">1004</td></tr>
  <tr class="loser"><td>text_ngrams3</td><td class="num">1569</td><td class="num">999</td><td class="num">574</td><td class="num">1424</td><td class="num">850</td><td class="num">1083</td></tr>
 </table>

 <p><strong>No-index is the fastest across every pattern.</strong> Every skip-index variant pays a non-zero index-consultation cost on every query without buying back any pruning savings.</p>

 <h2>Why? <code>EXPLAIN indexes=1</code></h2>

 <p>For <code>WHERE Message LIKE '%attestation%'</code> on each variant, how many granules survived the skip-index check:</p>

 <table>
  <tr><th>Variant</th><th class="num">Granules kept</th><th class="num">Pruned</th></tr>
  <tr><td>tokenbf_v1</td><td class="num">53,397 / 53,397</td><td class="num">0</td></tr>
  <tr><td>tokenbf_8k</td><td class="num">53,382 / 53,382</td><td class="num">0</td></tr>
  <tr><td>text_ngrams3</td><td class="num">53,379 / 53,383</td><td class="num"><strong>4</strong> (0.007%)</td></tr>
 </table>

 <p>Even single-word common terms get zero useful pruning. <code>attestation</code> appears in nearly every 8,192-row granule, so the bloom filter says "maybe present" for ~all granules.</p>

 <p>The reduced <code>read_rows</code> reported by <code>system.query_log</code> (e.g. 104M of 437M for <code>%failed to%</code>) is <strong>not</strong> coming from skip-index pruning — it comes from ClickHouse's intra-granule <code>LIKE</code> evaluator short-circuiting per row. Compare:</p>

 <pre><code>-- bench_no_index: 53,397 / 53,397 granules selected, no skip index
 -- bench_tokenbf_v1: same granules selected; index consulted but didn't prune
 -- bench_text_ngrams3: skip index tokenizes "failed to" into 7 trigrams,
 --   prunes 12 of 53,383 granules (0.02%)</code></pre>

 <p>The "useless granule" problem isn't a tokenizer issue — it's a property of the data. With ~8,000 rows per granule and a token appearing in &gt;0.5% of rows, every granule contains it. The bloom filter cannot help.</p>

 <h2>What about <code>hasToken()</code> instead of <code>LIKE</code>?</h2>

 <p>The <code>text()</code> index is genuinely <em>designed</em> for the <code>hasToken</code>, <code>hasAnyToken</code>, <code>searchAny</code>, etc. operators — not <code>LIKE</code>. Those operators speak directly to the inverted index without round-tripping through pattern matching. This benchmark only covers <code>LIKE</code> because that's what real-world Grafana / dashboard queries against log data generate. If your workload is <code>hasToken()</code>-shaped, re-run with that operator.</p>

 <h2>Recommendations</h2>

 <ul>
  <li><strong>For new schemas:</strong> use <code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code>, or omit the index entirely.</li>
  <li><strong>For existing schemas with <code>tokenbf_v1(32768, …)</code>:</strong> consider downsizing — saves ~70% of the index disk for free.</li>
  <li><strong>For existing schemas with <code>text()</code> on Message/Body for LIKE-shaped queries:</strong> rip it out. It's costing 50–500× the disk of an equivalent <code>tokenbf_v1</code> and offering no query benefit.</li>
  <li><strong>Avoid <code>ngrambf_v1</code> for this workload:</strong> bigger index, slower queries.</li>
  <li><strong>If you genuinely need pruning for substring search:</strong> the lever isn't the index — it's the <code>ORDER BY</code>. Move the column you actually filter on to the prefix of <code>ORDER BY</code>, and pre-tokenize known patterns into materialized columns.</li>
 </ul>

 <h2>Variants tested — full list (16)</h2>

 <details>
 <summary>Click to expand DDL snippets for each variant</summary>
 <pre><code>-- baseline
 CREATE TABLE bench_no_index (... Message String CODEC(ZSTD(1)))
 ENGINE = MergeTree() ORDER BY (tenant, node, timestamp);

 -- tokenbf_v1 family
 INDEX idx_msg Message TYPE tokenbf_v1(8192, 3, 0)  GRANULARITY 4    -- winner
 INDEX idx_msg Message TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 4    -- control

 -- ngrambf_v1 family
 INDEX idx_msg Message TYPE ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4

 -- text() family
 INDEX idx_msg Message TYPE text(tokenizer = 'splitByNonAlpha')                                 GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = splitByString([' ','=',',',':','"','\'']))         GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, preprocessor = lower(Message))    GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = splitByString(...), preprocessor = lower(Message)) GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, dictionary_block_size = 1024)     GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha,
                                preprocessor = lower(Message),
                                dictionary_block_size = 1024,
                                dictionary_block_frontcoding_compression = 1)                  GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = ngrams(3))                                          GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = ngrams(4))                                          GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = sparseGrams)                                        GRANULARITY 1
 INDEX idx_msg Message TYPE text(tokenizer = sparseGrams(3, 5, 4))                               GRANULARITY 1
 </code></pre>

 <p class="small">Note: ClickHouse silently rewrites <code>GRANULARITY 1</code> on <code>text()</code> indexes to <code>GRANULARITY 100000000</code> (100M-row blocks) regardless of what you specify. The pruning unit for <code>text()</code> is much coarser than for <code>tokenbf_v1</code>.</p>
 </details>

 <p class="footnote">Bench methodology: one shard, single MergeTree (not Replicated) for clean per-shard size measurements. Each variant inserted from a local 1-part source table to avoid re-reading the production replicated source 14 times. Inserts run in waves of 4 concurrent. <code>OPTIMIZE TABLE … FINAL</code> on every variant before measuring. Sizes from <code>system.parts</code>; query latency from <code>system.query_log</code> with <code>SYSTEM FLUSH LOGS</code> for sync.</p>

 </body>
 </html>
	<!doctype html>
	<html lang="en">
	<head>
	<meta charset="utf-8">
	<title>ClickHouse skip-index benchmark for log substring search</title>
	<style>
	:root {
	--fg: #1a1a1a;
	--muted: #666;
	--line: #e3e3e3;
	--accent: #2a6dd9;
	--winner: #e7f5e9;
	--loser: #fbeaea;
	--code-bg: #f6f8fa;
	}
	html, body { background: #fff; color: var(--fg); }
	body {
	font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif;
	line-height: 1.55;
	max-width: 920px;
	margin: 2rem auto 4rem;
	padding: 0 1.25rem;
	font-size: 15.5px;
	}
	h1 { font-size: 1.85rem; margin: 0 0 .4rem; line-height: 1.2; }
	h2 { font-size: 1.3rem; margin: 2.4rem 0 .6rem; padding-top: .4rem; border-top: 1px solid var(--line); }
	h3 { font-size: 1.05rem; margin: 1.4rem 0 .3rem; }
	.lede { color: var(--muted); margin: 0 0 1.5rem; }
	table { border-collapse: collapse; width: 100%; margin: .6rem 0 1.2rem; font-size: 14px; }
	th, td { padding: .35rem .55rem; border-bottom: 1px solid var(--line); text-align: left; vertical-align: top; }
	th { font-weight: 600; background: #fafafa; }
	td.num, th.num { text-align: right; font-variant-numeric: tabular-nums; }
	tr.winner td { background: var(--winner); }
	tr.loser td { background: var(--loser); }
	code, pre { font-family: "SF Mono", Menlo, Consolas, monospace; font-size: 13px; }
	pre { background: var(--code-bg); padding: .8rem 1rem; border-radius: 6px; overflow-x: auto; line-height: 1.4; }
	code { background: var(--code-bg); padding: .08em .35em; border-radius: 3px; }
	pre code { background: none; padding: 0; }
	.tldr {
	background: #f0f6ff;
	border-left: 3px solid var(--accent);
	padding: .9rem 1.1rem;
	border-radius: 0 6px 6px 0;
	margin: 1rem 0 1.6rem;
	}
	.tldr p { margin: .35rem 0; }
	.small { color: var(--muted); font-size: 13px; }
	.footnote { color: var(--muted); font-size: 13px; margin-top: 3rem; padding-top: 1rem; border-top: 1px solid var(--line); }
	ul { padding-left: 1.4rem; }
	li { margin: .25rem 0; }
	</style>
	</head>
	<body>

	<h1>ClickHouse skip-index benchmark for log substring search</h1>
	<p class="lede">Tested whether <code>text()</code>, <code>ngrambf_v1</code>, or any tunable variant can beat <code>tokenbf_v1</code> for typical <code>WHERE Message LIKE '%pattern%'</code> workloads on real log data. <strong>Surprise: none of them prune granules meaningfully, and no-index is the fastest.</strong></p>

	<div class="tldr">
	<p><strong>TL;DR</strong></p>
	<p>For substring <code>LIKE</code> queries on log data, the skip-index family is essentially decorative — every common token appears in every granule, so the bloom filter always says "maybe", and pruning is < 0.01% of granules.</p>
	<p>The current production choice (<code>tokenbf_v1(32768, 3, 0)</code>) is fine but oversized. <code>tokenbf_v1(8192, 3, 0)</code> gives a <strong>3.5× smaller index at no query cost</strong>. The <code>text()</code> family is 50–500× the size of <code>tokenbf_v1</code> and slightly <em>slower</em>.</p>
	<p>If your <code>LIKE</code> patterns are common log tokens (<code>error</code>, <code>timeout</code>, hex prefixes), the index is decorative; you can probably drop it.</p>
	</div>

	<h2>Setup</h2>

	<p>A single day of a production log table (<code>Message String</code> column) was copied 14 times onto one shard, each copy carrying a different skip-index configuration. After <code>OPTIMIZE TABLE … FINAL</code> on every variant, size and query-latency measurements were taken.</p>

	<table>
	<tr><th>Item</th><th>Value</th></tr>
	<tr><td>ClickHouse</td><td>26.2.5</td></tr>
	<tr><td>Rows per variant</td><td class="num">437,216,102</td></tr>
	<tr><td>Compressed data per variant</td><td class="num">10.82 GiB</td></tr>
	<tr><td>Engine</td><td><code>MergeTree</code>, one partition (one day)</td></tr>
	<tr><td><code>ORDER BY</code></td><td><code>(tenant, node, timestamp)</code></td></tr>
	<tr><td><code>index_granularity</code></td><td class="num">8,192</td></tr>
	<tr><td>Total granules per variant</td><td class="num">~53,400</td></tr>
	</table>

	<p>Each query (<code>SELECT count() FROM <variant> WHERE Message LIKE '%pattern%'</code>) was run twice with cold caches (<code>SYSTEM DROP MARK CACHE</code> + <code>SYSTEM DROP UNCOMPRESSED CACHE</code> between runs; filesystem cache was deliberately left alone to avoid impacting other workloads). Numbers below are run-2 unless otherwise noted, taken from <code>system.query_log</code>.</p>

	<h2>Index size after OPTIMIZE FINAL</h2>

	<table>
	<tr>
	<th>Variant</th>
	<th>Index spec</th>
	<th class="num">Index size</th>
	<th class="num">idx / data</th>
	</tr>
	<tr><td>baseline</td><td><em>(no skip index)</em></td><td class="num">0</td><td class="num">—</td></tr>
	<tr class="winner"><td><strong>tokenbf_8k</strong></td><td><code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code></td><td class="num"><strong>70.3 MiB</strong></td><td class="num"><strong>0.63%</strong></td></tr>
	<tr><td>tokenbf_v1 <span class="small">(common default)</span></td><td><code>tokenbf_v1(32768, 3, 0) GRANULARITY 4</code></td><td class="num">244.7 MiB</td><td class="num">2.21%</td></tr>
	<tr><td>ngrambf_4</td><td><code>ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4</code></td><td class="num">321.0 MiB</td><td class="num">2.90%</td></tr>
	<tr><td>text_split_lower</td><td><code>text(splitByString([' ','=',',',':','"',"'"]), lower())</code></td><td class="num">12.00 GiB</td><td class="num">110.9%</td></tr>
	<tr><td>text_split</td><td><code>text(splitByString([' ','=',',',':','"',"'"]))</code></td><td class="num">12.02 GiB</td><td class="num">111.1%</td></tr>
	<tr><td>text_optimal <span class="small">(kitchen-sink tuning)</span></td><td><code>text(splitByNonAlpha, lower, dict_block_size=1024, dict_frontcoding=1)</code></td><td class="num">13.84 GiB</td><td class="num">127.9%</td></tr>
	<tr><td>text_lower</td><td><code>text(splitByNonAlpha, preprocessor=lower)</code></td><td class="num">13.86 GiB</td><td class="num">128.2%</td></tr>
	<tr><td>text_dict_big</td><td><code>text(splitByNonAlpha, dictionary_block_size=1024)</code></td><td class="num">13.94 GiB</td><td class="num">128.9%</td></tr>
	<tr><td>text_default</td><td><code>text(splitByNonAlpha)</code></td><td class="num">13.97 GiB</td><td class="num">129.1%</td></tr>
	<tr><td>text_sparse_cut</td><td><code>text(sparseGrams(3, 5, 4))</code></td><td class="num">40.89 GiB</td><td class="num">377.9%</td></tr>
	<tr class="loser"><td>text_ngrams3</td><td><code>text(ngrams(3))</code></td><td class="num">67.86 GiB</td><td class="num">627.2%</td></tr>
	<tr class="loser"><td>text_ngrams4</td><td><code>text(ngrams(4))</code></td><td class="num">77.29 GiB</td><td class="num">714.4%</td></tr>
	<tr class="loser"><td>text_sparse <span class="small">(partial, killed)</span></td><td><code>text(sparseGrams)</code> <span class="small">— default</span></td><td class="num">~17×</td><td class="num"><em>extrapolated</em></td></tr>
	</table>

	<h3>Observations on sizing</h3>
	<ul>
	<li><code>tokenbf_v1(8192, 3, 0)</code> is the size winner — a quarter the size of the common <code>(32768)</code> default at no measurable cost.</li>
	<li>Every <code>text()</code> variant is <strong>orders of magnitude larger</strong> than <code>tokenbf_v1</code>. The smallest text variant is <strong>170× larger</strong> than the smallest tokenbf.</li>
	<li>The text-index "advanced" knobs (<code>preprocessor</code>, <code>dictionary_block_size</code>, <code>dictionary_block_frontcoding_compression</code>) move the index size by <strong>under 2%</strong>. Tokenizer choice is the only meaningful lever.</li>
	<li><code>ngrams</code>-tokenizer text indexes are uniformly disastrous on size (627–714% of the data).</li>
	<li><code>sparseGrams</code> default is the worst of all (~17× data) — the explicit cutoff form (<code>sparseGrams(3,5,4)</code>) tames it to ~4× but still terrible.</li>
	</ul>

	<h2>Query latency</h2>

	<p>Five representative <code>LIKE</code> patterns covering different selectivity:</p>

	<table>
	<tr><th>Pattern</th><th class="num">Match count</th><th class="num">Match rate</th><th>Notes</th></tr>
	<tr><td><code>%error%</code></td><td class="num">34,061,067</td><td class="num">7.8%</td><td>common log word</td></tr>
	<tr><td><code>%0x%</code></td><td class="num">33,832,079</td><td class="num">7.7%</td><td>short alphanumeric — interesting case for tokenbf since it splits on non-alpha</td></tr>
	<tr><td><code>%failed to%</code></td><td class="num">1,330,814</td><td class="num">0.3%</td><td>multi-word phrase with whitespace</td></tr>
	<tr><td><code>%attestation%</code></td><td class="num">44,450,697</td><td class="num">10.2%</td><td>single rare-ish word</td></tr>
	<tr><td><code>%timeout%</code></td><td class="num">25,602,941</td><td class="num">5.9%</td><td>common log word</td></tr>
	</table>

	<p>Wall-clock latency (ms, lower is better):</p>

	<table>
	<tr>
	<th>Variant</th>
	<th class="num"><code>error</code></th>
	<th class="num"><code>0x</code></th>
	<th class="num"><code>failed to</code></th>
	<th class="num"><code>attestation</code></th>
	<th class="num"><code>timeout</code></th>
	<th class="num"><strong>avg</strong></th>
	</tr>
	<tr class="winner"><td><strong>baseline (no index)</strong></td><td class="num">1107</td><td class="num">952</td><td class="num">344</td><td class="num">955</td><td class="num">541</td><td class="num"><strong>780</strong></td></tr>
	<tr><td>text_default</td><td class="num">1129</td><td class="num">924</td><td class="num">363</td><td class="num">1001</td><td class="num">589</td><td class="num">801</td></tr>
	<tr><td>text_split_lower</td><td class="num">1132</td><td class="num">988</td><td class="num">384</td><td class="num">931</td><td class="num">665</td><td class="num">820</td></tr>
	<tr><td>tokenbf_8k</td><td class="num">1188</td><td class="num">977</td><td class="num">387</td><td class="num">1009</td><td class="num">573</td><td class="num">827</td></tr>
	<tr><td>tokenbf_v1</td><td class="num">1334</td><td class="num">1062</td><td class="num">438</td><td class="num">1114</td><td class="num">640</td><td class="num">918</td></tr>
	<tr><td>ngrambf_4</td><td class="num">1386</td><td class="num">1064</td><td class="num">567</td><td class="num">1193</td><td class="num">810</td><td class="num">1004</td></tr>
	<tr class="loser"><td>text_ngrams3</td><td class="num">1569</td><td class="num">999</td><td class="num">574</td><td class="num">1424</td><td class="num">850</td><td class="num">1083</td></tr>
	</table>

	<p><strong>No-index is the fastest across every pattern.</strong> Every skip-index variant pays a non-zero index-consultation cost on every query without buying back any pruning savings.</p>

	<h2>Why? <code>EXPLAIN indexes=1</code></h2>

	<p>For <code>WHERE Message LIKE '%attestation%'</code> on each variant, how many granules survived the skip-index check:</p>

	<table>
	<tr><th>Variant</th><th class="num">Granules kept</th><th class="num">Pruned</th></tr>
	<tr><td>tokenbf_v1</td><td class="num">53,397 / 53,397</td><td class="num">0</td></tr>
	<tr><td>tokenbf_8k</td><td class="num">53,382 / 53,382</td><td class="num">0</td></tr>
	<tr><td>text_ngrams3</td><td class="num">53,379 / 53,383</td><td class="num"><strong>4</strong> (0.007%)</td></tr>
	</table>

	<p>Even single-word common terms get zero useful pruning. <code>attestation</code> appears in nearly every 8,192-row granule, so the bloom filter says "maybe present" for ~all granules.</p>

	<p>The reduced <code>read_rows</code> reported by <code>system.query_log</code> (e.g. 104M of 437M for <code>%failed to%</code>) is <strong>not</strong> coming from skip-index pruning — it comes from ClickHouse's intra-granule <code>LIKE</code> evaluator short-circuiting per row. Compare:</p>

	<pre><code>-- bench_no_index: 53,397 / 53,397 granules selected, no skip index
	-- bench_tokenbf_v1: same granules selected; index consulted but didn't prune
	-- bench_text_ngrams3: skip index tokenizes "failed to" into 7 trigrams,
	-- prunes 12 of 53,383 granules (0.02%)</code></pre>

	<p>The "useless granule" problem isn't a tokenizer issue — it's a property of the data. With ~8,000 rows per granule and a token appearing in >0.5% of rows, every granule contains it. The bloom filter cannot help.</p>

	<h2>What about <code>hasToken()</code> instead of <code>LIKE</code>?</h2>

	<p>The <code>text()</code> index is genuinely <em>designed</em> for the <code>hasToken</code>, <code>hasAnyToken</code>, <code>searchAny</code>, etc. operators — not <code>LIKE</code>. Those operators speak directly to the inverted index without round-tripping through pattern matching. This benchmark only covers <code>LIKE</code> because that's what real-world Grafana / dashboard queries against log data generate. If your workload is <code>hasToken()</code>-shaped, re-run with that operator.</p>

	<h2>Recommendations</h2>

	<ul>
	<li><strong>For new schemas:</strong> use <code>tokenbf_v1(8192, 3, 0) GRANULARITY 4</code>, or omit the index entirely.</li>
	<li><strong>For existing schemas with <code>tokenbf_v1(32768, …)</code>:</strong> consider downsizing — saves ~70% of the index disk for free.</li>
	<li><strong>For existing schemas with <code>text()</code> on Message/Body for LIKE-shaped queries:</strong> rip it out. It's costing 50–500× the disk of an equivalent <code>tokenbf_v1</code> and offering no query benefit.</li>
	<li><strong>Avoid <code>ngrambf_v1</code> for this workload:</strong> bigger index, slower queries.</li>
	<li><strong>If you genuinely need pruning for substring search:</strong> the lever isn't the index — it's the <code>ORDER BY</code>. Move the column you actually filter on to the prefix of <code>ORDER BY</code>, and pre-tokenize known patterns into materialized columns.</li>
	</ul>

	<h2>Variants tested — full list (16)</h2>

	<details>
	<summary>Click to expand DDL snippets for each variant</summary>
	<pre><code>-- baseline
	CREATE TABLE bench_no_index (... Message String CODEC(ZSTD(1)))
	ENGINE = MergeTree() ORDER BY (tenant, node, timestamp);

	-- tokenbf_v1 family
	INDEX idx_msg Message TYPE tokenbf_v1(8192, 3, 0) GRANULARITY 4 -- winner
	INDEX idx_msg Message TYPE tokenbf_v1(32768, 3, 0) GRANULARITY 4 -- control

	-- ngrambf_v1 family
	INDEX idx_msg Message TYPE ngrambf_v1(4, 32768, 3, 0) GRANULARITY 4

	-- text() family
	INDEX idx_msg Message TYPE text(tokenizer = 'splitByNonAlpha') GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = splitByString([' ','=',',',':','"','\''])) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, preprocessor = lower(Message)) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = splitByString(...), preprocessor = lower(Message)) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha, dictionary_block_size = 1024) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = splitByNonAlpha,
	preprocessor = lower(Message),
	dictionary_block_size = 1024,
	dictionary_block_frontcoding_compression = 1) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = ngrams(3)) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = ngrams(4)) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = sparseGrams) GRANULARITY 1
	INDEX idx_msg Message TYPE text(tokenizer = sparseGrams(3, 5, 4)) GRANULARITY 1
	</code></pre>

	<p class="small">Note: ClickHouse silently rewrites <code>GRANULARITY 1</code> on <code>text()</code> indexes to <code>GRANULARITY 100000000</code> (100M-row blocks) regardless of what you specify. The pruning unit for <code>text()</code> is much coarser than for <code>tokenbf_v1</code>.</p>
	</details>

	<p class="footnote">Bench methodology: one shard, single MergeTree (not Replicated) for clean per-shard size measurements. Each variant inserted from a local 1-part source table to avoid re-reading the production replicated source 14 times. Inserts run in waves of 4 concurrent. <code>OPTIMIZE TABLE … FINAL</code> on every variant before measuring. Sizes from <code>system.parts</code>; query latency from <code>system.query_log</code> with <code>SYSTEM FLUSH LOGS</code> for sync.</p>

	</body>
	</html>
No results found