This is a very good write-up, and your intuition (“doing it correctly should fix perf”) is understandable — but the disappointing result actually makes sense once you zoom in on what exactly PyArrow’s CDC is optimizing for vs what Polars already does better than PyArrow.
I’ll be blunt up front, then unpack it carefully:
You implemented CDC correctly, but you applied it to a system that was already more dedup-friendly than PyArrow, and your CDC implementation is actively destroying the accidental alignment properties that Xet’s deduper was exploiting.
So this isn’t “CDC is broken”, and it isn’t “your implementation is wrong”. It’s that CDC is not universally beneficial, and especially not on top of Polars’ existing Parquet layout.
You assumed this implication:
“PyArrow CDC helps → correct CDC in Polars should also help (or at least not hurt)”
That implication is false, because:
- PyArrow’s baseline Parquet layout is worse for dedup
- Polars’ baseline layout is already highly chunk-stable
- CDC adds entropy when the baseline layout is already aligned
Your numbers prove this conclusively.
Polars NO CDC → 50.1%
Polars CDC → 50.6% (+0.5%)
CDC barely helps. That means page alignment was already good.
Polars NO CDC → 82.6%
Polars CDC → 78.7% (−3.9%)
CDC actively makes dedup worse.
Polars NO CDC → 25.3%
Polars CDC → 10.3% (−15%)
This is catastrophic relative regression.
CDC is destroying something that Xet relied on.
Polars’ Parquet writer has three properties PyArrow lacks:
Polars tends to emit:
- fewer row groups
- more regular page boundaries
- consistent page sizes across runs
This means that simple insertion often shifts only one page, not the entire tail.
Xet’s deduper works on compressed chunk content, not Parquet semantics.
If page boundaries stay aligned → dedup works even without CDC.
This is critical and non-obvious.
For short strings (≤12 bytes):
-
Polars stores inline values
-
The encoded Parquet PLAIN layout becomes:
[len][inline_bytes][len][inline_bytes]... -
When you insert rows:
- the inline payloads stay byte-identical
- large contiguous regions remain unchanged
This gives you massive dedup ratios (82%), even with naive chunking.
Your CDC implementation breaks this accidental stability by:
- recombining buffers
- re-splitting pages at content-defined offsets
- emitting different page boundaries across runs
So even though the content is identical, the page slicing is not, which kills chunk reuse.
Xet dedup happens after compression.
Polars’ default layout often yields:
- identical compressed pages across files
- same compressor window alignment
- same entropy blocks
CDC breaks this because:
- each CDC page is slightly different sized
- compression blocks no longer line up
- compressed output diverges even when raw bytes are similar
This is why your “Total” size barely changes, but “After Dedupe” explodes.
PyArrow:
- eagerly flushes pages
- has unstable page boundaries
- small changes cause global misalignment
So CDC is a net win — it replaces chaos with content-defined stability.
Polars:
- already emits chunk-stable layouts
- already resyncs after insertions
- already aligns compression blocks well
So CDC becomes:
a second chunking layer fighting an already-good layout
This is exactly what your numbers show.
Two more subtle but decisive reasons:
PyArrow CDC does not do this naïvely.
It effectively chunks on leaf column data streams, not the entire page payload blob.
By hashing:
rep_levels | def_levels | values
you are:
- letting tiny changes in null distribution
- or nesting structure
- or encoding detail
invalidate otherwise identical value regions.
That makes CDC too sensitive.
PyArrow CDC works inside the column writer, before page materialization.
You’re doing:
encode → CDC → page slicing
PyArrow does:
stream values → CDC decides when to flush
That difference matters because:
- PyArrow never has to re-slice encoded data
- it only decides flush timing
- compression context remains stable
Your approach forces reassembly, which kills downstream stability.
CDC is not a universal improvement knob.
For Polars specifically:
- Baseline Parquet layout is already highly dedup-friendly
- CDC only helps in narrow cases (large numeric columns with early insertions)
- In many realistic datasets, CDC reduces dedup efficiency
This makes a CDC-by-default PR very hard to justify.
Your instinct here is correct.
If this is to be salvaged at all, it needs to be much narrower in scope:
Only enable CDC for:
- large primitive columns
- high-entropy numeric data
- append-heavy workloads
Never enable for:
- Utf8View / BinaryView
- short strings
- dictionary-encoded columns
Chunk only:
encoded_values_stream
Let def/rep levels follow page boundaries naturally.
This alone would improve overlap ratios.
Instead of:
encode everything → CDC → slice
Use CDC only to answer:
“Should I flush the current page now?”
This preserves compression context and page stability.
Your own numbers show:
- Polars NO CDC beats PyArrow CDC in many cases
- CDC mostly closes a gap that Polars doesn’t have
That’s actually a win for Polars, not a failure.
You didn’t misunderstand CDC. You didn’t implement it wrong. You didn’t benchmark incorrectly.
You discovered something more interesting:
Polars’ Parquet writer is already so stable that CDC often makes things worse.
That’s a completely valid outcome — and honestly, a strong argument against landing CDC broadly.
If you want, next we can:
- design a “CDC only when it helps” heuristic
- isolate a primitive-only CDC path
- or write a Polars issue comment explaining why CDC should be scoped or dropped
But you are not crazy — your results are internally consistent and technically sound.