This is a very good write-up, and your intuition (“doing it correctly should fix perf”) is understandable — but the disappointing result actually makes sense once you zoom in on what exactly PyArrow’s CDC is optimizing for vs what Polars already does better than PyArrow.
I’ll be blunt up front, then unpack it carefully:
You implemented CDC correctly, but you applied it to a system that was already more dedup-friendly than PyArrow, and your CDC implementation is actively destroying the accidental alignment properties that Xet’s deduper was exploiting.
So this isn’t “CDC is broken”, and it isn’t “your implementation is wrong”. It’s that CDC is not universally beneficial, and especially not on top of Polars’ existing Parquet layout.