TL;DR - I think the paper is a good contribution and basically holds up, but Figure 2 seems suspicious and the released repo doesn't include the pieces (AE training code and pretrained 4096-element AEs) that would be needed to make DC-AE practically competitive with SD/SDXL VAEs.
DC-AE is an MIT / Tsinghua / NVIDIA paper about improving generative autoencoders (like the SD VAE) under the high-spatial-compression ratio regime.
I am interested in improved autoencoders, so this gist/thread is my attempt to analyze and review some key claims from the DC-AE paper.
(Disclaimer: I work at NVIDIA in an unrelated org :) - this review is written in my personal capacity as an autoencoder buff).
DC-AE Paper Claim 1 - Existing autoencoder recipes work poorly at high spatial compression ratios - AGREED
Although in principle bigger spatial receptive field should allow better reconstructions, in practice existing autoencoder training recipes haven't worked that well when the latents are spatially-compressed beyond 16x. I've seen this in my own tests and it also shows up in the original SD-VAE paper, table 8 (the f=32 4096-element VAE is surprisingly worse than the f=16 4096-element and f=8 4096-element VAEs, despite having the same latent volume!)
The DC-AE paper has a dedicated chart showing this in Figure 2a (the SD-VAE line) and I believe the general claim although I don't know how much I trust Figure 2a specifically (see the discussion of claim 3 below).