TL;DR - I think the paper is a good contribution and basically holds up, but Figure 2 seems suspicious and the released repo doesn't include the pieces (AE training code and pretrained 4096-element AEs) that would be needed to make DC-AE practically competitive with SD/SDXL VAEs.
DC-AE is an MIT / Tsinghua / NVIDIA paper about improving generative autoencoders (like the SD VAE) under the high-spatial-compression ratio regime.
I am interested in improved autoencoders, so this gist/thread is my attempt to analyze and review some key claims from the DC-AE paper.
(Disclaimer: I work at NVIDIA in an unrelated org :) - this review is written in my personal capacity as an autoencoder buff).
DC-AE Paper Claim 2 - DC-AE maintains their reconstruction quality across spatial compression ratios - AGREED
The DC-AE authors adjust the network architecture to let the VAE achieve better results at higher spatial compression ratios (by adding some shortcuts that let the VAE easily encode spatial information into channels and decode those channels back into spatial information).
I think their claim (that this modification leads to similar reconstruction quality across different
f
values) is well-substantiated in the paper and by their released autoencoder zoo, and it's not a particularly surprising claim, so I'm inclined to believe it.