TL;DR - I think the paper is a good contribution and basically holds up, but Figure 2 seems suspicious and the released repo doesn't include the pieces (AE training code and pretrained 4096-element AEs) that would be needed to make DC-AE practically competitive with SD/SDXL VAEs.
DC-AE is an MIT / Tsinghua / NVIDIA paper about improving generative autoencoders (like the SD VAE) under the high-spatial-compression ratio regime.
I am interested in improved autoencoders, so this gist/thread is my attempt to analyze and review some key claims from the DC-AE paper.
(Disclaimer: I work at NVIDIA in an unrelated org :) - this review is written in my personal capacity as an autoencoder buff).
Priors
A generative autoencoder is a machine that takes in images, lossily compresses them into a compact latent representation, and then decodes a new image from the latent (by making up new details).
The "reconstruction quality" of an autoencoder is some score for how closely the decoded images match the original images.
All else being equal, the following things will generally lead to better reconstruction quality: