madebyollin/dc_ae_review.md

Last active December 8, 2024 20:02

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/madebyollin/0cb99cc86b4ee5394fac9a74e5f9aa63.js"></script>
Save madebyollin/0cb99cc86b4ee5394fac9a74e5f9aa63 to your computer and use it in GitHub Desktop.

Download ZIP

Reviewing the claims of DC-AE

Raw

dc_ae_review.md

Reviewing the Claims of DC-AE

TL;DR - I think the paper is a good contribution and basically holds up, but Figure 2 seems suspicious and the released repo doesn't include the pieces (AE training code and pretrained 4096-element AEs) that would be needed to make DC-AE practically competitive with SD/SDXL VAEs.

DC-AE is an MIT / Tsinghua / NVIDIA paper about improving generative autoencoders (like the SD VAE) under the high-spatial-compression ratio regime.

I am interested in improved autoencoders, so this gist/thread is my attempt to analyze and review some key claims from the DC-AE paper.

(Disclaimer: I work at NVIDIA in an unrelated org :) - this review is written in my personal capacity as an autoencoder buff).

Author

madebyollin commented Dec 8, 2024

DC-AE Paper Claim 2 - DC-AE maintains their reconstruction quality across spatial compression ratios - AGREED

The DC-AE authors adjust the network architecture to let the VAE achieve better results at higher spatial compression ratios (by adding some shortcuts that let the VAE easily encode spatial information into channels and decode those channels back into spatial information).

I think their claim (that this modification leads to similar reconstruction quality across different f values) is well-substantiated in the paper and by their released autoencoder zoo, and it's not a particularly surprising claim, so I'm inclined to believe it.

Author

madebyollin commented Dec 8, 2024

DC-AE Paper Claim 3 - DC-AE achieves uniformly faster + better results vs. the SD-VAE - MIXED

The paper shows lots of specific tests where the DC-AE recipe does better than the SD-VAE recipe (authors' reproduction?) at various tasks, and I'm okay with these claims.

However, I think the DC-AE paper implies it's better than SD-VAE in general, for all purposes, which I think needs some qualification.

The 0.90 and 2.04 ImageNet-Val numbers for SD-VAE f8c4 and f32c64 appear to be directly from the original SD paper (or else reproduced with incredible accuracy?) but those autoencoders were trained on OpenImages, not ImageNet (per SD paper Table 8 caption) and those numbers should be better if SD-VAE were finetuned on ImageNet-Train.
.
The revised paper includes additional results in Table 8, but doesn't say what these results represent (pretrained SD-VAE or reproduction? trained on what dataset / evaluated on what dataset?) so I don't know how to interpret them.
The released models / code are definitely not yet better than SD-VAE release, for two reasons:
- Unlike SD-VAE, the DC-AE training code is not available yet. This means users are currently restricted to the set of pretrained DC-AEs provided by the authors.
- Unlike SD-VAE, the DC-AE pretrained models that were released all have latent-volume = 2048 for 256x256 source image which means they will necessarily have higher reconstruction error than SD-VAE and aren't yet a practical replacement for the SD/SDXL VAEs (which have latent-volume = 4096).

Snippet for adding latent sizes to HF page

```js Array.from(document.querySelectorAll("header")).forEach(el => { if (!el.title.includes("dc-ae")) return; Array.from(el.querySelectorAll(".volume-label")).forEach(child=>el.removeChild(child)); let [f, c] = el.title.match(/f(\d+)c(\d+)/).slice(1).map(x=>parseInt(x)); let hw=Math.floor(256/f); let numelFor256 = Math.pow(hw,2)*c; console.log(el.title, f, c, numelFor256); let label = document.createElement("span"); label.classList.add("volume-label"); label.style.color = "orangered"; label.style.display = "inline-block"; label.style.paddingLeft = "1em"; label.style.fontSize = "0.9em"; label.style.fontFamily = "monospace"; label.textContent = `${c}x${hw}x${hw} = ${numelFor256}`; el.appendChild(label);}) ```

Author

madebyollin commented Dec 8, 2024

To substantiate the claim of "the released DC-AE checkpoints are not yet a practical replacement for the SDXL VAE", I checked two of the pretrained DC-AE models on my "challenge set" of 5 difficult images, and verified that they're worse than the SDXL VAE (as expected due to the 2x smaller latent size).

Additionally, I evaluated the mit-han-lab_dc-ae-f64c128-in-1.0 VAE on Coco 2017 Val 256 and verified that the rFID is higher than the SD or SDXL VAEs on this dataset (first screenshot is from the SDXL paper, but I've previously verified that these numbers match the results of my eval script)