Skip to content

Instantly share code, notes, and snippets.

@madebyollin
Last active December 8, 2024 20:02
Show Gist options
  • Save madebyollin/0cb99cc86b4ee5394fac9a74e5f9aa63 to your computer and use it in GitHub Desktop.
Save madebyollin/0cb99cc86b4ee5394fac9a74e5f9aa63 to your computer and use it in GitHub Desktop.
Reviewing the claims of DC-AE

Reviewing the Claims of DC-AE

TL;DR - I think the paper is a good contribution and basically holds up, but Figure 2 seems suspicious and the released repo doesn't include the pieces (AE training code and pretrained 4096-element AEs) that would be needed to make DC-AE practically competitive with SD/SDXL VAEs.


DC-AE is an MIT / Tsinghua / NVIDIA paper about improving generative autoencoders (like the SD VAE) under the high-spatial-compression ratio regime.

I am interested in improved autoencoders, so this gist/thread is my attempt to analyze and review some key claims from the DC-AE paper.

(Disclaimer: I work at NVIDIA in an unrelated org :) - this review is written in my personal capacity as an autoencoder buff).

@madebyollin
Copy link
Author

Priors

A generative autoencoder is a machine that takes in images, lossily compresses them into a compact latent representation, and then decodes a new image from the latent (by making up new details).

The "reconstruction quality" of an autoencoder is some score for how closely the decoded images match the original images.

All else being equal, the following things will generally lead to better reconstruction quality:

  1. Bigger latents (storing more information leads to better reconstructions)
  2. Bigger spatial/temporal receptive field (seeing more of the image lets the autoencoder deduplicate redundant information and store non-redundant information instead)
  3. Bigger encoder / decoder (having more compute/parameters lets the autoencoder use more sophisticated encodings)
  4. Closer train / test datasets (autoencoders can only guarantee good reconstructions for images that look like what they trained on)
  5. Having a less diverse dataset (autoencoders can make more assumptions about what valid images will look like)

@madebyollin
Copy link
Author

madebyollin commented Dec 8, 2024

DC-AE Paper Claim 1 - Existing autoencoder recipes work poorly at high spatial compression ratios - AGREED

Although in principle bigger spatial receptive field should allow better reconstructions, in practice existing autoencoder training recipes haven't worked that well when the latents are spatially-compressed beyond 16x. I've seen this in my own tests and it also shows up in the original SD-VAE paper, table 8 (the f=32 4096-element VAE is surprisingly worse than the f=16 4096-element and f=8 4096-element VAEs, despite having the same latent volume!)

image image

The DC-AE paper has a dedicated chart showing this in Figure 2a (the SD-VAE line) and I believe the general claim although I don't know how much I trust Figure 2a specifically (see the discussion of claim 3 below).

image

@madebyollin
Copy link
Author

DC-AE Paper Claim 2 - DC-AE maintains their reconstruction quality across spatial compression ratios - AGREED

The DC-AE authors adjust the network architecture to let the VAE achieve better results at higher spatial compression ratios (by adding some shortcuts that let the VAE easily encode spatial information into channels and decode those channels back into spatial information).

Screenshot 2024-12-08 at 10 43 07

I think their claim (that this modification leads to similar reconstruction quality across different f values) is well-substantiated in the paper and by their released autoencoder zoo, and it's not a particularly surprising claim, so I'm inclined to believe it.

Screenshot 2024-12-08 at 10 41 40

@madebyollin
Copy link
Author

DC-AE Paper Claim 3 - DC-AE achieves uniformly faster + better results vs. the SD-VAE - MIXED

The paper shows lots of specific tests where the DC-AE recipe does better than the SD-VAE recipe (authors' reproduction?) at various tasks, and I'm okay with these claims.

However, I think the DC-AE paper implies it's better than SD-VAE in general, for all purposes, which I think needs some qualification.

  • The 0.90 and 2.04 ImageNet-Val numbers for SD-VAE f8c4 and f32c64 appear to be directly from the original SD paper (or else reproduced with incredible accuracy?) but those autoencoders were trained on OpenImages, not ImageNet (per SD paper Table 8 caption) and those numbers should be better if SD-VAE were finetuned on ImageNet-Train.
    image image.
    The revised paper includes additional results in Table 8, but doesn't say what these results represent (pretrained SD-VAE or reproduction? trained on what dataset / evaluated on what dataset?) so I don't know how to interpret them.
    image

  • The released models / code are definitely not yet better than SD-VAE release, for two reasons:

    • Unlike SD-VAE, the DC-AE training code is not available yet. This means users are currently restricted to the set of pretrained DC-AEs provided by the authors.
    • Unlike SD-VAE, the DC-AE pretrained models that were released all have latent-volume = 2048 for 256x256 source image which means they will necessarily have higher reconstruction error than SD-VAE and aren't yet a practical replacement for the SD/SDXL VAEs (which have latent-volume = 4096).
      image
Snippet for adding latent sizes to HF page ```js Array.from(document.querySelectorAll("header")).forEach(el => { if (!el.title.includes("dc-ae")) return; Array.from(el.querySelectorAll(".volume-label")).forEach(child=>el.removeChild(child)); let [f, c] = el.title.match(/f(\d+)c(\d+)/).slice(1).map(x=>parseInt(x)); let hw=Math.floor(256/f); let numelFor256 = Math.pow(hw,2)*c; console.log(el.title, f, c, numelFor256); let label = document.createElement("span"); label.classList.add("volume-label"); label.style.color = "orangered"; label.style.display = "inline-block"; label.style.paddingLeft = "1em"; label.style.fontSize = "0.9em"; label.style.fontFamily = "monospace"; label.textContent = `${c}x${hw}x${hw} = ${numelFor256}`; el.appendChild(label);}) ```

@madebyollin
Copy link
Author

To substantiate the claim of "the released DC-AE checkpoints are not yet a practical replacement for the SDXL VAE", I checked two of the pretrained DC-AE models on my "challenge set" of 5 difficult images, and verified that they're worse than the SDXL VAE (as expected due to the 2x smaller latent size).
Unknown-29
Unknown-27

Additionally, I evaluated the mit-han-lab_dc-ae-f64c128-in-1.0 VAE on Coco 2017 Val 256 and verified that the rFID is higher than the SD or SDXL VAEs on this dataset (first screenshot is from the SDXL paper, but I've previously verified that these numbers match the results of my eval script)
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment