Ollin Boer Bohan madebyollin

Notes / Links about Stable Diffusion VAE

Stable Diffusion's VAE is a neural network that encodes images into a compressed "latent" format and decodes them back. The encoder performs 48x lossy compression, and the decoder generates new detail to fill in the gaps.

(Calling this model a "VAE" is sort of a misnomer - it's an encoder with some very slight KL regularization, and a conditional GAN decoder)

This document is a big pile of various links with more info.

Consistency Decoder PyTorch Model Code

Cleaned up version of https://gist.github.com/mrsteyk/74ad3ec2f6f823111ae4c90e168505ac, which is in turn based on the public_diff_vae.ConvUNetVAE from https://github.com/openai/consistencydecoder.

Example Usage

Install the consistency decoder code (for the inference logic) and download the extracted weights:

Variational Autoencoders Will Never Work

So you want to generate images with neural networks. You're in luck! VAEs are here to save the day. They're simple to implement, they generate images in one inference step (unlike those awful slow autoregressive models) and (most importantly) VAEs are 🚀🎉🎂🥳 theoretically grounded 🚀🎉🎂🥳 (unlike those scary GANs - don't look at the GANs)!

The idea

The idea of VAE is so simple, even an AI chatbot could explain it:

Your goal is to train a "decoder" neural network that consumes blobs of random noise from a fixed distribution (like torch.randn(1024)), interprets that noise as decisions about what to generate, and produces corresponding real-looking images. You want to train this network with nice simple image-space MSE loss against your dataset of real images.

List of good image generator training logs

A list of public training logs from neural network image generation models, since I think they're interesting.

The Criteria

Publicly accessible link
Losses plotted every so often
Samples generated every so often
Nontrivial dataset (i.e. not MNIST - 64x64 output RGB or better)

	def summarize_tensor(x):
	return f"\033[34m{str(tuple(x.shape)).ljust(24)}\033[0m (\033[31mmin {x.min().item():+.4f}\033[0m / \033[32mmean {x.mean().item():+.4f}\033[0m / \033[33mmax {x.max().item():+.4f}\033[0m)"


	class ModelActivationPrinter:
	def __init__(self, module, submodules_to_log):
	self.id_to_name = {
	id(module): str(name) for name, module in module.named_modules()
	}
	self.submodules = submodules_to_log

	#!/usr/bin/env python3
	from pathlib import Path
	from safetensors.torch import load_file

	def summarize_tensor(x):
	if x is None:
	return "None"
	x = x.float()
	return f"({x.min().item():.3f}, {x.mean().item():.3f}, {x.max().item():.3f})"

	# PyTorch <=2.0 doesn't support bfloat16 F.interpolate natively.
	# so, we have to do things the old fashioned way.

	import torch
	import torch.nn as nn

	# functional implementation
	def nearest_neighbor_upsample(x: torch.Tensor, scale_factor: int):
	"""Upsample {x} (NCHW) by scale factor {scale_factor} using nearest neighbor interpolation."""
	s = scale_factor

	# ------------------------------------------------------------------
	# EDIT: I eventually found a faster way to run SD on macOS, via MPSGraph (~0.8s / step on M1 Pro):
	# https://github.com/madebyollin/maple-diffusion
	# The original CoreML-related code & discussion is preserved below :)
	# ------------------------------------------------------------------

	# you too can run stable diffusion on the apple silicon GPU (no ANE sadly)
	#
	# quick test portraits (each took 50 steps x 2s / step ~= 100s on my M1 Pro):
	# * https://i.imgur.com/5ywISvm.png