Skip to content

Instantly share code, notes, and snippets.

View sayakpaul's full-sized avatar
:octocat:
Learn, unlearn and relearn.

Sayak Paul sayakpaul

:octocat:
Learn, unlearn and relearn.
View GitHub Profile
@sayakpaul
sayakpaul / inference.md
Last active March 10, 2025 15:31
Not so rigorously validated FP8 training of Flux (dev) DreamBooth LoRA
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
pipeline.load_lora_weights("sayakpaul/yarn_art_lora_flux", weight_name="pytorch_lora_weights.safetensors")
image = pipeline("a puppy in a pond, yarn art style", guidance_scale=3.5, height=768).images[0]
image.save("yarn.png")
@sayakpaul
sayakpaul / inference_with_torchao_serialized.py
Last active January 13, 2025 01:51
Shows how to run Flux schnell under 17GBs without bells and whistles. It additionally shows how to serialize the quantized checkpoint and load it back.
import torch
from huggingface_hub import hf_hub_download
from diffusers import FluxTransformer2DModel, DiffusionPipeline
dtype, device = torch.bfloat16, "cuda"
ckpt_id = "black-forest-labs/FLUX.1-schnell"
with torch.device("meta"):
config = FluxTransformer2DModel.load_config(ckpt_id, subfolder="transformer")
model = FluxTransformer2DModel.from_config(config).to(dtype)
@sayakpaul
sayakpaul / distributed_inference_diffusers.py
Last active September 10, 2024 02:04
Minimal example to show how to run distributed inference from a set of prompts with diffusers and accelerate.
# Originally by jiwooya1000, put together together by sayakpaul.
# Documentation: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference
"""
Run:
accelerate launch distributed_inference_diffusers.py --batch_size 8
# Enable memory optimizations for large models like SD3
accelerate launch distributed_inference_diffusers.py --batch_size 8 --low_mem=1
@sayakpaul
sayakpaul / run_flux_with_limited_resources.md
Last active February 23, 2025 07:10
This document enlists resources that show how to run Black Forest Lab's Flux with Diffusers under limited resources.
@sayakpaul
sayakpaul / run_flux_under_24gbs.py
Last active April 7, 2025 21:44
This gist shows how to run Flux on a 24GB 4090 card with Diffusers.
from diffusers import FluxPipeline, AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
from transformers import T5EncoderModel, T5TokenizerFast, CLIPTokenizer, CLIPTextModel
import torch
import gc
def flush():
gc.collect()
torch.cuda.empty_cache()
@sayakpaul
sayakpaul / benchmark_pixart-900m-1024-ft.py
Created July 2, 2024 08:59
Benchmarks the "ptx0/pixart-900m-1024-ft" model with `torch.compile()`.
import torch
torch.set_float32_matmul_precision("high")
from diffusers import DiffusionPipeline
import time
pipeline_id = "ptx0/pixart-900m-1024-ft"
pipeline = DiffusionPipeline.from_pretrained(
pipeline_id,
@sayakpaul
sayakpaul / run_sd3_compile.py
Last active August 14, 2024 21:30
The script shows how to run SD3 with `torch.compile()`
import torch
torch.set_float32_matmul_precision("high")
from diffusers import StableDiffusion3Pipeline
import time
id = "stabilityai/stable-diffusion-3-medium-diffusers"
pipeline = StableDiffusion3Pipeline.from_pretrained(
id,
@sayakpaul
sayakpaul / run_sd3_8bit.py
Last active November 25, 2024 21:50
The code snippet shows how to run Stable Diffusion 3 with a 8bit T5-xxl, drastically reducing the memory requirements.
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel
import torch
import time
import gc
def flush():
gc.collect()
torch.cuda.empty_cache()
@sayakpaul
sayakpaul / run_hunyuan_dit_compile.py
Created June 5, 2024 05:56
Benchmarking script for running Hunyuan DiT with `torch.compile()`.
import torch
torch.set_float32_matmul_precision("high")
from diffusers import HunyuanDiTPipeline
import argparse
import time
def load_pipeline(args):
@sayakpaul
sayakpaul / run_hunyuan_dit_less_memory.py
Created June 5, 2024 03:55
Run `HunyuanDiTPipeline` from Diffusers under 6GBs of GPU VRAM.
"""
Make sure you have `diffusers`, `accelerate`, `transformers`, and `bitsandbytes` installed.
You also set up PyTorch and CUDA.
Once the dependencies are installed, you can run `python run_hunyuan_dit_less_memory.py`.
"""
from diffusers import HunyuanDiTPipeline
from transformers import T5EncoderModel