Skip to content

Instantly share code, notes, and snippets.

View sayakpaul's full-sized avatar
:octocat:
Learn, unlearn and relearn.

Sayak Paul sayakpaul

:octocat:
Learn, unlearn and relearn.
View GitHub Profile
@sayakpaul
sayakpaul / pipeline_flux_with_cfg_batched.py
Last active September 20, 2024 18:41
Flux with CFG (batched) 💣
# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@sayakpaul
sayakpaul / README.md
Last active March 18, 2025 08:37
This code snippet shows how to split the Flux transformer across two 16GB GPUs and run inference with the full pipeline.
@sayakpaul
sayakpaul / inference.md
Last active June 5, 2025 05:04
Not so rigorously validated FP8 training of Flux (dev) DreamBooth LoRA
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
pipeline.load_lora_weights("sayakpaul/yarn_art_lora_flux", weight_name="pytorch_lora_weights.safetensors")
image = pipeline("a puppy in a pond, yarn art style", guidance_scale=3.5, height=768).images[0]
image.save("yarn.png")
@sayakpaul
sayakpaul / inference_with_torchao_serialized.py
Last active January 13, 2025 01:51
Shows how to run Flux schnell under 17GBs without bells and whistles. It additionally shows how to serialize the quantized checkpoint and load it back.
import torch
from huggingface_hub import hf_hub_download
from diffusers import FluxTransformer2DModel, DiffusionPipeline
dtype, device = torch.bfloat16, "cuda"
ckpt_id = "black-forest-labs/FLUX.1-schnell"
with torch.device("meta"):
config = FluxTransformer2DModel.load_config(ckpt_id, subfolder="transformer")
model = FluxTransformer2DModel.from_config(config).to(dtype)
@sayakpaul
sayakpaul / distributed_inference_diffusers.py
Last active September 10, 2024 02:04
Minimal example to show how to run distributed inference from a set of prompts with diffusers and accelerate.
# Originally by jiwooya1000, put together together by sayakpaul.
# Documentation: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference
"""
Run:
accelerate launch distributed_inference_diffusers.py --batch_size 8
# Enable memory optimizations for large models like SD3
accelerate launch distributed_inference_diffusers.py --batch_size 8 --low_mem=1
@sayakpaul
sayakpaul / run_flux_with_limited_resources.md
Last active June 18, 2025 08:53
This document enlists resources that show how to run Black Forest Lab's Flux with Diffusers under limited resources.
@sayakpaul
sayakpaul / run_flux_under_24gbs.py
Last active June 28, 2025 22:53
This gist shows how to run Flux on a 24GB 4090 card with Diffusers.
from diffusers import FluxPipeline, AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
from transformers import T5EncoderModel, T5TokenizerFast, CLIPTokenizer, CLIPTextModel
import torch
import gc
def flush():
gc.collect()
torch.cuda.empty_cache()
@sayakpaul
sayakpaul / benchmark_pixart-900m-1024-ft.py
Created July 2, 2024 08:59
Benchmarks the "ptx0/pixart-900m-1024-ft" model with `torch.compile()`.
import torch
torch.set_float32_matmul_precision("high")
from diffusers import DiffusionPipeline
import time
pipeline_id = "ptx0/pixart-900m-1024-ft"
pipeline = DiffusionPipeline.from_pretrained(
pipeline_id,
@sayakpaul
sayakpaul / run_sd3_compile.py
Last active August 14, 2024 21:30
The script shows how to run SD3 with `torch.compile()`
import torch
torch.set_float32_matmul_precision("high")
from diffusers import StableDiffusion3Pipeline
import time
id = "stabilityai/stable-diffusion-3-medium-diffusers"
pipeline = StableDiffusion3Pipeline.from_pretrained(
id,
@sayakpaul
sayakpaul / run_sd3_8bit.py
Last active November 25, 2024 21:50
The code snippet shows how to run Stable Diffusion 3 with a 8bit T5-xxl, drastically reducing the memory requirements.
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel
import torch
import time
import gc
def flush():
gc.collect()
torch.cuda.empty_cache()