Skip to content

Instantly share code, notes, and snippets.

View sayakpaul's full-sized avatar
:octocat:
Learn, unlearn and relearn.

Sayak Paul sayakpaul

:octocat:
Learn, unlearn and relearn.
View GitHub Profile
@sayakpaul
sayakpaul / inference.md
Last active February 5, 2025 14:13
(Not so rigrously tested) example showing how to use `bitsandbytes`, `peft`, etc. to LoRA fine-tune Flux.1 Dev.

When loading the LoRA params (that were obtained on a quantized base model) and merging them into the base model, it is recommended to first dequantize the base model, merge the LoRA params into it, and then quantize the model again. This is because merging into 4bit quantized models can lead to some rounding errors. Below, we provide an end-to-end example:

  1. First, load the original model and merge the LoRA params into it:
from diffusers import FluxPipeline 
import torch 

ckpt_id = "black-forest-labs/FLUX.1-dev"
pipeline = FluxPipeline.from_pretrained(
@sayakpaul
sayakpaul / low_rank_lora.py
Last active December 15, 2024 22:41
Make a high-rank LoRA low-rank.
"""
Usage:
python low_rank_lora.py --repo_id=glif/how2draw --filename="How2Draw-V2_000002800.safetensors" \
--new_rank=4 --new_lora_path="How2Draw-V2_000002800_rank_4.safetensors"
"""
import torch
from huggingface_hub import hf_hub_download
import safetensors.torch
@sayakpaul
sayakpaul / pipeline_flux_with_cfg_batched.py
Last active September 20, 2024 18:41
Flux with CFG (batched) 💣
# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@sayakpaul
sayakpaul / README.md
Last active August 29, 2025 01:29
This code snippet shows how to split the Flux transformer across two 16GB GPUs and run inference with the full pipeline.
@sayakpaul
sayakpaul / inference.md
Last active June 5, 2025 05:04
Not so rigorously validated FP8 training of Flux (dev) DreamBooth LoRA
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
pipeline.load_lora_weights("sayakpaul/yarn_art_lora_flux", weight_name="pytorch_lora_weights.safetensors")
image = pipeline("a puppy in a pond, yarn art style", guidance_scale=3.5, height=768).images[0]
image.save("yarn.png")
@sayakpaul
sayakpaul / inference_with_torchao_serialized.py
Last active August 29, 2025 11:42
Shows how to run Flux schnell under 17GBs without bells and whistles. It additionally shows how to serialize the quantized checkpoint and load it back.
import torch
from huggingface_hub import hf_hub_download
from diffusers import FluxTransformer2DModel, DiffusionPipeline
dtype, device = torch.bfloat16, "cuda"
ckpt_id = "black-forest-labs/FLUX.1-schnell"
with torch.device("meta"):
config = FluxTransformer2DModel.load_config(ckpt_id, subfolder="transformer")
model = FluxTransformer2DModel.from_config(config).to(dtype)
@sayakpaul
sayakpaul / distributed_inference_diffusers.py
Last active September 10, 2024 02:04
Minimal example to show how to run distributed inference from a set of prompts with diffusers and accelerate.
# Originally by jiwooya1000, put together together by sayakpaul.
# Documentation: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference
"""
Run:
accelerate launch distributed_inference_diffusers.py --batch_size 8
# Enable memory optimizations for large models like SD3
accelerate launch distributed_inference_diffusers.py --batch_size 8 --low_mem=1
@sayakpaul
sayakpaul / run_flux_with_limited_resources.md
Last active June 18, 2025 08:53
This document enlists resources that show how to run Black Forest Lab's Flux with Diffusers under limited resources.
@sayakpaul
sayakpaul / run_flux_under_24gbs.py
Last active June 28, 2025 22:53
This gist shows how to run Flux on a 24GB 4090 card with Diffusers.
from diffusers import FluxPipeline, AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
from transformers import T5EncoderModel, T5TokenizerFast, CLIPTokenizer, CLIPTextModel
import torch
import gc
def flush():
gc.collect()
torch.cuda.empty_cache()
@sayakpaul
sayakpaul / benchmark_pixart-900m-1024-ft.py
Created July 2, 2024 08:59
Benchmarks the "ptx0/pixart-900m-1024-ft" model with `torch.compile()`.
import torch
torch.set_float32_matmul_precision("high")
from diffusers import DiffusionPipeline
import time
pipeline_id = "ptx0/pixart-900m-1024-ft"
pipeline = DiffusionPipeline.from_pretrained(
pipeline_id,