Learn, unlearn and relearn.

Sayak Paul sayakpaul

Learn, unlearn and relearn.

ML at @huggingface | One PR at a time

2.9k followers · 8 following

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

sayakpaul / pipeline_flux_with_cfg_batched.py

Last active September 20, 2024 18:41

Flux with CFG (batched) 💣

	# Copyright 2024 Black Forest Labs and The HuggingFace Team. All rights reserved.
	#
	# Licensed under the Apache License, Version 2.0 (the "License");
	# you may not use this file except in compliance with the License.
	# You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing, software
	# distributed under the License is distributed on an "AS IS" BASIS,

sayakpaul / README.md

Last active March 18, 2025 08:37

This code snippet shows how to split the Flux transformer across two 16GB GPUs and run inference with the full pipeline.

Detailed writeup: https://huggingface2.notion.site/How-to-split-Flux-transformer-and-run-inference-aa1583ad23ce47a78589a79bb9309ab0

But TLDR is we split the models where possible and decouple the different stages of pipeline

sayakpaul / inference.md

Last active June 5, 2025 05:04

Not so rigorously validated FP8 training of Flux (dev) DreamBooth LoRA

from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained(
    "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
).to("cuda")
pipeline.load_lora_weights("sayakpaul/yarn_art_lora_flux", weight_name="pytorch_lora_weights.safetensors")
image = pipeline("a puppy in a pond, yarn art style", guidance_scale=3.5, height=768).images[0]
image.save("yarn.png")

sayakpaul / inference_with_torchao_serialized.py

Last active January 13, 2025 01:51

Shows how to run Flux schnell under 17GBs without bells and whistles. It additionally shows how to serialize the quantized checkpoint and load it back.

	import torch
	from huggingface_hub import hf_hub_download
	from diffusers import FluxTransformer2DModel, DiffusionPipeline

	dtype, device = torch.bfloat16, "cuda"
	ckpt_id = "black-forest-labs/FLUX.1-schnell"

	with torch.device("meta"):
	config = FluxTransformer2DModel.load_config(ckpt_id, subfolder="transformer")
	model = FluxTransformer2DModel.from_config(config).to(dtype)

sayakpaul / distributed_inference_diffusers.py

Last active September 10, 2024 02:04

Minimal example to show how to run distributed inference from a set of prompts with diffusers and accelerate.

	# Originally by jiwooya1000, put together together by sayakpaul.
	# Documentation: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference

	"""
	Run:

	accelerate launch distributed_inference_diffusers.py --batch_size 8

	# Enable memory optimizations for large models like SD3
	accelerate launch distributed_inference_diffusers.py --batch_size 8 --low_mem=1

sayakpaul / run_flux_with_limited_resources.md

Last active June 18, 2025 08:53

This document enlists resources that show how to run Black Forest Lab's Flux with Diffusers under limited resources.

Running Flux under limited resources with Diffusers

Flux: https://blackforestlabs.ai/announcing-black-forest-labs/

The first resource even allows you to run the pipeline under 16GBs of GPU VRAM.

sayakpaul / run_flux_under_24gbs.py

Last active June 28, 2025 22:53

This gist shows how to run Flux on a 24GB 4090 card with Diffusers.

	from diffusers import FluxPipeline, AutoencoderKL
	from diffusers.image_processor import VaeImageProcessor
	from transformers import T5EncoderModel, T5TokenizerFast, CLIPTokenizer, CLIPTextModel
	import torch
	import gc


	def flush():
	gc.collect()
	torch.cuda.empty_cache()

sayakpaul / benchmark_pixart-900m-1024-ft.py

Created July 2, 2024 08:59

Benchmarks the "ptx0/pixart-900m-1024-ft" model with `torch.compile()`.

	import torch

	torch.set_float32_matmul_precision("high")

	from diffusers import DiffusionPipeline
	import time

	pipeline_id = "ptx0/pixart-900m-1024-ft"
	pipeline = DiffusionPipeline.from_pretrained(
	pipeline_id,

sayakpaul / run_sd3_compile.py

Last active August 14, 2024 21:30

The script shows how to run SD3 with `torch.compile()`

	import torch

	torch.set_float32_matmul_precision("high")

	from diffusers import StableDiffusion3Pipeline
	import time

	id = "stabilityai/stable-diffusion-3-medium-diffusers"
	pipeline = StableDiffusion3Pipeline.from_pretrained(
	id,

sayakpaul / run_sd3_8bit.py

Last active November 25, 2024 21:50

The code snippet shows how to run Stable Diffusion 3 with a 8bit T5-xxl, drastically reducing the memory requirements.

	from diffusers import StableDiffusion3Pipeline
	from transformers import T5EncoderModel
	import torch
	import time
	import gc

	def flush():
	gc.collect()
	torch.cuda.empty_cache()

Newer Older