Skip to content

Instantly share code, notes, and snippets.

@FNGarvin
Created May 9, 2026 15:37
Show Gist options
  • Select an option

  • Save FNGarvin/96dbcaf2688a7047c2e6c30e9941c7da to your computer and use it in GitHub Desktop.

Select an option

Save FNGarvin/96dbcaf2688a7047c2e6c30e9941c7da to your computer and use it in GitHub Desktop.
Simplest API Use of Stable-Diffusion.cpp

I own Nvidia and have a preference for Podman containers, so I'm using the stable-diffusion.cpp official cuda image. The same general idea works for running the binaries directly, though.

I am using flux.2 Klein 4b in Q4 for this example because it's tiny, new, and because it supports edit features. You probably need a ~16GB GPU to do 1MP as done here, but even running on the CPU alone should work at ~512x512 (very slowly) if you have 12GB+ of system RAM.

Directory structure for this particular example is setup like this:

.
├──  models
│   ├──  4b.gguf
│   ├──  qwen3.gguf
│   └──  vae.st
├──  edit.png
├──  gen.png
└──  tim.py

edit.png and gen.png are the outputs we're producing, not prerequisites.

tim.py looks like this:

#!/usr/bin/env python3
import base64, requests, subprocess, time, os
GEN_PROMPT = "A lone, intricate tumbleweed rolling across a cracked asphalt highway in the high desert. Cinematic wide shot, golden hour lighting with long shadows, heat haze shimmering off the ground. 8k resolution, highly detailed dry brush textures, dust kicking up behind the tumbleweed as it bounces."
EDIT_PROMPT = "change this to a snowy scene with the same general content depicted in a very cold environment with a subtle blue filter"
def run():
    print("launching server & loading models...")
    addr = "http://127.0.0.1:1234"
    subprocess.run([
        "podman", "run", "-d", "--rm", "--name", "sd-mini", 
        "--device", "nvidia.com/gpu=all", "--entrypoint", "/sd-server", 
        "-v", f"{os.getcwd()}/models:/models", 
        "-v", f"{os.getcwd()}/inputs:/inputs",
        "-v", f"{os.getcwd()}/outputs:/outputs",
        "-p", "1234:1234", "ghcr.io/leejet/stable-diffusion.cpp:master-cuda", 
        "--listen-ip", "0.0.0.0", "--listen-port", "1234", 
        "--diffusion-model", "/models/4b.gguf", "--vae", "/models/vae.st", 
        "--llm", "/models/qwen3.gguf", "--lora-model-dir", "/tmp", 
        "--hires-upscalers-dir", "/tmp", "--diffusion-fa", "--threads", "8"
    ], check=True)

    while True:
        try:
            if requests.get(f"{addr}/v1/models", timeout=2).status_code == 200: break
        except: time.sleep(2)

    def wait_job(payload):
        jid = requests.post(f"{addr}/sdcpp/v1/img_gen", json=payload).json()["id"]
        while True:
            r = requests.get(f"{addr}/sdcpp/v1/jobs/{jid}").json()
            if r["status"] == "completed": return r["result"]["images"][0]["b64_json"]
            time.sleep(0.5)

    print("queuing t2i job")
    b64_gen = wait_job({"prompt": GEN_PROMPT, "width": 1024, "height": 1024, "sample_params": {"sample_steps": 4}})
    with open("gen.png", "wb") as f: f.write(base64.b64decode(b64_gen))

    print("queuing edit job")
    b64_edit = wait_job({"prompt": EDIT_PROMPT, "width": 1024, "height": 1024, "ref_images": [b64_gen], "sample_params": {"sample_steps": 4}})
    with open("edit.png", "wb") as f: f.write(base64.b64decode(b64_edit))
    
    print("tearing down container")
    subprocess.run(["podman", "rm", "-f", "sd-mini"], capture_output=True)

if __name__ == "__main__":
    run()

#END OF tim.py

Running it looks like this:

time ./tim.py
launching server & loading models...
6a8d73db31f88f419dff4b9116423ed6053352665351dfcbfaec7f1ffac7b4c1
queuing t2i job
queuing edit job
tearing down container

real    0m50.151s
user    0m0.333s
sys     0m0.197s

And ~50 seconds later, these images have been produced: Original t2i image i2i edit image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment