I own Nvidia and have a preference for Podman containers, so I'm using the stable-diffusion.cpp official cuda image. The same general idea works for running the binaries directly, though.
I am using flux.2 Klein 4b in Q4 for this example because it's tiny, new, and because it supports edit features. You probably need a ~16GB GPU to do 1MP as done here, but even running on the CPU alone should work at ~512x512 (very slowly) if you have 12GB+ of system RAM.
Directory structure for this particular example is setup like this:
.
├── models
│ ├── 4b.gguf
│ ├── qwen3.gguf
│ └── vae.st
├── edit.png
├── gen.png
└── tim.pyedit.png and gen.png are the outputs we're producing, not prerequisites.
tim.py looks like this:
#!/usr/bin/env python3
import base64, requests, subprocess, time, os
GEN_PROMPT = "A lone, intricate tumbleweed rolling across a cracked asphalt highway in the high desert. Cinematic wide shot, golden hour lighting with long shadows, heat haze shimmering off the ground. 8k resolution, highly detailed dry brush textures, dust kicking up behind the tumbleweed as it bounces."
EDIT_PROMPT = "change this to a snowy scene with the same general content depicted in a very cold environment with a subtle blue filter"
def run():
print("launching server & loading models...")
addr = "http://127.0.0.1:1234"
subprocess.run([
"podman", "run", "-d", "--rm", "--name", "sd-mini",
"--device", "nvidia.com/gpu=all", "--entrypoint", "/sd-server",
"-v", f"{os.getcwd()}/models:/models",
"-v", f"{os.getcwd()}/inputs:/inputs",
"-v", f"{os.getcwd()}/outputs:/outputs",
"-p", "1234:1234", "ghcr.io/leejet/stable-diffusion.cpp:master-cuda",
"--listen-ip", "0.0.0.0", "--listen-port", "1234",
"--diffusion-model", "/models/4b.gguf", "--vae", "/models/vae.st",
"--llm", "/models/qwen3.gguf", "--lora-model-dir", "/tmp",
"--hires-upscalers-dir", "/tmp", "--diffusion-fa", "--threads", "8"
], check=True)
while True:
try:
if requests.get(f"{addr}/v1/models", timeout=2).status_code == 200: break
except: time.sleep(2)
def wait_job(payload):
jid = requests.post(f"{addr}/sdcpp/v1/img_gen", json=payload).json()["id"]
while True:
r = requests.get(f"{addr}/sdcpp/v1/jobs/{jid}").json()
if r["status"] == "completed": return r["result"]["images"][0]["b64_json"]
time.sleep(0.5)
print("queuing t2i job")
b64_gen = wait_job({"prompt": GEN_PROMPT, "width": 1024, "height": 1024, "sample_params": {"sample_steps": 4}})
with open("gen.png", "wb") as f: f.write(base64.b64decode(b64_gen))
print("queuing edit job")
b64_edit = wait_job({"prompt": EDIT_PROMPT, "width": 1024, "height": 1024, "ref_images": [b64_gen], "sample_params": {"sample_steps": 4}})
with open("edit.png", "wb") as f: f.write(base64.b64decode(b64_edit))
print("tearing down container")
subprocess.run(["podman", "rm", "-f", "sd-mini"], capture_output=True)
if __name__ == "__main__":
run()
#END OF tim.pyRunning it looks like this:
❯ time ./tim.py
launching server & loading models...
6a8d73db31f88f419dff4b9116423ed6053352665351dfcbfaec7f1ffac7b4c1
queuing t2i job
queuing edit job
tearing down container
real 0m50.151s
user 0m0.333s
sys 0m0.197sAnd ~50 seconds later, these images have been produced:
Original t2i
i2i edit
