chadbrewbaker · April 5, 2026 18:46
diff --git a/timeshare.txt b/timeshare.txt
 Yes — this is exactly what llama-server's slot system (+ your queue dispatcher) was built for. You can run one llama-server instance on a single GGUF, expose several independent message queues, and give each queue explicit time slices that are long enough to amortize the cost of flushing the KV cache for a fresh context window. Model weights stay hot in VRAM/CPU the whole time; only per-slot KV caches are flushed/recomputed when you want a truly fresh start.Why this works (no cache thrashing)Slots = isolated KV caches: Start the server with -np N (N = number of queues + 1–2 buffer). Each slot gets its own KV cache. The server automatically assigns requests to slots (or you pin with "slot_id": X in the JSON payload). Model weights (the GGUF) are loaded once and stay cached.  
 Fresh context = intentional flush: In every request from a queue, include "cache_prompt": false. This forces the server to discard/recompute the KV cache for that request (fresh prefill, no reuse from prior work in the slot).  
 No thrashing on switch: Because model weights never leave cache, and you control slice length, the expensive prefill only happens once per request (or per batch). The rest of the slice is pure decode (fast). If your slice processes enough tokens overall, utilization stays high.  
 OS scheduler still applies: Run the single llama-server daemon under systemd/launchd + cgroups/CPUQuota exactly as in my previous reply. The queues themselves are time-sliced in user-space (your dispatcher).

 Recommended architecture

 Message Queues (Redis Streams / RabbitMQ / file-based / etc.)
          ↓ (dispatcher polls one queue at a time)
 Queue Dispatcher (Python service — enforces time slices)
          ↓ (HTTP POST to localhost:8080/v1/chat/completions)
 Single llama-server (-np = num_queues, one GGUF)
   ├── Slot 0 ← Queue A (fresh KV flush each request)
   ├── Slot 1 ← Queue B
   └── ...

 1. Start the llama-server daemon (one GGUF, multiple hot slots)Use the same systemd/launchd setup as before, but add these flags:bash

 # Linux example (add to ExecStart in .service)
 --model /path/to/your-model.gguf \
 --ctx-size 32768 \          # increase with -np; each slot needs room
 --parallel 8 \              # = number of queues + buffer (tune to VRAM)
 --n-gpu-layers 99 \
 --cache-type-k q8_0 \       # optional: quantize KV to save memory
 --cache-type-v q8_0 \
 --slot-save-path /var/cache/llama-slots \  # optional persistence
 --host 0.0.0.0 --port 8080

 VRAM rule of thumb: each slot’s KV cache ≈ (context_tokens × 2 bytes × layers_offloaded) per model size. Test with your model.
 For strict serialization (one queue at a time): set --parallel 1 and let the dispatcher enforce slices.

 Reload/restart the daemon as before.2. The queue dispatcher (enforces time slices + fresh flushes)Write a simple Python supervisor (runs as a separate systemd service or launchd agent). It:Polls N queues in round-robin.
 Activates one queue per slice.
 Processes a batch (or until timer expires).
 Sends requests with "cache_prompt": false (forces fresh KV).
 Slice length is tunable — set it long enough that decode time dominates prefill (e.g. 30–120 s or “until 500+ tokens generated across requests”).

 Example skeleton (uses redis + asyncio for illustration; adapt to RabbitMQ/Celery/etc.):python

 import asyncio
 import time
 import httpx
 import redis.asyncio as redis  # or your queue lib

 QUEUES = ["queue:agent1", "queue:agent2", "queue:agent3"]  # your message queues
 SLICE_SECONDS = 60          # ← tune: long enough to amortize prefill
 SERVER_URL = "http://localhost:8080/v1/chat/completions"
 HEADERS = {"Content-Type": "application/json"}

 async def process_slice(queue_name: str, duration: float):
    start = time.time()
    client = httpx.AsyncClient()
    while time.time() - start < duration:
        msg = await pop_from_queue(queue_name)  # your queue pop
        if not msg:
            await asyncio.sleep(0.1)
            continue
        
        payload = {
            "model": "local",
            "messages": msg["messages"],
            "max_tokens": msg.get("max_tokens", 512),
            "temperature": msg.get("temp", 0.7),
            "cache_prompt": False,          # ← forces fresh context / KV flush
            "slot_id": QUEUES.index(queue_name) % 8  # optional: pin to slot
        }
        await client.post(SERVER_URL, json=payload)  # stream or not
    print(f"Slice for {queue_name} ended after {duration}s")

 async def main():
    r = redis.from_url("redis://localhost")
    while True:
        for q in QUEUES:
            await process_slice(q, SLICE_SECONDS)
            # optional: explicit flush via server API
            # await client.post(f"http://localhost:8080/slots/{slot_id}?action=clear")

 if __name__ == "__main__":
    asyncio.run(main())

 Tune SLICE_SECONDS: Measure your model’s prefill time once (e.g. 2–10 s for long context). Make the slice 5–10× longer. Monitor with Prometheus + llama-server logs or /slots endpoint.
 Batch inside slice: Instead of one request, drain 5–20 messages per slice for even better amortization.
 Hermes Agent tie-in: Put the above logic (or a simplified version) in your pre-run Python script. It can poll queues, enforce slices, then feed stdout/context to the agent.

 3. Explicit cache control (if you need it)Before a queue’s slice: POST /slots/{id}?action=clear (or save/restore via --slot-save-path).
 After slice: optional save for persistence across restarts.
 Check slot status: GET /slots (shows KV usage per slot).

 4. OS scheduler + resource limits (unchanged from before)Linux systemd: Keep CPUQuota=XX%, CPUShares=XXX, Nice=10 on the llama-server.service. The dispatcher runs at normal priority.
 macOS launchd: Keep the plist with Nice + add ThrottleInterval if you want coarse scheduling.
 The dispatcher itself can be a systemd timer unit or cron that restarts every few minutes if you prefer fire-and-forget slices.

 Quick tuning checklistStart server with appropriate -np and large -c.
 Run dispatcher with your slice duration.
 Test one queue: time a fresh request vs. full slice throughput.
 Monitor: llama-server logs + nvidia-smi / htop / /slots endpoint. KV should only recompute on the fresh requests you asked for; model stays hot.
 Scale: Add queues by bumping -np and adding entries to QUEUES list (VRAM permitting). If VRAM tight → drop to -np 1 and rely purely on dispatcher serialization.

 This gives you clean isolation per queue, deliberate fresh contexts, and time slices that are long enough that the KV flush overhead becomes negligible. No duplicate GGUF loads, no cross-queue thrashing, and full use of Linux/macOS schedulers for the underlying daemon.Drop your exact number of queues, model size/context, or queue backend (Redis/Rabbit/etc.) and I’ll give you the full ready-to-deploy files (systemd + dispatcher + example curl payloads).
	Yes — this is exactly what llama-server's slot system (+ your queue dispatcher) was built for. You can run one llama-server instance on a single GGUF, expose several independent message queues, and give each queue explicit time slices that are long enough to amortize the cost of flushing the KV cache for a fresh context window. Model weights stay hot in VRAM/CPU the whole time; only per-slot KV caches are flushed/recomputed when you want a truly fresh start.Why this works (no cache thrashing)Slots = isolated KV caches: Start the server with -np N (N = number of queues + 1–2 buffer). Each slot gets its own KV cache. The server automatically assigns requests to slots (or you pin with "slot_id": X in the JSON payload). Model weights (the GGUF) are loaded once and stay cached.
	Fresh context = intentional flush: In every request from a queue, include "cache_prompt": false. This forces the server to discard/recompute the KV cache for that request (fresh prefill, no reuse from prior work in the slot).
	No thrashing on switch: Because model weights never leave cache, and you control slice length, the expensive prefill only happens once per request (or per batch). The rest of the slice is pure decode (fast). If your slice processes enough tokens overall, utilization stays high.
	OS scheduler still applies: Run the single llama-server daemon under systemd/launchd + cgroups/CPUQuota exactly as in my previous reply. The queues themselves are time-sliced in user-space (your dispatcher).

	Recommended architecture

	Message Queues (Redis Streams / RabbitMQ / file-based / etc.)
	↓ (dispatcher polls one queue at a time)
	Queue Dispatcher (Python service — enforces time slices)
	↓ (HTTP POST to localhost:8080/v1/chat/completions)
	Single llama-server (-np = num_queues, one GGUF)
	├── Slot 0 ← Queue A (fresh KV flush each request)
	├── Slot 1 ← Queue B
	└── ...

	1. Start the llama-server daemon (one GGUF, multiple hot slots)Use the same systemd/launchd setup as before, but add these flags:bash

	# Linux example (add to ExecStart in .service)
	--model /path/to/your-model.gguf \
	--ctx-size 32768 \ # increase with -np; each slot needs room
	--parallel 8 \ # = number of queues + buffer (tune to VRAM)
	--n-gpu-layers 99 \
	--cache-type-k q8_0 \ # optional: quantize KV to save memory
	--cache-type-v q8_0 \
	--slot-save-path /var/cache/llama-slots \ # optional persistence
	--host 0.0.0.0 --port 8080

	VRAM rule of thumb: each slot’s KV cache ≈ (context_tokens × 2 bytes × layers_offloaded) per model size. Test with your model.
	For strict serialization (one queue at a time): set --parallel 1 and let the dispatcher enforce slices.

	Reload/restart the daemon as before.2. The queue dispatcher (enforces time slices + fresh flushes)Write a simple Python supervisor (runs as a separate systemd service or launchd agent). It:Polls N queues in round-robin.
	Activates one queue per slice.
	Processes a batch (or until timer expires).
	Sends requests with "cache_prompt": false (forces fresh KV).
	Slice length is tunable — set it long enough that decode time dominates prefill (e.g. 30–120 s or “until 500+ tokens generated across requests”).

	Example skeleton (uses redis + asyncio for illustration; adapt to RabbitMQ/Celery/etc.):python

	import asyncio
	import time
	import httpx
	import redis.asyncio as redis # or your queue lib

	QUEUES = ["queue:agent1", "queue:agent2", "queue:agent3"] # your message queues
	SLICE_SECONDS = 60 # ← tune: long enough to amortize prefill
	SERVER_URL = "http://localhost:8080/v1/chat/completions"
	HEADERS = {"Content-Type": "application/json"}

	async def process_slice(queue_name: str, duration: float):
	start = time.time()
	client = httpx.AsyncClient()
	while time.time() - start < duration:
	msg = await pop_from_queue(queue_name) # your queue pop
	if not msg:
	await asyncio.sleep(0.1)
	continue

	payload = {
	"model": "local",
	"messages": msg["messages"],
	"max_tokens": msg.get("max_tokens", 512),
	"temperature": msg.get("temp", 0.7),
	"cache_prompt": False, # ← forces fresh context / KV flush
	"slot_id": QUEUES.index(queue_name) % 8 # optional: pin to slot
	}
	await client.post(SERVER_URL, json=payload) # stream or not
	print(f"Slice for {queue_name} ended after {duration}s")

	async def main():
	r = redis.from_url("redis://localhost")
	while True:
	for q in QUEUES:
	await process_slice(q, SLICE_SECONDS)
	# optional: explicit flush via server API
	# await client.post(f"http://localhost:8080/slots/{slot_id}?action=clear")

	if __name__ == "__main__":
	asyncio.run(main())

	Tune SLICE_SECONDS: Measure your model’s prefill time once (e.g. 2–10 s for long context). Make the slice 5–10× longer. Monitor with Prometheus + llama-server logs or /slots endpoint.
	Batch inside slice: Instead of one request, drain 5–20 messages per slice for even better amortization.
	Hermes Agent tie-in: Put the above logic (or a simplified version) in your pre-run Python script. It can poll queues, enforce slices, then feed stdout/context to the agent.

	3. Explicit cache control (if you need it)Before a queue’s slice: POST /slots/{id}?action=clear (or save/restore via --slot-save-path).
	After slice: optional save for persistence across restarts.
	Check slot status: GET /slots (shows KV usage per slot).

	4. OS scheduler + resource limits (unchanged from before)Linux systemd: Keep CPUQuota=XX%, CPUShares=XXX, Nice=10 on the llama-server.service. The dispatcher runs at normal priority.
	macOS launchd: Keep the plist with Nice + add ThrottleInterval if you want coarse scheduling.
	The dispatcher itself can be a systemd timer unit or cron that restarts every few minutes if you prefer fire-and-forget slices.

	Quick tuning checklistStart server with appropriate -np and large -c.
	Run dispatcher with your slice duration.
	Test one queue: time a fresh request vs. full slice throughput.
	Monitor: llama-server logs + nvidia-smi / htop / /slots endpoint. KV should only recompute on the fresh requests you asked for; model stays hot.
	Scale: Add queues by bumping -np and adding entries to QUEUES list (VRAM permitting). If VRAM tight → drop to -np 1 and rely purely on dispatcher serialization.

	This gives you clean isolation per queue, deliberate fresh contexts, and time slices that are long enough that the KV flush overhead becomes negligible. No duplicate GGUF loads, no cross-queue thrashing, and full use of Linux/macOS schedulers for the underlying daemon.Drop your exact number of queues, model size/context, or queue backend (Redis/Rabbit/etc.) and I’ll give you the full ready-to-deploy files (systemd + dispatcher + example curl payloads).
No results found