Hindsight + Osaurus local LLM setup
This note shows how to run Hindsight in Docker while using an Osaurus-hosted local OpenAI-compatible model for LLM calls.
- Osaurus is running on the host machine.
- Osaurus exposes an OpenAI-compatible API on
http://127.0.0.1:1337. - Hindsight runs in Docker and exposes its API on
http://localhost:8888. - The model ID is
minimax-m2.7-small-jangtq.
curl http://127.0.0.1:1337/v1/modelsYou should see a model list containing something like:
{
"data": [
{ "id": "minimax-m2.7-small-jangtq" }
]
}Optional direct smoke test:
curl http://127.0.0.1:1337/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-local' \
-d '{
"model": "minimax-m2.7-small-jangtq",
"messages": [{"role":"user","content":"Return exactly OK."}],
"max_tokens": 32768,
"temperature": 0.01
}'Inside Docker, 127.0.0.1 means the container itself, not the host. Use host.docker.internal so Hindsight can reach Osaurus on the host.
docker run -d --name hindsight \
-p 8888:8888 \
-p 9999:9999 \
-e HINDSIGHT_ENABLE_API=true \
-e HINDSIGHT_ENABLE_CP=true \
-e HINDSIGHT_API_HOST=0.0.0.0 \
-e HINDSIGHT_API_PORT=8888 \
-e HINDSIGHT_CP_DATAPLANE_API_URL=http://localhost:8888 \
-e HINDSIGHT_API_LLM_PROVIDER=openai \
-e HINDSIGHT_API_LLM_BASE_URL=http://host.docker.internal:1337 \
-e HINDSIGHT_API_LLM_API_KEY=sk-local \
-e HINDSIGHT_API_LLM_MODEL=minimax-m2.7-small-jangtq \
-e HINDSIGHT_API_LLM_TIMEOUT=300 \
-e HINDSIGHT_API_LLM_MAX_CONCURRENT=1 \
-e HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=1 \
-e HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT=1 \
-e HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32768 \
-e HINDSIGHT_API_CONSOLIDATION_MAX_TOKENS=32768 \
-e HINDSIGHT_API_RECALL_MAX_TOKENS=32768 \
-e HINDSIGHT_API_RECALL_CHUNKS_MAX_TOKENS=32768 \
-e HINDSIGHT_API_WORKER_MAX_SLOTS=3 \
-e HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTS=1 \
-e HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTS=1 \
-e HINDSIGHT_API_WORKER_FILE_CONVERT_RETAIN_MAX_SLOTS=1 \
-v "$HOME/.hindsight-docker:/home/hindsight/.pg0" \
ghcr.io/vectorize-io/hindsight:latestNotes:
HINDSIGHT_API_LLM_PROVIDER=openaiis used because Osaurus exposes an OpenAI-compatible API.HINDSIGHT_API_LLM_BASE_URL=http://host.docker.internal:1337is the key Docker networking setting.HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32768makes Hindsight send a large output token cap during retain/fact extraction. For a custom OpenAI-compatible base URL, Hindsight maps this tomax_tokens.- The concurrency settings keep local inference from being flooded with simultaneous requests.
Startup can take a bit because Hindsight initializes local embeddings/reranker models and verifies the LLM connection.
docker logs -f hindsightLook for logs like:
LLM: provider=openai, model=minimax-m2.7-small-jangtq
OpenAI-compatible client initialized: provider=openai, model=minimax-m2.7-small-jangtq, base_url=http://host.docker.internal:1337
Connection verified: openai/minimax-m2.7-small-jangtq
Application startup complete.
Then check health:
curl http://localhost:8888/healthThe CLI talks to Hindsight, not directly to Osaurus:
export HINDSIGHT_API_URL=http://localhost:8888
hindsight health
hindsight versionDo not set HINDSIGHT_API_URL to the Osaurus port. That variable is for the Hindsight API URL. Osaurus is configured via HINDSIGHT_API_LLM_BASE_URL in the Hindsight container.
hindsight bank create demo \
--name "Demo" \
--mission "Test memory bank for local Osaurus-backed Hindsight."
hindsight memory retain demo \
"Hindsight is connected to a local Osaurus OpenAI-compatible model." \
--context "setup smoke test"
hindsight bank stats demo
hindsight memory recall demo "What model is Hindsight connected to?" --max-tokens 32768Check that the container can reach Osaurus:
docker exec hindsight python - <<'PY'
import urllib.request
print(urllib.request.urlopen('http://host.docker.internal:1337/v1/models', timeout=5).read().decode()[:500])
PYIf this fails, verify Osaurus is running and listening on port 1337 on the host.
Check:
curl http://127.0.0.1:1337/healthIf inflight stays non-empty for a long time, restart Osaurus, then restart or wait for Hindsight.
Local models can be slow for fact extraction and consolidation. Keep concurrency low:
HINDSIGHT_API_LLM_MAX_CONCURRENT=1
HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=1
HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT=1Look at Docker logs for APIConnectionError or extraction failures:
docker logs --tail 200 hindsightIf LLM calls are failing, Hindsight may store documents but extract zero facts.
If Hindsight Reflect works poorly in the web UI (for example, it answers that no data was retrieved even though hindsight memory recall finds relevant facts), check the Docker logs:
docker logs --tail 300 hindsight 2>&1 | grep -iE 'reflect|tool|APIStatus|Invalid request format|ERROR|WARNING'A known failure mode with Hindsight 0.6.1 + Osaurus is:
APIStatusError in tool call ... HTTP 400: {"message": "Invalid request format", "type": "invalid_request_error"}
Hindsight Reflect uses OpenAI tool calling. During forced retrieval steps, Hindsight may rewrite a named tool choice like:
{"type":"function","function":{"name":"search_observations"}}into:
"required"Osaurus accepts the named OpenAI tool-choice object, but rejects tool_choice: "required" with HTTP 400. The result is that Reflect cannot call its retrieval tools, then falls back to a final answer with no retrieved context.
Patch openai_compatible_llm.py inside the Hindsight container so named tool choice is preserved for custom OpenAI-compatible endpoints such as Osaurus.
Find the block in:
/app/api/hindsight_api/engine/providers/openai_compatible_llm.pythat normalizes every named function tool_choice to "required", and restrict that rewrite to providers that actually need it, such as Ollama / LM Studio.
The patched logic should be equivalent to:
# Preserve OpenAI named tool_choice for Osaurus/custom OpenAI-compatible endpoints.
# Osaurus rejects tool_choice="required" with HTTP 400 Invalid request format,
# but accepts {"type":"function","function":{"name":"..."}}.
if (
isinstance(request_tool_choice, dict)
and request_tool_choice.get("type") == "function"
and self.provider in ("ollama", "lmstudio")
):
forced_name = request_tool_choice.get("function", {}).get("name")
if forced_name:
filtered = [t for t in tools if t.get("function", {}).get("name") == forced_name]
if filtered:
tools = filtered
request_tool_choice = "required"Then restart Hindsight:
docker restart hindsightIf you want the patch to survive container recreation, commit the patched container to a local image:
docker commit hindsight hindsight-osaurus:tool-choice-patchedThen use that image instead of ghcr.io/vectorize-io/hindsight:latest in your docker run command:
hindsight-osaurus:tool-choice-patchedLarge recall/chunk budgets can overload local models during Reflect. For a local Osaurus setup, consider lowering the bank-level Reflect retrieval budgets:
curl -sS -X PATCH http://localhost:8888/v1/default/banks/wiki/config \
-H 'Content-Type: application/json' \
-d '{"updates":{"recall_max_tokens":4096,"recall_chunks_max_tokens":1000,"reflect_source_facts_max_tokens":4096}}'Verify:
hindsight bank config wiki -o jsonIf Reflect is competing with background ingestion, disable the worker for interactive use:
-e HINDSIGHT_API_WORKER_ENABLED=falseThis keeps the API and web UI available while preventing background retain/consolidation jobs from consuming the local model.