Skip to content

Instantly share code, notes, and snippets.

@lfsmoura
Last active May 14, 2026 19:32
Show Gist options
  • Select an option

  • Save lfsmoura/7db167eb5e784589204cdb1390e34141 to your computer and use it in GitHub Desktop.

Select an option

Save lfsmoura/7db167eb5e784589204cdb1390e34141 to your computer and use it in GitHub Desktop.
Hindsight + Osaurus local LLM setup

Hindsight + Osaurus local LLM setup

Hindsight + Osaurus local LLM setup

This note shows how to run Hindsight in Docker while using an Osaurus-hosted local OpenAI-compatible model for LLM calls.

Assumptions

  • Osaurus is running on the host machine.
  • Osaurus exposes an OpenAI-compatible API on http://127.0.0.1:1337.
  • Hindsight runs in Docker and exposes its API on http://localhost:8888.
  • The model ID is minimax-m2.7-small-jangtq.

1. Confirm Osaurus is reachable from the host

curl http://127.0.0.1:1337/v1/models

You should see a model list containing something like:

{
  "data": [
    { "id": "minimax-m2.7-small-jangtq" }
  ]
}

Optional direct smoke test:

curl http://127.0.0.1:1337/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer sk-local' \
  -d '{
    "model": "minimax-m2.7-small-jangtq",
    "messages": [{"role":"user","content":"Return exactly OK."}],
    "max_tokens": 32768,
    "temperature": 0.01
  }'

2. Run Hindsight Docker against Osaurus

Inside Docker, 127.0.0.1 means the container itself, not the host. Use host.docker.internal so Hindsight can reach Osaurus on the host.

docker run -d --name hindsight \
  -p 8888:8888 \
  -p 9999:9999 \
  -e HINDSIGHT_ENABLE_API=true \
  -e HINDSIGHT_ENABLE_CP=true \
  -e HINDSIGHT_API_HOST=0.0.0.0 \
  -e HINDSIGHT_API_PORT=8888 \
  -e HINDSIGHT_CP_DATAPLANE_API_URL=http://localhost:8888 \
  -e HINDSIGHT_API_LLM_PROVIDER=openai \
  -e HINDSIGHT_API_LLM_BASE_URL=http://host.docker.internal:1337 \
  -e HINDSIGHT_API_LLM_API_KEY=sk-local \
  -e HINDSIGHT_API_LLM_MODEL=minimax-m2.7-small-jangtq \
  -e HINDSIGHT_API_LLM_TIMEOUT=300 \
  -e HINDSIGHT_API_LLM_MAX_CONCURRENT=1 \
  -e HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=1 \
  -e HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT=1 \
  -e HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32768 \
  -e HINDSIGHT_API_CONSOLIDATION_MAX_TOKENS=32768 \
  -e HINDSIGHT_API_RECALL_MAX_TOKENS=32768 \
  -e HINDSIGHT_API_RECALL_CHUNKS_MAX_TOKENS=32768 \
  -e HINDSIGHT_API_WORKER_MAX_SLOTS=3 \
  -e HINDSIGHT_API_WORKER_CONSOLIDATION_MAX_SLOTS=1 \
  -e HINDSIGHT_API_WORKER_RETAIN_MAX_SLOTS=1 \
  -e HINDSIGHT_API_WORKER_FILE_CONVERT_RETAIN_MAX_SLOTS=1 \
  -v "$HOME/.hindsight-docker:/home/hindsight/.pg0" \
  ghcr.io/vectorize-io/hindsight:latest

Notes:

  • HINDSIGHT_API_LLM_PROVIDER=openai is used because Osaurus exposes an OpenAI-compatible API.
  • HINDSIGHT_API_LLM_BASE_URL=http://host.docker.internal:1337 is the key Docker networking setting.
  • HINDSIGHT_API_RETAIN_MAX_COMPLETION_TOKENS=32768 makes Hindsight send a large output token cap during retain/fact extraction. For a custom OpenAI-compatible base URL, Hindsight maps this to max_tokens.
  • The concurrency settings keep local inference from being flooded with simultaneous requests.

3. Wait for Hindsight to start

Startup can take a bit because Hindsight initializes local embeddings/reranker models and verifies the LLM connection.

docker logs -f hindsight

Look for logs like:

LLM: provider=openai, model=minimax-m2.7-small-jangtq
OpenAI-compatible client initialized: provider=openai, model=minimax-m2.7-small-jangtq, base_url=http://host.docker.internal:1337
Connection verified: openai/minimax-m2.7-small-jangtq
Application startup complete.

Then check health:

curl http://localhost:8888/health

4. Configure the Hindsight CLI

The CLI talks to Hindsight, not directly to Osaurus:

export HINDSIGHT_API_URL=http://localhost:8888
hindsight health
hindsight version

Do not set HINDSIGHT_API_URL to the Osaurus port. That variable is for the Hindsight API URL. Osaurus is configured via HINDSIGHT_API_LLM_BASE_URL in the Hindsight container.

5. Create a bank and retain content

hindsight bank create demo \
  --name "Demo" \
  --mission "Test memory bank for local Osaurus-backed Hindsight."

hindsight memory retain demo \
  "Hindsight is connected to a local Osaurus OpenAI-compatible model." \
  --context "setup smoke test"

hindsight bank stats demo
hindsight memory recall demo "What model is Hindsight connected to?" --max-tokens 32768

Troubleshooting

Hindsight logs show connection errors to the LLM

Check that the container can reach Osaurus:

docker exec hindsight python - <<'PY'
import urllib.request
print(urllib.request.urlopen('http://host.docker.internal:1337/v1/models', timeout=5).read().decode()[:500])
PY

If this fails, verify Osaurus is running and listening on port 1337 on the host.

Osaurus has stuck inflight requests

Check:

curl http://127.0.0.1:1337/health

If inflight stays non-empty for a long time, restart Osaurus, then restart or wait for Hindsight.

Hindsight is very slow

Local models can be slow for fact extraction and consolidation. Keep concurrency low:

HINDSIGHT_API_LLM_MAX_CONCURRENT=1
HINDSIGHT_API_RETAIN_LLM_MAX_CONCURRENT=1
HINDSIGHT_API_CONSOLIDATION_LLM_MAX_CONCURRENT=1

Retain creates documents but no facts

Look at Docker logs for APIConnectionError or extraction failures:

docker logs --tail 200 hindsight

If LLM calls are failing, Hindsight may store documents but extract zero facts.

Reflect / Web UI tool-calling issue with Osaurus

If Hindsight Reflect works poorly in the web UI (for example, it answers that no data was retrieved even though hindsight memory recall finds relevant facts), check the Docker logs:

docker logs --tail 300 hindsight 2>&1 | grep -iE 'reflect|tool|APIStatus|Invalid request format|ERROR|WARNING'

A known failure mode with Hindsight 0.6.1 + Osaurus is:

APIStatusError in tool call ... HTTP 400: {"message": "Invalid request format", "type": "invalid_request_error"}

Cause

Hindsight Reflect uses OpenAI tool calling. During forced retrieval steps, Hindsight may rewrite a named tool choice like:

{"type":"function","function":{"name":"search_observations"}}

into:

"required"

Osaurus accepts the named OpenAI tool-choice object, but rejects tool_choice: "required" with HTTP 400. The result is that Reflect cannot call its retrieval tools, then falls back to a final answer with no retrieved context.

Patch

Patch openai_compatible_llm.py inside the Hindsight container so named tool choice is preserved for custom OpenAI-compatible endpoints such as Osaurus.

Find the block in:

/app/api/hindsight_api/engine/providers/openai_compatible_llm.py

that normalizes every named function tool_choice to "required", and restrict that rewrite to providers that actually need it, such as Ollama / LM Studio.

The patched logic should be equivalent to:

# Preserve OpenAI named tool_choice for Osaurus/custom OpenAI-compatible endpoints.
# Osaurus rejects tool_choice="required" with HTTP 400 Invalid request format,
# but accepts {"type":"function","function":{"name":"..."}}.
if (
    isinstance(request_tool_choice, dict)
    and request_tool_choice.get("type") == "function"
    and self.provider in ("ollama", "lmstudio")
):
    forced_name = request_tool_choice.get("function", {}).get("name")
    if forced_name:
        filtered = [t for t in tools if t.get("function", {}).get("name") == forced_name]
        if filtered:
            tools = filtered
        request_tool_choice = "required"

Then restart Hindsight:

docker restart hindsight

If you want the patch to survive container recreation, commit the patched container to a local image:

docker commit hindsight hindsight-osaurus:tool-choice-patched

Then use that image instead of ghcr.io/vectorize-io/hindsight:latest in your docker run command:

hindsight-osaurus:tool-choice-patched

Keep Reflect contexts smaller

Large recall/chunk budgets can overload local models during Reflect. For a local Osaurus setup, consider lowering the bank-level Reflect retrieval budgets:

curl -sS -X PATCH http://localhost:8888/v1/default/banks/wiki/config \
  -H 'Content-Type: application/json' \
  -d '{"updates":{"recall_max_tokens":4096,"recall_chunks_max_tokens":1000,"reflect_source_facts_max_tokens":4096}}'

Verify:

hindsight bank config wiki -o json

Worker contention

If Reflect is competing with background ingestion, disable the worker for interactive use:

-e HINDSIGHT_API_WORKER_ENABLED=false

This keeps the API and web UI available while preventing background retain/consolidation jobs from consuming the local model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment