Skip to content

Instantly share code, notes, and snippets.

@grahama1970
Created September 27, 2025 14:03
Show Gist options
  • Select an option

  • Save grahama1970/bde14bc62ed8d1f8a705611513d4d642 to your computer and use it in GitHub Desktop.

Select an option

Save grahama1970/bde14bc62ed8d1f8a705611513d4d642 to your computer and use it in GitHub Desktop.
CodeWorld — External Review Questions (context + code anchors)

CodeWorld — External Review Questions, Context, and Code Anchors

Context: CodeWorld is a prompt‑driven, multi‑variant orchestrator for agentic code generation. It emits per‑instance prompts, autostarts a tiny FastAPI ingest backend, runs agents (or a local fallback), and aggregates a reproducible scorecard. Observability flows to ArangoDB with a thin proto dashboard. Memory hooks integrate a Graph Memory service for recall and timeline context.

Inspiration: CWM: An Open‑Weights LLM for Research on Code Generation with World Models (Meta AI, Sept 24, 2025). Local copy: docs/papers/CWM_ An Open-Weights LLM for Research on Code Generation with World Models _ Research - AI at Meta.md. Our aim is to explore world‑model style signals for agentic coding by capturing observation→action episodes during runs and enabling recall‑driven guidance.

Objective: Harden the orchestrator for research‑grade iteration while keeping it thin and deterministic by default. We want principled process lifecycle, secure defaults, graceful degradation without Arango, clear contracts for prompts/outputs, and CI‑friendly smokes. We’re seeking a focused review to de‑risk architectural edges and align with the world‑model framing from CWM.

Where we’re blocked or want deeper input

  • Security and policy: autostarted backend exposure; /runs endpoint spawns processes; DB creation policy in hosted Arango.
  • Reliability without Arango: keep /stream useful; minimal RAM mode; schema choices for episodes/logs.
  • Concurrency + lifecycle: child process cleanup; detach safety; dashboard proc.
  • Canonicalization: duplicate algos module with risk of drift.
  • Prompt optimization rules: strictness and error surfacing.
  • MCP adapter ergonomics and failure modes.

1) Canonicalize algos module (prevent drift)

Question

  • Should we keep src/codeworld/algos/ as the canonical module and remove src/algos/? Any import edge cases in tests/tools to fix?

Why it matters

  • Two identical copies invite divergence and subtle import bugs during refactors.

Code anchors

  • src/codeworld/algos/multiply_variants.py
  • src/algos/multiply_variants.py

Acceptance

  • One canonical module under src/codeworld/algos/; imports updated; smokes stay green.

Example patch shape (pseudo)

diff --git a/src/algos/multiply_variants.py b/src/algos/multiply_variants.py
deleted file mode 100644
--- a/src/algos/multiply_variants.py
+++ /dev/null
@@
-# identical to src/codeworld/algos/multiply_variants.py; remove duplicate

2) Backend lifecycle and zombie/pipe safety

Questions

  • Is our backend autostart/stop logic robust across platforms and detach modes?
  • Could stdout/stderr pipes fill and deadlock? Do we risk zombies on failures?

Why it matters

  • The CLI frequently spawns uvicorn and optionally a dashboard; leaks or pipe deadlocks hurt CI and long‑running workflows.

Code anchors

  • _start_backend src/codeworld/cli.py:660–699
  • Bring‑up and wait src/codeworld/cli.py:920–1010
  • _stop_process src/codeworld/cli.py:284–306

Concrete asks

  • Recommend hardened _stop_process (process group termination, bounded read of pipes, platform nuances).
  • Suggest a small “proc supervisor” helper if warranted.

Snippet

# src/codeworld/cli.py:660–699
def _start_backend(api_base: str, extra_env: Optional[Dict[str, str]] = None) -> Optional[subprocess.Popen]:
    # Prefer uv if present; fall back to invoking uvicorn module
    uv_cmd = shutil.which("uv")
    if uv_cmd:
        return subprocess.Popen([uv_cmd, "run", "uvicorn", "codeworld.logger:app", "--host", "127.0.0.1", "--port", str(port)],
                                stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, env=env)
    return subprocess.Popen([sys.executable, "-m", "uvicorn", "codeworld.logger:app", "--host", "127.0.0.1", "--port", str(port)],
                            stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, env=env)

3) Security posture: loopback binding + /runs authorization

Questions

  • Are we guaranteed loopback‑only bindings in all autostart paths?
  • Should /runs require an auth token or capability to spawn work?

Why it matters

  • Spawning runs via HTTP without auth is unsafe on shared hosts; even if we bind to loopback by default, misconfigurations happen.

Code anchors

  • Bind addresses src/codeworld/cli.py:673,680 (127.0.0.1)
  • Spawn endpoint src/codeworld/logger.py:500–526 (/runscodeworld.cli run)

Concrete asks

  • Propose a minimal token gate for /runs (env CW_RUNS_TOKEN); reject if absent or mismatch.
  • Hard guard that autostart only binds to loopback; document override as opt‑in.

Snippet

# src/codeworld/logger.py:500–526
@app.post('/runs')
async def spawn_run(payload: dict):
    # TODO: optional token check (e.g., X-Run-Token header)
    cmd = [sys.executable, '-m', 'codeworld.cli', 'run', '--spec', str(target), '--run-id', run_id]
    subprocess.Popen(cmd, env=env)

4) Arango policy: safe DB creation gating

Question

  • Creating DB/collections from _system may be blocked in hosted Arango. Should this be gated via ALLOW_DB_CREATE=1 and degrade gracefully?

Code anchors

  • _connect_arango() src/codeworld/logger.py:60–88 (creates DB + collections)

Concrete asks

  • Add ALLOW_DB_CREATE (default off). If denied, return 503 with actionable message; keep /stream and proto UI working.

Snippet

# src/codeworld/logger.py:60–88 (excerpt)
db = client.db("_system", username=user, password=password)
if not db.has_database(db_name):
    db.create_database(db_name, users=[{"username": user, "password": password, "active": True}])

5) Arango‑down: RAM fallback for /stream, /episodes, /logs

Question

  • Introduce an in‑memory ring buffer when Arango is unavailable so proto UI remains useful; what’s a minimal, safe design?

Code anchors

  • 503 fallbacks: src/codeworld/logger.py:136–143, :210–233, :244–267
  • SSE broadcast queue: src/codeworld/logger.py:106–133

Concrete asks

  • Add CW_RAM_FALLBACK=1 to enable ring buffers (size N) for logs/episodes; mark responses as degraded: true.

6) Variant agent concurrency and file mutation

Questions

  • The agent writes/edits a shared variants module; is this safe under concurrent runs? Should we isolate per‑run or lock?

Code anchors

  • Write helpers src/codeworld/variant_agent.py:14–46
  • Template emit :89–118; mutations :146–189

Concrete asks

  • Prefer per‑run variants file under workspace/runs/<run_id>/variants.py and import dynamically; or implement file locks.

7) POP rules strictness and error surfacing

Questions

  • Are the optimization and validation rules strict enough to reject malformed prompts and surface actionable errors?

Code anchors

  • Help examples src/codeworld/tools/prompt_opt.py:8–10
  • Rules file src/codeworld/rules/prompt_optimization.yaml

Concrete asks

  • Add smokes for missing sections; ensure CLI exits non‑zero with clear messages.

8) MCP adapter ergonomics and failure modes

Questions

  • Are tool names, concurrency caps, and error paths sensible when Python side is down? What should the host expect?

Code anchors

  • mcp/codeworld-mcp/ (Node adapter)

Concrete asks

  • Provide ND smoke or doc note for CW_MAX_CONCURRENCY; ensure graceful degradation when backend unreachable.

9) World‑model alignment (CWM framing)

Questions

  • Given CWM’s emphasis on observation→action trajectories and RL in verifiable coding environments, are our episode and log schemas adequate for downstream world‑model research? What signals are missing (state hashes, interpreter traces, reward shaping)?

Code anchors

  • Episode ingest src/codeworld/logger.py:200–233
  • Score aggregation outputs workspace/runs/<run_id>/scorecard.json (runtime artifact)

Concrete asks

  • Propose a minimal schema extension for episodes to include execution state summaries and reward signals compatible with CWM‑style training data.

Repro and Validation Clues

Fast dev readiness

make project-ready

Live probe (strict backend)

READINESS_LIVE=1 STRICT_READY=1 make project-ready-live

Smokes (deterministic)

GAMIFIED_FAST_BENCH=1 uv run -q python tests/smoke/run_all.py

Release smokes (no Arango)

uv run python release_smokes/00_quick_check.py
uv run python release_smokes/10_run_from_prompt.py
uv run python release_smokes/20_emit_only_then_aggregate.py
uv run python release_smokes/30_run_from_spec.py

Expected Deliverables from Reviewer

  • Patches or diffs for items 1–5 (canonicalization, lifecycle, security, Arango gate, RAM fallback).
  • Schema proposal for episodes/logs aligned with world‑model research needs.
  • Notes on POP rule gaps and MCP ergonomics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment