Context: CodeWorld is a prompt‑driven, multi‑variant orchestrator for agentic code generation. It emits per‑instance prompts, autostarts a tiny FastAPI ingest backend, runs agents (or a local fallback), and aggregates a reproducible scorecard. Observability flows to ArangoDB with a thin proto dashboard. Memory hooks integrate a Graph Memory service for recall and timeline context.
Inspiration: CWM: An Open‑Weights LLM for Research on Code Generation with World Models (Meta AI, Sept 24, 2025). Local copy: docs/papers/CWM_ An Open-Weights LLM for Research on Code Generation with World Models _ Research - AI at Meta.md. Our aim is to explore world‑model style signals for agentic coding by capturing observation→action episodes during runs and enabling recall‑driven guidance.
Objective: Harden the orchestrator for research‑grade iteration while keeping it thin and deterministic by default. We want principled process lifecycle, secure defaults, graceful degradation without Arango, clear contracts for prompts/outputs, and CI‑friendly smokes. We’re seeking a focused review to de‑risk architectural edges and align with the world‑model framing from CWM.
Where we’re blocked or want deeper input
- Security and policy: autostarted backend exposure;
/runsendpoint spawns processes; DB creation policy in hosted Arango. - Reliability without Arango: keep
/streamuseful; minimal RAM mode; schema choices for episodes/logs. - Concurrency + lifecycle: child process cleanup; detach safety; dashboard proc.
- Canonicalization: duplicate algos module with risk of drift.
- Prompt optimization rules: strictness and error surfacing.
- MCP adapter ergonomics and failure modes.
Question
- Should we keep
src/codeworld/algos/as the canonical module and removesrc/algos/? Any import edge cases in tests/tools to fix?
Why it matters
- Two identical copies invite divergence and subtle import bugs during refactors.
Code anchors
src/codeworld/algos/multiply_variants.pysrc/algos/multiply_variants.py
Acceptance
- One canonical module under
src/codeworld/algos/; imports updated; smokes stay green.
Example patch shape (pseudo)
diff --git a/src/algos/multiply_variants.py b/src/algos/multiply_variants.py
deleted file mode 100644
--- a/src/algos/multiply_variants.py
+++ /dev/null
@@
-# identical to src/codeworld/algos/multiply_variants.py; remove duplicateQuestions
- Is our backend autostart/stop logic robust across platforms and detach modes?
- Could stdout/stderr pipes fill and deadlock? Do we risk zombies on failures?
Why it matters
- The CLI frequently spawns
uvicornand optionally a dashboard; leaks or pipe deadlocks hurt CI and long‑running workflows.
Code anchors
_start_backendsrc/codeworld/cli.py:660–699- Bring‑up and wait
src/codeworld/cli.py:920–1010 _stop_processsrc/codeworld/cli.py:284–306
Concrete asks
- Recommend hardened
_stop_process(process group termination, bounded read of pipes, platform nuances). - Suggest a small “proc supervisor” helper if warranted.
Snippet
# src/codeworld/cli.py:660–699
def _start_backend(api_base: str, extra_env: Optional[Dict[str, str]] = None) -> Optional[subprocess.Popen]:
# Prefer uv if present; fall back to invoking uvicorn module
uv_cmd = shutil.which("uv")
if uv_cmd:
return subprocess.Popen([uv_cmd, "run", "uvicorn", "codeworld.logger:app", "--host", "127.0.0.1", "--port", str(port)],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, env=env)
return subprocess.Popen([sys.executable, "-m", "uvicorn", "codeworld.logger:app", "--host", "127.0.0.1", "--port", str(port)],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, env=env)Questions
- Are we guaranteed loopback‑only bindings in all autostart paths?
- Should
/runsrequire an auth token or capability to spawn work?
Why it matters
- Spawning runs via HTTP without auth is unsafe on shared hosts; even if we bind to loopback by default, misconfigurations happen.
Code anchors
- Bind addresses
src/codeworld/cli.py:673,680(127.0.0.1) - Spawn endpoint
src/codeworld/logger.py:500–526(/runs→codeworld.cli run)
Concrete asks
- Propose a minimal token gate for
/runs(envCW_RUNS_TOKEN); reject if absent or mismatch. - Hard guard that autostart only binds to loopback; document override as opt‑in.
Snippet
# src/codeworld/logger.py:500–526
@app.post('/runs')
async def spawn_run(payload: dict):
# TODO: optional token check (e.g., X-Run-Token header)
cmd = [sys.executable, '-m', 'codeworld.cli', 'run', '--spec', str(target), '--run-id', run_id]
subprocess.Popen(cmd, env=env)Question
- Creating DB/collections from
_systemmay be blocked in hosted Arango. Should this be gated viaALLOW_DB_CREATE=1and degrade gracefully?
Code anchors
_connect_arango()src/codeworld/logger.py:60–88(creates DB + collections)
Concrete asks
- Add
ALLOW_DB_CREATE(default off). If denied, return 503 with actionable message; keep/streamand proto UI working.
Snippet
# src/codeworld/logger.py:60–88 (excerpt)
db = client.db("_system", username=user, password=password)
if not db.has_database(db_name):
db.create_database(db_name, users=[{"username": user, "password": password, "active": True}])Question
- Introduce an in‑memory ring buffer when Arango is unavailable so proto UI remains useful; what’s a minimal, safe design?
Code anchors
- 503 fallbacks:
src/codeworld/logger.py:136–143,:210–233,:244–267 - SSE broadcast queue:
src/codeworld/logger.py:106–133
Concrete asks
- Add
CW_RAM_FALLBACK=1to enable ring buffers (size N) for logs/episodes; mark responses asdegraded: true.
Questions
- The agent writes/edits a shared variants module; is this safe under concurrent runs? Should we isolate per‑run or lock?
Code anchors
- Write helpers
src/codeworld/variant_agent.py:14–46 - Template emit
:89–118; mutations:146–189
Concrete asks
- Prefer per‑run variants file under
workspace/runs/<run_id>/variants.pyand import dynamically; or implement file locks.
Questions
- Are the optimization and validation rules strict enough to reject malformed prompts and surface actionable errors?
Code anchors
- Help examples
src/codeworld/tools/prompt_opt.py:8–10 - Rules file
src/codeworld/rules/prompt_optimization.yaml
Concrete asks
- Add smokes for missing sections; ensure CLI exits non‑zero with clear messages.
Questions
- Are tool names, concurrency caps, and error paths sensible when Python side is down? What should the host expect?
Code anchors
mcp/codeworld-mcp/(Node adapter)
Concrete asks
- Provide ND smoke or doc note for
CW_MAX_CONCURRENCY; ensure graceful degradation when backend unreachable.
Questions
- Given CWM’s emphasis on observation→action trajectories and RL in verifiable coding environments, are our
episodeandlogschemas adequate for downstream world‑model research? What signals are missing (state hashes, interpreter traces, reward shaping)?
Code anchors
- Episode ingest
src/codeworld/logger.py:200–233 - Score aggregation outputs
workspace/runs/<run_id>/scorecard.json(runtime artifact)
Concrete asks
- Propose a minimal schema extension for episodes to include execution state summaries and reward signals compatible with CWM‑style training data.
Fast dev readiness
make project-readyLive probe (strict backend)
READINESS_LIVE=1 STRICT_READY=1 make project-ready-liveSmokes (deterministic)
GAMIFIED_FAST_BENCH=1 uv run -q python tests/smoke/run_all.pyRelease smokes (no Arango)
uv run python release_smokes/00_quick_check.py
uv run python release_smokes/10_run_from_prompt.py
uv run python release_smokes/20_emit_only_then_aggregate.py
uv run python release_smokes/30_run_from_spec.py- Patches or diffs for items 1–5 (canonicalization, lifecycle, security, Arango gate, RAM fallback).
- Schema proposal for episodes/logs aligned with world‑model research needs.
- Notes on POP rule gaps and MCP ergonomics.