All‑Smokes Gate Still Timing Out/Fails in Split — Targeted Debug + Patch Requests Created: 2025-09-27 TTL: Private, delete within 15 minutes after review
Summary
- We split the composite all_smokes gate into all_smokes_core + all_smokes_nd and added per‑check timeouts, xdist and PYTEST_ADDOPTS pass‑through.
- The orchestrator no longer dies universally, but we still see:
- Harness timeouts under certain runs (improved but still possible on slow hosts).
- True FAILs in all_smokes_nd due to env/base mismatches (see below) — these are not red in isolation.
Key Observations (from runs on this host)
- When the shim sees 8788 already bound, it starts on a free port (ma_port). Our configured checks set MINI_AGENT_API_HOST/PORT to ma_port, but CODEX_AGENT_API_BASE in readiness.yml is still hard‑coded to http://127.0.0.1:8788 — so codex‑agent tests can hit the wrong server and fail with 500/connection issues.
- In long runs, occasional port collisions and docker tools‑stub mapping :8791 caused non‑deterministic behavior. We added preflight port freeing and a docker stop for tools‑stub, which helped.
- After deduping pytest.ini sections, deterministic suites are clean. Remaining red items concentrate in ND/E2E where bases or models weren’t fully aligned to the shim’s dynamic port.
Failing tests (latest all_smokes_nd pass, trimmed)
- tests/ndsmoke/test_codex_agent_live_optional.py::test_codex_agent_live_optional
- tests/ndsmoke/test_loop_exec_python_ndsmoke.py::test_loop_exec_python_ndsmoke
- tests/ndsmoke/test_mini_agent_api_live_minimal_ndsmoke.py::test_agent_api_live_minimal_optional
- tests/ndsmoke/test_mini_agent_docker_live_optional.py::test_mini_agent_docker_codex_code_live_optional
- tests/ndsmoke/test_mini_agent_lang_ndsmoke.py::test_lang_javascript_live_optional
- tests/ndsmoke/test_ollama_generate_ndsmoke.py::test_ollama_generate_optional
- tests/ndsmoke_e2e/test_codex_agent_e2e_low.py::test_codex_agent_router_low_optional
- tests/ndsmoke_e2e/test_mini_agent_e2e_high_escalation.py::test_mini_agent_escalation_high_optional
- tests/ndsmoke_e2e/test_mini_agent_e2e_low.py::test_mini_agent_finalize_via_api_low
Hypotheses
- Codex‑agent path: wrong base URL (using 8788) when the shim moved to a free port; solution is to override CODEX_AGENT_API_BASE to http://{ma_host}:{ma_port} for configured checks (same as we already do for MINI_AGENT_API_*).
- Mini‑agent API minimal/low E2E: same root cause (hitting the wrong port when 8788 is occupied).
- JavaScript live optional: if Node isn’t detectable or path check is bypassed, force skip via tool detection or ensure Node installed; or fix the “which” probe path in the ND test environment so it skips cleanly when missing.
Please review and supply clean diffs for:
- scripts/mvp_check.py — override CODEX_AGENT_API_BASE with shim port for configured checks
@@ def main():
- env.update({k: str(v) for (k, v) in env_add.items()})
+ env.update({k: str(v) for (k, v) in env_add.items()})
# If the configured check targets the mini-agent API, prefer the shim port we resolved above
- if name == 'mini_agent_e2e_low' or 'MINI_AGENT_API_PORT' in env:
+ if name in ('mini_agent_e2e_low','all_smokes','all_smokes_core','all_smokes_nd') or 'MINI_AGENT_API_PORT' in env:
try:
env['MINI_AGENT_API_HOST'] = locals().get('ma_host', env.get('MINI_AGENT_API_HOST','127.0.0.1'))
env['MINI_AGENT_API_PORT'] = str(locals().get('ma_port', env.get('MINI_AGENT_API_PORT','8788')))
+ # Ensure codex-agent base hits the same shim
+ env['CODEX_AGENT_API_BASE'] = f"http://{env['MINI_AGENT_API_HOST']}:{env['MINI_AGENT_API_PORT']}"
except Exception:
pass- readiness.yml — remove hard‑coded CODEX_AGENT_API_BASE=…8788 from split checks or set it dynamically (mvp_check will override anyway). If you prefer to keep it:
- CODEX_AGENT_API_BASE: http://127.0.0.1:8788
+ # Base will be overridden at runtime to the shim port by mvp_check
+ CODEX_AGENT_API_BASE: http://127.0.0.1:8788- scripts/run_all_smokes.py — ensure consistent env for codex endpoint and allow xdist (already present on our branch but include for completeness)
@@
- cmd = ["pytest", "-q"] + targets
+ cmd = ["pytest", "-q"] + targets
if importlib.util.find_spec("xdist") and os.environ.get("NO_XDIST") != "1":
workers = os.environ.get("PYTEST_XDIST_AUTO_NUM_WORKERS") or "auto"
cmd += ["-n", workers]
extra = shlex.split(os.environ.get("PYTEST_ADDOPTS", "")) if os.environ.get("PYTEST_ADDOPTS") else []
cmd += extra-
pytest.ini — keep a single [pytest] section (we fixed duplicate section locally; include this note to avoid recurrence)
-
Optional: Node tool detection patch
- If you want lang_javascript_live_optional to skip cleanly when Node is missing, either ensure Node is available or add a more robust PATH probe (but per your policy, we prefer features working; skip only when truly absent).
Run Plan After Patch
- Strict split:
ALL_SMOKES_CORE_TIMEOUT=720 ALL_SMOKES_ND_TIMEOUT=1500 make project-ready-all-split - If CI supports xdist: set
PYTEST_XDIST_AUTO_NUM_WORKERS(e.g., 6–8) andPYTEST_ADDOPTS="--durations=25".
What I did (already applied locally)
- Port/freeing logic, docker tools-stub stop, echo shim for codex-agent OpenAI endpoint, parallel + timeout fixes, storage trace schema addition, Agent Proxy echo via router, HTTP invoker tolerant POST.
Ask
- Provide surgical diffs for (1) CODEX_AGENT_API_BASE override in configured checks and (2) keep readiness.yml’s base comment/dynamic override.
- Optional: confirm a maximum end‑to‑end wall‑clock you expect for the full suite so we can set conservative timeouts.