Skip to content

Instantly share code, notes, and snippets.

@jleechan2015
Created May 20, 2026 20:20
Show Gist options
  • Select an option

  • Save jleechan2015/027f63bed7e826003638ca1aad06963d to your computer and use it in GitHub Desktop.

Select an option

Save jleechan2015/027f63bed7e826003638ca1aad06963d to your computer and use it in GitHub Desktop.
PR #6958 level_up_entry_offer evidence iteration_014 — 3/3 PASS @ 91174d019b0bb2c677065042530f0967f1cf183c

Evidence Package: level_up_entry_offer_pr6958

Package Manifest

  • Test Name: level_up_entry_offer_pr6958
  • Run ID: level_up_entry_offer_pr6958-014-20260520T201714
  • Iteration: 14
  • Bundle Version: 1.2.0
  • Collected At (UTC): 2026-05-20T20:17:14.029698+00:00
  • Repository: worldarchitect.ai
  • Branch: fix/6926-review-comments
  • Commit: 91174d019b0bb2c677065042530f0967f1cf183c
  • Merge Base: f457ae58ab501c948aab8e9ff110c54899836f20
  • Commits Ahead of Main: 70

Git Provenance

.beads/issues.jsonl                                |   9 +
 .github/workflows/design-doc-gate.yml              |   3 +-
 docs/design/pr-designs/pr-6958.html                | 311 ++++++++++
 docs/design/pr-designs/pr-6958.md                  | 104 ++++
 mvp_site/agents.py                                 |  29 +-
 mvp_site/llm_parser.py                             |  41 ++
 mvp_site/llm_providers/gemini_provider.py          |   9 +-
 mvp_site/prompts/level_up_instruction.md           |  34 +-
 mvp_site/prompts/planning_protocol.md              |  27 +-
 mvp_site/prompts/rewards_system_instruction.md     |  26 +-
 mvp_site/rewards_engine.py                         | 630 +++++++++++----------
 mvp_site/schemas/prompt_tool_contracts.json        |   4 +-
 mvp_site/tests/data/modal_routing_fixtures.json    |   3 +-
 mvp_site/tests/test_agents.py                      |  47 +-
 mvp_site/tests/test_canonicalize_invariants.py     |  31 +-
 mvp_site/tests/test_freeze_time_choices.py         |  67 ++-
 mvp_site/tests/test_prompts.py                     |  23 +
 mvp_site/tests/test_rewards_engine.py              | 545 ++++++++++++++++--
 mvp_site/tests/test_rewards_engine_wiring.py       |  39 +-
 mvp_site/tests/test_streaming_orchestrator.py      | 297 +++++++++-
 .../tests/test_testing_utils_centralization.py     | 129 +++--
 mvp_site/tests/test_world_logic.py                 | 152 ++++-
 mvp_site/world_logic.py                            |  98 +++-
 roadmap/README.md                                  |   1 +
 .../nextsteps-2026-05-19-pr6958-review-fixes.md    |  94 +++
 testing_mcp/lib/server_utils.py                    |   9 +-
 testing_mcp/test_level_up_entry_offer_pr6958.py    | 386 +++++++++++++
 .../test_level_up_rewards_planning_atomicity.py    |  64 ++-
 ..._level_up_rewards_planning_atomicity_browser.py |  51 +-
 29 files changed, 2674 insertions(+), 589 deletions(-)

Server Runtime

  • Port: 58917
  • PID: 73637
  • Command: /opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python -m gunicorn mvp_site.main:app --bind 0.0.0.0:58917 --workers 1 --worker-class gthread --threads 4 --timeout 600 --max-requests 1000 --access-logfile - --error-logfile - --log-level info

Environment Variables

  • WORLDAI_DEV_MODE: true
  • TESTING: None
  • MOCK_SERVICES_MODE: false
  • GOOGLE_APPLICATION_CREDENTIALS: [SET - file:serviceAccountKey.json]
  • WORLDAI_GOOGLE_APPLICATION_CREDENTIALS: [SET - file:serviceAccountKey.json]
  • FIRESTORE_EMULATOR_HOST: None
  • PORT: 58917
  • FIREBASE_PROJECT_ID: worldarchitecture-ai
  • GEMINI_API_KEY: [SET - 39 chars]
  • LLM_REQUEST_RESPONSE_CAPTURE_PATH: /tmp/worldarchitect.ai/fix_6926-review-comments/level_up_entry_offer_pr6958/iteration_014/llm_request_responses_1779308040617.jsonl
  • HTTP_REQUEST_RESPONSE_CAPTURE_PATH: /tmp/worldarchitect.ai/fix_6926-review-comments/level_up_entry_offer_pr6958/iteration_014/http_request_responses_1779308040617.jsonl
  • GEMINI_HTTP_REQUEST_RESPONSE_CAPTURE_PATH: /tmp/worldarchitect.ai/fix_6926-review-comments/level_up_entry_offer_pr6958/iteration_014/gemini_http_request_responses_1779308040617.jsonl
  • MCP_TEST_PROVIDER_HTTP_CAPTURE_PATH: /tmp/worldarchitect.ai/fix_6926-review-comments/level_up_entry_offer_pr6958/iteration_014/provider_http_request_responses_1779308040617.jsonl

Files in This Bundle

  • README.md - This manifest
  • methodology.md - Testing methodology
  • evidence.md - Evidence summary with Claim→Artifact Map and Coverage Matrix
  • notes.md - Additional context, TODOs, follow-ups
  • metadata.json - Machine-readable metadata
  • assertions.json - Strict before/after parity assertions (if present)
  • run.json - Test results
    • streaming_evidence.json - Normalized streaming evidence summary
    • request_responses.jsonl - Raw MCP request/response payloads (if present)
    • llm_request_responses.jsonl - Raw LLM request/response payloads (if present)
    • http_request_responses.jsonl - Raw local-server HTTP request/response payloads (if present)
    • gemini_http_request_responses.jsonl - Raw Gemini transport HTTP traces (if present)
    • artifacts/ - Additional evidence files

Evidence Summary: level_up_entry_offer_pr6958

Test Results

  • Total Scenarios: 3
  • Scenario Validation Passed: 3
  • Scenario Validation Failed: 0
  • Scenario Validation Pass Rate: 100.0%
  • Raw LLM Layer Passed: 2/2 (100.0%)

⚠️ Multi-Campaign Isolation Note

This evidence bundle contains 2 campaigns:

  • 0 shared campaign(s) reused across multiple tests
  • 2 independent campaign(s) each used by one test only

Why: Each test uses its own campaign to prevent state bleed

Claim Scoping: Each scenario result below includes its campaign_id. Claims about specific scenarios reference ONLY that scenario's campaign. Aggregate claims (e.g., "18/18 passed") span all campaigns but each individual result is traceable to its campaign.

  • Post-Processing Campaign Capture Passed: 2
  • Post-Processing Campaign Capture Failed: 0
  • Post-Processing Campaign Capture Pass Rate: 100.0%

Scenario Results

entry_offer_level_up_now_only

  • Status: ✅ PASS
  • Campaign ID: v20ZBNHIYFTZh0P0Kaym

modal_mechanic_plus_finish_freeze_time

  • Status: ✅ PASS
  • Campaign ID: l0UR1tf0k3qEiBxyX9QY

EVIDENCE_SIGNATURE_GUARD

  • Status: ✅ PASS

Provenance Chain

  • Git HEAD: 91174d019b0bb2c677065042530f0967f1cf183c
  • Test Timestamp: 2026-05-20T20:17:14.029698+00:00
  • Server PID: 73637

Claim → Artifact Map

Claim File Key Field(s)
Scenario validation passed: 3/3 run.json scenarios[].passed, scenarios[].errors
Campaign post-processing capture passed: 2/2 run.json campaign_capture_status[*].status
Streaming evidence normalized streaming_evidence.json summary., scenarios[].chunk_count_observed
Bundle artifact inventory artifacts/collection_log.txt core_files, jsonl_captures, campaigns_dir
MCP request/response captured request_responses.jsonl Full request/response pairs
Local server HTTP request/response captured http_request_responses.jsonl http_request/http_response entries
LLM request/response stream captured llm_request_responses.jsonl request/response entries (type field)
Gemini HTTP transport captured gemini_http_request_responses.jsonl http_request/http_response/transport_error entries
Server execution log artifacts/server.log Raw server output
Git provenance metadata.json git_provenance.git_head = 91174d01...

Coverage Matrix

Scenario Status Campaign ID
entry_offer_level_up_now_only ✅ Pass v20ZBNHI...
modal_mechanic_plus_finish_freeze_time ✅ Pass l0UR1tf0...
EVIDENCE_SIGNATURE_GUARD ✅ Pass N/A

Evidence Integrity

  • All files in this bundle have corresponding .sha256 checksum files

  • Checksums use local basename paths so per-file verification works from each artifact directory

  • ⚠️ Server warnings detected (see artifacts/server.log)

  • Warning: ACTION_RESOLUTION_MISSING_FIELDS

  • Warning: INVENTORY_SAFEGUARD

What This Evidence Proves vs. Does NOT Prove

Proves:

  • Core logic and scenario validation for level_up_entry_offer_pr6958
  • Scenario execution pass rates (3/3)

Does NOT Prove:

  • Production server behavior (tested on local server unless otherwise noted)
  • Performance under load (single-request tests)
  • Edge cases not covered by scenarios

Methodology: level_up_entry_offer_pr6958

Test Type

Real API test against MCP server (not mock mode).

Test Mode

  • TESTING env var: None
  • MOCK_SERVICES_MODE env var: false
  • Mode: Real API calls via MCP HTTP JSON-RPC

Execution Environment

  • Server running at port 58917
  • Process: /opt/homebrew/Cellar/python@3.11/3.11.13/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python -m gunicorn mvp_site.main:app --bind 0.0.0.0:58917 --workers 1 --worker-class gthread --threads 4 --timeout 600 --max-requests 1000 --access-logfile - --error-logfile - --log-level info

Test Isolation Design

Multi-campaign architecture is BY DESIGN for test isolation.

  • Total Campaigns: 2
  • Shared Campaigns: 0 (used by multiple scenarios)
  • Independent Campaigns: 2 (single-scenario campaigns)
  • Isolated Tests: 0 (explicit isolated: True scenarios)
  • Rationale: Each test uses its own campaign to prevent state bleed

No scenarios in this run were marked isolated: True; campaign usage still follows multi-campaign separation to avoid state bleed. Campaign separation in this run still prevents state bleed across scenarios that use different campaign IDs.

Evidence Capture

  • Git provenance captured at test start
  • Raw request/response payloads captured for each MCP call
  • Server runtime info captured via lsof/ps
  • Streaming evidence normalized into streaming_evidence.json
  • Raw local-server HTTP request/response payloads captured in http_request_responses.jsonl
  • Raw LLM request/response payloads captured in llm_request_responses.jsonl
  • Raw Gemini HTTP transport payloads captured in gemini_http_request_responses.jsonl
  • Raw LLM response text captured in server.log (artifacts/server.log)

Evidence Mode

  • System instruction capture: filenames + char_count (lightweight). Raw LLM request/response payloads captured in request_responses.jsonl when raw payload capture is enabled.

Validation Criteria

Test scenarios validate that:

  1. MCP server processes actions correctly
  2. State updates are returned as expected
  3. Server processes all requests successfully (validation warnings may be logged but requests succeed)

Note: Server warnings (e.g., validation, entity tracking) may appear in logs. Check artifacts/server.log for full server output.

Warning parser for notes: counts each log line matching \bWARNING\b|SYSTEM WARNING: once.

Notes: level_up_entry_offer_pr6958

Run Information

  • Run ID: level_up_entry_offer_pr6958-014-20260520T201714
  • Iteration: 14
  • Bundle Version: 1.2.0
  • Timestamp: 2026-05-20T20:17:14.029698+00:00

Evidence Integrity

  • All files in this bundle have corresponding .sha256 checksum files
  • Checksums use local basename paths so per-file verification works from each artifact directory

Scenario Summary

  • Total: 3
  • Passed: 3
  • Failed: 0

Post-Processing Capture Summary

  • Campaigns with capture status: 2
  • Capture Passed: 2
  • Capture Failed: 0

Warning/Error Summary

  • Server Warnings: 72 warnings in server.log
  • Warning Parser: line-level regex \bWARNING\b|SYSTEM WARNING: (one count per matching line)
  • Key Warning Categories:
    • ACTION_RESOLUTION_MISSING_FIELDS
    • INVENTORY_SAFEGUARD

Follow-up Items

Additional Context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment