Task UUID: 378d049f-e957-4c78-9fa8-d1b84684d37e
Task ID: jaeger-query-service-indexmapping-conflict
Category: Platform Engineering
Author: chrorrala
Eval version: v12 (local files: v11 — API 500'd on download)
Backend: Docker (author confirmed in Discord: "I reverted to the normal docker as instructed")
Reviewer: Dylan Fitzgerald
Mean score 0.64 exceeds the docker backend threshold (<0.50). The task has excellent structural quality and the grader is clean, but it needs more difficulty for agents who get past the rename direction gate.
| Criterion | Threshold | Actual | Status |
|---|---|---|---|
| Solvable | solution.sh passes | 4/7 agents pass | ✅ |
| Challenging | Mean score <0.50 (docker) | 0.64 | ❌ |
| Substantial | ≥4H for senior engineer | Multi-layer Jaeger/Badger/GitOps investigation | ✅ |
| Metric | Value |
|---|---|
| Model | biggie-nebula |
| Version | v12 |
| Rollouts | 7 |
| Mean score | 0.64 |
| Full-pass rate (1.0) | 4/7 (57%) |
| Score distribution | [0, 0.25, 1.0, 1.0, 1.0, 1.0, 0.25] |
| Subscore | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Run 6 | Run 7 | Pass rate |
|---|---|---|---|---|---|---|---|---|
| index_cleanup (0.25) | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 57% |
| e2e_trace (0.25) | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 57% |
| config_standardized (0.25) | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 57% |
| storage_hardening (0.25) | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 86% |
All 3 failures share the identical root cause: agents interpreted bleat-service as the canonical name and treated the sed rename to bleat-api as the problem to undo.
- Run 1 (score 0.0): Reversed the sed command, removed OTEL_SERVICE_NAME from the ConfigMap entirely. All checks failed. Spent 39 min / 425 messages.
- Run 2 (score 0.25): Same wrong direction. Set OTEL_SERVICE_NAME to
bleat-service, but did correctly configure TTL via Helm. Onlystorage_hardeningpassed. - Run 7 (score 0.25): Correctly identified the mismatch, then resolved it backwards — removed the sed, kept
bleat-service. Configured TTL viakubectl set env. Onlystorage_hardeningpassed.
Classification: Genuine difficulty. The Mattermost breadcrumb ("the bleat-api rename...is working fine in the code, but OTEL_SERVICE_NAME still says bleat-service") is read by all agents but overridden by 3/7 who prioritize structural pattern-matching ("all other services use -service suffix") over human institutional context. This is a real LLM reasoning bias — agents that trust the human communication succeed; agents that trust code patterns fail.
All 4 passing runs (3, 4, 5, 6) converge on the same solution:
- Update ConfigMap
OTEL_SERVICE_NAMEtobleat-apivia Gitea + ArgoCD - Wipe Badger storage (cleanup pod or
kubectl exec rm -rf) - Configure TTL via env var or Helm
- Wait for traffic generator to produce fresh
bleat-apitraces
Turn efficiency varies widely: Run 5 completed in 11 min / 216 messages, while Run 3 took 60 min / 457 messages. The variance comes from TTL configuration discovery — Jaeger v2's YAML schema doesn't accept span_store_ttl where agents expect it, leading to crash-loop debugging. Agents that reach for kubectl set env early bypass this.
The score distribution is strictly bimodal: agents score either 1.0 (got the name right) or 0.0–0.25 (got it wrong). No agent who chose the correct rename direction failed any check. This means:
- The rename direction gate accounts for ~100% of the difficulty
- Badger cleanup, TTL config, service audit, and CI/CD pipeline work are effectively free once past the gate
- The 4 subscores collapse to 2 independent outcomes: "got the name right" and "configured TTL"
The grader is well-designed:
- 4 subscores, equal weights (0.25 each) — correct, no rounding issues
- ConfigMap detection for both OTEL config and TTL — fixed in recent versions (earlier versions only checked deployment args/env vars)
- Anti-gaming measures are thorough: requires recent traces with parent-child verification, can't game by wiping all data (e2e_trace needs live traffic), checks deployment spec not running pod
- Detailed feedback messages for each check — good for debugging
check_legacy_traces_cleanedis diagnostic-only (not scored) — good design choice- 60-second stabilization wait before grading, 60-minute trace freshness window — generous enough to avoid timing false negatives
No grader defects identified.
index_cleanup, e2e_trace, and config_standardized have perfectly correlated pass/fail across all 7 runs. They all depend on the same root decision (choosing bleat-api), so they always move together. This isn't a defect — it's a consequence of the task's layered structure where the rename decision gates everything downstream — but it means the apparent 4-way granularity overstates the actual score diversity.
- Multi-layer sabotage is creative and well-crafted: sed rename + OTEL misconfiguration + synthetic legacy spans + operation pollution + no Badger TTL
- CI/CD constraint (changes via Gitea → ArgoCD, selfHeal reverts kubectl) is realistic and adds meaningful operational complexity
- Mattermost breadcrumb is subtle but unambiguous — tests whether agents trust human communication over code patterns
- Continuous traffic generator deployed by setup.sh ensures traces flow post-fix — smart design that prevents timing-based false negatives
- The
authentication_serviceunderscore is a well-designed distractor: every run notices it, failing runs waste 10-30 turns "fixing" it, but the grader'sEXPECTED_SERVICESusesauthentication_service(underscore) as canonical. Tests prioritization skills. - The incident report is realistic and well-written
The task needs additional difficulty for agents who get past the rename direction gate. Currently 100% of correct-direction agents score 1.0. Here are concrete options:
Currently agents just rm -rf /mnt/data/badger/* and restart Jaeger. Consider:
- Requiring cleanup while keeping Jaeger running (can't just wipe the whole store)
- Adding other services' traces that must be preserved through the cleanup
- Making the PVC access path less obvious (e.g., non-default mount path)
Among passing runs, TTL discovery causes the most turn variance (11 min vs 60 min) but never causes a score difference because all agents eventually find a working approach. Consider:
- Requiring TTL to be configured via the Helm chart (not just
kubectl set env, which ArgoCD would revert) - Requiring a specific TTL range that the agent must reason about (not just "any value ≤168h")
The authentication_service underscore is currently a distractor. Making it a scored check would add a parallel difficulty axis that doesn't depend on the rename direction gate. The agent would need to fix both bleat-service → bleat-api AND authentication_service → authentication-service through the CI/CD pipeline.
Currently agents handle the Gitea → ArgoCD flow without trouble. Consider:
- Requiring changes in multiple repos (e.g., source code in
bleater-appAND manifests inbleater-manifests) - Adding ArgoCD sync issues that the agent must resolve
- Requiring the agent to verify the deployment actually rolled out (not just that the commit was pushed)
setup.sh: Well-structured 6-step sabotage. Commits deployment changes through Gitea (respecting the GitOps constraint), waits for ArgoCD sync, injects synthetic spans via OTLP/HTTP, deploys traffic generator, posts Mattermost breadcrumb. The sabotage verification step is a nice touch.
solution.sh: Clean 5-step fix. Fixes OTEL_SERVICE_NAME via Gitea commit, triggers ArgoCD refresh, wipes Badger, configures TTL via env vars, waits for stabilization. Correctly notes that the sed command is intentional (the application's init_tracing() hardcodes the service name via a Resource() that ignores OTEL_SERVICE_NAME).
Dockerfile: Minimal, no issues.
- Solution passes grader
- Mean score <0.50 (docker backend) — 0.64 exceeds threshold
- Scope ≥4H
- Task.yaml is clear, doesn't reveal grading criteria
- Grader checks align with task.yaml deliverables
- Equal subscore weights
- No dead-weight subscores (all 4 have variance)
- Agent can't score >0.3 without meaningful progress
- No answer leakage or reward hacking vectors
- Wait times sufficient (60s stabilization + 60min freshness window)
- No grader defects detected