Task Review: Jaeger Query Service Index Mapping Conflict

Task UUID: 378d049f-e957-4c78-9fa8-d1b84684d37e Task ID: jaeger-query-service-indexmapping-conflict Category: Platform Engineering Author: chrorrala Eval version: v12 (local files: v11 — API 500'd on download) Backend: Docker (author confirmed in Discord: "I reverted to the normal docker as instructed") Reviewer: Dylan Fitzgerald

Verdict: NEEDS_WORK

Mean score 0.64 exceeds the docker backend threshold (<0.50). The task has excellent structural quality and the grader is clean, but it needs more difficulty for agents who get past the rename direction gate.

Acceptance criteria

Criterion	Threshold	Actual	Status
Solvable	solution.sh passes	4/7 agents pass	✅
Challenging	Mean score <0.50 (docker)	0.64	❌
Substantial	≥4H for senior engineer	Multi-layer Jaeger/Badger/GitOps investigation	✅

Eval Summary

Metric	Value
Model	biggie-nebula
Version	v12
Rollouts	7
Mean score	0.64
Full-pass rate (1.0)	4/7 (57%)
Score distribution	`[0, 0.25, 1.0, 1.0, 1.0, 1.0, 0.25]`

Per-subscore breakdown

Subscore	Run 2	Run 3	Run 4	Run 5	Run 6	Run 7	Pass rate
index_cleanup (0.25)	0	1	1	1	1	0	57%
e2e_trace (0.25)	0	1	1	1	1	0	57%
config_standardized (0.25)	0	1	1	1	1	0	57%
storage_hardening (0.25)	1	1	1	1	1	1	86%

Failure Analysis

All 3 failures share the identical root cause: agents interpreted bleat-service as the canonical name and treated the sed rename to bleat-api as the problem to undo.

Run 1 (score 0.0): Reversed the sed command, removed OTEL_SERVICE_NAME from the ConfigMap entirely. All checks failed. Spent 39 min / 425 messages.
Run 2 (score 0.25): Same wrong direction. Set OTEL_SERVICE_NAME to bleat-service, but did correctly configure TTL via Helm. Only storage_hardening passed.
Run 7 (score 0.25): Correctly identified the mismatch, then resolved it backwards — removed the sed, kept bleat-service. Configured TTL via kubectl set env. Only storage_hardening passed.

Classification: Genuine difficulty. The Mattermost breadcrumb ("the bleat-api rename...is working fine in the code, but OTEL_SERVICE_NAME still says bleat-service") is read by all agents but overridden by 3/7 who prioritize structural pattern-matching ("all other services use -service suffix") over human institutional context. This is a real LLM reasoning bias — agents that trust the human communication succeed; agents that trust code patterns fail.

Passing runs

All 4 passing runs (3, 4, 5, 6) converge on the same solution:

Update ConfigMap OTEL_SERVICE_NAME to bleat-api via Gitea + ArgoCD
Wipe Badger storage (cleanup pod or kubectl exec rm -rf)
Configure TTL via env var or Helm
Wait for traffic generator to produce fresh bleat-api traces

Turn efficiency varies widely: Run 5 completed in 11 min / 216 messages, while Run 3 took 60 min / 457 messages. The variance comes from TTL configuration discovery — Jaeger v2's YAML schema doesn't accept span_store_ttl where agents expect it, leading to crash-loop debugging. Agents that reach for kubectl set env early bypass this.

The core problem: bimodal scores

The score distribution is strictly bimodal: agents score either 1.0 (got the name right) or 0.0–0.25 (got it wrong). No agent who chose the correct rename direction failed any check. This means:

The rename direction gate accounts for ~100% of the difficulty
Badger cleanup, TTL config, service audit, and CI/CD pipeline work are effectively free once past the gate
The 4 subscores collapse to 2 independent outcomes: "got the name right" and "configured TTL"

Grader Quality

The grader is well-designed:

4 subscores, equal weights (0.25 each) — correct, no rounding issues
ConfigMap detection for both OTEL config and TTL — fixed in recent versions (earlier versions only checked deployment args/env vars)
Anti-gaming measures are thorough: requires recent traces with parent-child verification, can't game by wiping all data (e2e_trace needs live traffic), checks deployment spec not running pod
Detailed feedback messages for each check — good for debugging
check_legacy_traces_cleaned is diagnostic-only (not scored) — good design choice
60-second stabilization wait before grading, 60-minute trace freshness window — generous enough to avoid timing false negatives

No grader defects identified.

Structural observation: three correlated subscores

index_cleanup, e2e_trace, and config_standardized have perfectly correlated pass/fail across all 7 runs. They all depend on the same root decision (choosing bleat-api), so they always move together. This isn't a defect — it's a consequence of the task's layered structure where the rename decision gates everything downstream — but it means the apparent 4-way granularity overstates the actual score diversity.

Task Design Quality

Strengths

Multi-layer sabotage is creative and well-crafted: sed rename + OTEL misconfiguration + synthetic legacy spans + operation pollution + no Badger TTL
CI/CD constraint (changes via Gitea → ArgoCD, selfHeal reverts kubectl) is realistic and adds meaningful operational complexity
Mattermost breadcrumb is subtle but unambiguous — tests whether agents trust human communication over code patterns
Continuous traffic generator deployed by setup.sh ensures traces flow post-fix — smart design that prevents timing-based false negatives
The authentication_service underscore is a well-designed distractor: every run notices it, failing runs waste 10-30 turns "fixing" it, but the grader's EXPECTED_SERVICES uses authentication_service (underscore) as canonical. Tests prioritization skills.
The incident report is realistic and well-written

Recommendations

The task needs additional difficulty for agents who get past the rename direction gate. Currently 100% of correct-direction agents score 1.0. Here are concrete options:

1. Make the Badger cleanup less straightforward (recommended)

Currently agents just rm -rf /mnt/data/badger/* and restart Jaeger. Consider:

Requiring cleanup while keeping Jaeger running (can't just wipe the whole store)
Adding other services' traces that must be preserved through the cleanup
Making the PVC access path less obvious (e.g., non-default mount path)

2. Turn the TTL configuration into a scored difficulty source

Among passing runs, TTL discovery causes the most turn variance (11 min vs 60 min) but never causes a score difference because all agents eventually find a working approach. Consider:

Requiring TTL to be configured via the Helm chart (not just kubectl set env, which ArgoCD would revert)
Requiring a specific TTL range that the agent must reason about (not just "any value ≤168h")

3. Add a second service with real naming drift

The authentication_service underscore is currently a distractor. Making it a scored check would add a parallel difficulty axis that doesn't depend on the rename direction gate. The agent would need to fix both bleat-service → bleat-api AND authentication_service → authentication-service through the CI/CD pipeline.

4. Make the CI/CD path harder

Currently agents handle the Gitea → ArgoCD flow without trouble. Consider:

Requiring changes in multiple repos (e.g., source code in bleater-app AND manifests in bleater-manifests)
Adding ArgoCD sync issues that the agent must resolve
Requiring the agent to verify the deployment actually rolled out (not just that the commit was pushed)

Setup & Solution Review

setup.sh: Well-structured 6-step sabotage. Commits deployment changes through Gitea (respecting the GitOps constraint), waits for ArgoCD sync, injects synthetic spans via OTLP/HTTP, deploys traffic generator, posts Mattermost breadcrumb. The sabotage verification step is a nice touch.

solution.sh: Clean 5-step fix. Fixes OTEL_SERVICE_NAME via Gitea commit, triggers ArgoCD refresh, wipes Badger, configures TTL via env vars, waits for stabilization. Correctly notes that the sed command is intentional (the application's init_tracing() hardcodes the service name via a Resource() that ignores OTEL_SERVICE_NAME).

Dockerfile: Minimal, no issues.

arubis/jaeger-task-review.md

Select an option

No results found