Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active April 3, 2026 00:13
Show Gist options
  • Select an option

  • Save arubis/bcb7570882fa8af56e47a48532845cbf to your computer and use it in GitHub Desktop.

Select an option

Save arubis/bcb7570882fa8af56e47a48532845cbf to your computer and use it in GitHub Desktop.
Jaeger Query Service Index Mapping Conflict — Difficulty Recommendations for chrorrala

Jaeger Query Service Index Mapping Conflict — Difficulty Recommendations

Task 378d049f (v12) by Christian (chrorrala)
Backend Docker
Threshold mean < 0.50
Current mean 0.64 (7 biggie-nebula runs)
Verdict NEEDS_WORK

The Problem: One Gate, Two Outcomes

All four subscores (index_cleanup, e2e_trace, config_standardized, storage_hardening) are perfectly correlated — they all hinge on a single decision: which direction to rename (bleat-service → bleat-api or the reverse).

Rename direction Runs Score
Correct (bleat-service → bleat-api) 4 1.0
Wrong (bleat-api → bleat-service) 3 0.0

No agent scored a partial. The task is effectively a coin flip, not a graduated challenge.

graph LR
    A[Agent starts] --> B{Rename direction?}
    B -->|bleat-service → bleat-api| C[All 4 checks pass]
    B -->|bleat-api → bleat-service| D[All 4 checks fail]
    C --> E["Score: 1.0"]
    D --> F["Score: 0.0"]

    style C fill:#22c55e,color:#fff
    style D fill:#ef4444,color:#fff
    style E fill:#22c55e,color:#fff
    style F fill:#ef4444,color:#fff
Loading

Target: Break this bimodality by adding independent failure modes that create partial scores and real variance.


Recommended Additions

Two additions that target different subscores with different difficulty types:

# Addition Subscore affected Difficulty type Independence
P3 Helm reconciliation trap storage_hardening Temporal/order-sensitive Fully independent of rename
P5 OTEL propagator mismatch e2e_trace Cascading dependency Triggered after correct rename

How They Reshape the Score Landscape

graph TD
    A[Agent starts] --> B{Rename direction?}
    B -->|Wrong| F0["0.0 — same as today"]
    B -->|Correct| C[index_cleanup ✅ config_standardized ✅]

    C --> D{Discovers propagator mismatch?}
    D -->|No| G1["e2e_trace ✗ — spans exist but unlinked"]
    D -->|Yes, fixes B3→W3C| G2["e2e_trace ✅"]

    C --> E{TTL fix survives CronJob revert?}
    E -->|"No (kubectl set env)"|H1["storage_hardening ✗ — reverted before grader"]
    E -->|"Yes (Helm upgrade)"|H2["storage_hardening ✅"]

    G1 --> S1["Score: 0.50"]
    G2 --> S2["Score: 0.75 or 1.0"]
    H1 --> S1
    H2 --> S2

    style F0 fill:#ef4444,color:#fff
    style G1 fill:#f59e0b,color:#fff
    style H1 fill:#f59e0b,color:#fff
    style G2 fill:#22c55e,color:#fff
    style H2 fill:#22c55e,color:#fff
    style S1 fill:#f59e0b,color:#fff
    style S2 fill:#22c55e,color:#fff
Loading

The key shift: an agent that gets the rename right now faces two more independent gates instead of coasting to 1.0.


P3 — Helm Reconciliation Trap

Concept

A CronJob in the observability namespace silently reverts Jaeger's TTL environment variables every 2–3 minutes. The agent's kubectl set env fix works momentarily, then disappears. The durable fix requires helm upgrade.

Why It Works

This exploits a common agent behavior: apply fix → verify immediately → move on. Real DevOps engineers know that kubectl set env on a Helm-managed deployment is ephemeral. The CronJob makes that lesson concrete.

Discovery Path (agent-fair)

  1. kubectl get cronjobs -n observability — CronJob is visible
  2. helm list -n observability — Jaeger is a Helm release
  3. Agent notices TTL disappeared after a few minutes → investigates → finds CronJob

Implementation Sketch

setup.sh — add after Jaeger deployment:

# CronJob that reverts TTL env vars every 2 minutes
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: CronJob
metadata:
  name: jaeger-config-enforcer
  namespace: observability
spec:
  schedule: "*/2 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: enforcer
            image: bitnami/kubectl:latest
            command: ["kubectl", "set", "env", "deployment/jaeger",
                      "-n", "observability",
                      "SPAN_STORAGE_BADGER_SPAN_STORE_TTL-",
                      "BADGER_SPAN_STORE_TTL-"]
          restartPolicy: OnFailure
EOF

Air-gap note: bitnami/kubectl (or equivalent) must be pre-pulled into the snapshot. Alternatively, use a lightweight image already present in the cluster.

solution.sh — replace kubectl set env with:

# Disable the CronJob that reverts TTL settings
kubectl patch cronjob jaeger-config-enforcer -n observability \
  -p '{"spec":{"suspend":true}}'

# Apply TTL durably via Helm
helm upgrade jaeger <chart-path> -n observability \
  --set badger.spanStoreTTL=72h \
  --reuse-values

grader.py — storage_hardening check should:

  • Wait ≥3 minutes (outlast one CronJob cycle) before checking TTL
  • Verify TTL is set on the running pod (not just the deployment spec)

P5 — OTEL Propagator Mismatch

Concept

The api-gateway is configured with OTEL_PROPAGATORS=b3 while backend services use the default W3C TraceContext (tracecontext,baggage). Traces from individual services appear in Jaeger, but cross-service spans aren't linked — no parent-child references.

Why It Works

This creates a cascading dependency: the agent fixes the rename, generates new traces, checks Jaeger, and sees bleat-api traces. But the e2e_trace check fails because spans aren't correlated. The agent must dig deeper into why traces exist but aren't connected.

Discovery Path (agent-fair)

  1. kubectl describe pod <api-gateway-pod> -n bleater — shows OTEL_PROPAGATORS=b3
  2. Jaeger UI: traces for api-gateway and bleat-api exist independently but never appear in the same trace
  3. Standard OTel knowledge: B3 and W3C TraceContext use different HTTP headers (X-B3-TraceId vs traceparent), so context doesn't propagate across the boundary

Implementation Sketch

setup.sh — add to the Gitea sabotage commit (bleater-manifests):

# In api-gateway deployment template, add env var:
- name: OTEL_PROPAGATORS
  value: "b3"

Since api-gateway is ArgoCD-managed, this must go through Gitea like the other sabotage.

solution.sh — add to the Gitea fix commit:

# Remove or correct the propagator override:
- name: OTEL_PROPAGATORS
  value: "tracecontext,baggage"

grader.py — the existing check_e2e_trace() already requires parent-child span references between api-gateway and bleat-api. It should naturally fail when propagation is broken. Verify that:

  • The check distinguishes "spans exist but unlinked" from "no spans at all"
  • It doesn't pass if both services have traces but in separate trace IDs

Projected Impact

Scenario Probability Score Subscores
Wrong rename ~3/7 (43%) 0.00 all fail
Right rename, misses both P3+P5 ~1/7 (14%) 0.50 index+config pass
Right rename, solves one of P3/P5 ~2/7 (29%) 0.75 3 of 4 pass
Right rename, solves both ~1/7 (14%) 1.00 all pass

Estimated new mean: ~0.39 (vs current 0.64)

These are rough projections. The actual distribution depends on how discoverable the breadcrumbs are and how much time agents spend investigating vs. brute-forcing. The key structural improvement is that three different score values are now possible (0.0, 0.50, 0.75, 1.0), breaking the bimodal pattern.


Optional Addition: P4 (authentication_service Underscore Fix)

If P3+P5 overshoots (mean drops too far below 0.50), or if you want a fifth subscore for finer granularity:

What: Score the existing authentication_serviceauthentication-service naming drift as a real check under config_standardized. The agent must generalize the bleat-service fix pattern to a second service.

Why it's lower priority: It tests pattern repetition rather than new investigation skills. Agents that solve the bleat rename will likely solve this too, so it adds time pressure more than cognitive difficulty.

If added, move to 5 subscores × 0.20 each.


Other Review Findings

The full review is at gist/9ea8d25. Beyond the difficulty threshold, no blockers were identified. Highlights:

Grader quality: Clean. Four equal-weight subscores, reasonable anti-gaming measures (e2e_trace requires real traffic with parent-child refs; can't score by wiping all data). 60-second stabilization wait and 60-minute trace freshness window are generous enough to avoid timing false negatives. No grader defects found.

Information isolation: Good. Dockerfile doesn't COPY setup.sh/solution.sh/grader.py (harness mounts externally). The Gitea commit message ("standardize bleat-service OTEL naming") is visible to agents but doesn't reveal the correct fix direction. The Mattermost breadcrumb is more revealing — it explicitly says "OTEL_SERVICE_NAME env var still says bleat-service" — but 3/7 agents override this with structural reasoning, so it's functioning as a calibrated hint rather than giving away the answer.

Task design: Strong. Multi-layer sabotage is creative (sed rename + OTEL misconfiguration + synthetic legacy spans + operation pollution + no Badger TTL). The CI/CD constraint (Gitea → ArgoCD, selfHeal reverts kubectl) is realistic. The authentication_service underscore is a well-designed distractor that costs failing agents 10–30 turns. The incident report reads like a real page.

One minor note: The Dockerfile uses base image nebula-devops:1.0.3. If a newer base version has relevant changes (e.g., different service configurations), the task may need rebasing. Not a blocker for this review.


Summary

Change Files touched Effort
P3: CronJob revert trap setup.sh, solution.sh, grader.py Medium — new K8s resource + Helm upgrade path
P5: Propagator mismatch setup.sh (Gitea sabotage), solution.sh, grader.py (verify existing check) Low-medium — env var addition + verify grader

Both additions are agent-fair (discoverable from standard investigation), independently gradeable (target different subscores), and structurally sound (create real variance, not coin flips). Together they should bring the mean comfortably below the 0.50 docker threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment