| Task | 378d049f (v12) by Christian (chrorrala) |
| Backend | Docker |
| Threshold | mean < 0.50 |
| Current mean | 0.64 (7 biggie-nebula runs) |
| Verdict | NEEDS_WORK |
All four subscores (index_cleanup, e2e_trace, config_standardized, storage_hardening) are perfectly correlated — they all hinge on a single decision: which direction to rename (bleat-service → bleat-api or the reverse).
| Rename direction | Runs | Score |
|---|---|---|
| Correct (bleat-service → bleat-api) | 4 | 1.0 |
| Wrong (bleat-api → bleat-service) | 3 | 0.0 |
No agent scored a partial. The task is effectively a coin flip, not a graduated challenge.
graph LR
A[Agent starts] --> B{Rename direction?}
B -->|bleat-service → bleat-api| C[All 4 checks pass]
B -->|bleat-api → bleat-service| D[All 4 checks fail]
C --> E["Score: 1.0"]
D --> F["Score: 0.0"]
style C fill:#22c55e,color:#fff
style D fill:#ef4444,color:#fff
style E fill:#22c55e,color:#fff
style F fill:#ef4444,color:#fff
Target: Break this bimodality by adding independent failure modes that create partial scores and real variance.
Two additions that target different subscores with different difficulty types:
| # | Addition | Subscore affected | Difficulty type | Independence |
|---|---|---|---|---|
| P3 | Helm reconciliation trap | storage_hardening | Temporal/order-sensitive | Fully independent of rename |
| P5 | OTEL propagator mismatch | e2e_trace | Cascading dependency | Triggered after correct rename |
graph TD
A[Agent starts] --> B{Rename direction?}
B -->|Wrong| F0["0.0 — same as today"]
B -->|Correct| C[index_cleanup ✅ config_standardized ✅]
C --> D{Discovers propagator mismatch?}
D -->|No| G1["e2e_trace ✗ — spans exist but unlinked"]
D -->|Yes, fixes B3→W3C| G2["e2e_trace ✅"]
C --> E{TTL fix survives CronJob revert?}
E -->|"No (kubectl set env)"|H1["storage_hardening ✗ — reverted before grader"]
E -->|"Yes (Helm upgrade)"|H2["storage_hardening ✅"]
G1 --> S1["Score: 0.50"]
G2 --> S2["Score: 0.75 or 1.0"]
H1 --> S1
H2 --> S2
style F0 fill:#ef4444,color:#fff
style G1 fill:#f59e0b,color:#fff
style H1 fill:#f59e0b,color:#fff
style G2 fill:#22c55e,color:#fff
style H2 fill:#22c55e,color:#fff
style S1 fill:#f59e0b,color:#fff
style S2 fill:#22c55e,color:#fff
The key shift: an agent that gets the rename right now faces two more independent gates instead of coasting to 1.0.
A CronJob in the observability namespace silently reverts Jaeger's TTL environment variables every 2–3 minutes. The agent's kubectl set env fix works momentarily, then disappears. The durable fix requires helm upgrade.
This exploits a common agent behavior: apply fix → verify immediately → move on. Real DevOps engineers know that kubectl set env on a Helm-managed deployment is ephemeral. The CronJob makes that lesson concrete.
kubectl get cronjobs -n observability— CronJob is visiblehelm list -n observability— Jaeger is a Helm release- Agent notices TTL disappeared after a few minutes → investigates → finds CronJob
setup.sh — add after Jaeger deployment:
# CronJob that reverts TTL env vars every 2 minutes
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: CronJob
metadata:
name: jaeger-config-enforcer
namespace: observability
spec:
schedule: "*/2 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: enforcer
image: bitnami/kubectl:latest
command: ["kubectl", "set", "env", "deployment/jaeger",
"-n", "observability",
"SPAN_STORAGE_BADGER_SPAN_STORE_TTL-",
"BADGER_SPAN_STORE_TTL-"]
restartPolicy: OnFailure
EOFAir-gap note:
bitnami/kubectl(or equivalent) must be pre-pulled into the snapshot. Alternatively, use a lightweight image already present in the cluster.
solution.sh — replace kubectl set env with:
# Disable the CronJob that reverts TTL settings
kubectl patch cronjob jaeger-config-enforcer -n observability \
-p '{"spec":{"suspend":true}}'
# Apply TTL durably via Helm
helm upgrade jaeger <chart-path> -n observability \
--set badger.spanStoreTTL=72h \
--reuse-valuesgrader.py — storage_hardening check should:
- Wait ≥3 minutes (outlast one CronJob cycle) before checking TTL
- Verify TTL is set on the running pod (not just the deployment spec)
The api-gateway is configured with OTEL_PROPAGATORS=b3 while backend services use the default W3C TraceContext (tracecontext,baggage). Traces from individual services appear in Jaeger, but cross-service spans aren't linked — no parent-child references.
This creates a cascading dependency: the agent fixes the rename, generates new traces, checks Jaeger, and sees bleat-api traces. But the e2e_trace check fails because spans aren't correlated. The agent must dig deeper into why traces exist but aren't connected.
kubectl describe pod <api-gateway-pod> -n bleater— showsOTEL_PROPAGATORS=b3- Jaeger UI: traces for
api-gatewayandbleat-apiexist independently but never appear in the same trace - Standard OTel knowledge: B3 and W3C TraceContext use different HTTP headers (
X-B3-TraceIdvstraceparent), so context doesn't propagate across the boundary
setup.sh — add to the Gitea sabotage commit (bleater-manifests):
# In api-gateway deployment template, add env var:
- name: OTEL_PROPAGATORS
value: "b3"Since api-gateway is ArgoCD-managed, this must go through Gitea like the other sabotage.
solution.sh — add to the Gitea fix commit:
# Remove or correct the propagator override:
- name: OTEL_PROPAGATORS
value: "tracecontext,baggage"grader.py — the existing check_e2e_trace() already requires parent-child span references between api-gateway and bleat-api. It should naturally fail when propagation is broken. Verify that:
- The check distinguishes "spans exist but unlinked" from "no spans at all"
- It doesn't pass if both services have traces but in separate trace IDs
| Scenario | Probability | Score | Subscores |
|---|---|---|---|
| Wrong rename | ~3/7 (43%) | 0.00 | all fail |
| Right rename, misses both P3+P5 | ~1/7 (14%) | 0.50 | index+config pass |
| Right rename, solves one of P3/P5 | ~2/7 (29%) | 0.75 | 3 of 4 pass |
| Right rename, solves both | ~1/7 (14%) | 1.00 | all pass |
Estimated new mean: ~0.39 (vs current 0.64)
These are rough projections. The actual distribution depends on how discoverable the breadcrumbs are and how much time agents spend investigating vs. brute-forcing. The key structural improvement is that three different score values are now possible (0.0, 0.50, 0.75, 1.0), breaking the bimodal pattern.
If P3+P5 overshoots (mean drops too far below 0.50), or if you want a fifth subscore for finer granularity:
What: Score the existing authentication_service → authentication-service naming drift as a real check under config_standardized. The agent must generalize the bleat-service fix pattern to a second service.
Why it's lower priority: It tests pattern repetition rather than new investigation skills. Agents that solve the bleat rename will likely solve this too, so it adds time pressure more than cognitive difficulty.
If added, move to 5 subscores × 0.20 each.
The full review is at gist/9ea8d25. Beyond the difficulty threshold, no blockers were identified. Highlights:
Grader quality: Clean. Four equal-weight subscores, reasonable anti-gaming measures (e2e_trace requires real traffic with parent-child refs; can't score by wiping all data). 60-second stabilization wait and 60-minute trace freshness window are generous enough to avoid timing false negatives. No grader defects found.
Information isolation: Good. Dockerfile doesn't COPY setup.sh/solution.sh/grader.py (harness mounts externally). The Gitea commit message ("standardize bleat-service OTEL naming") is visible to agents but doesn't reveal the correct fix direction. The Mattermost breadcrumb is more revealing — it explicitly says "OTEL_SERVICE_NAME env var still says bleat-service" — but 3/7 agents override this with structural reasoning, so it's functioning as a calibrated hint rather than giving away the answer.
Task design: Strong. Multi-layer sabotage is creative (sed rename + OTEL misconfiguration + synthetic legacy spans + operation pollution + no Badger TTL). The CI/CD constraint (Gitea → ArgoCD, selfHeal reverts kubectl) is realistic. The authentication_service underscore is a well-designed distractor that costs failing agents 10–30 turns. The incident report reads like a real page.
One minor note: The Dockerfile uses base image nebula-devops:1.0.3. If a newer base version has relevant changes (e.g., different service configurations), the task may need rebasing. Not a blocker for this review.
| Change | Files touched | Effort |
|---|---|---|
| P3: CronJob revert trap | setup.sh, solution.sh, grader.py | Medium — new K8s resource + Helm upgrade path |
| P5: Propagator mismatch | setup.sh (Gitea sabotage), solution.sh, grader.py (verify existing check) | Low-medium — env var addition + verify grader |
Both additions are agent-fair (discoverable from standard investigation), independently gradeable (target different subscores), and structurally sound (create real variance, not coin flips). Together they should bring the mean comfortably below the 0.50 docker threshold.