Task ID: jaeger-query-service-indexmapping-conflict
UUID: 378d049f-e957-4c78-9fa8-d1b84684d37e
Version: 12
Author: chrorrala (Christian Orrala)
Category: Platform Engineering
Reviewer: Dylan Fitzgerald
Date: 2026-04-03
Task UUID: 378d049f-e957-4c78-9fa8-d1b84684d37e
Task ID: jaeger-query-service-indexmapping-conflict
Category: Platform Engineering
Author: chrorrala
Eval version: v12 (local files: v11 — API 500'd on download)
Backend: Docker (author confirmed in Discord: "I reverted to the normal docker as instructed")
Reviewer: Dylan Fitzgerald
| Task UUID | a4bc3f9c-fe11-4473-960c-2717c39f2417 |
| Version | 19 |
| Thread | https://discord.com/channels/1427397917685321919/1487685828166553737 |
| Review date | 2026-04-02 |
A systematic process for adjusting AI agent task difficulty when scores are too high (task too easy) or too low (task too hard).
For acceptance criteria and pass rate thresholds, see Task Review Guide. For failure analysis methodology, see Task Eval Analysis.
- Task scoring above target threshold (>70% pass rate - too easy)
- Task scoring below target threshold (0% but solution works - artificial failures)
| """Grader for postgres-cve-2024-7348-pg-dump-privesc. | |
| Checks: | |
| 1. version_remediated — PostgreSQL image is >= 16.4 (not 16.3 or earlier) | |
| 2. search_path_hardened — POSTGRES_OPTIONS env var no longer sets unsafe search_path | |
| 3. deployment_healthy — Deployment has desired replicas ready | |
| 4. database_reachable — PostgreSQL responds to a basic health query | |
| """ | |
| import subprocess |
Task: 9cd2b86e-15bf-4c76-8685-efffd1114c8f
Version: v16
Task ID: argocd_hook_reconciliation_deadlock
Category: platform-eng
Author: hafis_83579
Reviewer: Dylan (primary)
Discord: https://discord.com/channels/1427397917685321919/1483534668413407262
Task: bd24c35b-157b-400b-bcdb-88e539b2467c
Version: 18 · Category: SRE · Difficulty: hard
Verdict: NEEDS_WORK
Solution passes (1.0). Mean score 0.60 across 8 biggie-nebula runs — below 0.70. Every subscore has variance. The task is well-designed and close to approval, but has one grader defect that produces non-deterministic failures unrelated to agent skill.