A systematic process for adjusting AI agent task difficulty when scores are too high (task too easy) or too low (task too hard).
For acceptance criteria and pass rate thresholds, see Task Review Guide. For failure analysis methodology, see Task Eval Analysis.
- Task scoring above target threshold (>70% pass rate - too easy)
- Task scoring below target threshold (0% but solution works - artificial failures)
- New task needs calibration before deployment
The task should feel like one realistic incident with interconnected problems, NOT a punchlist of unrelated jobs.
Good: "The GitOps pipeline is broken" -> agent investigates -> finds cascading failures Bad: "Fix these 5 things: runner, credentials, disk, ArgoCD path, image updater"
All problems should trace back to a single narrative:
- "Deployment outage" that requires investigating multiple symptoms
- Problems that reveal each other (fixing A exposes B)
- Unified success criteria: "the system works end-to-end"
The user MUST write all English language content that will be fed to the model:
- task.yaml prompt text
- Gitea issue titles and bodies
- Any documentation the agent will read
- Contents of data/ directory
Claude's role:
- Provide samples with guidance
- Indicate file path and line number for each change
- Validate afterward that user's text fits requirements
- Verify grader remains fair to user's wording
| Technique | Example |
|---|---|
| Vague issue titles | "Harbor push failures" -> "Image deployment pipeline stuck" |
| Remove backup secrets | Force credential discovery from password manager |
| Add red herrings | Issues about unrelated problems |
| Remove acceptance criteria | Single outcome statement vs bullet checklist |
| Technique | Example |
|---|---|
| Break infrastructure itself | Not just credentials, but Harbor deployment scaled to 0 |
| Corrupt automation | Break ArgoCD image updater config, not just app |
| Multi-layer credential chain | Creds -> vault -> token -> Keycloak |
| Break networking/DNS | Corrupt resolution for *.devops.local |
| Technique | Example |
|---|---|
| CronJob re-corruption | Periodically re-breaks something agent fixed |
| Trap fixes | Obvious fix triggers webhook breaking something else |
| Require git commits | Must edit manifests in repo, not just kubectl patch |
| Authentication barriers | Need to auth to ArgoCD CLI, not just use kubectl |
| Technique | Example |
|---|---|
| Partial scoring | Use equal-weight subscores, don't require all-or-nothing. Group related checks into one subscore to shift difficulty signal (never adjust weights — equal weights are a platform requirement) |
| Anti-cheat checks | Verify agent read issues before fixing |
| Require explanation | Must create post-mortem markdown file |
| Order validation | Check agent understood cascading failures |
Create problems that require ordered fixes:
Problem A (surface) <- Agent sees this first
| fixing A reveals...
Problem B (hidden) <- Only visible after A fixed
| fixing B reveals...
Problem C (deep) <- Final layer
Example:
Harbor scaled to 0 (can't push/pull)
| scale up reveals...
Harbor credentials wrong (auth fails)
| fix creds reveals...
Image updater config broken (auto-updates fail)
UNDERSPECIFIED BALANCED OVERSPECIFIED
| | |
"Fix it" "System works E2E, "Fix runner env vars,
investigate issues" restore harbor-creds,
scale harbor-registry..."
| | |
Unfair Challenging Too Easy
(agent can't (requires thinking) (checklist execution)
know success)
| Sign | Example | Problem |
|---|---|---|
| Bullet-point acceptance criteria | "- Runner stable, - Harbor accessible" | Becomes a checklist |
| Explicit component names | "Fix the harbor-creds secret" | Names the solution |
| Step-by-step hints | "Check Harbor, then credentials, then..." | Removes investigation |
| Issues name exact fixes | "Scale harbor-registry to 1" | No thinking required |
| Sign | Example | Problem |
|---|---|---|
| No success criteria | "Make it work" | Agent can't verify completion |
| Grader checks undiscoverable things | Checks secret name never mentioned | Unfair gotcha |
| No entry point | No issues, no error messages | Where to start? |
| Hidden requirements | Grader expects specific approach | Agent might solve differently |
For each grader check, ensure ONE of:
- Explicit in task.yaml: "ArgoCD must be Synced+Healthy"
- Hinted in Gitea issues: "Argo can't sync, manifest errors" (task.yaml points to issues)
- In referenced docs: Wiki, data/ files, or other docs task.yaml links to
- Discoverable in cluster: Agent can find via kubectl, logs, existing configs
- Implicit in E2E: If the E2E test passes, this must be working
Rule: If you can't point to where the agent would learn about a requirement, it's underspecified.
Specification chain: task.yaml -> points to issues/docs -> issues hint at problems -> agent investigates -> discovers specifics
Issues should read like reports from the same outage, not separate tickets:
Bad (disconnected):
- Issue 1: "Runner has wrong env vars"
- Issue 2: "Harbor credentials are bad"
- Issue 3: "Disk is full"
- Issue 4: "ArgoCD path is wrong"
Good (cohesive incident):
- Issue 1: "CI builds failing intermittently" (symptom of runner)
- Issue 2: "Images not deploying, registry issues" (symptom of Harbor chain)
- Issue 3: "ArgoCD broken, auto-updates stopped" (symptom of ArgoCD chain)
- Issue 4: "Sporadic failures, resource errors" (symptom of disk)
All issues should:
- Reference the same outage/incident
- Describe symptoms an engineer would actually report
- Cross-reference each other naturally ("might be related to the registry issues")
Instead of checking N separate things, frame as one outcome:
Bad: "Runner stable AND deployments available AND no pull errors AND ArgoCD synced AND..."
Good: "The GitOps pipeline works end-to-end: commit triggers successful deployment"
The E2E test becomes the primary success signal; component checks are supporting validation.
| Technique | Example |
|---|---|
| Symptom overlap | "stuck pending" could be Harbor OR disk |
| Vague cross-reference | "might be related to the registry issues from earlier" |
| Shared timeline | "started happening after yesterday's deploy" |
| Uncertainty | "not sure if this is the same problem or different" |
- Describe symptoms, not causes
- Mention errors vaguely ("auth errors in some logs")
- Add realistic developer uncertainty ("idk, didn't have time to check")
- Hint at multiple problems without separating them
- Never name the fix ("scale", "credentials", "env vars")
| Don't Say | Why |
|---|---|
| "credentials", "secret" | Names the solution |
| "scale", "replicas" | Names the kubectl command |
| "env vars", "DOCKER_HOST" | Names exact problem |
| "path", "source path" | Names ArgoCD fix |
| "disk-filler", "pod" | Names what to delete |
| Pattern | Example |
|---|---|
| Symptom + uncertainty | "Builds failing, might be resource related?" |
| Vague error reference | "I think I saw auth errors in some logs" |
| Multiple hints combined | "Registry unreachable sometimes, stuck pending" |
| Developer casualness | "idk, didn't have time to check" |
Replace bullet-point acceptance criteria with single outcome statement.
Instead of:
Restore the system so that:
- Gitea Runner is stable
- Bleater workloads are deployed
- Kubernetes can pull images
- ArgoCD becomes Synced
Suggest pattern:
Investigate and restore the system to full working order.
The system is considered restored when [single E2E outcome statement].
When changes require English content, provide:
### Issue N: [What it covers] (line XXX)
**File:** `path/to/file.sh:XXX`
**Samples:**
| Title | Body |
|-------|------|
| "Sample title 1" | "Sample body text 1" |
| "Sample title 2" | "Sample body text 2" |
| "Sample title 3" | "Sample body text 3" |Include in the code itself:
# Issue N: Covers [problem X] and [problem Y]
# GUIDANCE: Mention [symptoms]. Hint at [vague cause].
# Don't mention [explicit solution terms].
create_issue "[USER: WRITE TITLE]" \
"[USER: WRITE BODY]"## Quick Navigation
task.yaml:XX - [What to change]
setup.sh:XXX - Issue 1 (covers X)
setup.sh:XXX - Issue 2 (covers Y)After user writes content:
| Issue | Covers | Verdict | Notes |
|---|---|---|---|
| #1 | [problem] | pass/fail | [feedback] |
For each grader check:
Check: [check_name]
Tests: [what it validates]
Specified by: [task.yaml line / issue # / docs / discoverable via X]
Verdict: Fair / Unfair - [reason]
- Typos or formatting errors
- Content too explicit (gives away solution)
- Content too vague (unfair to agent)
- Grader checks something not hinted
The grader must actually verify the agent solved the problem correctly, not just that symptoms disappeared.
| Red Flag | Problem | Fix |
|---|---|---|
| Only checks pod status | Agent could delete broken pods | Check deployments have available replicas |
| Only checks "no errors" | Doesn't verify functionality | Add E2E functional test |
| No timing window | Flaky services might pass momentarily | Add stability window (e.g., 60s observation) |
| Checks can be cheated | Agent could fake the outcome | Add anti-cheat validation |
- Functional E2E test: Does the system actually work? (e.g., commit -> ArgoCD converges)
- Stability: Is the fix stable over time? (e.g., restart count not increasing)
- Root cause fixed: Not just symptoms masked (e.g., check config is correct, not just pod running)
- All components: Each broken thing has a corresponding check
At least one subscore must show variance across 10 rollouts. The Nebula reviewer bot enforces this automatically.
Variance means the subscore fluctuates between runs — some rollouts score 0 and others score 1 for the same check. This proves the task has a learnable dimension where agent behavior matters.
If every subscore is the same across all rollouts, one of two things is true:
- All 1s: The check is trivially solvable — every agent gets it regardless of strategy
- All 0s: The check is impossibly hard or broken — no agent can pass it
Neither case provides a learning signal. A task where the agent can't improve through different strategies is not useful for training.
Rollout 1: subscores = {'ci_works': 0, 'cd_works': 1, 'end-to-end-test': 1}
Rollout 2: subscores = {'ci_works': 1, 'cd_works': 1, 'end-to-end-test': 1}
...
Rollout 10: subscores = {'ci_works': 0, 'cd_works': 1, 'end-to-end-test': 1}
Here ci_works has variance (fluctuates between 0 and 1) — this is the learnable dimension. The agent's approach to fixing CI matters and produces different outcomes.
| All Subscores | Diagnosis | Fix |
|---|---|---|
| All 1s every run | Task too easy | Add difficulty per Tier 1-3 above |
| All 0s every run | Task too hard or grader broken | Check solution.sh passes; add grader waits |
| Mixed but identical across runs | Checks are deterministic on setup, not agent behavior | Redesign so at least one check tests something the agent must actively solve |
- The subscore most likely to vary should test the core skill the task measures
- Break complex grading into granular subscores — more subscores = more chances for variance to emerge naturally
- Avoid subscores that only test environment setup (these will be the same every run)
- Good variance candidates: E2E functional tests, multi-step fixes, integration checks
A deterministic grader produces consistent pass/fail for equivalent solutions. Non-determinism causes artificial failures where valid solutions fail due to undiscoverable assumptions.
Core principle: If grader checks something specific, either:
- It's specified somewhere the agent can discover (task.yaml, Gitea issues, docs, wiki), OR
- It's discoverable through normal investigation (kubectl, logs, existing configs), OR
- Grader detects it dynamically (not assume it)
| Source | Problem | Fix |
|---|---|---|
| Pattern-based pod detection | Agent uses valid but non-matching names | Specify naming OR use label selectors |
| Hardcoded ports/protocols | Agent configures different valid protocol | Specify protocol OR detect dynamically |
| Hardcoded resource names | Agent uses different valid names | Specify names OR discover via labels |
| Resource type assumptions | Grader expects Deployment, agent uses StatefulSet | Specify type OR check pods regardless of controller |
| Label selector assumptions | Agent uses different but valid labels | Specify labels OR detect by other means |
| Grader Check | Question | If No -> Action |
|---|---|---|
| Pod name pattern | Is this naming specified/discoverable? | Add to task.yaml/issues OR fix grader |
| Specific port/protocol | Is this protocol specified/discoverable? | Add hint OR detect dynamically |
| Label selector | Are these labels specified/discoverable? | Add hint OR use alternative detection |
| Resource type | Is this type specified/discoverable? | Add hint OR check pods regardless |
| Specific config values | Are these values specified/discoverable? | Add hint OR test behavior instead |
Check grader.py for hardcoded:
- Deployment names -> Must be discoverable via kubectl
- Secret names -> Must be referenced somewhere agent can find
- Namespace names -> Must be in .allowed_namespaces or documented
- App names -> Must be discoverable
Some cases require judgment. Flag explicitly:
DETERMINISM: Grader assumes [X].
- Specified where: [task.yaml / issue #N / discoverable via Y / NOWHERE]
- Alternatives: [what agent could legitimately do differently]
- Eval impact: [did this cause failures?]
- Recommend: [add hint to issues OR make grader flexible]
Claude CAN directly write:
- setup.sh breakage commands
- grader.py check functions
- solution.sh fix commands
- Code comments and guidance markers
| Pitfall | Solution |
|---|---|
| Grader checks something not hinted | Add vague hint to issues or task.yaml |
| Issues too explicit | User rewrites focusing on symptoms |
| Agent can pattern-match fix | Add investigation layer |
| Solution requires internet | Ensure all resources available in cluster |
| Timing issues in grader | Add appropriate waits/retries |