Task Difficulty Tuning

A systematic process for adjusting AI agent task difficulty when scores are too high (task too easy) or too low (task too hard).

For acceptance criteria and pass rate thresholds, see Task Review Guide. For failure analysis methodology, see Task Eval Analysis.

When to Use

Task scoring above target threshold (>70% pass rate - too easy)
Task scoring below target threshold (0% but solution works - artificial failures)
New task needs calibration before deployment

Core Principle: Cohesive End-to-End Task

The task should feel like one realistic incident with interconnected problems, NOT a punchlist of unrelated jobs.

Good: "The GitOps pipeline is broken" -> agent investigates -> finds cascading failures Bad: "Fix these 5 things: runner, credentials, disk, ArgoCD path, image updater"

All problems should trace back to a single narrative:

"Deployment outage" that requires investigating multiple symptoms
Problems that reveal each other (fixing A exposes B)
Unified success criteria: "the system works end-to-end"

Critical Rule: Human-Written Content

The user MUST write all English language content that will be fed to the model:

task.yaml prompt text
Gitea issue titles and bodies
Any documentation the agent will read
Contents of data/ directory

Claude's role:

Provide samples with guidance
Indicate file path and line number for each change
Validate afterward that user's text fits requirements
Verify grader remains fair to user's wording

Difficulty Adjustment Tiers

Tier 1: Obscure Breadcrumbs (Easiest to Implement)

Technique	Example
Vague issue titles	"Harbor push failures" -> "Image deployment pipeline stuck"
Remove backup secrets	Force credential discovery from password manager
Add red herrings	Issues about unrelated problems
Remove acceptance criteria	Single outcome statement vs bullet checklist

Tier 2: Add Investigation Layers

Technique	Example
Break infrastructure itself	Not just credentials, but Harbor deployment scaled to 0
Corrupt automation	Break ArgoCD image updater config, not just app
Multi-layer credential chain	Creds -> vault -> token -> Keycloak
Break networking/DNS	Corrupt resolution for *.devops.local

Tier 3: Temporal/Order-Sensitive Problems

Technique	Example
CronJob re-corruption	Periodically re-breaks something agent fixed
Trap fixes	Obvious fix triggers webhook breaking something else
Require git commits	Must edit manifests in repo, not just kubectl patch
Authentication barriers	Need to auth to ArgoCD CLI, not just use kubectl

Tier 4: Grading Structure

Technique	Example
Partial scoring	Use equal-weight subscores, don't require all-or-nothing. Group related checks into one subscore to shift difficulty signal (never adjust weights — equal weights are a platform requirement)
Anti-cheat checks	Verify agent read issues before fixing
Require explanation	Must create post-mortem markdown file
Order validation	Check agent understood cascading failures

Cascading Dependencies Pattern

Create problems that require ordered fixes:

Problem A (surface) <- Agent sees this first
    | fixing A reveals...
Problem B (hidden) <- Only visible after A fixed
    | fixing B reveals...
Problem C (deep) <- Final layer

Example:

Harbor scaled to 0 (can't push/pull)
    | scale up reveals...
Harbor credentials wrong (auth fails)
    | fix creds reveals...
Image updater config broken (auto-updates fail)

Specification Balance: Over vs Under

The Spectrum

UNDERSPECIFIED          BALANCED              OVERSPECIFIED
     |                     |                      |
"Fix it"          "System works E2E,      "Fix runner env vars,
                   investigate issues"     restore harbor-creds,
                                          scale harbor-registry..."
     |                     |                      |
  Unfair            Challenging             Too Easy
(agent can't      (requires thinking)    (checklist execution)
 know success)

Signs of Overspecification

Sign	Example	Problem
Bullet-point acceptance criteria	"- Runner stable, - Harbor accessible"	Becomes a checklist
Explicit component names	"Fix the harbor-creds secret"	Names the solution
Step-by-step hints	"Check Harbor, then credentials, then..."	Removes investigation
Issues name exact fixes	"Scale harbor-registry to 1"	No thinking required

Signs of Underspecification

Sign	Example	Problem
No success criteria	"Make it work"	Agent can't verify completion
Grader checks undiscoverable things	Checks secret name never mentioned	Unfair gotcha
No entry point	No issues, no error messages	Where to start?
Hidden requirements	Grader expects specific approach	Agent might solve differently

Finding Balance

For each grader check, ensure ONE of:

Explicit in task.yaml: "ArgoCD must be Synced+Healthy"
Hinted in Gitea issues: "Argo can't sync, manifest errors" (task.yaml points to issues)
In referenced docs: Wiki, data/ files, or other docs task.yaml links to
Discoverable in cluster: Agent can find via kubectl, logs, existing configs
Implicit in E2E: If the E2E test passes, this must be working

Rule: If you can't point to where the agent would learn about a requirement, it's underspecified.

Specification chain: task.yaml -> points to issues/docs -> issues hint at problems -> agent investigates -> discovers specifics

Task Cohesion: One Incident, Not a Punchlist

Connecting Issues Narratively

Issues should read like reports from the same outage, not separate tickets:

Bad (disconnected):

Issue 1: "Runner has wrong env vars"
Issue 2: "Harbor credentials are bad"
Issue 3: "Disk is full"
Issue 4: "ArgoCD path is wrong"

Good (cohesive incident):

Issue 1: "CI builds failing intermittently" (symptom of runner)
Issue 2: "Images not deploying, registry issues" (symptom of Harbor chain)
Issue 3: "ArgoCD broken, auto-updates stopped" (symptom of ArgoCD chain)
Issue 4: "Sporadic failures, resource errors" (symptom of disk)

All issues should:

Reference the same outage/incident
Describe symptoms an engineer would actually report
Cross-reference each other naturally ("might be related to the registry issues")

Unified Success Criteria

Instead of checking N separate things, frame as one outcome:

Bad: "Runner stable AND deployments available AND no pull errors AND ArgoCD synced AND..."

Good: "The GitOps pipeline works end-to-end: commit triggers successful deployment"

The E2E test becomes the primary success signal; component checks are supporting validation.

Issue Interconnection Techniques

Technique	Example
Symptom overlap	"stuck pending" could be Harbor OR disk
Vague cross-reference	"might be related to the registry issues from earlier"
Shared timeline	"started happening after yesterday's deploy"
Uncertainty	"not sure if this is the same problem or different"

Writing Guidance for Vague Issues

Principles

Describe symptoms, not causes
Mention errors vaguely ("auth errors in some logs")
Add realistic developer uncertainty ("idk, didn't have time to check")
Hint at multiple problems without separating them
Never name the fix ("scale", "credentials", "env vars")

What to Avoid

Don't Say	Why
"credentials", "secret"	Names the solution
"scale", "replicas"	Names the kubectl command
"env vars", "DOCKER_HOST"	Names exact problem
"path", "source path"	Names ArgoCD fix
"disk-filler", "pod"	Names what to delete

Good Patterns

Pattern	Example
Symptom + uncertainty	"Builds failing, might be resource related?"
Vague error reference	"I think I saw auth errors in some logs"
Multiple hints combined	"Registry unreachable sometimes, stuck pending"
Developer casualness	"idk, didn't have time to check"

Task.yaml Structure Guidance

Remove Checklist, Add Outcome

Replace bullet-point acceptance criteria with single outcome statement.

Instead of:

Restore the system so that:
- Gitea Runner is stable
- Bleater workloads are deployed
- Kubernetes can pull images
- ArgoCD becomes Synced

Suggest pattern:

Investigate and restore the system to full working order.
The system is considered restored when [single E2E outcome statement].

Workflow: Providing Samples for User

When changes require English content, provide:

1. Sample Table Format

### Issue N: [What it covers] (line XXX)
**File:** `path/to/file.sh:XXX`

**Samples:**
| Title | Body |
|-------|------|
| "Sample title 1" | "Sample body text 1" |
| "Sample title 2" | "Sample body text 2" |
| "Sample title 3" | "Sample body text 3" |

2. Guidance Comments

Include in the code itself:

# Issue N: Covers [problem X] and [problem Y]
# GUIDANCE: Mention [symptoms]. Hint at [vague cause].
#           Don't mention [explicit solution terms].
create_issue "[USER: WRITE TITLE]" \
"[USER: WRITE BODY]"

3. Quick Navigation Summary

## Quick Navigation

task.yaml:XX     - [What to change]
setup.sh:XXX     - Issue 1 (covers X)
setup.sh:XXX     - Issue 2 (covers Y)

Workflow: Validating User's Text

After user writes content:

1. Check Each Issue Against Requirements

Issue	Covers	Verdict	Notes
#1	[problem]	pass/fail	[feedback]

2. Verify Grader Fairness

For each grader check:

Check: [check_name]
Tests: [what it validates]
Specified by: [task.yaml line / issue # / docs / discoverable via X]
Verdict: Fair / Unfair - [reason]

3. Flag Issues

Typos or formatting errors
Content too explicit (gives away solution)
Content too vague (unfair to agent)
Grader checks something not hinted

Grader Validation: Not Too Relaxed

The grader must actually verify the agent solved the problem correctly, not just that symptoms disappeared.

Check for Relaxed Grading

Red Flag	Problem	Fix
Only checks pod status	Agent could delete broken pods	Check deployments have available replicas
Only checks "no errors"	Doesn't verify functionality	Add E2E functional test
No timing window	Flaky services might pass momentarily	Add stability window (e.g., 60s observation)
Checks can be cheated	Agent could fake the outcome	Add anti-cheat validation

Grader Should Verify

Functional E2E test: Does the system actually work? (e.g., commit -> ArgoCD converges)
Stability: Is the fix stable over time? (e.g., restart count not increasing)
Root cause fixed: Not just symptoms masked (e.g., check config is correct, not just pod running)
All components: Each broken thing has a corresponding check

Subscore Variance Requirement

At least one subscore must show variance across 10 rollouts. The Nebula reviewer bot enforces this automatically.

Variance means the subscore fluctuates between runs — some rollouts score 0 and others score 1 for the same check. This proves the task has a learnable dimension where agent behavior matters.

Why Variance Matters

If every subscore is the same across all rollouts, one of two things is true:

All 1s: The check is trivially solvable — every agent gets it regardless of strategy
All 0s: The check is impossibly hard or broken — no agent can pass it

Neither case provides a learning signal. A task where the agent can't improve through different strategies is not useful for training.

Example

Rollout 1:  subscores = {'ci_works': 0, 'cd_works': 1, 'end-to-end-test': 1}
Rollout 2:  subscores = {'ci_works': 1, 'cd_works': 1, 'end-to-end-test': 1}
...
Rollout 10: subscores = {'ci_works': 0, 'cd_works': 1, 'end-to-end-test': 1}

Here ci_works has variance (fluctuates between 0 and 1) — this is the learnable dimension. The agent's approach to fixing CI matters and produces different outcomes.

Diagnosing No Variance

All Subscores	Diagnosis	Fix
All 1s every run	Task too easy	Add difficulty per Tier 1-3 above
All 0s every run	Task too hard or grader broken	Check solution.sh passes; add grader waits
Mixed but identical across runs	Checks are deterministic on setup, not agent behavior	Redesign so at least one check tests something the agent must actively solve

Designing for Variance

The subscore most likely to vary should test the core skill the task measures
Break complex grading into granular subscores — more subscores = more chances for variance to emerge naturally
Avoid subscores that only test environment setup (these will be the same every run)
Good variance candidates: E2E functional tests, multi-step fixes, integration checks

Grader Determinism

A deterministic grader produces consistent pass/fail for equivalent solutions. Non-determinism causes artificial failures where valid solutions fail due to undiscoverable assumptions.

Core principle: If grader checks something specific, either:

It's specified somewhere the agent can discover (task.yaml, Gitea issues, docs, wiki), OR
It's discoverable through normal investigation (kubectl, logs, existing configs), OR
Grader detects it dynamically (not assume it)

Common Non-Determinism Sources

Source	Problem	Fix
Pattern-based pod detection	Agent uses valid but non-matching names	Specify naming OR use label selectors
Hardcoded ports/protocols	Agent configures different valid protocol	Specify protocol OR detect dynamically
Hardcoded resource names	Agent uses different valid names	Specify names OR discover via labels
Resource type assumptions	Grader expects Deployment, agent uses StatefulSet	Specify type OR check pods regardless of controller
Label selector assumptions	Agent uses different but valid labels	Specify labels OR detect by other means

Determinism Checklist

Grader Check	Question	If No -> Action
Pod name pattern	Is this naming specified/discoverable?	Add to task.yaml/issues OR fix grader
Specific port/protocol	Is this protocol specified/discoverable?	Add hint OR detect dynamically
Label selector	Are these labels specified/discoverable?	Add hint OR use alternative detection
Resource type	Is this type specified/discoverable?	Add hint OR check pods regardless
Specific config values	Are these values specified/discoverable?	Add hint OR test behavior instead

Hardcoded Values Review

Check grader.py for hardcoded:

Deployment names -> Must be discoverable via kubectl
Secret names -> Must be referenced somewhere agent can find
Namespace names -> Must be in .allowed_namespaces or documented
App names -> Must be discoverable

Gray Areas

Some cases require judgment. Flag explicitly:

DETERMINISM: Grader assumes [X].
   - Specified where: [task.yaml / issue #N / discoverable via Y / NOWHERE]
   - Alternatives: [what agent could legitimately do differently]
   - Eval impact: [did this cause failures?]
   - Recommend: [add hint to issues OR make grader flexible]

Technical Changes (Claude Writes)

Claude CAN directly write:

setup.sh breakage commands
grader.py check functions
solution.sh fix commands
Code comments and guidance markers

Common Pitfalls

Pitfall	Solution
Grader checks something not hinted	Add vague hint to issues or task.yaml
Issues too explicit	User rewrites focusing on symptoms
Agent can pattern-match fix	Add investigation layer
Solution requires internet	Ensure all resources available in cluster
Timing issues in grader	Add appropriate waits/retries

arubis/TASK_DIFFICULTY_TUNING.md