Skip to content

Instantly share code, notes, and snippets.

@arubis
Created April 1, 2026 16:40
Show Gist options
  • Select an option

  • Save arubis/be83b4cc987a2f48031745e430312b6c to your computer and use it in GitHub Desktop.

Select an option

Save arubis/be83b4cc987a2f48031745e430312b6c to your computer and use it in GitHub Desktop.

Task Difficulty Tuning

A systematic process for adjusting AI agent task difficulty when scores are too high (task too easy) or too low (task too hard).

For acceptance criteria and pass rate thresholds, see Task Review Guide. For failure analysis methodology, see Task Eval Analysis.

When to Use

  • Task scoring above target threshold (>70% pass rate - too easy)
  • Task scoring below target threshold (0% but solution works - artificial failures)
  • New task needs calibration before deployment

Core Principle: Cohesive End-to-End Task

The task should feel like one realistic incident with interconnected problems, NOT a punchlist of unrelated jobs.

Good: "The GitOps pipeline is broken" -> agent investigates -> finds cascading failures Bad: "Fix these 5 things: runner, credentials, disk, ArgoCD path, image updater"

All problems should trace back to a single narrative:

  • "Deployment outage" that requires investigating multiple symptoms
  • Problems that reveal each other (fixing A exposes B)
  • Unified success criteria: "the system works end-to-end"

Critical Rule: Human-Written Content

The user MUST write all English language content that will be fed to the model:

  • task.yaml prompt text
  • Gitea issue titles and bodies
  • Any documentation the agent will read
  • Contents of data/ directory

Claude's role:

  1. Provide samples with guidance
  2. Indicate file path and line number for each change
  3. Validate afterward that user's text fits requirements
  4. Verify grader remains fair to user's wording

Difficulty Adjustment Tiers

Tier 1: Obscure Breadcrumbs (Easiest to Implement)

Technique Example
Vague issue titles "Harbor push failures" -> "Image deployment pipeline stuck"
Remove backup secrets Force credential discovery from password manager
Add red herrings Issues about unrelated problems
Remove acceptance criteria Single outcome statement vs bullet checklist

Tier 2: Add Investigation Layers

Technique Example
Break infrastructure itself Not just credentials, but Harbor deployment scaled to 0
Corrupt automation Break ArgoCD image updater config, not just app
Multi-layer credential chain Creds -> vault -> token -> Keycloak
Break networking/DNS Corrupt resolution for *.devops.local

Tier 3: Temporal/Order-Sensitive Problems

Technique Example
CronJob re-corruption Periodically re-breaks something agent fixed
Trap fixes Obvious fix triggers webhook breaking something else
Require git commits Must edit manifests in repo, not just kubectl patch
Authentication barriers Need to auth to ArgoCD CLI, not just use kubectl

Tier 4: Grading Structure

Technique Example
Partial scoring Use equal-weight subscores, don't require all-or-nothing. Group related checks into one subscore to shift difficulty signal (never adjust weights — equal weights are a platform requirement)
Anti-cheat checks Verify agent read issues before fixing
Require explanation Must create post-mortem markdown file
Order validation Check agent understood cascading failures

Cascading Dependencies Pattern

Create problems that require ordered fixes:

Problem A (surface) <- Agent sees this first
    | fixing A reveals...
Problem B (hidden) <- Only visible after A fixed
    | fixing B reveals...
Problem C (deep) <- Final layer

Example:

Harbor scaled to 0 (can't push/pull)
    | scale up reveals...
Harbor credentials wrong (auth fails)
    | fix creds reveals...
Image updater config broken (auto-updates fail)

Specification Balance: Over vs Under

The Spectrum

UNDERSPECIFIED          BALANCED              OVERSPECIFIED
     |                     |                      |
"Fix it"          "System works E2E,      "Fix runner env vars,
                   investigate issues"     restore harbor-creds,
                                          scale harbor-registry..."
     |                     |                      |
  Unfair            Challenging             Too Easy
(agent can't      (requires thinking)    (checklist execution)
 know success)

Signs of Overspecification

Sign Example Problem
Bullet-point acceptance criteria "- Runner stable, - Harbor accessible" Becomes a checklist
Explicit component names "Fix the harbor-creds secret" Names the solution
Step-by-step hints "Check Harbor, then credentials, then..." Removes investigation
Issues name exact fixes "Scale harbor-registry to 1" No thinking required

Signs of Underspecification

Sign Example Problem
No success criteria "Make it work" Agent can't verify completion
Grader checks undiscoverable things Checks secret name never mentioned Unfair gotcha
No entry point No issues, no error messages Where to start?
Hidden requirements Grader expects specific approach Agent might solve differently

Finding Balance

For each grader check, ensure ONE of:

  1. Explicit in task.yaml: "ArgoCD must be Synced+Healthy"
  2. Hinted in Gitea issues: "Argo can't sync, manifest errors" (task.yaml points to issues)
  3. In referenced docs: Wiki, data/ files, or other docs task.yaml links to
  4. Discoverable in cluster: Agent can find via kubectl, logs, existing configs
  5. Implicit in E2E: If the E2E test passes, this must be working

Rule: If you can't point to where the agent would learn about a requirement, it's underspecified.

Specification chain: task.yaml -> points to issues/docs -> issues hint at problems -> agent investigates -> discovers specifics

Task Cohesion: One Incident, Not a Punchlist

Connecting Issues Narratively

Issues should read like reports from the same outage, not separate tickets:

Bad (disconnected):

  • Issue 1: "Runner has wrong env vars"
  • Issue 2: "Harbor credentials are bad"
  • Issue 3: "Disk is full"
  • Issue 4: "ArgoCD path is wrong"

Good (cohesive incident):

  • Issue 1: "CI builds failing intermittently" (symptom of runner)
  • Issue 2: "Images not deploying, registry issues" (symptom of Harbor chain)
  • Issue 3: "ArgoCD broken, auto-updates stopped" (symptom of ArgoCD chain)
  • Issue 4: "Sporadic failures, resource errors" (symptom of disk)

All issues should:

  • Reference the same outage/incident
  • Describe symptoms an engineer would actually report
  • Cross-reference each other naturally ("might be related to the registry issues")

Unified Success Criteria

Instead of checking N separate things, frame as one outcome:

Bad: "Runner stable AND deployments available AND no pull errors AND ArgoCD synced AND..."

Good: "The GitOps pipeline works end-to-end: commit triggers successful deployment"

The E2E test becomes the primary success signal; component checks are supporting validation.

Issue Interconnection Techniques

Technique Example
Symptom overlap "stuck pending" could be Harbor OR disk
Vague cross-reference "might be related to the registry issues from earlier"
Shared timeline "started happening after yesterday's deploy"
Uncertainty "not sure if this is the same problem or different"

Writing Guidance for Vague Issues

Principles

  1. Describe symptoms, not causes
  2. Mention errors vaguely ("auth errors in some logs")
  3. Add realistic developer uncertainty ("idk, didn't have time to check")
  4. Hint at multiple problems without separating them
  5. Never name the fix ("scale", "credentials", "env vars")

What to Avoid

Don't Say Why
"credentials", "secret" Names the solution
"scale", "replicas" Names the kubectl command
"env vars", "DOCKER_HOST" Names exact problem
"path", "source path" Names ArgoCD fix
"disk-filler", "pod" Names what to delete

Good Patterns

Pattern Example
Symptom + uncertainty "Builds failing, might be resource related?"
Vague error reference "I think I saw auth errors in some logs"
Multiple hints combined "Registry unreachable sometimes, stuck pending"
Developer casualness "idk, didn't have time to check"

Task.yaml Structure Guidance

Remove Checklist, Add Outcome

Replace bullet-point acceptance criteria with single outcome statement.

Instead of:

Restore the system so that:
- Gitea Runner is stable
- Bleater workloads are deployed
- Kubernetes can pull images
- ArgoCD becomes Synced

Suggest pattern:

Investigate and restore the system to full working order.
The system is considered restored when [single E2E outcome statement].

Workflow: Providing Samples for User

When changes require English content, provide:

1. Sample Table Format

### Issue N: [What it covers] (line XXX)
**File:** `path/to/file.sh:XXX`

**Samples:**
| Title | Body |
|-------|------|
| "Sample title 1" | "Sample body text 1" |
| "Sample title 2" | "Sample body text 2" |
| "Sample title 3" | "Sample body text 3" |

2. Guidance Comments

Include in the code itself:

# Issue N: Covers [problem X] and [problem Y]
# GUIDANCE: Mention [symptoms]. Hint at [vague cause].
#           Don't mention [explicit solution terms].
create_issue "[USER: WRITE TITLE]" \
"[USER: WRITE BODY]"

3. Quick Navigation Summary

## Quick Navigation

task.yaml:XX     - [What to change]
setup.sh:XXX     - Issue 1 (covers X)
setup.sh:XXX     - Issue 2 (covers Y)

Workflow: Validating User's Text

After user writes content:

1. Check Each Issue Against Requirements

Issue Covers Verdict Notes
#1 [problem] pass/fail [feedback]

2. Verify Grader Fairness

For each grader check:

Check: [check_name]
Tests: [what it validates]
Specified by: [task.yaml line / issue # / docs / discoverable via X]
Verdict: Fair / Unfair - [reason]

3. Flag Issues

  • Typos or formatting errors
  • Content too explicit (gives away solution)
  • Content too vague (unfair to agent)
  • Grader checks something not hinted

Grader Validation: Not Too Relaxed

The grader must actually verify the agent solved the problem correctly, not just that symptoms disappeared.

Check for Relaxed Grading

Red Flag Problem Fix
Only checks pod status Agent could delete broken pods Check deployments have available replicas
Only checks "no errors" Doesn't verify functionality Add E2E functional test
No timing window Flaky services might pass momentarily Add stability window (e.g., 60s observation)
Checks can be cheated Agent could fake the outcome Add anti-cheat validation

Grader Should Verify

  • Functional E2E test: Does the system actually work? (e.g., commit -> ArgoCD converges)
  • Stability: Is the fix stable over time? (e.g., restart count not increasing)
  • Root cause fixed: Not just symptoms masked (e.g., check config is correct, not just pod running)
  • All components: Each broken thing has a corresponding check

Subscore Variance Requirement

At least one subscore must show variance across 10 rollouts. The Nebula reviewer bot enforces this automatically.

Variance means the subscore fluctuates between runs — some rollouts score 0 and others score 1 for the same check. This proves the task has a learnable dimension where agent behavior matters.

Why Variance Matters

If every subscore is the same across all rollouts, one of two things is true:

  • All 1s: The check is trivially solvable — every agent gets it regardless of strategy
  • All 0s: The check is impossibly hard or broken — no agent can pass it

Neither case provides a learning signal. A task where the agent can't improve through different strategies is not useful for training.

Example

Rollout 1:  subscores = {'ci_works': 0, 'cd_works': 1, 'end-to-end-test': 1}
Rollout 2:  subscores = {'ci_works': 1, 'cd_works': 1, 'end-to-end-test': 1}
...
Rollout 10: subscores = {'ci_works': 0, 'cd_works': 1, 'end-to-end-test': 1}

Here ci_works has variance (fluctuates between 0 and 1) — this is the learnable dimension. The agent's approach to fixing CI matters and produces different outcomes.

Diagnosing No Variance

All Subscores Diagnosis Fix
All 1s every run Task too easy Add difficulty per Tier 1-3 above
All 0s every run Task too hard or grader broken Check solution.sh passes; add grader waits
Mixed but identical across runs Checks are deterministic on setup, not agent behavior Redesign so at least one check tests something the agent must actively solve

Designing for Variance

  • The subscore most likely to vary should test the core skill the task measures
  • Break complex grading into granular subscores — more subscores = more chances for variance to emerge naturally
  • Avoid subscores that only test environment setup (these will be the same every run)
  • Good variance candidates: E2E functional tests, multi-step fixes, integration checks

Grader Determinism

A deterministic grader produces consistent pass/fail for equivalent solutions. Non-determinism causes artificial failures where valid solutions fail due to undiscoverable assumptions.

Core principle: If grader checks something specific, either:

  1. It's specified somewhere the agent can discover (task.yaml, Gitea issues, docs, wiki), OR
  2. It's discoverable through normal investigation (kubectl, logs, existing configs), OR
  3. Grader detects it dynamically (not assume it)

Common Non-Determinism Sources

Source Problem Fix
Pattern-based pod detection Agent uses valid but non-matching names Specify naming OR use label selectors
Hardcoded ports/protocols Agent configures different valid protocol Specify protocol OR detect dynamically
Hardcoded resource names Agent uses different valid names Specify names OR discover via labels
Resource type assumptions Grader expects Deployment, agent uses StatefulSet Specify type OR check pods regardless of controller
Label selector assumptions Agent uses different but valid labels Specify labels OR detect by other means

Determinism Checklist

Grader Check Question If No -> Action
Pod name pattern Is this naming specified/discoverable? Add to task.yaml/issues OR fix grader
Specific port/protocol Is this protocol specified/discoverable? Add hint OR detect dynamically
Label selector Are these labels specified/discoverable? Add hint OR use alternative detection
Resource type Is this type specified/discoverable? Add hint OR check pods regardless
Specific config values Are these values specified/discoverable? Add hint OR test behavior instead

Hardcoded Values Review

Check grader.py for hardcoded:

  • Deployment names -> Must be discoverable via kubectl
  • Secret names -> Must be referenced somewhere agent can find
  • Namespace names -> Must be in .allowed_namespaces or documented
  • App names -> Must be discoverable

Gray Areas

Some cases require judgment. Flag explicitly:

DETERMINISM: Grader assumes [X].
   - Specified where: [task.yaml / issue #N / discoverable via Y / NOWHERE]
   - Alternatives: [what agent could legitimately do differently]
   - Eval impact: [did this cause failures?]
   - Recommend: [add hint to issues OR make grader flexible]

Technical Changes (Claude Writes)

Claude CAN directly write:

  • setup.sh breakage commands
  • grader.py check functions
  • solution.sh fix commands
  • Code comments and guidance markers

Common Pitfalls

Pitfall Solution
Grader checks something not hinted Add vague hint to issues or task.yaml
Issues too explicit User rewrites focusing on symptoms
Agent can pattern-match fix Add investigation layer
Solution requires internet Ensure all resources available in cluster
Timing issues in grader Add appropriate waits/retries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment