Task Review: `broken-canary-gitops-migration-recovery` (v25)

UUID: 72edf9a5-6e74-4b21-ad8d-6dc1fce79813
Author: yashpatil2000
Category: platform | Difficulty: medium

Summary

The task design is strong — a realistic "abandoned canary migration" scenario across ArgoCD, Argo Rollouts, Istio, and Gitea. Agents demonstrate substantial competence, but the mean score (0.15) is depressed by grader bugs, specification gaps, and environment instability. Only one of the four subscores (sync_lifecycle, 60%) provides meaningful signal.

Metric	Value
Scored Runs	10
Scores	0, 0.25, 0, 0.25, 0.25, 0, 0, 0.25, 0.25, 0.25
Mean Score	0.15
Max Achieved	0.25
test-solution (local)	1.0
test-solution (hosted)	Fails

Per-Subscore Pass Rates

Subscore	Passes	Rate	Category
`canary_rollout`	0/10	0%	Grader bug (must fix)
`gitops_and_cluster`	0/10	0%	Specification gap (must fix)
`istio_and_networking`	0/10	0%	Mixed — spec gap + debatable design
`sync_lifecycle`	6/10	60%	Genuine difficulty

Must Fix: Grader Bugs

These are issues where the agent has no control over the outcome. They must be fixed before the next review.

Istio Metrics Gate Blocks `canary_rollout` (0/10)

Location: grader.py:899-920

The loadgenerator is grader infrastructure that the agent never touches. It deploys to a loadgenerator namespace that isn't created, and even if it were, it lacks Istio injection — so under STRICT mTLS, its plain-HTTP traffic won't generate istio_requests_total metrics. There is no scenario where the Istio metrics gate passes. This blocks a check agents have no control over.

Fix: Create the loadgenerator namespace and label it istio-injection=enabled before deploying the load generator pod.

Difficulty note: The existing multi-rollout analysis check (all rollouts with analysis must pass, not just one) already provides substantial difficulty — no additional compensation needed.

ArgoCD Instability Under Resource Pressure

This task adds significant workload on top of the base Nebula stack: Argo Rollouts (2 controller replicas + dashboard) and Istio IngressGateway. Combined with imagePullPolicy: Always on these components, the added load can destabilize ArgoCD's repo-server on hosted infrastructure.

Root cause: ArgoCD's repo-server has a tight liveness probe (10s initial delay, 1s timeout, 3 failures to kill). On startup it generates a GPG key, which takes ~10s on an idle machine but ~19s+ under CPU contention. When the probe fires before the server is ready, it kills and restarts the pod — creating a crash-loop that prevents ApplicationSet from generating apps, syncs from completing, and hooks from firing.

The trigger: setup.sh installs argo-rollouts and istio-ingressgateway with imagePullPolicy: Always. In Nebula's air-gapped environment, k3s attempts a registry pull that hangs until timeout before falling back to the cached image. Under resource pressure, the crash-looping pods from these pull timeouts consume enough CPU to push the repo-server's startup past its liveness probe window.

What it affects:

Hosted validation: solution.sh completes all its steps correctly but scores 0.0 because ArgoCD never generates Applications.
Agent runs: ApplicationSet generation stalls, app syncs time out, hook jobs fail to trigger. The 4 runs scoring 0.0 (where even sync_lifecycle fails) could be partly explained by this.
Grader: check_sync_lifecycle triggers its own ArgoCD sync with a 180s timeout — if the repo-server is unstable at grading time, this fails regardless of the agent's work.

See: Air-Gap Cascade: imagePullPolicy: Always Destabilizes ArgoCD

Fixes (address both):

Change the break mechanism — Use a wrong image tag (ErrImagePull), bad args (fast crash), or missing permissions instead of imagePullPolicy: Always. All create a similar "fix the controller" experience without the resource cascade. The "fix the broken controller" step is valuable task content — just change how it's broken.
Reduce the additional resource footprint — Consider installing argo-rollouts with --set controller.replicas=1 (one controller instead of two) and validate that the task still works under hosted infra constraints. The fewer crash-looping pods during setup, the less likely the cascade triggers.

Must Fix: Specification Gaps

These checks test legitimate requirements, but the task.yaml doesn't specify them clearly enough for agents to know what's expected. The fixes are about making the spec match the grader, not about changing what's tested.

Sync-Wave on Rollouts (blocks `gitops_and_cluster`, 0/10)

Location: grader.py:263-264

The grader requires argocd.argoproj.io/sync-wave annotations on Rollout objects. Task.yaml says "sync-wave values are set accordingly" and the setup plants sync-wave: "-5" on existing rollout templates as a discoverable clue.

However: 0/10 agents add sync-wave to Rollouts. ArgoCD's PreSync/PostSync phases already guarantee ordering without annotations on the main resources. Primary reviewer Greg assessed this as "not standard and not discoverable."

Fix (option A — recommended): Remove the Rollout wave check. The gitops_and_cluster subscore bundles 6 sub-checks where one failure zeros the entire 0.25 — that compound gating already provides substantial difficulty.

Fix (option B): Make the requirement unmissable: "Each Rollout manifest must carry an argocd.argoproj.io/sync-wave annotation with a numeric value between the PreSync hook wave and the PostSync hook wave." This shifts the challenge from "discover the hidden requirement" to "implement it correctly across 8 services."

VirtualService Gateway Binding (contributes to `istio_and_networking`)

Location: grader.py:614-621

The grader requires ALL VirtualServices to reference the Istio Gateway. The task says "ensure traffic binds correctly for both internal and external communication" — 2/10 agents got this right, so it IS discoverable, but most agents use standard mesh-only patterns for internal services.

Fix: Add a clearer task.yaml note: "Each microservice's VirtualService should be accessible via both the mesh gateway and the external ingress Gateway."

Difficulty note: This is a good difficulty lever. The 2/10 pass rate shows it's hard even when discoverable. With clearer specification, expect ~4-6/10 — still meaningfully challenging.

Defer: Debatable Design Choices

These are design decisions where the check is testing something legitimate but the implementation has trade-offs. They don't need to block the next iteration — address them if scores are too high or too low after fixing the above.

STRICT mTLS Cross-Namespace Test (contributes to `istio_and_networking`)

Location: grader.py:390-400, _test_external_communication()

The task says "bleater services are not reachable from outside the namespace" — describing a broken state the agent should fix. STRICT PeerAuthentication blocking cross-namespace traffic IS the problem. Agents who leave STRICT in place (runs 1, 5) genuinely haven't completed the task. The check is fair to agents who fail it.

The concern is an inverted incentive: runs 2 and 4 pass this check by never enabling Istio at all — no injection means no STRICT enforcement, so plaintext traffic flows freely. The check rewards skipping Istio entirely over configuring it correctly but leaving STRICT in place.

If addressing: Gate the cross-namespace check on sidecar presence — only test cross-namespace traffic if sidecars are confirmed present, otherwise fail with "sidecars not configured." This prevents agents from passing by omitting Istio while keeping the PA removal as a genuine challenge.

If deferring: The check works correctly in the common case (agents who enable Istio must also remove STRICT PA). The inverted incentive only affects agents who skip Istio, which also fails other sub-checks in istio_and_networking. The composite scoring may mask this issue in practice.

Mega-Subscore Bundling

gitops_and_cluster gates 6 sub-checks (rollouts, gitops_applications, gitops_sync_waves, cluster_prerequisites, analysis_template, services_and_health) under a single AND — one failure anywhere zeros the entire 0.25. This is severe compound gating: an agent who completes 5/6 sub-requirements gets the same score as one who completes 0/6.

This currently works in the task's favor (adds difficulty), and the compound structure prevents gaming. But if post-fix scores are too low, consider splitting into 2 subscores (e.g., gitops_applications and cluster_config) with equal weights across more granular subscores. Conversely, if scores are too high, the bundling is already the strongest difficulty lever available.

Alignment With Prior Review Feedback

This review aligns with Greg's earlier feedback on the key issues:

Sync-wave on Rollouts — Greg flagged this in v14 and v16 as "not standard" and "a path restriction." Our recommendation matches.
VS gateway binding — Greg's v10 concern was that Nebula didn't have istio-ingressgateway, making the Gateway a no-op. The v25 setup.sh now installs istio-ingressgateway, resolving that concern — our recommendation to keep the check with clearer spec reflects this change.
Issue comment requirement — Greg flagged in v14 that the grader required comments without task.yaml specifying them. Task.yaml v25 now says "must be closed upon deployment completion with a comment" — resolved.
Istio metrics gate and imagePullPolicy cascade are new findings not previously reviewed.

Note: Sidecar Check Is Not a Bug

The sidecar check (grader.py:369-375) queries initContainers for istio-proxy. This was initially flagged as a potential bug, but it is correct for Istio 1.27's native sidecar mode, where istio-proxy is placed as an init container with restartPolicy: Always. The v25 feedback confirms the check works: runs with injection enabled show "All microservice pods have Istio sidecars" (passes), while runs without injection correctly report missing sidecars.

What Agents Actually Accomplish

Despite the 0.15 mean score, agents do substantial correct work. In the better runs (2, 5, 8, 9, 10):

All 8 Rollouts exist with canary strategy and Istio traffic routing
All 9 Applications generated, Healthy, and Synced
Old bleater-platform Application removed
AnalysisTemplate correctly configured
PreSync/PostSync hooks creating and closing Gitea issues
VirtualServices with stable/canary subsets
DestinationRules correctly configured

After the must-fix items are addressed, we estimate the mean would rise to roughly 0.55–0.70. The compound gating within mega-subscores provides natural difficulty, and the difficulty suggestions above are designed to keep the score in range without a separate tuning iteration.

What's Good

Realistic "abandoned migration" scenario with ~10 interrelated failures
Agents demonstrate genuine investigation and problem-solving skills
sync_lifecycle provides real signal (60% pass rate with varied failure modes)
Substantial scope appropriate for medium difficulty
solution.sh achieves 1.0 locally
Sidecar check correctly handles Istio 1.27 native sidecar mode

Detailed Run Data

Run	Score	sync_lifecycle
1	0.0	0
2	0.25	1
3	0.0	0
4	0.25	1
5	0.25	1
6	0.0	0
7	0.0	0
8	0.25	1
9	0.25	1
10	0.25	1

arubis/broken-canary-review.md

Select an option

No results found

Select an option

No results found

Task Review: `broken-canary-gitops-migration-recovery` (v25)

Summary

Per-Subscore Pass Rates

Must Fix: Grader Bugs

Istio Metrics Gate Blocks `canary_rollout` (0/10)

ArgoCD Instability Under Resource Pressure

Must Fix: Specification Gaps

Sync-Wave on Rollouts (blocks `gitops_and_cluster`, 0/10)

VirtualService Gateway Binding (contributes to `istio_and_networking`)

Defer: Debatable Design Choices

STRICT mTLS Cross-Namespace Test (contributes to `istio_and_networking`)

Mega-Subscore Bundling

Alignment With Prior Review Feedback

Note: Sidecar Check Is Not a Bug

What Agents Actually Accomplish

What's Good

Detailed Run Data

arubis/broken-canary-review.md

Task Review: broken-canary-gitops-migration-recovery (v25)

Summary

Per-Subscore Pass Rates

Must Fix: Grader Bugs

Istio Metrics Gate Blocks canary_rollout (0/10)

ArgoCD Instability Under Resource Pressure

Must Fix: Specification Gaps

Sync-Wave on Rollouts (blocks gitops_and_cluster, 0/10)

VirtualService Gateway Binding (contributes to istio_and_networking)

Defer: Debatable Design Choices

STRICT mTLS Cross-Namespace Test (contributes to istio_and_networking)

Mega-Subscore Bundling

Alignment With Prior Review Feedback

Note: Sidecar Check Is Not a Bug

What Agents Actually Accomplish

What's Good

Detailed Run Data

Task Review: `broken-canary-gitops-migration-recovery` (v25)

Istio Metrics Gate Blocks `canary_rollout` (0/10)

Sync-Wave on Rollouts (blocks `gitops_and_cluster`, 0/10)

VirtualService Gateway Binding (contributes to `istio_and_networking`)

STRICT mTLS Cross-Namespace Test (contributes to `istio_and_networking`)