UUID: 72edf9a5-6e74-4b21-ad8d-6dc1fce79813
Author: yashpatil2000
Category: platform | Difficulty: medium
The task design is strong — a realistic "abandoned canary migration" scenario across ArgoCD, Argo Rollouts, Istio, and Gitea. Agents demonstrate substantial competence, but the mean score (0.15) is depressed by grader bugs, specification gaps, and environment instability. Only one of the four subscores (sync_lifecycle, 60%) provides meaningful signal.
| Metric | Value |
|---|---|
| Scored Runs | 10 |
| Scores | 0, 0.25, 0, 0.25, 0.25, 0, 0, 0.25, 0.25, 0.25 |
| Mean Score | 0.15 |
| Max Achieved | 0.25 |
| test-solution (local) | 1.0 |
| test-solution (hosted) | Fails |
| Subscore | Passes | Rate | Category |
|---|---|---|---|
canary_rollout |
0/10 | 0% | Grader bug (must fix) |
gitops_and_cluster |
0/10 | 0% | Specification gap (must fix) |
istio_and_networking |
0/10 | 0% | Mixed — spec gap + debatable design |
sync_lifecycle |
6/10 | 60% | Genuine difficulty |
These are issues where the agent has no control over the outcome. They must be fixed before the next review.
Location: grader.py:899-920
The loadgenerator is grader infrastructure that the agent never touches. It deploys to a loadgenerator namespace that isn't created, and even if it were, it lacks Istio injection — so under STRICT mTLS, its plain-HTTP traffic won't generate istio_requests_total metrics. There is no scenario where the Istio metrics gate passes. This blocks a check agents have no control over.
Fix: Create the loadgenerator namespace and label it istio-injection=enabled before deploying the load generator pod.
Difficulty note: The existing multi-rollout analysis check (all rollouts with analysis must pass, not just one) already provides substantial difficulty — no additional compensation needed.
This task adds significant workload on top of the base Nebula stack: Argo Rollouts (2 controller replicas + dashboard) and Istio IngressGateway. Combined with imagePullPolicy: Always on these components, the added load can destabilize ArgoCD's repo-server on hosted infrastructure.
Root cause: ArgoCD's repo-server has a tight liveness probe (10s initial delay, 1s timeout, 3 failures to kill). On startup it generates a GPG key, which takes ~10s on an idle machine but ~19s+ under CPU contention. When the probe fires before the server is ready, it kills and restarts the pod — creating a crash-loop that prevents ApplicationSet from generating apps, syncs from completing, and hooks from firing.
The trigger: setup.sh installs argo-rollouts and istio-ingressgateway with imagePullPolicy: Always. In Nebula's air-gapped environment, k3s attempts a registry pull that hangs until timeout before falling back to the cached image. Under resource pressure, the crash-looping pods from these pull timeouts consume enough CPU to push the repo-server's startup past its liveness probe window.
What it affects:
- Hosted validation: solution.sh completes all its steps correctly but scores 0.0 because ArgoCD never generates Applications.
- Agent runs: ApplicationSet generation stalls, app syncs time out, hook jobs fail to trigger. The 4 runs scoring 0.0 (where even
sync_lifecyclefails) could be partly explained by this. - Grader:
check_sync_lifecycletriggers its own ArgoCD sync with a 180s timeout — if the repo-server is unstable at grading time, this fails regardless of the agent's work.
See: Air-Gap Cascade: imagePullPolicy: Always Destabilizes ArgoCD
Fixes (address both):
-
Change the break mechanism — Use a wrong image tag (
ErrImagePull), bad args (fast crash), or missing permissions instead ofimagePullPolicy: Always. All create a similar "fix the controller" experience without the resource cascade. The "fix the broken controller" step is valuable task content — just change how it's broken. -
Reduce the additional resource footprint — Consider installing argo-rollouts with
--set controller.replicas=1(one controller instead of two) and validate that the task still works under hosted infra constraints. The fewer crash-looping pods during setup, the less likely the cascade triggers.
These checks test legitimate requirements, but the task.yaml doesn't specify them clearly enough for agents to know what's expected. The fixes are about making the spec match the grader, not about changing what's tested.
Location: grader.py:263-264
The grader requires argocd.argoproj.io/sync-wave annotations on Rollout objects. Task.yaml says "sync-wave values are set accordingly" and the setup plants sync-wave: "-5" on existing rollout templates as a discoverable clue.
However: 0/10 agents add sync-wave to Rollouts. ArgoCD's PreSync/PostSync phases already guarantee ordering without annotations on the main resources. Primary reviewer Greg assessed this as "not standard and not discoverable."
Fix (option A — recommended): Remove the Rollout wave check. The gitops_and_cluster subscore bundles 6 sub-checks where one failure zeros the entire 0.25 — that compound gating already provides substantial difficulty.
Fix (option B): Make the requirement unmissable: "Each Rollout manifest must carry an argocd.argoproj.io/sync-wave annotation with a numeric value between the PreSync hook wave and the PostSync hook wave." This shifts the challenge from "discover the hidden requirement" to "implement it correctly across 8 services."
Location: grader.py:614-621
The grader requires ALL VirtualServices to reference the Istio Gateway. The task says "ensure traffic binds correctly for both internal and external communication" — 2/10 agents got this right, so it IS discoverable, but most agents use standard mesh-only patterns for internal services.
Fix: Add a clearer task.yaml note: "Each microservice's VirtualService should be accessible via both the mesh gateway and the external ingress Gateway."
Difficulty note: This is a good difficulty lever. The 2/10 pass rate shows it's hard even when discoverable. With clearer specification, expect ~4-6/10 — still meaningfully challenging.
These are design decisions where the check is testing something legitimate but the implementation has trade-offs. They don't need to block the next iteration — address them if scores are too high or too low after fixing the above.
Location: grader.py:390-400, _test_external_communication()
The task says "bleater services are not reachable from outside the namespace" — describing a broken state the agent should fix. STRICT PeerAuthentication blocking cross-namespace traffic IS the problem. Agents who leave STRICT in place (runs 1, 5) genuinely haven't completed the task. The check is fair to agents who fail it.
The concern is an inverted incentive: runs 2 and 4 pass this check by never enabling Istio at all — no injection means no STRICT enforcement, so plaintext traffic flows freely. The check rewards skipping Istio entirely over configuring it correctly but leaving STRICT in place.
If addressing: Gate the cross-namespace check on sidecar presence — only test cross-namespace traffic if sidecars are confirmed present, otherwise fail with "sidecars not configured." This prevents agents from passing by omitting Istio while keeping the PA removal as a genuine challenge.
If deferring: The check works correctly in the common case (agents who enable Istio must also remove STRICT PA). The inverted incentive only affects agents who skip Istio, which also fails other sub-checks in istio_and_networking. The composite scoring may mask this issue in practice.
gitops_and_cluster gates 6 sub-checks (rollouts, gitops_applications, gitops_sync_waves, cluster_prerequisites, analysis_template, services_and_health) under a single AND — one failure anywhere zeros the entire 0.25. This is severe compound gating: an agent who completes 5/6 sub-requirements gets the same score as one who completes 0/6.
This currently works in the task's favor (adds difficulty), and the compound structure prevents gaming. But if post-fix scores are too low, consider splitting into 2 subscores (e.g., gitops_applications and cluster_config) with equal weights across more granular subscores. Conversely, if scores are too high, the bundling is already the strongest difficulty lever available.
This review aligns with Greg's earlier feedback on the key issues:
- Sync-wave on Rollouts — Greg flagged this in v14 and v16 as "not standard" and "a path restriction." Our recommendation matches.
- VS gateway binding — Greg's v10 concern was that Nebula didn't have istio-ingressgateway, making the Gateway a no-op. The v25 setup.sh now installs istio-ingressgateway, resolving that concern — our recommendation to keep the check with clearer spec reflects this change.
- Issue comment requirement — Greg flagged in v14 that the grader required comments without task.yaml specifying them. Task.yaml v25 now says "must be closed upon deployment completion with a comment" — resolved.
- Istio metrics gate and imagePullPolicy cascade are new findings not previously reviewed.
The sidecar check (grader.py:369-375) queries initContainers for istio-proxy. This was initially flagged as a potential bug, but it is correct for Istio 1.27's native sidecar mode, where istio-proxy is placed as an init container with restartPolicy: Always. The v25 feedback confirms the check works: runs with injection enabled show "All microservice pods have Istio sidecars" (passes), while runs without injection correctly report missing sidecars.
Despite the 0.15 mean score, agents do substantial correct work. In the better runs (2, 5, 8, 9, 10):
- All 8 Rollouts exist with canary strategy and Istio traffic routing
- All 9 Applications generated, Healthy, and Synced
- Old bleater-platform Application removed
- AnalysisTemplate correctly configured
- PreSync/PostSync hooks creating and closing Gitea issues
- VirtualServices with stable/canary subsets
- DestinationRules correctly configured
After the must-fix items are addressed, we estimate the mean would rise to roughly 0.55–0.70. The compound gating within mega-subscores provides natural difficulty, and the difficulty suggestions above are designed to keep the score in range without a separate tuning iteration.
- Realistic "abandoned migration" scenario with ~10 interrelated failures
- Agents demonstrate genuine investigation and problem-solving skills
sync_lifecycleprovides real signal (60% pass rate with varied failure modes)- Substantial scope appropriate for medium difficulty
- solution.sh achieves 1.0 locally
- Sidecar check correctly handles Istio 1.27 native sidecar mode
| Run | Score | gitops_and_cluster | istio_and_networking | canary_rollout | sync_lifecycle |
|---|---|---|---|---|---|
| 1 | 0.0 | 0 | 0 | 0 | 0 |
| 2 | 0.25 | 0 | 0 | 0 | 1 |
| 3 | 0.0 | 0 | 0 | 0 | 0 |
| 4 | 0.25 | 0 | 0 | 0 | 1 |
| 5 | 0.25 | 0 | 0 | 0 | 1 |
| 6 | 0.0 | 0 | 0 | 0 | 0 |
| 7 | 0.0 | 0 | 0 | 0 | 0 |
| 8 | 0.25 | 0 | 0 | 0 | 1 |
| 9 | 0.25 | 0 | 0 | 0 | 1 |
| 10 | 0.25 | 0 | 0 | 0 | 1 |