When a task's setup.sh installs Helm releases with imagePullPolicy: Always in Nebula's air-gapped environment, the pods start successfully on idle machines — but under resource pressure (hosted infra, CI runners, concurrent containers), the registry pull timeout can trigger a crash-loop cascade that takes down ArgoCD's repo-server.
Discovered during review of broken-canary-gitops-migration-recovery (v22), where test-solution scored 0.0 despite the solution completing correctly.
Does NOT reproduce on an idle host. With ample CPU headroom, k3s fails the registry check quickly and falls back to the cached image. The pods start normally.
Reproduces under resource pressure. When the host has reduced CPU headroom (other Docker containers, hosted infra with shared resources), the registry pull timeout takes longer. This can push argo-rollouts pods into crash-loop, which consumes CPU (~568m per pod observed), which slows ArgoCD's repo-server startup past its liveness probe deadline.
| Host condition | argo-rollouts | ArgoCD repo-server | Task solvable? |
|---|---|---|---|
| Idle (~1% CPU) | Starts cleanly, 0 restarts | Stable | Yes |
| Loaded (~20%+ CPU) | Crash-loops, 6-7 restarts | 17+ restarts, never Ready | No |
| Loaded, argo-rollouts scaled to 0 | N/A | Stable after pod delete | Yes |
This explains the author's report: "works on my local but hosted validation fails."
flowchart TD
A["setup.sh installs argo-rollouts\nwith imagePullPolicy: Always"] --> B["k3s attempts registry pull\nin air-gapped environment"]
B --> C{"Host under\nresource pressure?"}
C -->|No| D["Pull fails fast → cached image used\nPods start normally ✅"]
C -->|Yes| E["Pull timeout is slow\nPods restart before image resolves"]
E --> F["Crash-loop burns ~568m CPU\nper argo-rollouts pod"]
F --> G["CPU contention slows\nrepo-server GPG key generation to ~19s"]
G --> H["Liveness probe fires at t=10s\n(1s timeout, 3 failures to kill)"]
H --> I["Repo-server killed → crash-loop\n→ ApplicationSet can't generate apps"]
I --> J["0 Applications, 0 Rollouts\nTask unsolvable, score 0.0"]
style C fill:#ffd,stroke:#aa0
style D fill:#dfd,stroke:#0a0
style J fill:#fcc,stroke:#c33
The tight probe defaults are the proximate cause of the repo-server crash-loop:
livenessProbe:
initialDelaySeconds: 10 # probe starts at t=10
periodSeconds: 10 # checks every 10s
timeoutSeconds: 1 # must respond within 1s
failureThreshold: 3 # killed after 3 consecutive failuresThe repo-server's first action on startup is GPG key generation, which takes ~19s under contention (vs <10s normally). The probe starts at t=10 but the server isn't listening until t≈20. Three failures → SIGTERM (exit 143) → restart → repeat.
Avoid imagePullPolicy: Always as a break mechanism in air-gapped tasks. It creates environment-dependent behavior that works locally but fails on constrained infrastructure.
Alternatives that produce a similar "fix the broken controller" experience:
| Approach | Behavior | Predictable? |
|---|---|---|
Wrong image tag (--set image.tag=v0.0.0-broken) |
ErrImagePull / ImagePullBackOff — no CPU burn |
Yes |
Bad args (--set extraArgs='{--invalid-flag}') |
Fast CrashLoopBackOff with immediate exit |
Yes |
| Missing CRD permissions | Controller runs but can't reconcile | Yes |
imagePullPolicy: Always |
Works on idle hosts, cascades on loaded hosts | No |