Air-Gap Hazard: `imagePullPolicy: Always` and ArgoCD Repo-Server Stability

When a task's setup.sh installs Helm releases with imagePullPolicy: Always in Nebula's air-gapped environment, the pods start successfully on idle machines — but under resource pressure (hosted infra, CI runners, concurrent containers), the registry pull timeout can trigger a crash-loop cascade that takes down ArgoCD's repo-server.

Discovered during review of broken-canary-gitops-migration-recovery (v22), where test-solution scored 0.0 despite the solution completing correctly.

Reproduction

Does NOT reproduce on an idle host. With ample CPU headroom, k3s fails the registry check quickly and falls back to the cached image. The pods start normally.

Reproduces under resource pressure. When the host has reduced CPU headroom (other Docker containers, hosted infra with shared resources), the registry pull timeout takes longer. This can push argo-rollouts pods into crash-loop, which consumes CPU (~568m per pod observed), which slows ArgoCD's repo-server startup past its liveness probe deadline.

Host condition	argo-rollouts	ArgoCD repo-server	Task solvable?
Idle (~1% CPU)	Starts cleanly, 0 restarts	Stable	Yes
Loaded (~20%+ CPU)	Crash-loops, 6-7 restarts	17+ restarts, never Ready	No
Loaded, argo-rollouts scaled to 0	N/A	Stable after pod delete	Yes

This explains the author's report: "works on my local but hosted validation fails."

The Cascade Under Load

flowchart TD
    A["setup.sh installs argo-rollouts\nwith imagePullPolicy: Always"] --> B["k3s attempts registry pull\nin air-gapped environment"]
    B --> C{"Host under\nresource pressure?"}
    C -->|No| D["Pull fails fast → cached image used\nPods start normally ✅"]
    C -->|Yes| E["Pull timeout is slow\nPods restart before image resolves"]
    E --> F["Crash-loop burns ~568m CPU\nper argo-rollouts pod"]
    F --> G["CPU contention slows\nrepo-server GPG key generation to ~19s"]
    G --> H["Liveness probe fires at t=10s\n(1s timeout, 3 failures to kill)"]
    H --> I["Repo-server killed → crash-loop\n→ ApplicationSet can't generate apps"]
    I --> J["0 Applications, 0 Rollouts\nTask unsolvable, score 0.0"]

    style C fill:#ffd,stroke:#aa0
    style D fill:#dfd,stroke:#0a0
    style J fill:#fcc,stroke:#c33

ArgoCD Repo-Server Liveness Probe

The tight probe defaults are the proximate cause of the repo-server crash-loop:

livenessProbe:
  initialDelaySeconds: 10   # probe starts at t=10
  periodSeconds: 10          # checks every 10s
  timeoutSeconds: 1          # must respond within 1s
  failureThreshold: 3        # killed after 3 consecutive failures

The repo-server's first action on startup is GPG key generation, which takes ~19s under contention (vs <10s normally). The probe starts at t=10 but the server isn't listening until t≈20. Three failures → SIGTERM (exit 143) → restart → repeat.

Recommendation for Task Authors

Avoid imagePullPolicy: Always as a break mechanism in air-gapped tasks. It creates environment-dependent behavior that works locally but fails on constrained infrastructure.

Alternatives that produce a similar "fix the broken controller" experience:

Approach	Behavior	Predictable?
Wrong image tag (`--set image.tag=v0.0.0-broken`)	`ErrImagePull` / `ImagePullBackOff` — no CPU burn	Yes
Bad args (`--set extraArgs='{--invalid-flag}'`)	Fast `CrashLoopBackOff` with immediate exit	Yes
Missing CRD permissions	Controller runs but can't reconcile	Yes
`imagePullPolicy: Always`	Works on idle hosts, cascades on loaded hosts	No

arubis/airgap-imagepull-cascade.md

Select an option

No results found

Select an option

No results found

Air-Gap Hazard: `imagePullPolicy: Always` and ArgoCD Repo-Server Stability

Reproduction

The Cascade Under Load

ArgoCD Repo-Server Liveness Probe

Recommendation for Task Authors

arubis/airgap-imagepull-cascade.md

Air-Gap Hazard: imagePullPolicy: Always and ArgoCD Repo-Server Stability

Reproduction

The Cascade Under Load

ArgoCD Repo-Server Liveness Probe

Recommendation for Task Authors

Air-Gap Hazard: `imagePullPolicy: Always` and ArgoCD Repo-Server Stability