Skip to content

Instantly share code, notes, and snippets.

@arubis
Last active March 24, 2026 21:25
Show Gist options
  • Select an option

  • Save arubis/dcc99f0de996f6ea0b8a9cf1ad5a0fd3 to your computer and use it in GitHub Desktop.

Select an option

Save arubis/dcc99f0de996f6ea0b8a9cf1ad5a0fd3 to your computer and use it in GitHub Desktop.
Air-Gap Cascade: imagePullPolicy Always destabilizes ArgoCD in Nebula

Air-Gap Hazard: imagePullPolicy: Always and ArgoCD Repo-Server Stability

When a task's setup.sh installs Helm releases with imagePullPolicy: Always in Nebula's air-gapped environment, the pods start successfully on idle machines — but under resource pressure (hosted infra, CI runners, concurrent containers), the registry pull timeout can trigger a crash-loop cascade that takes down ArgoCD's repo-server.

Discovered during review of broken-canary-gitops-migration-recovery (v22), where test-solution scored 0.0 despite the solution completing correctly.


Reproduction

Does NOT reproduce on an idle host. With ample CPU headroom, k3s fails the registry check quickly and falls back to the cached image. The pods start normally.

Reproduces under resource pressure. When the host has reduced CPU headroom (other Docker containers, hosted infra with shared resources), the registry pull timeout takes longer. This can push argo-rollouts pods into crash-loop, which consumes CPU (~568m per pod observed), which slows ArgoCD's repo-server startup past its liveness probe deadline.

Host condition argo-rollouts ArgoCD repo-server Task solvable?
Idle (~1% CPU) Starts cleanly, 0 restarts Stable Yes
Loaded (~20%+ CPU) Crash-loops, 6-7 restarts 17+ restarts, never Ready No
Loaded, argo-rollouts scaled to 0 N/A Stable after pod delete Yes

This explains the author's report: "works on my local but hosted validation fails."

The Cascade Under Load

flowchart TD
    A["setup.sh installs argo-rollouts\nwith imagePullPolicy: Always"] --> B["k3s attempts registry pull\nin air-gapped environment"]
    B --> C{"Host under\nresource pressure?"}
    C -->|No| D["Pull fails fast → cached image used\nPods start normally ✅"]
    C -->|Yes| E["Pull timeout is slow\nPods restart before image resolves"]
    E --> F["Crash-loop burns ~568m CPU\nper argo-rollouts pod"]
    F --> G["CPU contention slows\nrepo-server GPG key generation to ~19s"]
    G --> H["Liveness probe fires at t=10s\n(1s timeout, 3 failures to kill)"]
    H --> I["Repo-server killed → crash-loop\n→ ApplicationSet can't generate apps"]
    I --> J["0 Applications, 0 Rollouts\nTask unsolvable, score 0.0"]

    style C fill:#ffd,stroke:#aa0
    style D fill:#dfd,stroke:#0a0
    style J fill:#fcc,stroke:#c33
Loading

ArgoCD Repo-Server Liveness Probe

The tight probe defaults are the proximate cause of the repo-server crash-loop:

livenessProbe:
  initialDelaySeconds: 10   # probe starts at t=10
  periodSeconds: 10          # checks every 10s
  timeoutSeconds: 1          # must respond within 1s
  failureThreshold: 3        # killed after 3 consecutive failures

The repo-server's first action on startup is GPG key generation, which takes ~19s under contention (vs <10s normally). The probe starts at t=10 but the server isn't listening until t≈20. Three failures → SIGTERM (exit 143) → restart → repeat.

Recommendation for Task Authors

Avoid imagePullPolicy: Always as a break mechanism in air-gapped tasks. It creates environment-dependent behavior that works locally but fails on constrained infrastructure.

Alternatives that produce a similar "fix the broken controller" experience:

Approach Behavior Predictable?
Wrong image tag (--set image.tag=v0.0.0-broken) ErrImagePull / ImagePullBackOff — no CPU burn Yes
Bad args (--set extraArgs='{--invalid-flag}') Fast CrashLoopBackOff with immediate exit Yes
Missing CRD permissions Controller runs but can't reconcile Yes
imagePullPolicy: Always Works on idle hosts, cascades on loaded hosts No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment