Skip to content

Instantly share code, notes, and snippets.

@potiuk
Created May 25, 2026 23:11
Show Gist options
  • Select an option

  • Save potiuk/ccd5324eba142c62588266438763a274 to your computer and use it in GitHub Desktop.

Select an option

Save potiuk/ccd5324eba142c62588266438763a274 to your computer and use it in GitHub Desktop.
Apache Airflow security audit — joined summary of two waves (May 2026)

Apache Airflow security audit — joined summary of two waves

This summary covers two waves of Tooling Agents ASVS-L3 audits against apache/airflow, triaged in May 2026. It compares the impact of fixes, false-positive counts and shape, and the overall trajectory.

Audits triaged in this session (all at the same 0920c77 / b1aec75 commit pair, with AGENTS.md finally in scanner context):

Earlier rounds (without AGENTS.md in context) are referenced as "previous wave" — those PRs were already merged before this session started.


Impact: what landed, what didn't

This wave (3 audits, single commit pair):

Audit Findings Real PRs Fixed-finding count False positives By-design / deprecated / N/A Doc / marginal
airflow-core L3 0920c77 23 6 7 4 3 9
task-sdk L3 0920c77 20 4 4 3 8 5
providers/google L3 b1aec75 20 5 8 0 4 (deprecated module) 8
Wave totals 63 15 19 7 15 22

Previous wave (multiple iterations as the auditor calibrated):

Run Findings Real PRs Fixed-finding count False positives By-design / dropped Doc / marginal
airflow-core L1 (original) 16 1 1 2 (factually wrong: TLS premise; revocation claim) 12 1
airflow-core L3 5872e6c (no AGENTS.md context) 200 5 6 ~5 acknowledged after re-triage ~125 ~64
airflow-core L3 41a6436 10 7 7 3 0 0
task-sdk L1 (original) 16 1 1 ~1 (TLS premise) 13 1
task-sdk L3 95bbf6a (no AGENTS.md context) 122 5 5 ~few ~95 ~17
Wave totals (approx) ~364 19 20 ~11 ~245 ~85

L3 runs without AGENTS.md context are approximate because many "false positive" candidates collapsed into the documented by-design buckets rather than being individually refuted.


What's actually CVE-worthy

Zero in either wave. Re-reading both waves end-to-end:

Wave Headline severity Reality
Previous, task-sdk old-F-005 (httpx.ConnectError) "supervisor IPC permanently broken" Reliability bug, not a security boundary
Previous, task-sdk old-F-006 (failed task → SUCCESS) "downstream pipeline corruption" Correctness bug, not a security boundary
Previous, task-sdk old-F-007 (stuck RUNNING) "stuck task forever" Reliability bug
Previous, task-sdk old-F-017 (secrets-backend fallthrough) "deny-then-fallback to env backend" Defense-in-depth; DAG authors already trusted in security model
Previous, airflow-core old-F-003 (random.choices) "weak password gen" Real crypto correctness — but SimpleAuthManager is dev-only
Previous, airflow-core old-F-008 (JSONDecode fail-open) "fail-open on _collect_teams_to_check" Unreachable today — Pydantic body validation 422s before authz dependency runs
This wave, airflow-core F-002 (Medium, bulk OVERWRITE) "bulk authz bypass" Only matters in experimental [core] multi_team, which security_model.rst documents as not providing task-level isolation
This wave, providers/google F-001 (Medium, google_openid) "email-claim takeover" Module is @deprecated(planned_removal_release="apache-airflow-providers-google==15.0.0") — auth backends don't even run on Airflow 3

Both waves: zero exploitable findings; zero CVE-worthy.


False-positive comparison

Previous wave This wave
Strict false positives (factually-wrong claims) ~11 across all runs 7
As % of findings (excluding noisy L3 without AGENTS.md context) ~14% on the smaller runs 11%
Refutation cost One line from security_model.rst or a config schema Cross-module grep — finding the upstream merge_contextvars / is_safe_url / BaseAuthManager call
Auditor self-hedging Rare; confident misclassifications Frequent: "may perform this check separately", "cannot be verified from provided code", "speculative — no cookies actually set"

The category mix shifted. Previous-wave false positives were "wrong frame" — treating admin config as untrusted, treating dev-only modules as production, treating documented limitations as new findings. This-wave false positives are "missed the upstream guard" — auditor looked at the right module but didn't follow the delegation chain to where the safety check lives (shared logging module, server-side route, alternate auth-manager method).

Specific false positives this wave

# Finding What the audit saw What it missed
airflow-core F-001 "JWT revocation never checked during validation" JWTValidator.avalidated_claims() — no revocation check BaseAuthManager.get_user_from_token() checks at a different layer; auditor's own description hedges "resolve_user_from_token may perform this check separately"
airflow-core F-009 "__Host- cookie prefix unverifiable" The audited file's imports and constants only The cookie attribute is incompatible with Airflow's configurable get_cookie_path() — auditor explicitly says "cannot be verified from provided code", i.e., coverage gap surfaced as finding
airflow-core F-010 "Login next parameter lacks server-side allowlist validation" UI-side getNextHref construction Server-side is_safe_url(next, request=request) exists at core_api/routes/public/auth.py:45 and 400s on unsafe input
airflow-core F-011 "Connection login field returned unredacted" ConnectionResponse model emits login verbatim login is documented as a username across the provider ecosystem; masking would break expected API behaviour
task-sdk F-002 "No auto-injected task_id/dag_id identity in logs" SDK-level structlog processor chain structlog.contextvars.merge_contextvars is in the shared shared/logging/.../structlog.py:351; identity bound at task start propagates automatically
task-sdk F-003 "No UTC enforcement in timestamp configuration" SDK-level configure_logging doesn't pass utc=True Shared module uses MaybeTimeStamper(fmt="iso"); structlog's TimeStamper defaults to utc=True and ISO 8601 carries Z
task-sdk F-007 "No HTTPS scheme enforcement before bearer token transmission" Client.__init__ accepts any base_url Same admin-config rationale as the previous wave — plaintext HTTP behind a TLS-terminating ingress is a documented supported deployment per security_model.rst

providers/google produced zero outright false positives — all 20 findings were either real, deprecated-module, doc-improvement, or marginal-hygiene.


Shape of the actual fixes — across both waves

Previous wave: reliability + correctness. The bugs that mattered would have caused user-observable failures: broken IPC channels, wrong terminal task states, stuck RUNNING tasks, downstream pipelines corrupted by a falsely-SUCCESS-marked upstream.

This wave: observability + defense-in-depth. The bugs that matter cause silent failures — operators don't see the warning they should see:

  • suppress(Exception) and except X: pass swallowing security-relevant signals (F-001 task-sdk, F-008 google, F-019 airflow-core).
  • Exception strings carrying internal identifiers (project IDs, SA emails, IAM permission names) into user-visible error responses (F-010 google).
  • Unguarded finally blocks where instrumentation can mask the original failure (F-005, F-006 task-sdk; F-013 google).
  • Silent log-history truncation on transient read failures (F-012 google).
  • Misconfigurations that should be loud but aren't (F-005 airflow-core jwt_audience section mismatch; F-008 google missing remote_log_conn_id).
  • Input-validation hardening on known footgun shapes — none currently exploitable but worth closing: LIKE wildcards (F-007/8 airflow-core), GCS path traversal (F-005/6 google), CR/LF in log args (F-018 airflow-core), query-string secret redaction (F-015 airflow-core).

Trajectory

Two things changed between waves:

  1. Volume collapsed. L3 task-sdk went from 122 findings → 20. L3 airflow-core went from 200 → 23. Adding AGENTS.md to the scanner's context did most of the work — multi-team, DFP/triggerer DB access, Jinja sandbox, dependency-CVE policy, deployment-manager responsibilities all stopped re-surfacing under fresh IDs.

  2. Actionability rate jumped. Previous wave averaged ~5% of findings → actual PR. This wave averaged ~30% (with providers/google at 40%, where the audit was cleanest-shape). The remaining ~70% splits between false positives, by-design, deprecated modules, and doc improvements.


Bottom line

~35 PRs across two waves, 0 CVEs, 0 currently-exploitable bugs. The first wave caught the user-observable reliability bugs; this wave caught the silent-failure observability gaps and a few defense-in-depth shapes. The audit is now finding what's left after the obvious bugs have been cleaned up, which is a healthy signal about the codebase's maturity at L3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment