Apache Airflow security audit — joined summary of two waves

This summary covers two waves of Tooling Agents ASVS-L3 audits against apache/airflow, triaged in May 2026. It compares the impact of fixes, false-positive counts and shape, and the overall trajectory.

Audits triaged in this session (all at the same 0920c77 / b1aec75 commit pair, with AGENTS.md finally in scanner context):

airflow-core L3 — tooling-agents#23
task-sdk L3 — tooling-agents#24
providers/google L3 — tooling-agents#34

Earlier rounds (without AGENTS.md in context) are referenced as "previous wave" — those PRs were already merged before this session started.

Impact: what landed, what didn't

This wave (3 audits, single commit pair):

Audit	Findings	Real PRs	Fixed-finding count	False positives	By-design / deprecated / N/A	Doc / marginal
airflow-core L3 `0920c77`	23	6	7	4	3	9
task-sdk L3 `0920c77`	20	4	4	3	8	5
providers/google L3 `b1aec75`	20	5	8	0	4 (deprecated module)	8
Wave totals	63	15	19	7	15	22

Previous wave (multiple iterations as the auditor calibrated):

Run	Findings	Real PRs	Fixed-finding count	False positives	By-design / dropped	Doc / marginal
airflow-core L1 (original)	16	1	1	2 (factually wrong: TLS premise; revocation claim)	12	1
airflow-core L3 `5872e6c` (no AGENTS.md context)	200	5	6	~5 acknowledged after re-triage	~125	~64
airflow-core L3 `41a6436`	10	7	7	3	0	0
task-sdk L1 (original)	16	1	1	~1 (TLS premise)	13	1
task-sdk L3 `95bbf6a` (no AGENTS.md context)	122	5	5	~few	~95	~17
Wave totals (approx)	~364	19	20	~11	~245	~85

L3 runs without AGENTS.md context are approximate because many "false positive" candidates collapsed into the documented by-design buckets rather than being individually refuted.

What's actually CVE-worthy

Zero in either wave. Re-reading both waves end-to-end:

Wave	Headline severity	Reality
Previous, task-sdk old-F-005 (`httpx.ConnectError`)	"supervisor IPC permanently broken"	Reliability bug, not a security boundary
Previous, task-sdk old-F-006 (failed task → SUCCESS)	"downstream pipeline corruption"	Correctness bug, not a security boundary
Previous, task-sdk old-F-007 (stuck RUNNING)	"stuck task forever"	Reliability bug
Previous, task-sdk old-F-017 (secrets-backend fallthrough)	"deny-then-fallback to env backend"	Defense-in-depth; DAG authors already trusted in security model
Previous, airflow-core old-F-003 (`random.choices`)	"weak password gen"	Real crypto correctness — but `SimpleAuthManager` is dev-only
Previous, airflow-core old-F-008 (JSONDecode fail-open)	"fail-open on `_collect_teams_to_check`"	Unreachable today — Pydantic body validation 422s before authz dependency runs
This wave, airflow-core F-002 (Medium, bulk OVERWRITE)	"bulk authz bypass"	Only matters in experimental `[core] multi_team`, which `security_model.rst` documents as not providing task-level isolation
This wave, providers/google F-001 (Medium, `google_openid`)	"email-claim takeover"	Module is `@deprecated(planned_removal_release="apache-airflow-providers-google==15.0.0")` — auth backends don't even run on Airflow 3

Both waves: zero exploitable findings; zero CVE-worthy.

False-positive comparison

	Previous wave	This wave
Strict false positives (factually-wrong claims)	~11 across all runs	7
As % of findings (excluding noisy L3 without AGENTS.md context)	~14% on the smaller runs	11%
Refutation cost	One line from `security_model.rst` or a config schema	Cross-module grep — finding the upstream `merge_contextvars` / `is_safe_url` / `BaseAuthManager` call
Auditor self-hedging	Rare; confident misclassifications	Frequent: "may perform this check separately", "cannot be verified from provided code", "speculative — no cookies actually set"

The category mix shifted. Previous-wave false positives were "wrong frame" — treating admin config as untrusted, treating dev-only modules as production, treating documented limitations as new findings. This-wave false positives are "missed the upstream guard" — auditor looked at the right module but didn't follow the delegation chain to where the safety check lives (shared logging module, server-side route, alternate auth-manager method).

Specific false positives this wave

#	Finding	What the audit saw	What it missed
airflow-core F-001	"JWT revocation never checked during validation"	`JWTValidator.avalidated_claims()` — no revocation check	`BaseAuthManager.get_user_from_token()` checks at a different layer; auditor's own description hedges "resolve_user_from_token may perform this check separately"
airflow-core F-009	"`__Host-` cookie prefix unverifiable"	The audited file's imports and constants only	The cookie attribute is incompatible with Airflow's configurable `get_cookie_path()` — auditor explicitly says "cannot be verified from provided code", i.e., coverage gap surfaced as finding
airflow-core F-010	"Login `next` parameter lacks server-side allowlist validation"	UI-side `getNextHref` construction	Server-side `is_safe_url(next, request=request)` exists at `core_api/routes/public/auth.py:45` and 400s on unsafe input
airflow-core F-011	"Connection `login` field returned unredacted"	`ConnectionResponse` model emits `login` verbatim	`login` is documented as a username across the provider ecosystem; masking would break expected API behaviour
task-sdk F-002	"No auto-injected `task_id`/`dag_id` identity in logs"	SDK-level structlog processor chain	`structlog.contextvars.merge_contextvars` is in the shared `shared/logging/.../structlog.py:351`; identity bound at task start propagates automatically
task-sdk F-003	"No UTC enforcement in timestamp configuration"	SDK-level `configure_logging` doesn't pass `utc=True`	Shared module uses `MaybeTimeStamper(fmt="iso")`; structlog's `TimeStamper` defaults to `utc=True` and ISO 8601 carries `Z`
task-sdk F-007	"No HTTPS scheme enforcement before bearer token transmission"	`Client.__init__` accepts any `base_url`	Same admin-config rationale as the previous wave — plaintext HTTP behind a TLS-terminating ingress is a documented supported deployment per `security_model.rst`

providers/google produced zero outright false positives — all 20 findings were either real, deprecated-module, doc-improvement, or marginal-hygiene.

Shape of the actual fixes — across both waves

Previous wave: reliability + correctness. The bugs that mattered would have caused user-observable failures: broken IPC channels, wrong terminal task states, stuck RUNNING tasks, downstream pipelines corrupted by a falsely-SUCCESS-marked upstream.

This wave: observability + defense-in-depth. The bugs that matter cause silent failures — operators don't see the warning they should see:

suppress(Exception) and except X: pass swallowing security-relevant signals (F-001 task-sdk, F-008 google, F-019 airflow-core).
Exception strings carrying internal identifiers (project IDs, SA emails, IAM permission names) into user-visible error responses (F-010 google).
Unguarded finally blocks where instrumentation can mask the original failure (F-005, F-006 task-sdk; F-013 google).
Silent log-history truncation on transient read failures (F-012 google).
Misconfigurations that should be loud but aren't (F-005 airflow-core jwt_audience section mismatch; F-008 google missing remote_log_conn_id).
Input-validation hardening on known footgun shapes — none currently exploitable but worth closing: LIKE wildcards (F-007/8 airflow-core), GCS path traversal (F-005/6 google), CR/LF in log args (F-018 airflow-core), query-string secret redaction (F-015 airflow-core).

Trajectory

Two things changed between waves:

Volume collapsed. L3 task-sdk went from 122 findings → 20. L3 airflow-core went from 200 → 23. Adding AGENTS.md to the scanner's context did most of the work — multi-team, DFP/triggerer DB access, Jinja sandbox, dependency-CVE policy, deployment-manager responsibilities all stopped re-surfacing under fresh IDs.
Actionability rate jumped. Previous wave averaged ~5% of findings → actual PR. This wave averaged ~30% (with providers/google at 40%, where the audit was cleanest-shape). The remaining ~70% splits between false positives, by-design, deprecated modules, and doc improvements.

Bottom line

~35 PRs across two waves, 0 CVEs, 0 currently-exploitable bugs. The first wave caught the user-observable reliability bugs; this wave caught the silent-failure observability gaps and a few defense-in-depth shapes. The audit is now finding what's left after the obvious bugs have been cleaned up, which is a healthy signal about the codebase's maturity at L3.

potiuk/airflow-security-audit-summary.md

Select an option

No results found