This summary covers two waves of Tooling Agents ASVS-L3 audits against apache/airflow, triaged in May 2026. It compares the impact of fixes, false-positive counts and shape, and the overall trajectory.
Audits triaged in this session (all at the same 0920c77 / b1aec75 commit pair, with AGENTS.md finally in scanner context):
- airflow-core L3 — tooling-agents#23
- task-sdk L3 — tooling-agents#24
- providers/google L3 — tooling-agents#34
Earlier rounds (without AGENTS.md in context) are referenced as "previous wave" — those PRs were already merged before this session started.
This wave (3 audits, single commit pair):
| Audit | Findings | Real PRs | Fixed-finding count | False positives | By-design / deprecated / N/A | Doc / marginal |
|---|---|---|---|---|---|---|
airflow-core L3 0920c77 |
23 | 6 | 7 | 4 | 3 | 9 |
task-sdk L3 0920c77 |
20 | 4 | 4 | 3 | 8 | 5 |
providers/google L3 b1aec75 |
20 | 5 | 8 | 0 | 4 (deprecated module) | 8 |
| Wave totals | 63 | 15 | 19 | 7 | 15 | 22 |
Previous wave (multiple iterations as the auditor calibrated):
| Run | Findings | Real PRs | Fixed-finding count | False positives | By-design / dropped | Doc / marginal |
|---|---|---|---|---|---|---|
| airflow-core L1 (original) | 16 | 1 | 1 | 2 (factually wrong: TLS premise; revocation claim) | 12 | 1 |
airflow-core L3 5872e6c (no AGENTS.md context) |
200 | 5 | 6 | ~5 acknowledged after re-triage | ~125 | ~64 |
airflow-core L3 41a6436 |
10 | 7 | 7 | 3 | 0 | 0 |
| task-sdk L1 (original) | 16 | 1 | 1 | ~1 (TLS premise) | 13 | 1 |
task-sdk L3 95bbf6a (no AGENTS.md context) |
122 | 5 | 5 | ~few | ~95 | ~17 |
| Wave totals (approx) | ~364 | 19 | 20 | ~11 | ~245 | ~85 |
L3 runs without AGENTS.md context are approximate because many "false positive" candidates collapsed into the documented by-design buckets rather than being individually refuted.
Zero in either wave. Re-reading both waves end-to-end:
| Wave | Headline severity | Reality |
|---|---|---|
Previous, task-sdk old-F-005 (httpx.ConnectError) |
"supervisor IPC permanently broken" | Reliability bug, not a security boundary |
| Previous, task-sdk old-F-006 (failed task → SUCCESS) | "downstream pipeline corruption" | Correctness bug, not a security boundary |
| Previous, task-sdk old-F-007 (stuck RUNNING) | "stuck task forever" | Reliability bug |
| Previous, task-sdk old-F-017 (secrets-backend fallthrough) | "deny-then-fallback to env backend" | Defense-in-depth; DAG authors already trusted in security model |
Previous, airflow-core old-F-003 (random.choices) |
"weak password gen" | Real crypto correctness — but SimpleAuthManager is dev-only |
| Previous, airflow-core old-F-008 (JSONDecode fail-open) | "fail-open on _collect_teams_to_check" |
Unreachable today — Pydantic body validation 422s before authz dependency runs |
| This wave, airflow-core F-002 (Medium, bulk OVERWRITE) | "bulk authz bypass" | Only matters in experimental [core] multi_team, which security_model.rst documents as not providing task-level isolation |
This wave, providers/google F-001 (Medium, google_openid) |
"email-claim takeover" | Module is @deprecated(planned_removal_release="apache-airflow-providers-google==15.0.0") — auth backends don't even run on Airflow 3 |
Both waves: zero exploitable findings; zero CVE-worthy.
| Previous wave | This wave | |
|---|---|---|
| Strict false positives (factually-wrong claims) | ~11 across all runs | 7 |
| As % of findings (excluding noisy L3 without AGENTS.md context) | ~14% on the smaller runs | 11% |
| Refutation cost | One line from security_model.rst or a config schema |
Cross-module grep — finding the upstream merge_contextvars / is_safe_url / BaseAuthManager call |
| Auditor self-hedging | Rare; confident misclassifications | Frequent: "may perform this check separately", "cannot be verified from provided code", "speculative — no cookies actually set" |
The category mix shifted. Previous-wave false positives were "wrong frame" — treating admin config as untrusted, treating dev-only modules as production, treating documented limitations as new findings. This-wave false positives are "missed the upstream guard" — auditor looked at the right module but didn't follow the delegation chain to where the safety check lives (shared logging module, server-side route, alternate auth-manager method).
| # | Finding | What the audit saw | What it missed |
|---|---|---|---|
| airflow-core F-001 | "JWT revocation never checked during validation" | JWTValidator.avalidated_claims() — no revocation check |
BaseAuthManager.get_user_from_token() checks at a different layer; auditor's own description hedges "resolve_user_from_token may perform this check separately" |
| airflow-core F-009 | "__Host- cookie prefix unverifiable" |
The audited file's imports and constants only | The cookie attribute is incompatible with Airflow's configurable get_cookie_path() — auditor explicitly says "cannot be verified from provided code", i.e., coverage gap surfaced as finding |
| airflow-core F-010 | "Login next parameter lacks server-side allowlist validation" |
UI-side getNextHref construction |
Server-side is_safe_url(next, request=request) exists at core_api/routes/public/auth.py:45 and 400s on unsafe input |
| airflow-core F-011 | "Connection login field returned unredacted" |
ConnectionResponse model emits login verbatim |
login is documented as a username across the provider ecosystem; masking would break expected API behaviour |
| task-sdk F-002 | "No auto-injected task_id/dag_id identity in logs" |
SDK-level structlog processor chain | structlog.contextvars.merge_contextvars is in the shared shared/logging/.../structlog.py:351; identity bound at task start propagates automatically |
| task-sdk F-003 | "No UTC enforcement in timestamp configuration" | SDK-level configure_logging doesn't pass utc=True |
Shared module uses MaybeTimeStamper(fmt="iso"); structlog's TimeStamper defaults to utc=True and ISO 8601 carries Z |
| task-sdk F-007 | "No HTTPS scheme enforcement before bearer token transmission" | Client.__init__ accepts any base_url |
Same admin-config rationale as the previous wave — plaintext HTTP behind a TLS-terminating ingress is a documented supported deployment per security_model.rst |
providers/google produced zero outright false positives — all 20 findings were either real, deprecated-module, doc-improvement, or marginal-hygiene.
Previous wave: reliability + correctness. The bugs that mattered would have caused user-observable failures: broken IPC channels, wrong terminal task states, stuck RUNNING tasks, downstream pipelines corrupted by a falsely-SUCCESS-marked upstream.
This wave: observability + defense-in-depth. The bugs that matter cause silent failures — operators don't see the warning they should see:
suppress(Exception)andexcept X: passswallowing security-relevant signals (F-001 task-sdk, F-008 google, F-019 airflow-core).- Exception strings carrying internal identifiers (project IDs, SA emails, IAM permission names) into user-visible error responses (F-010 google).
- Unguarded
finallyblocks where instrumentation can mask the original failure (F-005, F-006 task-sdk; F-013 google). - Silent log-history truncation on transient read failures (F-012 google).
- Misconfigurations that should be loud but aren't (F-005 airflow-core
jwt_audiencesection mismatch; F-008 google missingremote_log_conn_id). - Input-validation hardening on known footgun shapes — none currently exploitable but worth closing: LIKE wildcards (F-007/8 airflow-core), GCS path traversal (F-005/6 google), CR/LF in log args (F-018 airflow-core), query-string secret redaction (F-015 airflow-core).
Two things changed between waves:
-
Volume collapsed. L3 task-sdk went from 122 findings → 20. L3 airflow-core went from 200 → 23. Adding
AGENTS.mdto the scanner's context did most of the work — multi-team, DFP/triggerer DB access, Jinja sandbox, dependency-CVE policy, deployment-manager responsibilities all stopped re-surfacing under fresh IDs. -
Actionability rate jumped. Previous wave averaged ~5% of findings → actual PR. This wave averaged ~30% (with providers/google at 40%, where the audit was cleanest-shape). The remaining ~70% splits between false positives, by-design, deprecated modules, and doc improvements.
~35 PRs across two waves, 0 CVEs, 0 currently-exploitable bugs. The first wave caught the user-observable reliability bugs; this wave caught the silent-failure observability gaps and a few defense-in-depth shapes. The audit is now finding what's left after the obvious bugs have been cleaned up, which is a healthy signal about the codebase's maturity at L3.