Fuses: CC2 (Incident & Resilience) × CC3 (Trustworthy Pipeline) Companion doc:
cross-cutting-hackathons.md§10.1 Recommended slot: the first flagship blended event to run (per the sequencing in §12).
| Attribute | Value |
|---|---|
| Codename | Ship It or Sink It |
| Format | Brownfield break-fix + pipeline hardening |
| Duration | 1 day (≈7 working hours) |
| Team size | 2–4 |
| Difficulty | ●●●○○ |
| Native platform | New brownfield harness (primary) + Gittery checkers (pipeline half) |
| Pass bar | Total ≥ 70 and the surprise failure drill is survived |
A service has been thrown over the wall to your team in a bad state. It is buckling under load and nobody can trust how it gets to production — the CI is decorative, commits are unsigned, and there's a vulnerable dependency lurking in the tree. You have one day to make it survive and make it shippable, in that order. At the end of the day we will try to break it again, live, and you'll have to recover from the plan you wrote.
The fiction matters: frame it as a real handover from a team that has left. Participants are the new on-call.
| Treasury topic | Original category | Cross-cut |
|---|---|---|
| Diagnose injected latency / bad config / broken migration / high CPU | Mixed | CC2 |
| 20-minute failure drill | Recovery | CC2 |
| Backup & recovery plan — automatic and manual | Recovery | CC2 |
| Load / stress tests with realistic scenarios | Benchmarking | CC2 |
| Migration rollback instructions | Databases | CC2 |
| Harden CI workflow + evidence-producing checks + readiness record | CI | CC3 |
| Triage simulated scan report | CI | CC3 |
| SBOM / dependency scanning | CI | CC3 |
| Signed commits | Git | CC3 |
| Build cache / incremental builds | Caching | CC3 |
By the end, participants should be able to:
- Diagnose a degrading system under time pressure using a hypothesis log rather than guesswork.
- Write and execute a recovery runbook with both automatic and manual fallbacks.
- Prove a fix holds with a realistic load/stress test, not an assertion.
- Harden a CI pipeline so it produces evidence — a repository-readiness record a stranger could trust.
- Triage security findings correctly (critical/high/medium/false-positive) and read an SBOM.
- Enforce provenance so an unsigned or unscanned change cannot reach
main.
A small but realistic service — e.g. a TypeScript or Python REST API backed by a relational DB — in a repo that contains, by design:
- A weak CI workflow (runs tests, produces no evidence, no gating).
- Unsigned commits permitted on the default branch.
- A vulnerable transitive dependency and no SBOM.
- No build cache (every run rebuilds from cold).
- A fault-injection control panel (the harness) the facilitators drive — invisible to the obvious code path.
- A seeded but thin test suite that passes green at start (so the breakage is environmental, not test-visible).
Provide: repo access, the running service URL, a metrics/logs endpoint, and a one-page "handover note" written in-character that is deliberately incomplete.
| Time | Phase | What happens |
|---|---|---|
| 0:00–0:30 | Briefing | Premise, rules, rubric walkthrough, harness orientation. |
| 0:30–1:00 | Recon | Teams read the code, metrics and handover note. No fixing yet — produce a hypothesis log. |
| 1:00–3:00 | Act I — Survive (CC2) | A fault is injected at 1:00. Diagnose → fix → write the recovery runbook → prove with a load test. |
| 3:00–3:45 | Lunch / buffer | Harness stays up; no scoring. |
| 3:45–6:00 | Act II — Trust (CC3) | You may not "ship" until: hardened CI produces an evidence record, SBOM is generated and the vulnerable dep is triaged, signed commits are enforced, build is incremental. |
| 6:00–6:30 | The Drill | A new, unseen fault is injected. Teams have 20 minutes to restore service using their runbook. |
| 6:30–7:00 | Readout | Each team presents their readiness record + runbook; facilitators score live. |
- A passing service with the Act I fault resolved.
- A recovery runbook (one page): detection signal → diagnosis → containment → recovery (automatic and manual fallback) → verification.
- A load/stress test + its result, demonstrating the fix holds under realistic traffic.
- A hardened CI workflow that emits a repository-readiness record (the evidence artifact).
- An SBOM + a short scan-triage table classifying each finding and naming the action taken.
- Signed-commit enforcement proven (an unsigned commit is rejected).
| Dimension | Points | What earns full marks |
|---|---|---|
| Diagnosis speed & method (Act I) | 20 | Fault correctly identified; hypothesis log shows what was ruled out and why. |
| Recovery runbook quality | 15 | Both automatic and manual paths; a stranger could execute it cold. |
| Load test proves the fix | 10 | Realistic scenario; fix demonstrably holds; honest methodology. |
| CI produces the evidence record | 20 | Readiness record is complete, gated, and regenerated on every run. |
| SBOM + scan triage | 15 | SBOM generated; vulnerable dep found; findings correctly classified incl. the false positive. |
| Signed-commit enforcement | 10 | Unsigned commit rejected by policy + CI. |
| The Drill | 10 | Service restored within the 20-minute window by following the team's own runbook. |
Pass = total ≥ 70 AND the Drill is survived (service restored within the window). Surviving the Drill is a gate, not just points — a team can score well on paper and still fail if their runbook was fiction.
Pick one for Act I and a different one for the Drill, ideally a class the team didn't see:
| Fault | Symptom | Honest diagnosis path |
|---|---|---|
| Injected latency on a downstream call | p95 climbs, throughput craters | trace → find the slow dependency → add timeout + cache/circuit-break |
| Pinned CPU (hot loop / N+1) | CPU at 100%, requests queue | profile → find the hot path → fix the query/loop |
| Corrupted config value | intermittent 5xx on one route | diff config → spot the bad value → restore from known-good |
| Half-applied migration | schema mismatch errors | inspect migration state → roll back → re-apply cleanly |
The half-applied migration is the natural CC2↔CC5 bridge if you want to foreshadow "Mid-Flight Engine Swap."
- Brownfield service repo with the six planted weaknesses (§4).
- Fault-injection control panel that can toggle each fault in the bank without code changes.
- A pinned, known-vulnerable transitive dependency + a chosen SBOM format.
- A simulated scan report containing one true-positive-critical and at least one false-positive.
- CI scaffolding the teams can harden (the "before" state) + a reference "after" for grading.
- (Optional reuse) Gittery checkers for the pipeline deliverables, so Act II auto-grades.
- A load-test target profile (realistic request mix) so results are comparable across teams.
- A reset button per team (restore repo + service to a clean checkpoint).
- Stuck on diagnosis > 30 min: release Hint 1 (which subsystem the metric points at), then Hint 2 (the fault class) at 45 min. Never reveal the fix.
- One team hard-blocked: offer a one-time reset to a clean checkpoint; they keep any pipeline work already committed.
- Act II tooling sprawl: pin exact CI system, scanner and SBOM tool in the brief so grading stays uniform.
| Risk | Mitigation |
|---|---|
| Time pressure rewards luck over method | Rubric rewards the hypothesis log, not just the fix. |
| A single blocker stalls a team all day | Reset checkpoints + laddered hints. |
| "Pipeline plumbing" feels joyless | Anchor it in the threat: the forged commit, the compromised dependency. |
| Drill becomes chaos | The Drill only injects a fault whose class the team has already practised mitigating. |
- Half-day cut: drop the load test and the build-cache requirement; keep diagnosis + runbook + evidence + drill.
- Hard mode: inject the Drill fault during Act II so resilience and trust work overlap.
- Async pre-work: run the Gittery CC3 drills (CI hardening, signed commits, scan triage) as a warm-up the week before, so Act II goes faster and deeper.
- Harness healthy; all faults toggle cleanly; reset verified.
- Each team has repo access, service URL, metrics endpoint, handover note.
- Rubric + readiness-record template shared.
- Drill fault chosen (different class from Act I).
- Graders briefed; live-scoring sheet ready for the readout.