Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created June 24, 2026 18:11
Show Gist options
  • Select an option

  • Save decagondev/8a32ceeb002c0b5abff96c0fb741376f to your computer and use it in GitHub Desktop.

Select an option

Save decagondev/8a32ceeb002c0b5abff96c0fb741376f to your computer and use it in GitHub Desktop.

Hackathon Design Document — "Ship It or Sink It"

Fuses: CC2 (Incident & Resilience) × CC3 (Trustworthy Pipeline) Companion doc: cross-cutting-hackathons.md §10.1 Recommended slot: the first flagship blended event to run (per the sequencing in §12).

Attribute Value
Codename Ship It or Sink It
Format Brownfield break-fix + pipeline hardening
Duration 1 day (≈7 working hours)
Team size 2–4
Difficulty ●●●○○
Native platform New brownfield harness (primary) + Gittery checkers (pipeline half)
Pass bar Total ≥ 70 and the surprise failure drill is survived

1. Premise

A service has been thrown over the wall to your team in a bad state. It is buckling under load and nobody can trust how it gets to production — the CI is decorative, commits are unsigned, and there's a vulnerable dependency lurking in the tree. You have one day to make it survive and make it shippable, in that order. At the end of the day we will try to break it again, live, and you'll have to recover from the plan you wrote.

The fiction matters: frame it as a real handover from a team that has left. Participants are the new on-call.


2. Treasury topic coverage

Treasury topic Original category Cross-cut
Diagnose injected latency / bad config / broken migration / high CPU Mixed CC2
20-minute failure drill Recovery CC2
Backup & recovery plan — automatic and manual Recovery CC2
Load / stress tests with realistic scenarios Benchmarking CC2
Migration rollback instructions Databases CC2
Harden CI workflow + evidence-producing checks + readiness record CI CC3
Triage simulated scan report CI CC3
SBOM / dependency scanning CI CC3
Signed commits Git CC3
Build cache / incremental builds Caching CC3

3. Learning objectives

By the end, participants should be able to:

  1. Diagnose a degrading system under time pressure using a hypothesis log rather than guesswork.
  2. Write and execute a recovery runbook with both automatic and manual fallbacks.
  3. Prove a fix holds with a realistic load/stress test, not an assertion.
  4. Harden a CI pipeline so it produces evidence — a repository-readiness record a stranger could trust.
  5. Triage security findings correctly (critical/high/medium/false-positive) and read an SBOM.
  6. Enforce provenance so an unsigned or unscanned change cannot reach main.

4. What participants are handed

A small but realistic service — e.g. a TypeScript or Python REST API backed by a relational DB — in a repo that contains, by design:

  • A weak CI workflow (runs tests, produces no evidence, no gating).
  • Unsigned commits permitted on the default branch.
  • A vulnerable transitive dependency and no SBOM.
  • No build cache (every run rebuilds from cold).
  • A fault-injection control panel (the harness) the facilitators drive — invisible to the obvious code path.
  • A seeded but thin test suite that passes green at start (so the breakage is environmental, not test-visible).

Provide: repo access, the running service URL, a metrics/logs endpoint, and a one-page "handover note" written in-character that is deliberately incomplete.


5. Run of show

Time Phase What happens
0:00–0:30 Briefing Premise, rules, rubric walkthrough, harness orientation.
0:30–1:00 Recon Teams read the code, metrics and handover note. No fixing yet — produce a hypothesis log.
1:00–3:00 Act I — Survive (CC2) A fault is injected at 1:00. Diagnose → fix → write the recovery runbook → prove with a load test.
3:00–3:45 Lunch / buffer Harness stays up; no scoring.
3:45–6:00 Act II — Trust (CC3) You may not "ship" until: hardened CI produces an evidence record, SBOM is generated and the vulnerable dep is triaged, signed commits are enforced, build is incremental.
6:00–6:30 The Drill A new, unseen fault is injected. Teams have 20 minutes to restore service using their runbook.
6:30–7:00 Readout Each team presents their readiness record + runbook; facilitators score live.

6. Deliverables

  1. A passing service with the Act I fault resolved.
  2. A recovery runbook (one page): detection signal → diagnosis → containment → recovery (automatic and manual fallback) → verification.
  3. A load/stress test + its result, demonstrating the fix holds under realistic traffic.
  4. A hardened CI workflow that emits a repository-readiness record (the evidence artifact).
  5. An SBOM + a short scan-triage table classifying each finding and naming the action taken.
  6. Signed-commit enforcement proven (an unsigned commit is rejected).

7. Scoring rubric (100 points)

Dimension Points What earns full marks
Diagnosis speed & method (Act I) 20 Fault correctly identified; hypothesis log shows what was ruled out and why.
Recovery runbook quality 15 Both automatic and manual paths; a stranger could execute it cold.
Load test proves the fix 10 Realistic scenario; fix demonstrably holds; honest methodology.
CI produces the evidence record 20 Readiness record is complete, gated, and regenerated on every run.
SBOM + scan triage 15 SBOM generated; vulnerable dep found; findings correctly classified incl. the false positive.
Signed-commit enforcement 10 Unsigned commit rejected by policy + CI.
The Drill 10 Service restored within the 20-minute window by following the team's own runbook.

Pass = total ≥ 70 AND the Drill is survived (service restored within the window). Surviving the Drill is a gate, not just points — a team can score well on paper and still fail if their runbook was fiction.


8. Fault bank (Act I + the Drill)

Pick one for Act I and a different one for the Drill, ideally a class the team didn't see:

Fault Symptom Honest diagnosis path
Injected latency on a downstream call p95 climbs, throughput craters trace → find the slow dependency → add timeout + cache/circuit-break
Pinned CPU (hot loop / N+1) CPU at 100%, requests queue profile → find the hot path → fix the query/loop
Corrupted config value intermittent 5xx on one route diff config → spot the bad value → restore from known-good
Half-applied migration schema mismatch errors inspect migration state → roll back → re-apply cleanly

The half-applied migration is the natural CC2↔CC5 bridge if you want to foreshadow "Mid-Flight Engine Swap."


9. Facilitator build checklist

  • Brownfield service repo with the six planted weaknesses (§4).
  • Fault-injection control panel that can toggle each fault in the bank without code changes.
  • A pinned, known-vulnerable transitive dependency + a chosen SBOM format.
  • A simulated scan report containing one true-positive-critical and at least one false-positive.
  • CI scaffolding the teams can harden (the "before" state) + a reference "after" for grading.
  • (Optional reuse) Gittery checkers for the pipeline deliverables, so Act II auto-grades.
  • A load-test target profile (realistic request mix) so results are comparable across teams.
  • A reset button per team (restore repo + service to a clean checkpoint).

10. Hint laddering & safety nets

  • Stuck on diagnosis > 30 min: release Hint 1 (which subsystem the metric points at), then Hint 2 (the fault class) at 45 min. Never reveal the fix.
  • One team hard-blocked: offer a one-time reset to a clean checkpoint; they keep any pipeline work already committed.
  • Act II tooling sprawl: pin exact CI system, scanner and SBOM tool in the brief so grading stays uniform.

11. Run-time risks & mitigations

Risk Mitigation
Time pressure rewards luck over method Rubric rewards the hypothesis log, not just the fix.
A single blocker stalls a team all day Reset checkpoints + laddered hints.
"Pipeline plumbing" feels joyless Anchor it in the threat: the forged commit, the compromised dependency.
Drill becomes chaos The Drill only injects a fault whose class the team has already practised mitigating.

12. Variants & stretch

  • Half-day cut: drop the load test and the build-cache requirement; keep diagnosis + runbook + evidence + drill.
  • Hard mode: inject the Drill fault during Act II so resilience and trust work overlap.
  • Async pre-work: run the Gittery CC3 drills (CI hardening, signed commits, scan triage) as a warm-up the week before, so Act II goes faster and deeper.

13. Day-of pre-flight checklist

  • Harness healthy; all faults toggle cleanly; reset verified.
  • Each team has repo access, service URL, metrics endpoint, handover note.
  • Rubric + readiness-record template shared.
  • Drill fault chosen (different class from Act I).
  • Graders briefed; live-scoring sheet ready for the readout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment