Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created June 3, 2026 14:06
Show Gist options
  • Select an option

  • Save decagondev/5fc4e041cffe8c53c43d91ee8f6f3f56 to your computer and use it in GitHub Desktop.

Select an option

Save decagondev/5fc4e041cffe8c53c43d91ee8f6f3f56 to your computer and use it in GitHub Desktop.

Module 2 Challenger — Take-Home

Uncertainty, Calibration & Knowing the Limits

In Module 1 you designed a layered system. This time you interrogate a single layer's honesty — because the whole architecture only works if its confidence estimates can be trusted. The grade is in the reasoning, not in finding a "perfect" system.

Time: ~45–60 min · Due: before the Module 3 session · Submit: post your write-up in the cohort channel.


The assignment

Part A — Audit a system for honest uncertainty (core)

Pick one model or automated decision system that produces some kind of score or confidence and acts on it — from work, from a case study, or one you invent in enough detail to reason about.

⚠️ The email spam filter audited below is off-limits as your submission — it's the worked demo. Pick something else.

Write a ~300–400 word audit answering four things:

  1. Calibration vs accuracy. What does "accuracy" mean for this system, and separately, is its confidence calibrated — when it says "90% sure," is it right ~90% of the time? How would you check?
  2. Raw score vs honest confidence. Where does its confidence number come from — a raw model output, or something calibrated against real outcomes? If raw, why might it be overconfident?
  3. The confidently-wrong zone. Where's the auto-act threshold, and which errors land above it — shipping unreviewed? Describe the confidently-wrong rate as the metric that matters more than accuracy.
  4. The edge of competence. What does an out-of-distribution input look like here, does anything detect it, and what should happen (widen uncertainty and escalate, or extrapolate confidently)?

Part B — Four short reflections (a paragraph each)

  1. You're told a system is "99% accurate." What's the single follow-up question you'd ask before trusting it to auto-act — and why that one?
  2. Describe a real setting where you'd rather deploy a less-accurate-but-calibrated model than a more-accurate-but-overconfident one. What about the setting drives that?
  3. A model returns 99% confidence on an input unlike anything it has seen. How much should you trust it, and what should the system do?
  4. Pick a decision with lopsided error costs. Where would you set the confidence threshold — and why not just 0.9?

Optional — Peer response (this is our discussion, async)

Reply to one classmate's audit. Find the single place their system would be confidently wrong — wrong and above the auto-act bar, shipping unreviewed — and say how you'd make that error surface as low confidence instead.


Before you start: the five ideas you're applying

Everything below is just these, applied to one system. Keep them in front of you:

  1. Calibration ≠ accuracy. Accuracy = how often it's right. Calibration = whether its confidence is honest. Always ask: "right about what, and is the confidence honest?"
  2. A raw score is an opinion, not a truth. The model's 0.95 is its opinion of itself. Honest confidence is the audited version — calibrated against real outcomes, per context.
  3. The confidently-wrong rate beats accuracy. The dangerous error is being wrong while sure, because high confidence is exactly what ships a case past review.
  4. Confidence has fine print. It's only valid inside the world it was measured on. Outside that world (OOD), a confident number is a trap.
  5. Err toward the cheaper, recoverable mistake. When errors are lopsided, the threshold should be too — set it by cost, not by a round number like 0.9.

A worked starter — auditing a different system

Let's run the four audit moves on something that is not your assignment: a consumer email spam filter that auto-moves messages to the Spam folder above a confidence threshold. Watch the moves, then make them on your own system.

1 · Calibration vs accuracy. Accuracy here is how often the filter labels spam-vs-not correctly overall. Calibration is a different question: of all the messages it scored "97% spam," were ~97% actually spam? How I'd check: bucket every message by its stated spam score (90–95%, 95–99%, 99%+), then compare each bucket against ground truth — user "not spam" rescues and reported-missed-spam. If the 97% bucket is right only 85% of the time, the filter is accurate but overconfident: a liar about its own confidence.

2 · Raw score vs honest confidence. The filter's output probability is a raw score — its opinion of itself, straight from the classifier. It becomes honest confidence only after a calibration step maps those scores to observed correctness, ideally per context (sender domain, language, message type), because a filter calibrated on English marketing mail is not calibrated on a foreign-language invoice. Raw scores from these models tend to spike overconfident right after a model update.

3 · The confidently-wrong zone. Say the auto-move threshold is 95%. Errors below it stay in the inbox — annoying, but the user sees and corrects them. The dangerous errors are above 95%: a legitimate, important email (a job offer, an invoice) scored 98% spam and silently filed away unseen. That's confidently wrong — shipped past the only reviewer, the user. The metric that matters isn't overall accuracy; it's the confidently-wrong rate: of everything auto-filed, what fraction was actually legit? That number predicts the real harm.

4 · The edge of competence (OOD). A brand-new phishing style, unlike anything in training, arrives looking like ordinary mail. The filter confidently scores it 3% spam and delivers it — high confidence, zero basis. Does anything flag "this is unlike what I was trained on"? If not, that's the gap. The right response to a novel input isn't a confident verdict — it's to widen uncertainty: quarantine it, or route it to a review queue, rather than deliver on a confidence the model has no right to.

Notice what just happened: I separated accuracy from calibration and said how I'd check, traced confidence back to a raw score, found the confidently-wrong zone above the auto-act bar, and named a real OOD failure with the right response. That's the whole of Part A. Your job is to make those four moves on your system.


The shape you're auditing

This is what an honest-uncertainty system looks like — the spam filter as a flowchart. (Most real systems are missing one of the amber boxes. Part of your audit is spotting which.)

flowchart TD
    IN([Incoming email])
    M["Spam model<br/>raw score = its opinion of itself"]
    CAL["Calibration<br/>raw score to honest confidence<br/>checked vs user corrections"]
    OOD["OOD / novelty check<br/>a style it was trained on?"]
    GATE["Decision gate<br/>honest confidence + novelty + consequence"]
    AUTO([Auto-move to Spam<br/>confident and familiar])
    REVIEW([Quarantine / review<br/>uncertain, novel, or consequential])

    IN --> M --> CAL --> GATE
    IN --> OOD
    OOD -. unfamiliar .-> GATE
    GATE -- confident and familiar --> AUTO
    GATE -- uncertain or novel --> REVIEW

    style CAL stroke:#F5A623,stroke-width:2px
    style OOD stroke:#C97E12,stroke-width:1.5px
Loading

Your turn — a template to fill in

Copy this into your doc and replace each blank. Keep it tight — most of the marks are in points 1 and 3.

SYSTEM (one line): __________________________________________

1. CALIBRATION vs ACCURACY
   "Accuracy" here means: _____________________________
   Is its confidence calibrated? How I'd check: ________
   (e.g. bucket cases by stated confidence, compare to observed correctness)

2. RAW SCORE vs HONEST CONFIDENCE
   The confidence number comes from: __________________
   Raw model output, or calibrated vs real outcomes? __
   If raw, why it might be overconfident: _____________

3. THE CONFIDENTLY-WRONG ZONE
   Auto-act threshold: ________________________________
   Errors that land ABOVE it (ship unreviewed): _______
   Confidently-wrong rate, in words: __________________

4. THE EDGE OF COMPETENCE (OOD)
   What an out-of-distribution input looks like here: _
   Does anything detect it? ___________________________
   What SHOULD happen (widen + escalate?): ____________

If you want a diagram, copy the Mermaid block above and relabel it for your system — it renders as a flowchart in your Gist.

Part B — starter prompts (answer in your own words; don't just restate these):

  1. Ask yourself what "99% accurate" leaves out. Two systems can share that number and be wildly different to trust — what's the property that differs, and where would you look for it?
  2. Picture a setting that auto-acts on confidence. The question becomes: do confident errors get reviewed, or shipped? Let that decide which model you'd rather have.
  3. Ask one thing: was that 99% ever measured on inputs like this one? If the answer is no, the number isn't confidence — it's a guess wearing a confidence's clothes.
  4. Name your two errors. Which one is recoverable, and which is catastrophic? Let the gap between them — not a tidy 0.9 — set the bar.

Common traps (self-check before you submit)

  • Auditing accuracy and calling it done. If you never checked calibration, you skipped the module.
  • Trusting the raw score. A softmax / model probability is the model's opinion of itself, not an audited truth.
  • Reporting average accuracy instead of the confidently-wrong rate — the errors that ship unreviewed are the ones that matter.
  • Assuming confidence means something on unfamiliar inputs. Calibration is only valid in-distribution.
  • Round-number thresholds. 0.9 because it looks tidy is not a justification. Cost asymmetry is.

What a strong submission shows

  • Reasons from calibration and the confidently-wrong rate, not average accuracy.
  • Separates raw score from honest confidence, and says how it would check calibration.
  • Identifies a real out-of-distribution condition and the right response (widen + escalate).
  • Any threshold is justified by cost asymmetry, not a round number.

Good luck — bring the audit you'd defend to someone whose job depends on the system being honest, not the one that's easiest to write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment