Uncertainty, Calibration & Knowing the Limits
In Module 1 you designed a layered system. This time you interrogate a single layer's honesty — because the whole architecture only works if its confidence estimates can be trusted. The grade is in the reasoning, not in finding a "perfect" system.
Time: ~45–60 min · Due: before the Module 3 session · Submit: post your write-up in the cohort channel.
Pick one model or automated decision system that produces some kind of score or confidence and acts on it — from work, from a case study, or one you invent in enough detail to reason about.
⚠️ The email spam filter audited below is off-limits as your submission — it's the worked demo. Pick something else.
Write a ~300–400 word audit answering four things:
- Calibration vs accuracy. What does "accuracy" mean for this system, and separately, is its confidence calibrated — when it says "90% sure," is it right ~90% of the time? How would you check?
- Raw score vs honest confidence. Where does its confidence number come from — a raw model output, or something calibrated against real outcomes? If raw, why might it be overconfident?
- The confidently-wrong zone. Where's the auto-act threshold, and which errors land above it — shipping unreviewed? Describe the confidently-wrong rate as the metric that matters more than accuracy.
- The edge of competence. What does an out-of-distribution input look like here, does anything detect it, and what should happen (widen uncertainty and escalate, or extrapolate confidently)?
- You're told a system is "99% accurate." What's the single follow-up question you'd ask before trusting it to auto-act — and why that one?
- Describe a real setting where you'd rather deploy a less-accurate-but-calibrated model than a more-accurate-but-overconfident one. What about the setting drives that?
- A model returns 99% confidence on an input unlike anything it has seen. How much should you trust it, and what should the system do?
- Pick a decision with lopsided error costs. Where would you set the confidence threshold — and why not just 0.9?
Reply to one classmate's audit. Find the single place their system would be confidently wrong — wrong and above the auto-act bar, shipping unreviewed — and say how you'd make that error surface as low confidence instead.
Everything below is just these, applied to one system. Keep them in front of you:
- Calibration ≠ accuracy. Accuracy = how often it's right. Calibration = whether its confidence is honest. Always ask: "right about what, and is the confidence honest?"
- A raw score is an opinion, not a truth. The model's
0.95is its opinion of itself. Honest confidence is the audited version — calibrated against real outcomes, per context. - The confidently-wrong rate beats accuracy. The dangerous error is being wrong while sure, because high confidence is exactly what ships a case past review.
- Confidence has fine print. It's only valid inside the world it was measured on. Outside that world (OOD), a confident number is a trap.
- Err toward the cheaper, recoverable mistake. When errors are lopsided, the threshold should be too — set it by cost, not by a round number like 0.9.
Let's run the four audit moves on something that is not your assignment: a consumer email spam filter that auto-moves messages to the Spam folder above a confidence threshold. Watch the moves, then make them on your own system.
1 · Calibration vs accuracy. Accuracy here is how often the filter labels spam-vs-not correctly overall. Calibration is a different question: of all the messages it scored "97% spam," were ~97% actually spam? How I'd check: bucket every message by its stated spam score (90–95%, 95–99%, 99%+), then compare each bucket against ground truth — user "not spam" rescues and reported-missed-spam. If the 97% bucket is right only 85% of the time, the filter is accurate but overconfident: a liar about its own confidence.
2 · Raw score vs honest confidence. The filter's output probability is a raw score — its opinion of itself, straight from the classifier. It becomes honest confidence only after a calibration step maps those scores to observed correctness, ideally per context (sender domain, language, message type), because a filter calibrated on English marketing mail is not calibrated on a foreign-language invoice. Raw scores from these models tend to spike overconfident right after a model update.
3 · The confidently-wrong zone. Say the auto-move threshold is 95%. Errors below it stay in the inbox — annoying, but the user sees and corrects them. The dangerous errors are above 95%: a legitimate, important email (a job offer, an invoice) scored 98% spam and silently filed away unseen. That's confidently wrong — shipped past the only reviewer, the user. The metric that matters isn't overall accuracy; it's the confidently-wrong rate: of everything auto-filed, what fraction was actually legit? That number predicts the real harm.
4 · The edge of competence (OOD). A brand-new phishing style, unlike anything in training, arrives looking like ordinary mail. The filter confidently scores it 3% spam and delivers it — high confidence, zero basis. Does anything flag "this is unlike what I was trained on"? If not, that's the gap. The right response to a novel input isn't a confident verdict — it's to widen uncertainty: quarantine it, or route it to a review queue, rather than deliver on a confidence the model has no right to.
Notice what just happened: I separated accuracy from calibration and said how I'd check, traced confidence back to a raw score, found the confidently-wrong zone above the auto-act bar, and named a real OOD failure with the right response. That's the whole of Part A. Your job is to make those four moves on your system.
This is what an honest-uncertainty system looks like — the spam filter as a flowchart. (Most real systems are missing one of the amber boxes. Part of your audit is spotting which.)
flowchart TD
IN([Incoming email])
M["Spam model<br/>raw score = its opinion of itself"]
CAL["Calibration<br/>raw score to honest confidence<br/>checked vs user corrections"]
OOD["OOD / novelty check<br/>a style it was trained on?"]
GATE["Decision gate<br/>honest confidence + novelty + consequence"]
AUTO([Auto-move to Spam<br/>confident and familiar])
REVIEW([Quarantine / review<br/>uncertain, novel, or consequential])
IN --> M --> CAL --> GATE
IN --> OOD
OOD -. unfamiliar .-> GATE
GATE -- confident and familiar --> AUTO
GATE -- uncertain or novel --> REVIEW
style CAL stroke:#F5A623,stroke-width:2px
style OOD stroke:#C97E12,stroke-width:1.5px
Copy this into your doc and replace each blank. Keep it tight — most of the marks are in points 1 and 3.
SYSTEM (one line): __________________________________________
1. CALIBRATION vs ACCURACY
"Accuracy" here means: _____________________________
Is its confidence calibrated? How I'd check: ________
(e.g. bucket cases by stated confidence, compare to observed correctness)
2. RAW SCORE vs HONEST CONFIDENCE
The confidence number comes from: __________________
Raw model output, or calibrated vs real outcomes? __
If raw, why it might be overconfident: _____________
3. THE CONFIDENTLY-WRONG ZONE
Auto-act threshold: ________________________________
Errors that land ABOVE it (ship unreviewed): _______
Confidently-wrong rate, in words: __________________
4. THE EDGE OF COMPETENCE (OOD)
What an out-of-distribution input looks like here: _
Does anything detect it? ___________________________
What SHOULD happen (widen + escalate?): ____________
If you want a diagram, copy the Mermaid block above and relabel it for your system — it renders as a flowchart in your Gist.
Part B — starter prompts (answer in your own words; don't just restate these):
- Ask yourself what "99% accurate" leaves out. Two systems can share that number and be wildly different to trust — what's the property that differs, and where would you look for it?
- Picture a setting that auto-acts on confidence. The question becomes: do confident errors get reviewed, or shipped? Let that decide which model you'd rather have.
- Ask one thing: was that 99% ever measured on inputs like this one? If the answer is no, the number isn't confidence — it's a guess wearing a confidence's clothes.
- Name your two errors. Which one is recoverable, and which is catastrophic? Let the gap between them — not a tidy 0.9 — set the bar.
- Auditing accuracy and calling it done. If you never checked calibration, you skipped the module.
- Trusting the raw score. A softmax / model probability is the model's opinion of itself, not an audited truth.
- Reporting average accuracy instead of the confidently-wrong rate — the errors that ship unreviewed are the ones that matter.
- Assuming confidence means something on unfamiliar inputs. Calibration is only valid in-distribution.
- Round-number thresholds. 0.9 because it looks tidy is not a justification. Cost asymmetry is.
- Reasons from calibration and the confidently-wrong rate, not average accuracy.
- Separates raw score from honest confidence, and says how it would check calibration.
- Identifies a real out-of-distribution condition and the right response (widen + escalate).
- Any threshold is justified by cost asymmetry, not a round number.
Good luck — bring the audit you'd defend to someone whose job depends on the system being honest, not the one that's easiest to write.