Feedback, Evaluation & Honest Metrics
You've designed a system, audited its honesty, and audited its provenance. This time you audit how it evaluates itself — and whether it's quietly fooling itself. The grade is in the reasoning, not in finding a flawless system.
Time: ~45–60 min · Due: before the Module 5 session · Submit: post your write-up in the cohort channel.
Pick one system you know that has a feedback loop — a recommender, a fraud or spam model, a support-ticket router, a forecasting system.
⚠️ The content-moderation system audited below is off-limits as your submission — it's the worked demo. Pick something else.
In a ~300–400 word audit, answer five things:
- The loop. What outcomes become feedback, and which cases generate it naturally (the ones a human actually looks at)?
- The blind spot. Which cases generate no feedback (confident, auto-handled)? Where would a confident drift — the confidently-wrong cases — hide?
- The sampling scheme. Design a random sampled review that would see the blind spot. Be specific about random — and say why re-checking the lowest-confidence auto cases would not work.
- Feedback quality. Is your feedback fast or slow, clean or noisy, biased or unbiased? What does that imply about trusting the dashboard — and is "we don't know yet" the honest answer anywhere?
- The metric. What number is (or would be) optimized? What's the cheapest way to game it — and what gaming-resistant metric would you replace it with?
- Someone rewards your team for "automation rate." How do you hit the target without doing what they actually want — and what metric would you replace it with?
- A change improves offline accuracy, but you haven't measured the blind spot and your early feedback is thin and biased. Accept, reject, or "not yet" — and what would you demand first?
- When is "we don't know yet" the honest answer about a system's performance — and why is that sometimes safer than a confident dashboard?
- A golden set is preserved ground truth (Module 3). When does trusting it become its own map-as-territory error — and what would you do about it?
Reply to one classmate's audit and poke a hole in their evaluation. Find where their sampling scheme isn't actually random (so it reimports the system's bias), or where their chosen metric could still be gamed — and propose the fix.
Everything below is just these, applied to one system. Keep them in front of you:
- The feedback a system generates is biased by its own decisions. It learns from what it sees, but it chooses what it sees.
- The blind spot. Confident, auto-handled cases generate no feedback — and the confidently-wrong errors live there by construction.
- Random is load-bearing. Only a random sample of the blind spot is unbiased. Let the system pick which cases to re-check and you've reimported its bias.
- Goodhart's law. When a measure becomes a target, it stops being a good measure. Always ask: what's the cheapest way to move this number — and is that what I want?
- Know your feedback quality. Fast or slow, clean or noisy, biased or unbiased. "We don't know yet" beats a confident dashboard built on thin, biased signal.
Let's run the audit on something that is not your assignment: an automated content-moderation system. It scores each post; above a "remove" threshold it auto-removes, below an "approve" threshold it auto-approves, and the uncertain middle band goes to human moderators. Users can appeal removals and report approved posts. Watch the moves, then make them on your own system.
1 · The loop & natural feedback. Feedback arrives from three places, and all three involve someone who looked: the middle-band posts human moderators reviewed, user appeals of removals, and user reports of approved posts. Every one of those is skewed — toward content near the threshold, or content that provoked enough reaction for someone to act.
2 · The blind spot. The posts the system confidently auto-approved that nobody reported, and the ones it confidently auto-removed that nobody appealed, generate nothing. By construction, the confidently-wrong cases live right here: harmful content confidently approved with no report, or benign content confidently removed by someone who didn't bother to appeal. The system is blind exactly where it was most sure.
3 · The sampling scheme. Pull a random sample of auto-approved and auto-removed posts — ignoring the score entirely — and have moderators grade them against policy. Random is the whole game: if you instead re-check the auto-decisions nearest the threshold (the least-confident ones), you've reimported the system's bias and you'll miss the confident errors, which is the entire point. Most sampled posts will confirm the system was right; the small fraction that don't are the early warning, and they're the only unbiased estimate of the confidently-wrong rate on each auto flow.
4 · Feedback quality. Appeals and reports are fast but biased — only motivated users act. Moderator labels on the middle band are cleaner but still biased — only the borderline cases. The random sample is slower and costlier but unbiased. So an honest dashboard separates "what users complained about" (fast, biased) from "what the random audit found" (slow, unbiased) — and where the audit is still thin, "we don't know yet" is the honest read.
5 · The metric & gaming. Say the team optimizes auto-resolution rate to cut moderation cost. Cheapest way to move it: widen the auto bands / lower the thresholds → more confident auto-decisions → more confident mistakes shipped unreviewed. (Optimize low appeal rate instead, and you game it by making appeals hard to file.) Gaming-resistant replacements: the confidently-wrong rate from the random audit, the score's calibration, and the quality of the human-reviewed band — numbers you can't move without actually moderating better.
Notice what just happened: I located the blind spot (the confident auto-decisions, not the appealed ones), designed a random sample, classified the feedback quality, and found the gameable metric and its replacement. That's the whole of Part A. Your job is to make those five moves on your system.
This is the module in one picture. Look at what connects to Evaluate: escalated cases get there naturally, but the auto-resolved cases have no direct path — the only bridge across that gap is the amber random-sample. Your audit is one question: does that amber bridge exist in your system, or do the confident cases never come back at all?
flowchart TD
IN([Incoming cases])
SYS["Moderation system<br/>scores + thresholds"]
ESC["Escalated / appealed / reported<br/>a human looked"]
AUTO["Auto-resolved<br/>confidently approved or removed"]
EVAL["Evaluate<br/>metrics + retraining"]
SAMPLE["Random sampled review<br/>grade auto cases, ignore the score"]
IN --> SYS
SYS == uncertain ==> ESC
SYS == confident ==> AUTO
ESC -- natural but biased --> EVAL
AUTO -- random sample --> SAMPLE
SAMPLE -. unbiased signal .-> EVAL
style SAMPLE stroke:#F5A623,stroke-width:2px
style AUTO stroke:#C97E12,stroke-width:1.5px
Copy this into your doc and replace each blank. Keep it tight — most of the marks are in points 2 and 3.
SYSTEM (one line): __________________________________________
1. THE LOOP
What outcomes become feedback: _____________________
Which cases generate it naturally (a human looked): _
2. THE BLIND SPOT
Cases that generate NO feedback (confident, auto): _
Where confidently-wrong errors would hide: _________
3. RANDOM SAMPLING SCHEME
How I'd sample the blind spot (must be RANDOM): ____
Why lowest-confidence re-checks don't count: _______
4. FEEDBACK QUALITY
Fast/slow · clean/noisy · biased/unbiased: _________
Where "we don't know yet" is the honest answer: ____
5. THE METRIC
What's optimized: __________________________________
Cheapest way to game it: ___________________________
Gaming-resistant replacement: ______________________
If you want a diagram, copy the Mermaid block above and relabel it for your system — it renders as a flowchart in your Gist.
Part B — starter prompts (answer in your own words; don't just restate these):
- Ask what behaviour cheaply raises "automation rate" — then ask whether that behaviour is the thing you actually wanted. The gap between those two answers is your whole point.
- Before shipping on a headline number, ask: what evidence would change my mind, and do I have it yet? If the blind spot is unmeasured, the honest move is usually "not yet — show me the sampled review first."
- Name the properties of your feedback that make confidence dishonest — slow? sparse? biased? When those stack up, "we don't know yet" isn't a dodge, it's the rigorous answer.
- Ask what's in your golden set that isn't in the real world anymore — and what's in the world that never made it into the set. That gap is where trusting the benchmark quietly becomes the map-as-territory error.
- Auditing the escalated / appealed cases and calling it done. That's the visible region — the blind spot is everything the system confidently let through.
- A "sampling scheme" that re-checks the least-confident auto cases. Still biased. It has to be random.
- Reporting a cheaply-gamed metric — automation rate, raw accuracy, low appeal rate. Ask how you'd cheat it.
- Trusting a green dashboard built on thin, biased feedback instead of saying "we don't know yet."
- Treating the golden set as the whole truth. It's a fixed, biased map — useful, not gospel.
- Correctly locates the blind spot — the confident, auto-handled cases, not the escalated ones.
- The sampling scheme is genuinely random — it doesn't let the system choose what gets re-checked.
- Names a gaming-resistant metric and shows the cheapest way the easy proxy gets gamed.
- Reasons honestly about feedback quality, and is willing to say "not yet."
Good luck — bring the audit that distrusts the evaluation the system handed you for free, and goes looking for what's missing.