Skip to content

Instantly share code, notes, and snippets.

@decagondev
Created June 10, 2026 13:13
Show Gist options
  • Select an option

  • Save decagondev/c4b0ee5336ccefa4385d9f9b6007f2da to your computer and use it in GitHub Desktop.

Select an option

Save decagondev/c4b0ee5336ccefa4385d9f9b6007f2da to your computer and use it in GitHub Desktop.

Module 4 Challenger — Take-Home

Feedback, Evaluation & Honest Metrics

You've designed a system, audited its honesty, and audited its provenance. This time you audit how it evaluates itself — and whether it's quietly fooling itself. The grade is in the reasoning, not in finding a flawless system.

Time: ~45–60 min · Due: before the Module 5 session · Submit: post your write-up in the cohort channel.


The assignment

Part A — Audit a feedback loop for its blind spot (core)

Pick one system you know that has a feedback loop — a recommender, a fraud or spam model, a support-ticket router, a forecasting system.

⚠️ The content-moderation system audited below is off-limits as your submission — it's the worked demo. Pick something else.

In a ~300–400 word audit, answer five things:

  1. The loop. What outcomes become feedback, and which cases generate it naturally (the ones a human actually looks at)?
  2. The blind spot. Which cases generate no feedback (confident, auto-handled)? Where would a confident drift — the confidently-wrong cases — hide?
  3. The sampling scheme. Design a random sampled review that would see the blind spot. Be specific about random — and say why re-checking the lowest-confidence auto cases would not work.
  4. Feedback quality. Is your feedback fast or slow, clean or noisy, biased or unbiased? What does that imply about trusting the dashboard — and is "we don't know yet" the honest answer anywhere?
  5. The metric. What number is (or would be) optimized? What's the cheapest way to game it — and what gaming-resistant metric would you replace it with?

Part B — Four short reflections (a paragraph each)

  1. Someone rewards your team for "automation rate." How do you hit the target without doing what they actually want — and what metric would you replace it with?
  2. A change improves offline accuracy, but you haven't measured the blind spot and your early feedback is thin and biased. Accept, reject, or "not yet" — and what would you demand first?
  3. When is "we don't know yet" the honest answer about a system's performance — and why is that sometimes safer than a confident dashboard?
  4. A golden set is preserved ground truth (Module 3). When does trusting it become its own map-as-territory error — and what would you do about it?

Optional — Peer response (this is our discussion, async)

Reply to one classmate's audit and poke a hole in their evaluation. Find where their sampling scheme isn't actually random (so it reimports the system's bias), or where their chosen metric could still be gamed — and propose the fix.


Before you start: the five ideas you're applying

Everything below is just these, applied to one system. Keep them in front of you:

  1. The feedback a system generates is biased by its own decisions. It learns from what it sees, but it chooses what it sees.
  2. The blind spot. Confident, auto-handled cases generate no feedback — and the confidently-wrong errors live there by construction.
  3. Random is load-bearing. Only a random sample of the blind spot is unbiased. Let the system pick which cases to re-check and you've reimported its bias.
  4. Goodhart's law. When a measure becomes a target, it stops being a good measure. Always ask: what's the cheapest way to move this number — and is that what I want?
  5. Know your feedback quality. Fast or slow, clean or noisy, biased or unbiased. "We don't know yet" beats a confident dashboard built on thin, biased signal.

A worked starter — auditing a different system

Let's run the audit on something that is not your assignment: an automated content-moderation system. It scores each post; above a "remove" threshold it auto-removes, below an "approve" threshold it auto-approves, and the uncertain middle band goes to human moderators. Users can appeal removals and report approved posts. Watch the moves, then make them on your own system.

1 · The loop & natural feedback. Feedback arrives from three places, and all three involve someone who looked: the middle-band posts human moderators reviewed, user appeals of removals, and user reports of approved posts. Every one of those is skewed — toward content near the threshold, or content that provoked enough reaction for someone to act.

2 · The blind spot. The posts the system confidently auto-approved that nobody reported, and the ones it confidently auto-removed that nobody appealed, generate nothing. By construction, the confidently-wrong cases live right here: harmful content confidently approved with no report, or benign content confidently removed by someone who didn't bother to appeal. The system is blind exactly where it was most sure.

3 · The sampling scheme. Pull a random sample of auto-approved and auto-removed posts — ignoring the score entirely — and have moderators grade them against policy. Random is the whole game: if you instead re-check the auto-decisions nearest the threshold (the least-confident ones), you've reimported the system's bias and you'll miss the confident errors, which is the entire point. Most sampled posts will confirm the system was right; the small fraction that don't are the early warning, and they're the only unbiased estimate of the confidently-wrong rate on each auto flow.

4 · Feedback quality. Appeals and reports are fast but biased — only motivated users act. Moderator labels on the middle band are cleaner but still biased — only the borderline cases. The random sample is slower and costlier but unbiased. So an honest dashboard separates "what users complained about" (fast, biased) from "what the random audit found" (slow, unbiased) — and where the audit is still thin, "we don't know yet" is the honest read.

5 · The metric & gaming. Say the team optimizes auto-resolution rate to cut moderation cost. Cheapest way to move it: widen the auto bands / lower the thresholds → more confident auto-decisions → more confident mistakes shipped unreviewed. (Optimize low appeal rate instead, and you game it by making appeals hard to file.) Gaming-resistant replacements: the confidently-wrong rate from the random audit, the score's calibration, and the quality of the human-reviewed band — numbers you can't move without actually moderating better.

Notice what just happened: I located the blind spot (the confident auto-decisions, not the appealed ones), designed a random sample, classified the feedback quality, and found the gameable metric and its replacement. That's the whole of Part A. Your job is to make those five moves on your system.


The shape you're auditing

This is the module in one picture. Look at what connects to Evaluate: escalated cases get there naturally, but the auto-resolved cases have no direct path — the only bridge across that gap is the amber random-sample. Your audit is one question: does that amber bridge exist in your system, or do the confident cases never come back at all?

flowchart TD
    IN([Incoming cases])
    SYS["Moderation system<br/>scores + thresholds"]
    ESC["Escalated / appealed / reported<br/>a human looked"]
    AUTO["Auto-resolved<br/>confidently approved or removed"]
    EVAL["Evaluate<br/>metrics + retraining"]
    SAMPLE["Random sampled review<br/>grade auto cases, ignore the score"]

    IN --> SYS
    SYS == uncertain ==> ESC
    SYS == confident ==> AUTO
    ESC -- natural but biased --> EVAL
    AUTO -- random sample --> SAMPLE
    SAMPLE -. unbiased signal .-> EVAL

    style SAMPLE stroke:#F5A623,stroke-width:2px
    style AUTO stroke:#C97E12,stroke-width:1.5px
Loading

Your turn — a template to fill in

Copy this into your doc and replace each blank. Keep it tight — most of the marks are in points 2 and 3.

SYSTEM (one line): __________________________________________

1. THE LOOP
   What outcomes become feedback: _____________________
   Which cases generate it naturally (a human looked): _

2. THE BLIND SPOT
   Cases that generate NO feedback (confident, auto): _
   Where confidently-wrong errors would hide: _________

3. RANDOM SAMPLING SCHEME
   How I'd sample the blind spot (must be RANDOM): ____
   Why lowest-confidence re-checks don't count: _______

4. FEEDBACK QUALITY
   Fast/slow · clean/noisy · biased/unbiased: _________
   Where "we don't know yet" is the honest answer: ____

5. THE METRIC
   What's optimized: __________________________________
   Cheapest way to game it: ___________________________
   Gaming-resistant replacement: ______________________

If you want a diagram, copy the Mermaid block above and relabel it for your system — it renders as a flowchart in your Gist.

Part B — starter prompts (answer in your own words; don't just restate these):

  1. Ask what behaviour cheaply raises "automation rate" — then ask whether that behaviour is the thing you actually wanted. The gap between those two answers is your whole point.
  2. Before shipping on a headline number, ask: what evidence would change my mind, and do I have it yet? If the blind spot is unmeasured, the honest move is usually "not yet — show me the sampled review first."
  3. Name the properties of your feedback that make confidence dishonest — slow? sparse? biased? When those stack up, "we don't know yet" isn't a dodge, it's the rigorous answer.
  4. Ask what's in your golden set that isn't in the real world anymore — and what's in the world that never made it into the set. That gap is where trusting the benchmark quietly becomes the map-as-territory error.

Common traps (self-check before you submit)

  • Auditing the escalated / appealed cases and calling it done. That's the visible region — the blind spot is everything the system confidently let through.
  • A "sampling scheme" that re-checks the least-confident auto cases. Still biased. It has to be random.
  • Reporting a cheaply-gamed metric — automation rate, raw accuracy, low appeal rate. Ask how you'd cheat it.
  • Trusting a green dashboard built on thin, biased feedback instead of saying "we don't know yet."
  • Treating the golden set as the whole truth. It's a fixed, biased map — useful, not gospel.

What a strong submission shows

  • Correctly locates the blind spot — the confident, auto-handled cases, not the escalated ones.
  • The sampling scheme is genuinely random — it doesn't let the system choose what gets re-checked.
  • Names a gaming-resistant metric and shows the cheapest way the easy proxy gets gamed.
  • Reasons honestly about feedback quality, and is willing to say "not yet."

Good luck — bring the audit that distrusts the evaluation the system handed you for free, and goes looking for what's missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment