Skip to content

Instantly share code, notes, and snippets.

@pmarreck
Last active July 1, 2026 06:21
Show Gist options
  • Select an option

  • Save pmarreck/b30aa3ca69cb70a5526f8a63ab8c8d7e to your computer and use it in GitHub Desktop.

Select an option

Save pmarreck/b30aa3ca69cb70a5526f8a63ab8c8d7e to your computer and use it in GitHub Desktop.
MFIC — Mechanically-Falsifiable Independent Control: a discipline for enforced tests and runtime checks that keep agents honest

MFIC — Mechanically-Falsifiable Independent Control

A discipline for checks that can't be fooled by the same mistake that produced the work.

Abstract. When the author of a piece of work — an LLM, or a human moving fast — can be confidently and silently wrong, the usual defenses fail quietly: a test written from the same mistaken assumption as the code passes while the code is broken. MFIC is the class of check that escapes this trap, defined by four properties that must all hold at once. It is Mechanical (cases swept by machine, not hand-picked, so nothing is silently omitted), Falsifiable (each case genuinely bites when the work is wrong, on inputs you couldn't pre-arrange to pass), Independent (its verdict comes from the contract or the data — a source the producer doesn't control — which is segregation of duties restated for software), and a real Control (it holds the authority to block the bad outcome, not merely log it). This is the COSO / Sarbanes-Oxley model of internal control repointed at code and AI output, with the model as the new untrusted party; TDD supplies only the second of the four words, and the framework names the rest. Two distinctions carry the practical weight: whether an oracle exists outside the producer — if so, who writes the check is irrelevant; if not, you need a separate, adversarial agent — and whether a control is advisory (the producer could route around it if determined) or hard (it mechanically lacks the capability to proceed, as when it cannot land a commit a separate agent hasn't cryptographically signed). The apparent paradox is that faster, more capable, more autonomous producers demand more control surface, not less — but the controls ride the same cost curve as the capability and are built by the same engine that makes them necessary, turning the whole exchange into a net gain rather than a tax.


Section 1 — The idea

The problem

You have an artifact — code, a document, a data record — produced by an author you can't fully trust to be correct: an LLM, or a human working fast. That author will sometimes be confidently wrong, and the wrong output will look exactly like the right output. The question MFIC answers is: what kind of check actually catches that, and what kind only appears to?

The trap is subtle. Most checks are written by, or reason from, the same source that produced the work — so they inherit its blind spots. A test that encodes the same wrong assumption as the code will pass while the code is wrong. The entire purpose of MFIC is to characterize the checks that escape this trap, and to name precisely what they have that ordinary checks lack.

This is not a new problem. It is the problem corporate internal-control practice — COSO, Sarbanes-Oxley — has spent decades on, under a different name. Their untrusted party was a person who might cook the books; ours is a model that might hallucinate. The machinery transfers wholesale: segregation of duties, preventive/detective/corrective controls, audit trails, reasonable assurance, test of controls. MFIC is that machinery pointed at code and AI output.

It also explains where TDD stops. TDD's red phase establishes exactly one property: that a test can fire (it failed before the code existed). That's necessary, but it's only one of four things you need. MFIC names the other three.

The definition

An MFIC control is an independent check at a handoff boundary that derives its verdict from the contract or the data itself — never from the producer's account of its own work — and holds the authority to reject, halt, or redirect.

The litmus

One question decides whether a check qualifies:

If the same agent wrote both the check and the thing being checked, could it pass with wrong work?

If yes, the check is gameable, and it is not MFIC. If no, it is. Every design decision below exists to let you answer "no."

The four words

Each word names a distinct way checks fail, and closes it.

Mechanically — the cases are swept by machine, not chosen by hand. Hand-picked cases test what the author thought to test, which is exactly the set that excludes the bug they didn't think of. Mechanical generation — exhaust the domain, mutate the input, generate properties — covers the cases nobody nominated. Without it you get omission bugs: the code is wrong in a region no test ever visits.

Falsifiable — each case is a real refutation you couldn't have pre-arranged to pass. A test that asserts almost nothing ("it didn't crash") stays green whether or not the output is correct; a weak oracle hides wrong answers behind a passing run. Falsifiable means the check bites: when the work is wrong it goes red, and you didn't get to choose the inputs that keep it quiet.

Independent — the source of truth is causally independent of the producer. This is segregation of duties restated for software: the maker and the checker must not be the same entity, or share the same blind spot. This is the axis TDD leaves wide open — writing the test first proves it can fail, but if one author wrote both test and code, a shared wrong assumption sails through both. (This word has enough depth that it gets its own subsection below.)

Control — the check is not just an observation; it has authority. A baseline plus the power to fail the build, block the commit, or steer the work back. Telemetry that notices a problem and logs it is not a control — it observes without blocking. A check that stops the bad thing from happening is. (This word also gets its own subsection.)

A check with all four is MFIC. Drop any one and you have a familiar near-miss: hand-picked cases (¬M), vacuous green tests (¬F), collusive tests that share the code's blind spot (¬I), or a warn-only monitor that logs but never blocks (¬C).

Where the checks live, and how to pick one

Statically — in CI, verifying the code before it ships — these are exhaustion over a finite domain, differential testing against a reference implementation, metamorphic testing of transform-invariants, input-mutation coverage, and property-based tests. All detective. Dynamically — in production, verifying state as it flows between stages — they are boundary contracts, schema and invariant preservation, and reject-to-previous loops. Preventive and corrective.

Reach for the strongest, cheapest check the structure allows. If the domain is finite, exhaust it. If the operation has an inverse, round-trip it (decode(encode(x)) == x). If a reference implementation exists, run both and diff. If a transform has an invariant, assert it. If you can corrupt a known-good input, mutate it and measure how much corruption your verifier catches. Otherwise, generate properties and shrink failures to a minimal case. The ranking has one principle behind it: prefer an oracle the author didn't write. An asserted "expected value" can itself be the bug; a round-trip, a reference, or an invariant cannot be, because none of them depends on the author having been right.

Input-mutation deserves a special mention because it turns a claim into a number. "My validator checks the whole format" is unfalsifiable hand-waving. Instead: take a valid input, flip each bit or byte, and assert the validator now rejects it. The fraction it rejects is a coverage figure — how much of the format the validator actually constrains — that nobody can hallucinate past. Two honesty requirements keep it from lying: pair it with a corpus of all-valid inputs that must all pass (or a checker that rejects everything scores a perfect 100%), and score only the bytes that must matter (padding and checksum-excluded regions are legitimate don't-cares).

Independence, in depth: is there an oracle the producer didn't write?

The whole independence question reduces to one fork.

If there is an oracle outside the producer — a reference implementation, an inverse to round-trip against, a transform-invariant, a checksum fixed before the producer touched anything — then authorship doesn't matter. The same agent can write both the code and the check and still cannot pass wrong work, because the judge is a separate causal thing it doesn't control. Round-tripping doesn't care who wrote the round-trip; if encode and decode disagree, they disagree.

If there is no such oracle — the ordinary example-based unit test, where the author's hand-asserted expected values are the standard of truth — then the litmus bites hard. One agent can encode the same wrong assumption in both the code and its expected values, and they will agree, in lockstep, all the way to production. Writing the test first does not rescue this: temporal order is not causal independence. Proving a test can fail is not the same as proving it fails for the right reason.

Only in this no-oracle case do you actually need a different agent. The fix is a checker that derives its cases from the contract and never reads the implementation — because a checker that reads the code gets anchored into the same blind spot (collusion through reading). With a fleet of agents this is cheap: spawn a fresh checker, hand it only the contract, keep the code hidden. Draw it from a different model family and the blind spots decorrelate — the closest thing to an independent mind short of hiring a human. SOX's "independent" always meant different people; with cheap agents, "different checker" becomes a default you can finally afford.

There's a cost to the naive version, though: if the checker must be written before the work, you serialize the two and the producer sits idle. The way out is to split the four words across two concurrent agents. The producer writes its own red-green tests and codes against them — it owns Mechanically and Falsifiable and never waits on anyone. The approver, reasoning only from the contract, owns Independence and Control: it holds acceptance authority and can interrupt. The producer runs free; its work simply isn't booked until the approver signs. Optimistic execution, gated settlement.

Critically, the approver doesn't ask "do the tests pass?" — it asks "do these tests actually pin the contract, or are they vacuous or collusive?" That is a test of controls in the audit sense: you don't redo the work, you check whether its control is effective. The veto is Control; the contract-derived verdict is Independence; the ability to bounce the work back while the context is still loaded is the corrective arm of Control. This pairing earns its keep precisely where the tests are hand-authored and no oracle exists (the collusion-prone case); where a differential or metamorphic oracle already exists, the approver is just belt-and-suspenders.

Control, in depth: advisory versus hard

Control is authority to reject — and authority comes in strengths. The single most important distinction is between an advisory control and a hard one.

An advisory control is one the producer bumps into but could still get around if it were determined to. The capability to do the wrong thing remains in its hands, and you are relying on it not exercising that capability. A hard control is one where the capability has been removed: the producer cannot mechanically do the wrong thing at all, no matter how determined, because it lacks something it cannot manufacture. That is the line. Everything else is gradation.

Three rungs make the gradation concrete:

  1. Written policy. A rule in a style guide or a prompt: "always use the safe tool," "the page must show today's date." This is the weakest advisory control — it has no power to stop anything, and it leaks the instant the producer ignores it (which, for an LLM reaching for the obvious tool, is often).
  2. Blocking gate. A hook that intercepts the action and refuses it — a pre-commit hook that rejects a stale field, a wrapper that blocks a forbidden command. This supplies real enforcement: it actually stops the action. But it is still advisory in the strict sense, because whoever controls the gate can route around it (call a different tool, disable the hook). Its value is honest and large anyway: the realistic threat is not a determined adversary, it's a hurried actor doing the obvious thing, and the gate catches exactly that. This is reasonable assurance — correctly scoped, not a failure.
  3. Cryptographic capability. Now the producer genuinely lacks the capability to proceed alone. An approver that has adversarially accepted the work signs the exact artifact — a commit's tree-hash, or a command plus its arguments — with a private key the producer does not hold; a verifier the producer cannot edit (a protected remote, an isolated CI runner) admits only a matching signature. Wrong, missing, or unsigned means no land and no run. The producer can't forge the signature, can't redirect the approval to a different artifact, and can't self-approve. Rungs 1 and 2 are advisory; rung 3 is hard. The jump from 2 to 3 is the jump from "we trust it won't route around the gate" to "it cannot."

One subtlety completes the hard control. A signature bound only to what (this artifact) still replays across when — "run command X," signed once with no notion of time, is a bearer token valid forever. So you sign when too: an approved_at timestamp with a short maximum age, or a single-use nonce, or both. Commits are naturally fresh because the tree-hash changes every time; re-runnable actions are not, so the time-gate matters most there. An approval is a fresh act, not a possession — bound to what, who, and when. Drop the last and it decays into a standing privilege.

Two honesty notes hold across every rung. First, each rung only relocates trust — to the gate's host, to the approver's key, to the approver's independence. None of this is proof; it is reasonable assurance, which is the recognized standard, not a deficiency. Second, every hard control needs a break-glass override, because sometimes the checker is wrong and a human must overrule it. That override looks like surrender — it reintroduces the advisory boundary you just worked to remove — but you can't eliminate it, and the known failure mode even has a name in the SOX literature: management override of controls. The fix is not to remove the override but to make it loud: signed, dated, reasoned, logged, reviewed. A silent override is a hole; a conspicuous one preserves the assurance.

And — the discipline applied to itself — an honest setup gates the human's own commits behind the AI checker too. The thesis refuses to exempt its author.


Section 2 — Concrete examples

1. Test-driven development, completed

This framework opened by positioning itself against TDD: the red phase buys exactly one of the four words. So TDD is the natural place to watch all four get assembled — and to see precisely what plain, single-author TDD leaves on the table.

Falsifiable comes for free, and it is the entire point of the red phase. Writing the test and watching it fail before the implementation exists is what proves the test can go red at all — that it measures something. A test that was green the moment it was written asserts nothing, and gives you no evidence it would catch the bug it's named for. The failure is the proof of bite, which is exactly the property most "tests" never establish.

Control is one wiring decision away. A red test that prints to a log is telemetry; a red test that blocks the merge is a control. Make CI a hard stop on failure and you have C: the build does not proceed, the work is not booked, and no human has to remember to look.

Mechanically is where plain TDD is weakest — and the gap tells you what to add. A single hand-written example test is, by definition, a hand-picked case: the ¬M near-miss from Section 1. Its execution is automated and repeatable, but its coverage is only as wide as the cases you happened to think of — the exact set that omits the bug you didn't. You earn real M by mechanizing the cases on top of the red-green loop: property-based generation, input-mutation, exhaustion over a finite domain. "Test first" is a discipline about when you write the test; M is a discipline about no longer hand-picking what it covers.

Independence is the word TDD cannot supply alone — and the fix is adversarial role separation. If one agent writes both the test and the code, a shared wrong assumption is baked into both and rides straight through green; writing the test first proves it can fail, but temporal order is not causal independence. You reclaim I by handing spec- and test-authorship to a different agent, placed in a deliberately adversarial role to the builder. This is the producer/approver split from Section 1 pushed to its limit on the independence axis: the builder doesn't merely get its tests audited — it doesn't write them, so there is no test it can quietly make vacuous to grant itself an easy pass. The goal is to make self-collusion structurally impossible: an agent cannot conspire with itself to be lazy when a separate, adversarial verifier defines what "done" means and never sees the implementation. It is the same logic as code review, red-teaming, and maker-checker controls everywhere, and it targets a real failure mode — a single agent grading its own work tends toward leniency; an independent one has no shared blind spot to hide in.

So TDD is not a counterexample to MFIC — it is the skeleton. Out of the box it gives you F cleanly, and C the moment you make the stop hard; the other two words are a to-do list it hands you — mechanize the cases for M, split the roles for I.

2. Finding every instance of a thing in a document — narrow first, then judge

The task: a document contains some unknown number of instances of a category — citations, claims, dates, named entities, policy violations — and you need all of them. The tempting move is to hand the whole document to an LLM and ask it to find them. Don't.

An LLM asked to "find all X" has an unbounded, invisible failure mode: the instances it misses leave no trace. A non-event cannot be audited. You will never know what it didn't return, and it will sometimes return fewer than exist — non-deterministically, with no signal that anything is missing.

The MFIC move is to invert the order of trust, and it feels backwards. Before you involve the LLM, narrow the search space with a deterministic, mechanical method: regexes for the shapes the thing is known to take, vector or semantic search for the regions where it tends to appear, a grammar, a classical parser. (This is exactly what a tool like eyecite does for legal citations — match the known textual forms first.) That deterministic pass over-produces on purpose: it surfaces a finite, enumerated set of candidates, accepting false positives in order to avoid silent false negatives. Then you bring in the LLM — not to search, but to adjudicate each candidate in the set: is this one real, and does it mean what the document implies?

Why this is a control, in the MFIC sense:

  • It is mechanical: recall is established by an exhaustive machine sweep, not by the LLM's attention or sampling. The set of things to consider is generated, not noticed.
  • It is falsifiable and auditable per item: "did it find them all?" (unanswerable) becomes "here are N candidates, each with an accept/reject verdict" (fully inspectable). The LLM can no longer silently omit — every candidate is on a worklist it must dispose of, one by one, and omission becomes an explicit rejection that leaves a trace.
  • It is independent: the deterministic prefilter and the LLM have decorrelated failure modes. The regex over-recalls but does not quietly skip a pattern it was written for; the LLM rejects false positives but can no longer hide a miss. The composite has the prefilter's recall and the LLM's precision — two different mechanisms, neither standing in for the other. The honest limitation is worth stating, because it is also the real win. The composite can only find what the prefilter surfaces — if the deterministic pass misses a candidate, the LLM never sees it, so recall is capped by the prefilter, not improved by the LLM. But the prefilter's recall is a measurable, characterizable number (test it against a labeled corpus, or with input-mutation), whereas "did the LLM find all of them" is not measurable at all. You haven't necessarily raised recall; you've traded an unbounded, invisible failure for a bounded, quantified one. MFIC always prefers the failure mode you can put a number on.

The general principle, stated plainly: do not lean on the LLM for the part it can silently get wrong. Lean first on old-fashioned deterministic machinery to make the problem finite and explicit, and only then bring the LLM in to do the part it's actually good at — judgment over an enumerated set it cannot quietly skip past. Restricting the LLM up front is what makes its output trustworthy.

3. A check with no expected answer — the complexity gate

A function is annotated with its intended cost — say // complexity: O(n). A microbenchmark (a test type most suites lack entirely) runs it at N, 2N, 4N, 8N and asserts that the time per doubling stays near 2×, failing when the ratio climbs past roughly 2.8× (a quadratic blows up to ≈4×). It sweeps inputs (Mechanically); a quadratic loop trips an exact numeric ratio, not a gut feeling (Falsifiable); the verdict comes from measured growth, not from the annotation's claim about itself (Independent); and it fails the build (Control).

What makes this a clean illustration is that it needs no expected value and no reference implementation. It is an oracle-free metamorphic check: it asserts a relationship — f(2N) ≈ 2·f(N) — and the ratio cancels out hardware speed, so it's portable across machines. The ordinary microbenchmark that only logs absolute milliseconds is machine-dependent and warn-only — it observes but never blocks, so it isn't a control at all. In practice this kind of gate has surfaced accidental-quadratic bugs that an absolute-time benchmark walked right past, including ones worth a large constant-factor speedup once fixed.

4. The commit boundary — walked from advisory to hard

The single richest place to install these controls is the commit: the boundary where a fallible author's work enters permanent, shared history. The same boundary illustrates the entire advisory-to-hard ladder, on one concrete surface.

Advisory, from policy to gate — the freshness hook. Suppose a page must display today's date, and the LLM editing it keeps forgetting to update it. As a written rule in a style guide, this is the weakest possible control, and it leaks constantly. Replace it with a pre-commit hook that recomputes the date from the system clock and rejects any commit whose date doesn't match. Now it's enforced at every commit (Mechanically), as a hard reject rather than a warning (Falsifiable), against today as the clock reports it rather than whatever the author typed (Independent), blocking the commit outright (Control). This is physics over policy — correctness enforced at the gate, not requested in prose. The machinery ships in every git repo; almost nobody wires the commit boundary up as a contract.

Advisory, the routable gate — enforcing a tool choice. "Use the safe VCS wrapper, never the raw command" as prose was advisory and got ignored whenever the raw command was the obvious reach. A hook that blocks the raw command supplies real enforcement, and the invariant holds. It's still routable-around in principle — a determined actor could find another path — but it reliably stops the actual threat, which is a hurried actor doing the obvious thing. This is the self-referential case: a control the human imposed on the LLM, the same discipline applied to the very tool that writes everything else.

Hard — signed commits and signed actions. To cross from advisory to hard, deny the producer the capability to commit alone. An approver that has adversarially reviewed the diff signs the exact artifact — the commit's tree-hash (and, for privileged commands, the command plus its arguments, stamped with an approved_at time) — with a key the producer doesn't hold. A verifier the producer can't edit (a protected remote, an isolated CI runner) accepts only a matching, unexpired signature. Wrong, missing, or stale, and the commit can't land or the command can't run. A different party reasoning from the contract (Independent); no possible self-approval and no redirect to a different artifact; the work gated on a capability the producer simply does not have (Control, at full strength). Signed commits gate history and are naturally fresh because the tree-hash moves with every change; signed actions gate re-runnable privileged commands, where the freshness timestamp does the real work. Trust hasn't vanished — it's relocated to the verifier's host and the approver's key — but the producer can no longer enter bad state into history on its own say-so.

Across all three, the progression is the one Section 1 names: a rule it can ignore, a gate it could route around but realistically won't, and finally a capability it does not possess. Only the last is hard — and even the last keeps a loud, logged break-glass override, because the discipline that won't trust the producer also won't pretend its checker is infallible.


In Conclusion

None of this is a brake on the technology — it is in fact what lets you run it at full throttle. The upside is real and large: a capable agent, or a fleet of them, already out-produces the unaided human on the work this discipline covers, and that gap only widens from here. The paradox is that capturing the gain safely takes more control surface, not less — because a faster, more autonomous, more confident producer manufactures more places to be silently wrong, and each one needs a check that doesn't share its blind spot. So the control budget has to scale with capability, exactly when intuition says a better tool should need less supervision.

The resolution is built into the same shift that creates the problem. The checker is just another agent; the gate is a few lines in a hook; the signature is one key the producer doesn't hold. Segregation of duties used to mean hiring more people — now it is a default you can spin up per commit, from a different model family, for the price of a second inference. The controls ride the same cost curve as the capability, which is precisely why you can keep the speed without keeping the risk: every increment of throughput you unlock comes with an increment of cheap, mechanical verification to match it.

And here the loop closes in your favor: the same capability that has to be governed is also what builds the governance. The LLM, prompted to do so, writes the hook, generates the property-based and mutation suites, stands up the differential harness, wires the signature check — verification gets built on the very multiplier it exists to police. When a boundary demands a control nobody has invented yet, the LLM helps design and implement that one too, so even extending the discipline rides the curve rather than fighting it. That the producer's own kind authors these controls doesn't dilute them: the oracle-bearing checks from Section 1 are independent no matter who writes them, and the rest are simply handed to a different, equally cheap agent. The extra work this discipline demands is real — and it is absorbed by the same engine that created the demand, which is what turns the whole exchange into a net gain rather than a wash.

That is the whole wager, and IMHO it is a solid one. Unverified output was never productivity in the first place — it was simply a liability you hadn't noticed yet, and the faster you produced it, the larger it invisibly grew, until it was no longer invisible. The discipline converts raw non-deterministic capability into measurable outputs you can actually stand behind, which is the only kind worth having. Trust nothing the producer says about its own work; verify and measure everything with something it didn't write; and then go fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment