Skip to content

Instantly share code, notes, and snippets.

@ianphil
Last active March 5, 2026 14:05
Show Gist options
  • Select an option

  • Save ianphil/1f755f3508a6c37b47fe7907f221cfe8 to your computer and use it in GitHub Desktop.

Select an option

Save ianphil/1f755f3508a6c37b47fe7907f221cfe8 to your computer and use it in GitHub Desktop.
Lean as an Executable Spec for Testing C# Code

Lean as an Executable Spec for Testing C# Code

Executive Summary

Write a spec for your cron scheduler. Model it in Lean as a pure state machine — simple, obviously correct. Prove the key invariants: one-shot jobs fire at most once, running jobs can't be scheduled concurrently. AI generates the production C# from the spec. Then FsCheck generates thousands of random event sequences — AddJob, TimerTick, Restart, ClockSkip — and runs each one through both the Lean model and the C# code, step by step. After every event, compare the state: which jobs fired, how many times, what's their status. If the C# scheduler says a one-shot job fired twice and the Lean model says that's impossible, the test fails and tells you exactly which event sequence caused the disagreement. The Lean proofs guarantee the model is correct. The property tests guarantee the C# matches.

The full solution — proving C# correct directly in Lean — requires formally modeling C#'s async runtime, null semantics, exception propagation, garbage collection, and the CLR memory model. The C# language spec alone is 800 pages of English prose with known ambiguities. Nobody has done this yet; comparable efforts for simpler languages (C, subsets of Rust, WebAssembly) took dedicated research teams years. That's a problem for PhD students. This document is for engineers shipping production code who want stronger guarantees than testing alone can provide, without waiting for the research to land.


A pragmatic approach to verified software: write a Lean specification, generate C# from it with AI, then use the Lean model as an executable spec — auto-generating property-based tests that run against the C# code, checking thousands of random inputs match the Lean model's output. Not a full equivalence proof, but far stronger than hand-written tests, and you get it almost for free.

Based on ideas from Leo de Moura's AI Is Rewriting the World's Software.


The Problem

AI is generating code at unprecedented scale. Google and Microsoft report 25–30% of new code is AI-generated. Microsoft's CTO predicts 95% by 2030. But as de Moura observes:

"AI outsources not just the writing but the thinking."

Testing provides confidence. Proof provides a guarantee. But full formal verification of production C# is research-grade work — you'd need to model C#'s async runtime, null semantics, and exception behavior in Lean. That's not practical today for most teams.

The pragmatic middle ground: use Lean as a deterministic executable spec.

The Approach

┌──────────────────┐                    ┌────────────────────┐
│   Lean Model      │    compile to      │   Executable Spec   │
│   (verified,      │ ─────────────────► │   (native binary,   │
│    pure functions) │    executable      │    JSON over stdio)  │
└──────────────────┘                    └─────────┬──────────┘
                                                  │
        ┌─────────────────────────────────────────┤
        │ for each random input:                  │
        │   1. ask executable spec for expected   │
        │   2. run C# code for actual output      │
        │   3. assert they match                  │
        │                                         │
┌───────▼──────────┐                    ┌─────────▼──────────┐
│   C# Production   │ ◄── compared ──── │   FsCheck / Hedgehog│
│   Code            │     against       │   Property Tests    │
└──────────────────┘                    └────────────────────┘

Step 1: Write the Lean spec

Model your domain logic as pure Lean functions. Keep it simple — the goal is obviously correct, not fast. De Moura's key insight:

"An inefficient program that is obviously correct can serve as its own specification."

Example — a scheduling rule:

/-- A job that is already running must not be scheduled again. -/
def canSchedule (job : Job) : Bool :=
  job.status == .enabled && !job.isRunning

/-- One-shot jobs execute at most maxRetries + 1 times. -/
def validRunCount (job : Job) (runs : Nat) : Bool :=
  if job.kind == .oneShot then runs ≤ job.maxRetries + 1
  else true

Then prove properties about these functions. The proofs are checked by Lean's kernel — a few thousand lines of trusted code. No LLM judgment, no flaky tests, no ambiguity.

theorem oneshot_bounded (job : Job) (h : job.kind = .oneShot)
    (runs : Nat) (hvalid : validRunCount job runs = true) :
    runs ≤ job.maxRetries + 1 := by
  simp [validRunCount, h] at hvalid; exact hvalid

Step 2: Generate C# with AI

Give the AI agent both the natural language spec AND the Lean model. The Lean model removes the "I interpreted the spec differently" problem — there's no ambiguity in canSchedule. The AI writes real C# with all the production concerns (async, DI, persistence, error handling).

Step 3: Build the executable spec

Add a thin IO layer to your Lean project that accepts JSON over stdin and returns the model's output:

-- Spec.lean
def handleRequest (json : String) : IO String := do
  let input ← parseInput json   -- untrusted glue code
  let result := validate input   -- THIS is the verified function
  return toJson result

The executable spec is a native binary. The JSON parsing is untrusted glue — but that's fine. If the serialization has a bug, the property tests will catch it (Lean and C# will disagree on some input). The executable spec is self-checking.

Build it once: lake build. Ship the binary alongside your test project. No Lean toolchain needed on CI — just the compiled executable spec.

Step 4: Property-based testing in C#

Use FsCheck or Hedgehog to generate random inputs, run them through both the C# code and the executable spec, and assert the outputs match.

// NuGet: FsCheck.Xunit
[Property]
public Property CSharpMatchesExecutableSpec()
{
    return Prop.ForAll(ArbitraryInput(), input =>
    {
        var expected = ExecutableSpec.Evaluate(input);   // verified model
        var actual = ProductionCode.Evaluate(input);     // C# implementation

        Assert.Equal(expected.IsValid, actual.IsValid);
        Assert.Equal(expected.Errors, actual.Errors);
    });
}

FsCheck generates hundreds or thousands of inputs per test run. Same seed produces the same inputs — fully deterministic and reproducible. When it finds a disagreement, it shrinks the input to the minimal failing case.

The executable spec wrapper

A small C# class that manages the executable spec process:

public class ExecutableSpec : IDisposable
{
    private readonly Process _process;

    public ExecutableSpec(string specPath)
    {
        _process = Process.Start(new ProcessStartInfo
        {
            FileName = specPath,
            RedirectStandardInput = true,
            RedirectStandardOutput = true,
            UseShellExecute = false
        });
    }

    public SpecResult Evaluate(SpecInput input)
    {
        _process.StandardInput.WriteLine(JsonSerializer.Serialize(input));
        var line = _process.StandardOutput.ReadLine();
        return JsonSerializer.Deserialize<SpecResult>(line);
    }
}

Start the executable spec once per test class. Pipe JSON over stdin/stdout. No overhead per test beyond serialization.

What This Catches

Property-based testing against a verified executable spec catches classes of bugs that hand-written tests structurally cannot:

Bug class Hand-written tests Executable spec property tests
Known edge cases ✅ If you thought of them ✅ Plus thousands you didn't
Boundary conditions ⚠️ Off-by-one if you're lucky ✅ FsCheck tries boundaries automatically
Ordering/combination effects ❌ Combinatorial explosion ✅ Random combinations explored
Spec drift (code diverges from intent) ❌ Tests drift too ✅ Executable spec is the verified source of truth
Adversarial inputs ❌ Unless you fuzz ✅ Generators include adversarial patterns

De Moura puts it directly:

"For any fixed testing strategy, a sufficiently adversarial system can overfit to it. A proof cannot be gamed."

The Lean proofs guarantee the executable spec is correct. The property tests guarantee the C# matches. The combination gives you near-proof-level assurance for the properties you specified, without modeling C#'s full semantics in Lean.

When to Use This vs. Other Approaches

Approach Best for Deterministic? Cost
Lean proofs Mathematical invariants (all inputs) Compile-time
Executable spec property tests C# matches verified model (random inputs) Seconds per run
LLM-as-judge (spec-tests) Fuzzy human intent ("user-friendly") API call per test
Hand-written xUnit Known regressions, documentation Milliseconds

These are complementary layers, not alternatives:

  • Lean proofs cover the properties that matter most — mathematically, for all inputs
  • Executable spec property tests bridge the Lean/C# gap with thousands of random samples
  • LLM-as-judge handles what math can't express ("is this error message helpful?")
  • Hand-written tests document known scenarios and serve as regression guards

Practical Considerations

Executable spec binary on CI. Compile the executable spec on a machine with the Lean toolchain, commit the binary (or publish as a build artifact). CI runs the C# property tests against it without needing Lean installed. It's a single static binary — no runtime dependencies.

Generator quality matters. FsCheck's built-in generators cover basic types well. For domain-specific types, write custom generators that include adversarial cases — the weird inputs that production code sees but developers don't think to test. The Lean model handles them all uniformly because it's defined over the full input space.

Shrinking finds minimal failures. When FsCheck discovers a disagreement, it automatically shrinks the input to the smallest case that still fails. Instead of "test failed with this 50-field object," you get "test failed when field X is false and field Y is true" — the exact boundary where C# and Lean diverge.

Start small. Pick one module where correctness matters most — validation, authorization, state machines, parsers. Write the Lean model, prove a few key properties, wire up the executable spec. Expand as the pattern proves its value.

The executable spec is self-checking. If the JSON serialization layer has a bug, the property tests will catch it — the C# and Lean will disagree on some input. You don't need to verify the IO code. You only need to verify the pure function it wraps, and Lean's kernel already did that.

The Bigger Picture

De Moura's article argues that as AI generates more code, the verification gap widens:

"Engineers stop understanding what their systems do. AI outsources not just the writing but the thinking."

The executable spec approach addresses this directly. The Lean spec IS the understanding — a precise, machine-checked definition of what the system must do. When AI generates the C# implementation, the executable spec tests verify that the AI got it right. When someone changes the C# code later, the executable spec catches drift immediately. When requirements change, you update the Lean model first, the proofs tell you what broke, and the property tests tell you where the C# diverged.

The spec becomes the source of truth. The C# becomes the implementation detail. The executable spec is the bridge.


References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment