Created
May 6, 2026 16:53
-
-
Save dollspace-gay/f6ae1e88693a469e9afff2b3a4ae5bb0 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Conformance suites: what they are, when they help, how to build one | |
| A conformance suite is a third axis of testing — distinct from unit tests | |
| and integration tests. Where unit tests prove that a function does what | |
| its body says, and integration tests prove that subsystems compose, a | |
| conformance suite proves that **what the public API claims it does | |
| matches what it actually does, against an external ground truth**. | |
| This document explains the pattern, when it earns its keep, and how to | |
| set one up. The case study is ferrotorch (a pure-Rust PyTorch | |
| reimplementation), but the pattern generalizes to any project that | |
| claims behavioral parity with a reference. | |
| --- | |
| ## The problem conformance suites solve | |
| The most common failure mode in a reimplementation project is "tests | |
| pass, ship it" — where: | |
| - Unit tests cover the parts of the code the author thought to test. | |
| - Integration tests cover the workflows the author thought to exercise. | |
| - Doctests show that the example in the docstring compiles and runs. | |
| - The README says "PyTorch parity for op X." | |
| And then a downstream user runs `op X` with a real input and gets a | |
| result that differs from PyTorch by 5 ULP, or silently runs on CPU | |
| when they explicitly put the tensor on GPU, or panics on an edge case | |
| PyTorch handles cleanly. The author's tests didn't catch it because | |
| the author was checking *internal correctness*, not *contract fidelity*. | |
| Conformance suites close this gap by encoding the contract — "this op | |
| behaves the same as the reference library's op for the same inputs" — | |
| as a mechanically-checkable assertion. The reference library is the | |
| ground truth; the suite proves the implementation matches. | |
| --- | |
| ## What a conformance suite *is not* | |
| It's worth being precise, because conformance suites overlap with | |
| several things they're not. | |
| | Pattern | What it proves | Conformance is different because | | |
| |---|---|---| | |
| | **Unit tests** | This function returns what its body computes | Conformance proves the function matches an *external* spec, not its own implementation | | |
| | **Integration tests** | These subsystems compose correctly | Conformance is per-public-item; integration is workflow-level | | |
| | **Property-based tests** | Algebraic invariants hold (`add(x, 0) == x`) | Conformance can use property tests, but the assertion is "matches reference" not "satisfies invariant" | | |
| | **Doctests** | The documented example compiles and runs | Doctests are author-authored; conformance is reference-authored | | |
| | **Snapshot tests** | Output matches a previous run's output | Conformance compares to an *external* reference, not a previous self-run | | |
| | **Fuzzing** | No crashes on random inputs | Conformance is curated inputs with known reference outputs | | |
| | **Benchmarks** | This is fast | Conformance is correctness-only | | |
| You can have all of these and still ship a "PyTorch parity for op X" | |
| claim that isn't true. Conformance is the assertion that closes that | |
| specific gap. | |
| --- | |
| ## The four-layer architecture | |
| A conformance suite has four mechanical layers. They're independent | |
| enough to build in any order, but the dependencies flow downward: | |
| ``` | |
| +----------------------------------+ | |
| | Layer 4 — Strict coverage gate | ← CI fails if a public item | |
| | | lacks a test reference | |
| +----------------------------------+ | |
| ↑ refers to | |
| +----------------------------------+ | |
| | Layer 3 — Conformance tests | ← Per-op test functions that | |
| | | load fixtures and assert | |
| +----------------------------------+ | |
| ↑ loads | |
| +----------------------------------+ | |
| | Layer 2 — Reference fixtures | ← JSON / .npz files committed | |
| | | to the repo, generated by a | |
| | | script that calls the reference | |
| +----------------------------------+ | |
| ↑ derives from | |
| +----------------------------------+ | |
| | Layer 1 — Surface inventory | ← List of every `pub` item the | |
| | | project claims is its API | |
| +----------------------------------+ | |
| ``` | |
| ### Layer 1 — Surface inventory | |
| A list of every public item your project exposes. The "denominator" | |
| for coverage. Built once, regenerated whenever the public surface | |
| changes. In Rust, this is generated by a `syn`-based parser that walks | |
| all source files and collects every `pub fn` / `pub struct` / | |
| `pub trait` / `pub method` with module path + signature. | |
| The inventory is committed to the repo. PRs that change the public | |
| surface produce a clean JSON diff that the reviewer can check against | |
| the test coverage diff. | |
| ```jsonc | |
| // Example: tests/conformance/_surface.json | |
| { | |
| "items": [ | |
| { | |
| "path": "mylib::tensor::add", | |
| "kind": "fn", | |
| "signature": "fn add<T: Float>(a: &Tensor<T>, b: &Tensor<T>) -> Result<Tensor<T>>" | |
| }, | |
| { | |
| "path": "mylib::tensor::Tensor::reshape", | |
| "kind": "method", | |
| "signature": "fn reshape(&self, shape: &[usize]) -> Result<Tensor<T>>" | |
| }, | |
| // ... thousands of items | |
| ] | |
| } | |
| ``` | |
| ### Layer 2 — Reference fixtures | |
| Generated offline by a script that imports the reference library and | |
| records `(input, expected_output)` pairs for each op. Curated input | |
| sets per op cover normal cases + edge cases (empty input, NaN/Inf | |
| boundaries, broadcast shapes, etc.). | |
| Fixtures are committed to the repo. CI never invokes the reference | |
| library; the JSON files are the source of truth. | |
| ```python | |
| # Example: scripts/regenerate_tensor_fixtures.py | |
| import torch, json | |
| fixtures = [] | |
| for op_name in ["add", "mul", "sub", "div"]: | |
| op = getattr(torch, op_name) | |
| for shape in [[3], [3, 4], [2, 3, 4]]: | |
| a = torch.randn(*shape) | |
| b = torch.randn(*shape) | |
| result = op(a, b) | |
| fixtures.append({ | |
| "op": op_name, | |
| "input_a": a.tolist(), | |
| "input_b": b.tolist(), | |
| "shape": shape, | |
| "expected": result.tolist(), | |
| }) | |
| with open("tests/conformance/fixtures/tensor.json", "w") as f: | |
| json.dump({"version": torch.__version__, "fixtures": fixtures}, f) | |
| ``` | |
| The script's pinned reference library version is recorded in the | |
| fixture metadata. When the reference library updates, you re-run the | |
| script and re-commit the fixtures. | |
| ### Layer 3 — Conformance tests | |
| Rust integration tests that load fixtures and assert. Tolerance comes | |
| from a per-op-category table (bit-exact for indexing, 1 ULP for f32 | |
| elementwise, 1e-6 abs for f32 reductions, etc.). | |
| ```rust | |
| // Example: tests/conformance_tensor.rs | |
| use crate::tolerance::*; | |
| #[test] | |
| fn add_matches_reference() { | |
| let fixtures = load_fixtures("tensor.json"); | |
| for fixture in fixtures.filter(|f| f.op == "add") { | |
| let a = Tensor::from_vec(&fixture.input_a, &fixture.shape).unwrap(); | |
| let b = Tensor::from_vec(&fixture.input_b, &fixture.shape).unwrap(); | |
| let actual = mylib::add(&a, &b).unwrap(); | |
| assert_close_f32_cpu(&actual.data(), &fixture.expected); | |
| } | |
| } | |
| ``` | |
| For projects with GPU paths, the same test runs on GPU with looser | |
| tolerance. For projects with autograd, the test additionally runs the | |
| backward pass and compares gradients. | |
| ### Layer 4 — Strict coverage gate | |
| A test that cross-references the surface inventory against the | |
| conformance test files. Fails if any public item lacks a test | |
| reference. Items that legitimately can't be tested go into an | |
| exclusion file with a written reason and a tracking issue number. | |
| ```rust | |
| // Example: tests/conformance_surface_coverage.rs | |
| #[test] | |
| fn every_public_item_has_a_conformance_reference() { | |
| let surface: Vec<Item> = load_surface_json(); | |
| let exclusions: Vec<Exclusion> = load_exclusions_toml(); | |
| let test_text = read_all_conformance_test_files(); | |
| let uncovered: Vec<_> = surface.iter() | |
| .filter(|item| !test_text.contains(&item.path)) | |
| .filter(|item| !exclusions.iter().any(|e| e.path == item.path)) | |
| .collect(); | |
| assert!(uncovered.is_empty(), | |
| "{} public items lack a test reference: {:?}", | |
| uncovered.len(), uncovered); | |
| } | |
| ``` | |
| The gate is **strict from day 1**. Every new `pub fn` added to the | |
| project must be referenced in a conformance test or added to | |
| exclusions with a tracking issue. Without this strictness, coverage | |
| drifts. | |
| --- | |
| ## What programs benefit | |
| Not every project earns its keep with a conformance suite. The | |
| heuristic: **does your project make a behavioral-parity claim?** | |
| ### Strong fit | |
| - **Reimplementations of an existing library.** ferrotorch (PyTorch), | |
| candle/burn (PyTorch), polars-rust (pandas), arrow-rs (Arrow C++), | |
| postgrest (PostgREST), litestream (DynamoDB), babashka (Clojure). | |
| The reference library is the ground truth; conformance proves you | |
| match. | |
| - **Cross-language ports.** A Python library reimplemented in Go; a | |
| C++ library reimplemented in Rust; a JavaScript library | |
| reimplemented in WebAssembly. The original is the spec. | |
| - **Compatibility shims.** A library that claims "drop-in replacement | |
| for X" — every public item must behave identically. | |
| - **Multi-backend implementations.** A library with CPU + GPU + TPU | |
| backends, where each backend must produce equivalent outputs. | |
| Conformance proves cross-backend agreement. | |
| - **Standards implementations.** A library implementing a published | |
| spec (HTTP/3, MessagePack, JSON Schema, GraphQL, RFC 6901). The | |
| spec's reference implementation is the ground truth. | |
| - **Data-format converters.** A library that round-trips a format | |
| (Parquet, ORC, Arrow IPC, ONNX). Conformance: encoding then decoding | |
| produces bit-identical output. | |
| - **Numerical / scientific libraries.** Anything claiming "matches | |
| numpy" or "matches scipy" or "matches octave." Numerical drift is | |
| invisible to unit tests; conformance catches it. | |
| ### Weak fit | |
| - **Greenfield libraries with no reference.** Nothing to be conformant | |
| against. Property-based testing serves the same purpose (algebraic | |
| invariants are the spec). | |
| - **UI / interactive applications.** Behavior is human-perceived; | |
| hard to encode as fixture-vs-output assertion. | |
| - **Network services with stateful protocols.** Conformance can apply | |
| to the wire format but not to the long-lived session state — those | |
| need integration tests. | |
| - **DSLs and macro-heavy code.** The "public surface" is the macros | |
| themselves; conformance applies to what the macros expand to, which | |
| is fluid. | |
| ### Mixed fit | |
| - **General-purpose application code.** No external spec, but if you | |
| have a "stable API" claim, conformance against your own historical | |
| behavior (snapshot-style) catches regressions. | |
| - **Compilers.** Conformance against a published language spec works | |
| (C, Rust); against a competing compiler is fragile (their bugs are | |
| not your spec). | |
| --- | |
| ## What conformance suites produce | |
| The visible output is "tests pass." The invisible output is the | |
| **bug-finding rate**. | |
| In ferrotorch's conformance work, across 7 phases covering | |
| ~340 conformance tests: | |
| - **Phase 2.0 creation** (foundational, simple ops): 0 cascade bugs | |
| - **Phase 2.1 elementwise** (arithmetic): 3 GPU bugs surfaced (PTX JIT | |
| failure, missing kernel, broadcast misalignment) | |
| - **Phase 2.2 reductions**: 4 GPU bugs (silent CPU-detour, missing | |
| scale, uninitialized buffer, polynomial residual) | |
| - **Phase 2.5 activations**: 8 GPU/CPU bugs (4 polynomial accuracy | |
| gaps, 1 GPU autograd save-state issue, 1 PTX JIT, 1 backward grad | |
| delta, 1 forward divergence) | |
| - **Phase 2.4 linalg**: 2 GPU bugs (matmul f64 forward routing | |
| asymmetry, dispatch limited to 2D×2D) | |
| - ... and so on. | |
| Net: **~30 latent bugs caught in code that already had unit tests, doctests, and "PyTorch parity" claims in the README.** None of these would have surfaced through any other testing pattern. Each one is a real defect that a downstream user would have hit. | |
| The pattern that emerges from running a conformance suite over a | |
| mature codebase: **bugs cluster around silent fallbacks**. The most | |
| common shapes: | |
| 1. **Silent CPU detour**: a function claims to support GPU but calls | |
| `.data()?` internally, demoting to CPU. Output is correct but | |
| performance is silently terrible. Conformance catches this when | |
| the test asserts the output tensor is still on GPU. | |
| 2. **Routing asymmetry**: forward path is `f32`-only; backward path | |
| correctly branches on dtype. Forward + backward together fails for | |
| `f64`. The unit test for the forward passed because it was | |
| `f32`-only; the unit test for the backward passed because it | |
| tested isolated. Conformance catches this when the test runs | |
| `forward(f64)` then asserts. | |
| 3. **Verification-debt clusters**: a fix applies the same patch to N | |
| sibling kernels but only the original one has a probe. Conformance | |
| catches the unverified siblings the next time their test runs. | |
| 4. **Stub residue**: a public function returns | |
| `Err(NotImplementedOnCuda)` despite the README claiming GPU | |
| support. Unit tests pass (they hit the CPU path); conformance | |
| surfaces the gap when the test runs the GPU lane. | |
| 5. **Numerical drift**: a polynomial approximation produces 5e-7 | |
| error against PyTorch's libdevice-backed call. Unit tests use | |
| self-comparison so they don't catch it; conformance does. | |
| --- | |
| ## How to set one up | |
| ### Step 0 — Pick the reference library and pin its version | |
| Decide which library is your conformance ground truth. Pin to a | |
| specific version. Record the pin in the fixture metadata so the suite | |
| can detect drift. | |
| For ferrotorch this was `torch==2.11.0+cu130`. For a numpy | |
| reimplementation it might be `numpy==2.4.4`. For a JSON parser it | |
| might be the published RFC test vectors. | |
| ### Step 1 — Build the surface inventory tool | |
| Either rustdoc-based (`cargo +nightly rustdoc --output-format json`) or | |
| syn-based (parse `src/**/*.rs` with `syn 2`). Both work. Output a | |
| sorted, deterministic JSON file that lists every public item. | |
| Commit the inventory file. PRs that change the surface produce a | |
| diffable JSON change. | |
| ### Step 2 — Decide tolerance categories | |
| For numerical libraries, write down the tolerance per op category in a | |
| table: | |
| | Category | CPU tolerance | GPU tolerance | Reason | | |
| |---|---|---|---| | |
| | Indexing/slicing/shape | bit-exact | bit-exact | No arithmetic | | |
| | f32 elementwise | 1 ULP | 1 ULP | IEEE 754 deterministic | | |
| | f32 reductions | 1e-6 abs | 1e-5 abs | GPU reduces in tree order | | |
| | f32 transcendentals | 1e-5 rel | 1e-4 rel | GPU approx instructions looser | | |
| | Matmul (f32) | 1e-4 rel | 1e-3 rel | O(n³) error amplification | | |
| | RNG | distribution moments only | distribution moments only | Different RNGs | | |
| For non-numerical libraries (string/JSON/format converters), | |
| everything is bit-exact. | |
| The table goes in a shared helper file | |
| (`tests/conformance/_tolerance.rs`) and gets reused across phases. | |
| ### Step 3 — Author one phase as a proof-of-pattern | |
| Pick one module — ideally a small one — and write its full | |
| conformance suite. Layers 2 (regen script), 3 (tests), 4 (gate). Get | |
| it green. Now you've proven the pattern works for your project. | |
| ### Step 4 — Sweep the rest | |
| For a medium codebase (~50-100 public items), one phase might be | |
| enough. For a large codebase (~1000+ public items), break into phases | |
| by module. Each phase is its own dispatch — author the regen script, | |
| the test file, the exclusion changes. | |
| If you can run subagent dispatches in parallel (Claude Code, | |
| similar tooling), 4-5 phases at once is a good batch. The constraint | |
| is the shared exclusion file — give each subagent explicit | |
| "don't-edit-this-file" instructions and apply changes consolidated at | |
| the end. | |
| ### Step 5 — Triage the bug cascade | |
| You will surface bugs. They will outnumber the original test count. | |
| Don't try to fix them inline; that's a different work mode. File each | |
| as a tracking issue, add a per-test `cascade_skip()` referencing the | |
| issue number, and continue building the suite. | |
| When the suite is complete, you have a structured queue of real bugs | |
| to fix. Each bug-fix dispatch flips a skip back to a live assertion — | |
| a clean unit of work. | |
| ### Step 6 — Make the gate strict in CI | |
| The strict coverage gate is the lock-in. Once it's in CI, no public | |
| item can be added without either (a) a conformance test reference or | |
| (b) an exclusion entry with a tracking issue. The cost of keeping | |
| coverage current is paid at PR time, not at "we should write more | |
| tests" time. | |
| --- | |
| ## Common objections | |
| **"This is just unit testing with extra steps."** | |
| No — unit testing proves the implementation matches its own logic. | |
| Conformance proves the implementation matches an external spec. They | |
| test orthogonal claims. | |
| **"Tolerance comparisons are fragile."** | |
| Yes, in two ways. (1) Tolerances drift across reference-library | |
| versions. Mitigation: pin the version, record it in fixture metadata, | |
| re-run when bumping. (2) GPU and CPU tolerances differ — bake that | |
| into the tolerance table from the start. | |
| **"The fixture files are huge."** | |
| Yes. Each fixture is small but they accumulate. ferrotorch's | |
| fixtures total ~50 MB across 7 phases. Trade-off: large repo vs | |
| running Python in CI. Most teams prefer the large repo. LFS is an | |
| option if it gets unwieldy. | |
| **"What about RNG ops? They can't be bit-exact."** | |
| Compare distribution moments (mean, variance, range) instead of raw | |
| values. The conformance assertion is "samples have the same | |
| distribution," not "samples have the same bits." Tolerances are | |
| larger (~3-5% on n=10000) but still meaningful. | |
| **"The reference library is buggy."** | |
| Sometimes you'll find your implementation is *more correct* than the | |
| reference. ferrotorch caught one such case: hardsigmoid f64 backward, | |
| where ferrotorch returns the true 1/6 in f64 while PyTorch returns | |
| the f32-rounded value. Document the divergence in the test, file a | |
| tracking issue noting "ferrotorch diverges from reference; reference | |
| is wrong here," and either match the reference's wrong behavior (if | |
| parity is the contract) or document the divergence (if correctness | |
| is the contract). | |
| **"Can't I just run the reference library in CI?"** | |
| You can, but: (a) Python-in-Rust-CI doubles build complexity, | |
| (b) reference-library updates silently change tests, (c) GPU | |
| reference runs require GPU CI. Captured fixtures avoid all three. | |
| --- | |
| ## Anti-patterns to avoid | |
| - **Generic exclusion entries.** "Tested by something somewhere" is | |
| not a reason. Each exclusion needs a specific path-and-test | |
| reference. | |
| - **Tolerance weakening to silence a failing test.** The right | |
| response to a tolerance miss is "file a bug, skip the test with | |
| the tracking issue, move on." Not "raise the tolerance until it | |
| passes." | |
| - **Skipping the strict gate "until coverage is up."** The gate is | |
| what makes coverage stay up; without it, every PR adds a tiny bit | |
| of drift. | |
| - **Phantom tests for backward structs / type aliases / re-exports.** | |
| These are usually covered transitively. Use exclusion entries with | |
| "implicit coverage" reasons instead of authoring tests that don't | |
| prove anything. | |
| - **Same-library snapshot tests dressed as conformance.** If you're | |
| comparing the implementation to its own previous output, that's a | |
| snapshot test, not conformance. Conformance requires an external | |
| reference. | |
| - **Conformance against an unstable reference.** If the reference | |
| library changes weekly, your fixtures go stale weekly. Pin the | |
| reference version conservatively. | |
| --- | |
| ## Maintenance over time | |
| Once the suite is in place: | |
| 1. **Reference-library updates** (e.g., torch 2.11 → 2.12): re-run | |
| regen scripts, commit new fixtures, run the suite. Failures are | |
| either reference-library changes ferrotorch should match (file | |
| issue) or ferrotorch bugs the new fixtures surfaced (also file | |
| issue). | |
| 2. **New public items**: the strict gate catches them at PR time. | |
| Author requires authors to add a conformance test reference (or | |
| an exclusion) with their PR. | |
| 3. **Bug fixes**: when a tracked cascade bug is fixed, the test's | |
| `cascade_skip()` reference becomes a live assertion. Verify the | |
| fix produces the expected output, remove the skip. | |
| 4. **Tolerance reviews**: if a tolerance is consistently passing with | |
| margin or consistently failing without bugs, the tolerance is | |
| probably wrong. Audit the table periodically. | |
| 5. **Fixture audits**: fixtures grow over time as edge cases get | |
| added. Periodically prune redundant cases. | |
| The suite is a living artifact — not a one-time investment. | |
| --- | |
| ## When NOT to use this pattern | |
| - **Throwaway code.** The investment isn't recouped. | |
| - **Code where "correct" is defined by humans, not a reference.** | |
| E.g., UI rendering, linting, autocompletion ranking. No external | |
| spec to be conformant against. | |
| - **Code where the reference is itself the system under test.** | |
| Doesn't apply (no oracle). | |
| - **Pre-1.0 libraries with rapidly-evolving APIs.** The suite | |
| invests in the public surface; if the surface is in flux, the | |
| investment evaporates each release. | |
| For these cases, property-based testing, snapshot testing, or | |
| human-curated regression tests serve the role better. | |
| --- | |
| ## Summary | |
| Conformance suites prove that what your library claims to do matches | |
| what an external reference does. They catch a class of bug that no | |
| other testing pattern finds: silent fallbacks, routing asymmetries, | |
| numerical drift, and stub residue. | |
| The four-layer architecture — surface inventory → reference fixtures | |
| → conformance tests → strict coverage gate — is mechanical enough to | |
| build incrementally. Strong fit: any project making a behavioral-parity | |
| claim. Weak fit: greenfield code with no oracle. | |
| The pattern's value scales with the size of the public API and the | |
| strength of the parity claim. For a small library with a "compatible | |
| with X" claim, a conformance suite is overkill. For a large library | |
| with a "drop-in replacement for X" claim, it's the only way to keep | |
| the claim honest. | |
| Build it once, maintain it incrementally, file the bugs it surfaces, | |
| fix them as separate work. The bugs are the value. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some notes in case it's helpful:
"The reference library is buggy."
I would add a note about filing an actual bug against the reference library. This is good software stewardship and helps the ecosystem as a whole. For the agent, the instruction might be to include a note about filing a bug against the reference in the description of the issue.
Also, for a complicated project there are likely to be cases where the reference library is not exactly buggy, but under-specified. A combination of inputs in the reference library may be undocumented and have some sort of reasonable but unpredicted behavior. Deciding whether to copy this or mark with an issue is important. Basically there are more cases than "reference is right" / "we are right".
Data-format converters.
"Conformance: encoding then decoding produces bit-identical output." This would be good if true, but often round trip conversion cannot be bit-identical but will be functionally identical. A trick to help with this is to define a canonicalization function from each format to itself. Then you say the canonicalized outputs of encoding-decoding are bit identical.
Performance conformance
There is also a "performance conformance" type of testing. Sometimes the unit test and integration test will show correct results, taking the right path. But the public symbol will accidentally hit a slow fallback and be slow (but undetectable otherwise). It looks likes some of these were caught because of CPU/GPU location issues but often everything is on the CPU and just hitting different branches. Performance conformance is probably a good extra step beyond what's here.
Lifetimes, Memory
Another aspect I didn't see is lifetimes, ownership, and memory allocation. This is a classic problem for C API ilbraries. The Rust interface will typically be OK but C assumptions about lifetimes are typically in documentation and poorly tested in practice. This can be a tricky part of conformance testing between languages.