dollspace-gay · May 6, 2026 16:53 · nwhitehead · May 7, 2026 · dollspace-gay · May 7, 2026
diff --git a/conformance testing b/conformance testing
 # Conformance suites: what they are, when they help, how to build one

 A conformance suite is a third axis of testing — distinct from unit tests
 and integration tests. Where unit tests prove that a function does what
 its body says, and integration tests prove that subsystems compose, a
 conformance suite proves that **what the public API claims it does
 matches what it actually does, against an external ground truth**.

 This document explains the pattern, when it earns its keep, and how to
 set one up. The case study is ferrotorch (a pure-Rust PyTorch
 reimplementation), but the pattern generalizes to any project that
 claims behavioral parity with a reference.

 ---

 ## The problem conformance suites solve

 The most common failure mode in a reimplementation project is "tests
 pass, ship it" — where:

 - Unit tests cover the parts of the code the author thought to test.
 - Integration tests cover the workflows the author thought to exercise.
 - Doctests show that the example in the docstring compiles and runs.
 - The README says "PyTorch parity for op X."

 And then a downstream user runs `op X` with a real input and gets a
 result that differs from PyTorch by 5 ULP, or silently runs on CPU
 when they explicitly put the tensor on GPU, or panics on an edge case
 PyTorch handles cleanly. The author's tests didn't catch it because
 the author was checking *internal correctness*, not *contract fidelity*.

 Conformance suites close this gap by encoding the contract — "this op
 behaves the same as the reference library's op for the same inputs" —
 as a mechanically-checkable assertion. The reference library is the
 ground truth; the suite proves the implementation matches.

 ---

 ## What a conformance suite *is not*

 It's worth being precise, because conformance suites overlap with
 several things they're not.

 | Pattern | What it proves | Conformance is different because |
 |---|---|---|
 | **Unit tests** | This function returns what its body computes | Conformance proves the function matches an *external* spec, not its own implementation |
 | **Integration tests** | These subsystems compose correctly | Conformance is per-public-item; integration is workflow-level |
 | **Property-based tests** | Algebraic invariants hold (`add(x, 0) == x`) | Conformance can use property tests, but the assertion is "matches reference" not "satisfies invariant" |
 | **Doctests** | The documented example compiles and runs | Doctests are author-authored; conformance is reference-authored |
 | **Snapshot tests** | Output matches a previous run's output | Conformance compares to an *external* reference, not a previous self-run |
 | **Fuzzing** | No crashes on random inputs | Conformance is curated inputs with known reference outputs |
 | **Benchmarks** | This is fast | Conformance is correctness-only |

 You can have all of these and still ship a "PyTorch parity for op X"
 claim that isn't true. Conformance is the assertion that closes that
 specific gap.

 ---

 ## The four-layer architecture

 A conformance suite has four mechanical layers. They're independent
 enough to build in any order, but the dependencies flow downward:

 ```
 +----------------------------------+
 |  Layer 4 — Strict coverage gate  |   ← CI fails if a public item
 |                                  |     lacks a test reference
 +----------------------------------+
              ↑ refers to
 +----------------------------------+
 |  Layer 3 — Conformance tests     |   ← Per-op test functions that
 |                                  |     load fixtures and assert
 +----------------------------------+
              ↑ loads
 +----------------------------------+
 |  Layer 2 — Reference fixtures    |   ← JSON / .npz files committed
 |                                  |     to the repo, generated by a
 |                                  |     script that calls the reference
 +----------------------------------+
              ↑ derives from
 +----------------------------------+
 |  Layer 1 — Surface inventory     |   ← List of every `pub` item the
 |                                  |     project claims is its API
 +----------------------------------+
 ```

 ### Layer 1 — Surface inventory

 A list of every public item your project exposes. The "denominator"
 for coverage. Built once, regenerated whenever the public surface
 changes. In Rust, this is generated by a `syn`-based parser that walks
 all source files and collects every `pub fn` / `pub struct` /
 `pub trait` / `pub method` with module path + signature.

 The inventory is committed to the repo. PRs that change the public
 surface produce a clean JSON diff that the reviewer can check against
 the test coverage diff.

 ```jsonc
 // Example: tests/conformance/_surface.json
 {
  "items": [
    {
      "path": "mylib::tensor::add",
      "kind": "fn",
      "signature": "fn add<T: Float>(a: &Tensor<T>, b: &Tensor<T>) -> Result<Tensor<T>>"
    },
    {
      "path": "mylib::tensor::Tensor::reshape",
      "kind": "method",
      "signature": "fn reshape(&self, shape: &[usize]) -> Result<Tensor<T>>"
    },
    // ... thousands of items
  ]
 }
 ```

 ### Layer 2 — Reference fixtures

 Generated offline by a script that imports the reference library and
 records `(input, expected_output)` pairs for each op. Curated input
 sets per op cover normal cases + edge cases (empty input, NaN/Inf
 boundaries, broadcast shapes, etc.).

 Fixtures are committed to the repo. CI never invokes the reference
 library; the JSON files are the source of truth.

 ```python
 # Example: scripts/regenerate_tensor_fixtures.py
 import torch, json

 fixtures = []
 for op_name in ["add", "mul", "sub", "div"]:
    op = getattr(torch, op_name)
    for shape in [[3], [3, 4], [2, 3, 4]]:
        a = torch.randn(*shape)
        b = torch.randn(*shape)
        result = op(a, b)
        fixtures.append({
            "op": op_name,
            "input_a": a.tolist(),
            "input_b": b.tolist(),
            "shape": shape,
            "expected": result.tolist(),
        })

 with open("tests/conformance/fixtures/tensor.json", "w") as f:
    json.dump({"version": torch.__version__, "fixtures": fixtures}, f)
 ```

 The script's pinned reference library version is recorded in the
 fixture metadata. When the reference library updates, you re-run the
 script and re-commit the fixtures.

 ### Layer 3 — Conformance tests

 Rust integration tests that load fixtures and assert. Tolerance comes
 from a per-op-category table (bit-exact for indexing, 1 ULP for f32
 elementwise, 1e-6 abs for f32 reductions, etc.).

 ```rust
 // Example: tests/conformance_tensor.rs
 use crate::tolerance::*;

 #[test]
 fn add_matches_reference() {
    let fixtures = load_fixtures("tensor.json");
    for fixture in fixtures.filter(|f| f.op == "add") {
        let a = Tensor::from_vec(&fixture.input_a, &fixture.shape).unwrap();
        let b = Tensor::from_vec(&fixture.input_b, &fixture.shape).unwrap();
        let actual = mylib::add(&a, &b).unwrap();
        assert_close_f32_cpu(&actual.data(), &fixture.expected);
    }
 }
 ```

 For projects with GPU paths, the same test runs on GPU with looser
 tolerance. For projects with autograd, the test additionally runs the
 backward pass and compares gradients.

 ### Layer 4 — Strict coverage gate

 A test that cross-references the surface inventory against the
 conformance test files. Fails if any public item lacks a test
 reference. Items that legitimately can't be tested go into an
 exclusion file with a written reason and a tracking issue number.

 ```rust
 // Example: tests/conformance_surface_coverage.rs
 #[test]
 fn every_public_item_has_a_conformance_reference() {
    let surface: Vec<Item> = load_surface_json();
    let exclusions: Vec<Exclusion> = load_exclusions_toml();
    let test_text = read_all_conformance_test_files();

    let uncovered: Vec<_> = surface.iter()
        .filter(|item| !test_text.contains(&item.path))
        .filter(|item| !exclusions.iter().any(|e| e.path == item.path))
        .collect();

    assert!(uncovered.is_empty(),
        "{} public items lack a test reference: {:?}",
        uncovered.len(), uncovered);
 }
 ```

 The gate is **strict from day 1**. Every new `pub fn` added to the
 project must be referenced in a conformance test or added to
 exclusions with a tracking issue. Without this strictness, coverage
 drifts.

 ---

 ## What programs benefit

 Not every project earns its keep with a conformance suite. The
 heuristic: **does your project make a behavioral-parity claim?**

 ### Strong fit

 - **Reimplementations of an existing library.** ferrotorch (PyTorch),
  candle/burn (PyTorch), polars-rust (pandas), arrow-rs (Arrow C++),
  postgrest (PostgREST), litestream (DynamoDB), babashka (Clojure).
  The reference library is the ground truth; conformance proves you
  match.

 - **Cross-language ports.** A Python library reimplemented in Go; a
  C++ library reimplemented in Rust; a JavaScript library
  reimplemented in WebAssembly. The original is the spec.

 - **Compatibility shims.** A library that claims "drop-in replacement
  for X" — every public item must behave identically.

 - **Multi-backend implementations.** A library with CPU + GPU + TPU
  backends, where each backend must produce equivalent outputs.
  Conformance proves cross-backend agreement.

 - **Standards implementations.** A library implementing a published
  spec (HTTP/3, MessagePack, JSON Schema, GraphQL, RFC 6901). The
  spec's reference implementation is the ground truth.

 - **Data-format converters.** A library that round-trips a format
  (Parquet, ORC, Arrow IPC, ONNX). Conformance: encoding then decoding
  produces bit-identical output.

 - **Numerical / scientific libraries.** Anything claiming "matches
  numpy" or "matches scipy" or "matches octave." Numerical drift is
  invisible to unit tests; conformance catches it.

 ### Weak fit

 - **Greenfield libraries with no reference.** Nothing to be conformant
  against. Property-based testing serves the same purpose (algebraic
  invariants are the spec).

 - **UI / interactive applications.** Behavior is human-perceived;
  hard to encode as fixture-vs-output assertion.

 - **Network services with stateful protocols.** Conformance can apply
  to the wire format but not to the long-lived session state — those
  need integration tests.

 - **DSLs and macro-heavy code.** The "public surface" is the macros
  themselves; conformance applies to what the macros expand to, which
  is fluid.

 ### Mixed fit

 - **General-purpose application code.** No external spec, but if you
  have a "stable API" claim, conformance against your own historical
  behavior (snapshot-style) catches regressions.

 - **Compilers.** Conformance against a published language spec works
  (C, Rust); against a competing compiler is fragile (their bugs are
  not your spec).

 ---

 ## What conformance suites produce

 The visible output is "tests pass." The invisible output is the
 **bug-finding rate**.

 In ferrotorch's conformance work, across 7 phases covering
 ~340 conformance tests:

 - **Phase 2.0 creation** (foundational, simple ops): 0 cascade bugs
 - **Phase 2.1 elementwise** (arithmetic): 3 GPU bugs surfaced (PTX JIT
  failure, missing kernel, broadcast misalignment)
 - **Phase 2.2 reductions**: 4 GPU bugs (silent CPU-detour, missing
  scale, uninitialized buffer, polynomial residual)
 - **Phase 2.5 activations**: 8 GPU/CPU bugs (4 polynomial accuracy
  gaps, 1 GPU autograd save-state issue, 1 PTX JIT, 1 backward grad
  delta, 1 forward divergence)
 - **Phase 2.4 linalg**: 2 GPU bugs (matmul f64 forward routing
  asymmetry, dispatch limited to 2D×2D)
 - ... and so on.

 Net: **~30 latent bugs caught in code that already had unit tests, doctests, and "PyTorch parity" claims in the README.** None of these would have surfaced through any other testing pattern. Each one is a real defect that a downstream user would have hit.

 The pattern that emerges from running a conformance suite over a
 mature codebase: **bugs cluster around silent fallbacks**. The most
 common shapes:

 1. **Silent CPU detour**: a function claims to support GPU but calls
   `.data()?` internally, demoting to CPU. Output is correct but
   performance is silently terrible. Conformance catches this when
   the test asserts the output tensor is still on GPU.

 2. **Routing asymmetry**: forward path is `f32`-only; backward path
   correctly branches on dtype. Forward + backward together fails for
   `f64`. The unit test for the forward passed because it was
   `f32`-only; the unit test for the backward passed because it
   tested isolated. Conformance catches this when the test runs
   `forward(f64)` then asserts.

 3. **Verification-debt clusters**: a fix applies the same patch to N
   sibling kernels but only the original one has a probe. Conformance
   catches the unverified siblings the next time their test runs.

 4. **Stub residue**: a public function returns
   `Err(NotImplementedOnCuda)` despite the README claiming GPU
   support. Unit tests pass (they hit the CPU path); conformance
   surfaces the gap when the test runs the GPU lane.

 5. **Numerical drift**: a polynomial approximation produces 5e-7
   error against PyTorch's libdevice-backed call. Unit tests use
   self-comparison so they don't catch it; conformance does.

 ---

 ## How to set one up

 ### Step 0 — Pick the reference library and pin its version

 Decide which library is your conformance ground truth. Pin to a
 specific version. Record the pin in the fixture metadata so the suite
 can detect drift.

 For ferrotorch this was `torch==2.11.0+cu130`. For a numpy
 reimplementation it might be `numpy==2.4.4`. For a JSON parser it
 might be the published RFC test vectors.

 ### Step 1 — Build the surface inventory tool

 Either rustdoc-based (`cargo +nightly rustdoc --output-format json`) or
 syn-based (parse `src/**/*.rs` with `syn 2`). Both work. Output a
 sorted, deterministic JSON file that lists every public item.

 Commit the inventory file. PRs that change the surface produce a
 diffable JSON change.

 ### Step 2 — Decide tolerance categories

 For numerical libraries, write down the tolerance per op category in a
 table:

 | Category | CPU tolerance | GPU tolerance | Reason |
 |---|---|---|---|
 | Indexing/slicing/shape | bit-exact | bit-exact | No arithmetic |
 | f32 elementwise | 1 ULP | 1 ULP | IEEE 754 deterministic |
 | f32 reductions | 1e-6 abs | 1e-5 abs | GPU reduces in tree order |
 | f32 transcendentals | 1e-5 rel | 1e-4 rel | GPU approx instructions looser |
 | Matmul (f32) | 1e-4 rel | 1e-3 rel | O(n³) error amplification |
 | RNG | distribution moments only | distribution moments only | Different RNGs |

 For non-numerical libraries (string/JSON/format converters),
 everything is bit-exact.

 The table goes in a shared helper file
 (`tests/conformance/_tolerance.rs`) and gets reused across phases.

 ### Step 3 — Author one phase as a proof-of-pattern

 Pick one module — ideally a small one — and write its full
 conformance suite. Layers 2 (regen script), 3 (tests), 4 (gate). Get
 it green. Now you've proven the pattern works for your project.

 ### Step 4 — Sweep the rest

 For a medium codebase (~50-100 public items), one phase might be
 enough. For a large codebase (~1000+ public items), break into phases
 by module. Each phase is its own dispatch — author the regen script,
 the test file, the exclusion changes.

 If you can run subagent dispatches in parallel (Claude Code,
 similar tooling), 4-5 phases at once is a good batch. The constraint
 is the shared exclusion file — give each subagent explicit
 "don't-edit-this-file" instructions and apply changes consolidated at
 the end.

 ### Step 5 — Triage the bug cascade

 You will surface bugs. They will outnumber the original test count.
 Don't try to fix them inline; that's a different work mode. File each
 as a tracking issue, add a per-test `cascade_skip()` referencing the
 issue number, and continue building the suite.

 When the suite is complete, you have a structured queue of real bugs
 to fix. Each bug-fix dispatch flips a skip back to a live assertion —
 a clean unit of work.

 ### Step 6 — Make the gate strict in CI

 The strict coverage gate is the lock-in. Once it's in CI, no public
 item can be added without either (a) a conformance test reference or
 (b) an exclusion entry with a tracking issue. The cost of keeping
 coverage current is paid at PR time, not at "we should write more
 tests" time.

 ---

 ## Common objections

 **"This is just unit testing with extra steps."**

 No — unit testing proves the implementation matches its own logic.
 Conformance proves the implementation matches an external spec. They
 test orthogonal claims.

 **"Tolerance comparisons are fragile."**

 Yes, in two ways. (1) Tolerances drift across reference-library
 versions. Mitigation: pin the version, record it in fixture metadata,
 re-run when bumping. (2) GPU and CPU tolerances differ — bake that
 into the tolerance table from the start.

 **"The fixture files are huge."**

 Yes. Each fixture is small but they accumulate. ferrotorch's
 fixtures total ~50 MB across 7 phases. Trade-off: large repo vs
 running Python in CI. Most teams prefer the large repo. LFS is an
 option if it gets unwieldy.

 **"What about RNG ops? They can't be bit-exact."**

 Compare distribution moments (mean, variance, range) instead of raw
 values. The conformance assertion is "samples have the same
 distribution," not "samples have the same bits." Tolerances are
 larger (~3-5% on n=10000) but still meaningful.

 **"The reference library is buggy."**

 Sometimes you'll find your implementation is *more correct* than the
 reference. ferrotorch caught one such case: hardsigmoid f64 backward,
 where ferrotorch returns the true 1/6 in f64 while PyTorch returns
 the f32-rounded value. Document the divergence in the test, file a
 tracking issue noting "ferrotorch diverges from reference; reference
 is wrong here," and either match the reference's wrong behavior (if
 parity is the contract) or document the divergence (if correctness
 is the contract).

 **"Can't I just run the reference library in CI?"**

 You can, but: (a) Python-in-Rust-CI doubles build complexity,
 (b) reference-library updates silently change tests, (c) GPU
 reference runs require GPU CI. Captured fixtures avoid all three.

 ---

 ## Anti-patterns to avoid

 - **Generic exclusion entries.** "Tested by something somewhere" is
  not a reason. Each exclusion needs a specific path-and-test
  reference.

 - **Tolerance weakening to silence a failing test.** The right
  response to a tolerance miss is "file a bug, skip the test with
  the tracking issue, move on." Not "raise the tolerance until it
  passes."

 - **Skipping the strict gate "until coverage is up."** The gate is
  what makes coverage stay up; without it, every PR adds a tiny bit
  of drift.

 - **Phantom tests for backward structs / type aliases / re-exports.**
  These are usually covered transitively. Use exclusion entries with
  "implicit coverage" reasons instead of authoring tests that don't
  prove anything.

 - **Same-library snapshot tests dressed as conformance.** If you're
  comparing the implementation to its own previous output, that's a
  snapshot test, not conformance. Conformance requires an external
  reference.

 - **Conformance against an unstable reference.** If the reference
  library changes weekly, your fixtures go stale weekly. Pin the
  reference version conservatively.

 ---

 ## Maintenance over time

 Once the suite is in place:

 1. **Reference-library updates** (e.g., torch 2.11 → 2.12): re-run
   regen scripts, commit new fixtures, run the suite. Failures are
   either reference-library changes ferrotorch should match (file
   issue) or ferrotorch bugs the new fixtures surfaced (also file
   issue).

 2. **New public items**: the strict gate catches them at PR time.
   Author requires authors to add a conformance test reference (or
   an exclusion) with their PR.

 3. **Bug fixes**: when a tracked cascade bug is fixed, the test's
   `cascade_skip()` reference becomes a live assertion. Verify the
   fix produces the expected output, remove the skip.

 4. **Tolerance reviews**: if a tolerance is consistently passing with
   margin or consistently failing without bugs, the tolerance is
   probably wrong. Audit the table periodically.

 5. **Fixture audits**: fixtures grow over time as edge cases get
   added. Periodically prune redundant cases.

 The suite is a living artifact — not a one-time investment.

 ---

 ## When NOT to use this pattern

 - **Throwaway code.** The investment isn't recouped.

 - **Code where "correct" is defined by humans, not a reference.**
  E.g., UI rendering, linting, autocompletion ranking. No external
  spec to be conformant against.

 - **Code where the reference is itself the system under test.**
  Doesn't apply (no oracle).

 - **Pre-1.0 libraries with rapidly-evolving APIs.** The suite
  invests in the public surface; if the surface is in flux, the
  investment evaporates each release.

 For these cases, property-based testing, snapshot testing, or
 human-curated regression tests serve the role better.

 ---

 ## Summary

 Conformance suites prove that what your library claims to do matches
 what an external reference does. They catch a class of bug that no
 other testing pattern finds: silent fallbacks, routing asymmetries,
 numerical drift, and stub residue.

 The four-layer architecture — surface inventory → reference fixtures
 → conformance tests → strict coverage gate — is mechanical enough to
 build incrementally. Strong fit: any project making a behavioral-parity
 claim. Weak fit: greenfield code with no oracle.

 The pattern's value scales with the size of the public API and the
 strength of the parity claim. For a small library with a "compatible
 with X" claim, a conformance suite is overkill. For a large library
 with a "drop-in replacement for X" claim, it's the only way to keep
 the claim honest.

 Build it once, maintain it incrementally, file the bugs it surfaces,
 fix them as separate work. The bugs are the value.
No results found