Skip to content

Instantly share code, notes, and snippets.

@dollspace-gay
Created May 6, 2026 16:53
Show Gist options
  • Select an option

  • Save dollspace-gay/f6ae1e88693a469e9afff2b3a4ae5bb0 to your computer and use it in GitHub Desktop.

Select an option

Save dollspace-gay/f6ae1e88693a469e9afff2b3a4ae5bb0 to your computer and use it in GitHub Desktop.
# Conformance suites: what they are, when they help, how to build one
A conformance suite is a third axis of testing — distinct from unit tests
and integration tests. Where unit tests prove that a function does what
its body says, and integration tests prove that subsystems compose, a
conformance suite proves that **what the public API claims it does
matches what it actually does, against an external ground truth**.
This document explains the pattern, when it earns its keep, and how to
set one up. The case study is ferrotorch (a pure-Rust PyTorch
reimplementation), but the pattern generalizes to any project that
claims behavioral parity with a reference.
---
## The problem conformance suites solve
The most common failure mode in a reimplementation project is "tests
pass, ship it" — where:
- Unit tests cover the parts of the code the author thought to test.
- Integration tests cover the workflows the author thought to exercise.
- Doctests show that the example in the docstring compiles and runs.
- The README says "PyTorch parity for op X."
And then a downstream user runs `op X` with a real input and gets a
result that differs from PyTorch by 5 ULP, or silently runs on CPU
when they explicitly put the tensor on GPU, or panics on an edge case
PyTorch handles cleanly. The author's tests didn't catch it because
the author was checking *internal correctness*, not *contract fidelity*.
Conformance suites close this gap by encoding the contract — "this op
behaves the same as the reference library's op for the same inputs" —
as a mechanically-checkable assertion. The reference library is the
ground truth; the suite proves the implementation matches.
---
## What a conformance suite *is not*
It's worth being precise, because conformance suites overlap with
several things they're not.
| Pattern | What it proves | Conformance is different because |
|---|---|---|
| **Unit tests** | This function returns what its body computes | Conformance proves the function matches an *external* spec, not its own implementation |
| **Integration tests** | These subsystems compose correctly | Conformance is per-public-item; integration is workflow-level |
| **Property-based tests** | Algebraic invariants hold (`add(x, 0) == x`) | Conformance can use property tests, but the assertion is "matches reference" not "satisfies invariant" |
| **Doctests** | The documented example compiles and runs | Doctests are author-authored; conformance is reference-authored |
| **Snapshot tests** | Output matches a previous run's output | Conformance compares to an *external* reference, not a previous self-run |
| **Fuzzing** | No crashes on random inputs | Conformance is curated inputs with known reference outputs |
| **Benchmarks** | This is fast | Conformance is correctness-only |
You can have all of these and still ship a "PyTorch parity for op X"
claim that isn't true. Conformance is the assertion that closes that
specific gap.
---
## The four-layer architecture
A conformance suite has four mechanical layers. They're independent
enough to build in any order, but the dependencies flow downward:
```
+----------------------------------+
| Layer 4 — Strict coverage gate | ← CI fails if a public item
| | lacks a test reference
+----------------------------------+
↑ refers to
+----------------------------------+
| Layer 3 — Conformance tests | ← Per-op test functions that
| | load fixtures and assert
+----------------------------------+
↑ loads
+----------------------------------+
| Layer 2 — Reference fixtures | ← JSON / .npz files committed
| | to the repo, generated by a
| | script that calls the reference
+----------------------------------+
↑ derives from
+----------------------------------+
| Layer 1 — Surface inventory | ← List of every `pub` item the
| | project claims is its API
+----------------------------------+
```
### Layer 1 — Surface inventory
A list of every public item your project exposes. The "denominator"
for coverage. Built once, regenerated whenever the public surface
changes. In Rust, this is generated by a `syn`-based parser that walks
all source files and collects every `pub fn` / `pub struct` /
`pub trait` / `pub method` with module path + signature.
The inventory is committed to the repo. PRs that change the public
surface produce a clean JSON diff that the reviewer can check against
the test coverage diff.
```jsonc
// Example: tests/conformance/_surface.json
{
"items": [
{
"path": "mylib::tensor::add",
"kind": "fn",
"signature": "fn add<T: Float>(a: &Tensor<T>, b: &Tensor<T>) -> Result<Tensor<T>>"
},
{
"path": "mylib::tensor::Tensor::reshape",
"kind": "method",
"signature": "fn reshape(&self, shape: &[usize]) -> Result<Tensor<T>>"
},
// ... thousands of items
]
}
```
### Layer 2 — Reference fixtures
Generated offline by a script that imports the reference library and
records `(input, expected_output)` pairs for each op. Curated input
sets per op cover normal cases + edge cases (empty input, NaN/Inf
boundaries, broadcast shapes, etc.).
Fixtures are committed to the repo. CI never invokes the reference
library; the JSON files are the source of truth.
```python
# Example: scripts/regenerate_tensor_fixtures.py
import torch, json
fixtures = []
for op_name in ["add", "mul", "sub", "div"]:
op = getattr(torch, op_name)
for shape in [[3], [3, 4], [2, 3, 4]]:
a = torch.randn(*shape)
b = torch.randn(*shape)
result = op(a, b)
fixtures.append({
"op": op_name,
"input_a": a.tolist(),
"input_b": b.tolist(),
"shape": shape,
"expected": result.tolist(),
})
with open("tests/conformance/fixtures/tensor.json", "w") as f:
json.dump({"version": torch.__version__, "fixtures": fixtures}, f)
```
The script's pinned reference library version is recorded in the
fixture metadata. When the reference library updates, you re-run the
script and re-commit the fixtures.
### Layer 3 — Conformance tests
Rust integration tests that load fixtures and assert. Tolerance comes
from a per-op-category table (bit-exact for indexing, 1 ULP for f32
elementwise, 1e-6 abs for f32 reductions, etc.).
```rust
// Example: tests/conformance_tensor.rs
use crate::tolerance::*;
#[test]
fn add_matches_reference() {
let fixtures = load_fixtures("tensor.json");
for fixture in fixtures.filter(|f| f.op == "add") {
let a = Tensor::from_vec(&fixture.input_a, &fixture.shape).unwrap();
let b = Tensor::from_vec(&fixture.input_b, &fixture.shape).unwrap();
let actual = mylib::add(&a, &b).unwrap();
assert_close_f32_cpu(&actual.data(), &fixture.expected);
}
}
```
For projects with GPU paths, the same test runs on GPU with looser
tolerance. For projects with autograd, the test additionally runs the
backward pass and compares gradients.
### Layer 4 — Strict coverage gate
A test that cross-references the surface inventory against the
conformance test files. Fails if any public item lacks a test
reference. Items that legitimately can't be tested go into an
exclusion file with a written reason and a tracking issue number.
```rust
// Example: tests/conformance_surface_coverage.rs
#[test]
fn every_public_item_has_a_conformance_reference() {
let surface: Vec<Item> = load_surface_json();
let exclusions: Vec<Exclusion> = load_exclusions_toml();
let test_text = read_all_conformance_test_files();
let uncovered: Vec<_> = surface.iter()
.filter(|item| !test_text.contains(&item.path))
.filter(|item| !exclusions.iter().any(|e| e.path == item.path))
.collect();
assert!(uncovered.is_empty(),
"{} public items lack a test reference: {:?}",
uncovered.len(), uncovered);
}
```
The gate is **strict from day 1**. Every new `pub fn` added to the
project must be referenced in a conformance test or added to
exclusions with a tracking issue. Without this strictness, coverage
drifts.
---
## What programs benefit
Not every project earns its keep with a conformance suite. The
heuristic: **does your project make a behavioral-parity claim?**
### Strong fit
- **Reimplementations of an existing library.** ferrotorch (PyTorch),
candle/burn (PyTorch), polars-rust (pandas), arrow-rs (Arrow C++),
postgrest (PostgREST), litestream (DynamoDB), babashka (Clojure).
The reference library is the ground truth; conformance proves you
match.
- **Cross-language ports.** A Python library reimplemented in Go; a
C++ library reimplemented in Rust; a JavaScript library
reimplemented in WebAssembly. The original is the spec.
- **Compatibility shims.** A library that claims "drop-in replacement
for X" — every public item must behave identically.
- **Multi-backend implementations.** A library with CPU + GPU + TPU
backends, where each backend must produce equivalent outputs.
Conformance proves cross-backend agreement.
- **Standards implementations.** A library implementing a published
spec (HTTP/3, MessagePack, JSON Schema, GraphQL, RFC 6901). The
spec's reference implementation is the ground truth.
- **Data-format converters.** A library that round-trips a format
(Parquet, ORC, Arrow IPC, ONNX). Conformance: encoding then decoding
produces bit-identical output.
- **Numerical / scientific libraries.** Anything claiming "matches
numpy" or "matches scipy" or "matches octave." Numerical drift is
invisible to unit tests; conformance catches it.
### Weak fit
- **Greenfield libraries with no reference.** Nothing to be conformant
against. Property-based testing serves the same purpose (algebraic
invariants are the spec).
- **UI / interactive applications.** Behavior is human-perceived;
hard to encode as fixture-vs-output assertion.
- **Network services with stateful protocols.** Conformance can apply
to the wire format but not to the long-lived session state — those
need integration tests.
- **DSLs and macro-heavy code.** The "public surface" is the macros
themselves; conformance applies to what the macros expand to, which
is fluid.
### Mixed fit
- **General-purpose application code.** No external spec, but if you
have a "stable API" claim, conformance against your own historical
behavior (snapshot-style) catches regressions.
- **Compilers.** Conformance against a published language spec works
(C, Rust); against a competing compiler is fragile (their bugs are
not your spec).
---
## What conformance suites produce
The visible output is "tests pass." The invisible output is the
**bug-finding rate**.
In ferrotorch's conformance work, across 7 phases covering
~340 conformance tests:
- **Phase 2.0 creation** (foundational, simple ops): 0 cascade bugs
- **Phase 2.1 elementwise** (arithmetic): 3 GPU bugs surfaced (PTX JIT
failure, missing kernel, broadcast misalignment)
- **Phase 2.2 reductions**: 4 GPU bugs (silent CPU-detour, missing
scale, uninitialized buffer, polynomial residual)
- **Phase 2.5 activations**: 8 GPU/CPU bugs (4 polynomial accuracy
gaps, 1 GPU autograd save-state issue, 1 PTX JIT, 1 backward grad
delta, 1 forward divergence)
- **Phase 2.4 linalg**: 2 GPU bugs (matmul f64 forward routing
asymmetry, dispatch limited to 2D×2D)
- ... and so on.
Net: **~30 latent bugs caught in code that already had unit tests, doctests, and "PyTorch parity" claims in the README.** None of these would have surfaced through any other testing pattern. Each one is a real defect that a downstream user would have hit.
The pattern that emerges from running a conformance suite over a
mature codebase: **bugs cluster around silent fallbacks**. The most
common shapes:
1. **Silent CPU detour**: a function claims to support GPU but calls
`.data()?` internally, demoting to CPU. Output is correct but
performance is silently terrible. Conformance catches this when
the test asserts the output tensor is still on GPU.
2. **Routing asymmetry**: forward path is `f32`-only; backward path
correctly branches on dtype. Forward + backward together fails for
`f64`. The unit test for the forward passed because it was
`f32`-only; the unit test for the backward passed because it
tested isolated. Conformance catches this when the test runs
`forward(f64)` then asserts.
3. **Verification-debt clusters**: a fix applies the same patch to N
sibling kernels but only the original one has a probe. Conformance
catches the unverified siblings the next time their test runs.
4. **Stub residue**: a public function returns
`Err(NotImplementedOnCuda)` despite the README claiming GPU
support. Unit tests pass (they hit the CPU path); conformance
surfaces the gap when the test runs the GPU lane.
5. **Numerical drift**: a polynomial approximation produces 5e-7
error against PyTorch's libdevice-backed call. Unit tests use
self-comparison so they don't catch it; conformance does.
---
## How to set one up
### Step 0 — Pick the reference library and pin its version
Decide which library is your conformance ground truth. Pin to a
specific version. Record the pin in the fixture metadata so the suite
can detect drift.
For ferrotorch this was `torch==2.11.0+cu130`. For a numpy
reimplementation it might be `numpy==2.4.4`. For a JSON parser it
might be the published RFC test vectors.
### Step 1 — Build the surface inventory tool
Either rustdoc-based (`cargo +nightly rustdoc --output-format json`) or
syn-based (parse `src/**/*.rs` with `syn 2`). Both work. Output a
sorted, deterministic JSON file that lists every public item.
Commit the inventory file. PRs that change the surface produce a
diffable JSON change.
### Step 2 — Decide tolerance categories
For numerical libraries, write down the tolerance per op category in a
table:
| Category | CPU tolerance | GPU tolerance | Reason |
|---|---|---|---|
| Indexing/slicing/shape | bit-exact | bit-exact | No arithmetic |
| f32 elementwise | 1 ULP | 1 ULP | IEEE 754 deterministic |
| f32 reductions | 1e-6 abs | 1e-5 abs | GPU reduces in tree order |
| f32 transcendentals | 1e-5 rel | 1e-4 rel | GPU approx instructions looser |
| Matmul (f32) | 1e-4 rel | 1e-3 rel | O(n³) error amplification |
| RNG | distribution moments only | distribution moments only | Different RNGs |
For non-numerical libraries (string/JSON/format converters),
everything is bit-exact.
The table goes in a shared helper file
(`tests/conformance/_tolerance.rs`) and gets reused across phases.
### Step 3 — Author one phase as a proof-of-pattern
Pick one module — ideally a small one — and write its full
conformance suite. Layers 2 (regen script), 3 (tests), 4 (gate). Get
it green. Now you've proven the pattern works for your project.
### Step 4 — Sweep the rest
For a medium codebase (~50-100 public items), one phase might be
enough. For a large codebase (~1000+ public items), break into phases
by module. Each phase is its own dispatch — author the regen script,
the test file, the exclusion changes.
If you can run subagent dispatches in parallel (Claude Code,
similar tooling), 4-5 phases at once is a good batch. The constraint
is the shared exclusion file — give each subagent explicit
"don't-edit-this-file" instructions and apply changes consolidated at
the end.
### Step 5 — Triage the bug cascade
You will surface bugs. They will outnumber the original test count.
Don't try to fix them inline; that's a different work mode. File each
as a tracking issue, add a per-test `cascade_skip()` referencing the
issue number, and continue building the suite.
When the suite is complete, you have a structured queue of real bugs
to fix. Each bug-fix dispatch flips a skip back to a live assertion —
a clean unit of work.
### Step 6 — Make the gate strict in CI
The strict coverage gate is the lock-in. Once it's in CI, no public
item can be added without either (a) a conformance test reference or
(b) an exclusion entry with a tracking issue. The cost of keeping
coverage current is paid at PR time, not at "we should write more
tests" time.
---
## Common objections
**"This is just unit testing with extra steps."**
No — unit testing proves the implementation matches its own logic.
Conformance proves the implementation matches an external spec. They
test orthogonal claims.
**"Tolerance comparisons are fragile."**
Yes, in two ways. (1) Tolerances drift across reference-library
versions. Mitigation: pin the version, record it in fixture metadata,
re-run when bumping. (2) GPU and CPU tolerances differ — bake that
into the tolerance table from the start.
**"The fixture files are huge."**
Yes. Each fixture is small but they accumulate. ferrotorch's
fixtures total ~50 MB across 7 phases. Trade-off: large repo vs
running Python in CI. Most teams prefer the large repo. LFS is an
option if it gets unwieldy.
**"What about RNG ops? They can't be bit-exact."**
Compare distribution moments (mean, variance, range) instead of raw
values. The conformance assertion is "samples have the same
distribution," not "samples have the same bits." Tolerances are
larger (~3-5% on n=10000) but still meaningful.
**"The reference library is buggy."**
Sometimes you'll find your implementation is *more correct* than the
reference. ferrotorch caught one such case: hardsigmoid f64 backward,
where ferrotorch returns the true 1/6 in f64 while PyTorch returns
the f32-rounded value. Document the divergence in the test, file a
tracking issue noting "ferrotorch diverges from reference; reference
is wrong here," and either match the reference's wrong behavior (if
parity is the contract) or document the divergence (if correctness
is the contract).
**"Can't I just run the reference library in CI?"**
You can, but: (a) Python-in-Rust-CI doubles build complexity,
(b) reference-library updates silently change tests, (c) GPU
reference runs require GPU CI. Captured fixtures avoid all three.
---
## Anti-patterns to avoid
- **Generic exclusion entries.** "Tested by something somewhere" is
not a reason. Each exclusion needs a specific path-and-test
reference.
- **Tolerance weakening to silence a failing test.** The right
response to a tolerance miss is "file a bug, skip the test with
the tracking issue, move on." Not "raise the tolerance until it
passes."
- **Skipping the strict gate "until coverage is up."** The gate is
what makes coverage stay up; without it, every PR adds a tiny bit
of drift.
- **Phantom tests for backward structs / type aliases / re-exports.**
These are usually covered transitively. Use exclusion entries with
"implicit coverage" reasons instead of authoring tests that don't
prove anything.
- **Same-library snapshot tests dressed as conformance.** If you're
comparing the implementation to its own previous output, that's a
snapshot test, not conformance. Conformance requires an external
reference.
- **Conformance against an unstable reference.** If the reference
library changes weekly, your fixtures go stale weekly. Pin the
reference version conservatively.
---
## Maintenance over time
Once the suite is in place:
1. **Reference-library updates** (e.g., torch 2.11 → 2.12): re-run
regen scripts, commit new fixtures, run the suite. Failures are
either reference-library changes ferrotorch should match (file
issue) or ferrotorch bugs the new fixtures surfaced (also file
issue).
2. **New public items**: the strict gate catches them at PR time.
Author requires authors to add a conformance test reference (or
an exclusion) with their PR.
3. **Bug fixes**: when a tracked cascade bug is fixed, the test's
`cascade_skip()` reference becomes a live assertion. Verify the
fix produces the expected output, remove the skip.
4. **Tolerance reviews**: if a tolerance is consistently passing with
margin or consistently failing without bugs, the tolerance is
probably wrong. Audit the table periodically.
5. **Fixture audits**: fixtures grow over time as edge cases get
added. Periodically prune redundant cases.
The suite is a living artifact — not a one-time investment.
---
## When NOT to use this pattern
- **Throwaway code.** The investment isn't recouped.
- **Code where "correct" is defined by humans, not a reference.**
E.g., UI rendering, linting, autocompletion ranking. No external
spec to be conformant against.
- **Code where the reference is itself the system under test.**
Doesn't apply (no oracle).
- **Pre-1.0 libraries with rapidly-evolving APIs.** The suite
invests in the public surface; if the surface is in flux, the
investment evaporates each release.
For these cases, property-based testing, snapshot testing, or
human-curated regression tests serve the role better.
---
## Summary
Conformance suites prove that what your library claims to do matches
what an external reference does. They catch a class of bug that no
other testing pattern finds: silent fallbacks, routing asymmetries,
numerical drift, and stub residue.
The four-layer architecture — surface inventory → reference fixtures
→ conformance tests → strict coverage gate — is mechanical enough to
build incrementally. Strong fit: any project making a behavioral-parity
claim. Weak fit: greenfield code with no oracle.
The pattern's value scales with the size of the public API and the
strength of the parity claim. For a small library with a "compatible
with X" claim, a conformance suite is overkill. For a large library
with a "drop-in replacement for X" claim, it's the only way to keep
the claim honest.
Build it once, maintain it incrementally, file the bugs it surfaces,
fix them as separate work. The bugs are the value.
@nwhitehead
Copy link
Copy Markdown

Some notes in case it's helpful:


"The reference library is buggy."

I would add a note about filing an actual bug against the reference library. This is good software stewardship and helps the ecosystem as a whole. For the agent, the instruction might be to include a note about filing a bug against the reference in the description of the issue.

Also, for a complicated project there are likely to be cases where the reference library is not exactly buggy, but under-specified. A combination of inputs in the reference library may be undocumented and have some sort of reasonable but unpredicted behavior. Deciding whether to copy this or mark with an issue is important. Basically there are more cases than "reference is right" / "we are right".


Data-format converters.

"Conformance: encoding then decoding produces bit-identical output." This would be good if true, but often round trip conversion cannot be bit-identical but will be functionally identical. A trick to help with this is to define a canonicalization function from each format to itself. Then you say the canonicalized outputs of encoding-decoding are bit identical.


Performance conformance

There is also a "performance conformance" type of testing. Sometimes the unit test and integration test will show correct results, taking the right path. But the public symbol will accidentally hit a slow fallback and be slow (but undetectable otherwise). It looks likes some of these were caught because of CPU/GPU location issues but often everything is on the CPU and just hitting different branches. Performance conformance is probably a good extra step beyond what's here.


Lifetimes, Memory

Another aspect I didn't see is lifetimes, ownership, and memory allocation. This is a classic problem for C API ilbraries. The Rust interface will typically be OK but C assumptions about lifetimes are typically in documentation and poorly tested in practice. This can be a tricky part of conformance testing between languages.

@dollspace-gay
Copy link
Copy Markdown
Author

True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment