A brief overview of property-based testing

What is property-based testing

When people talk about automated testing, 98% of the time they're thinking of "example-based testing". This is a testing methodology that goes something like this:

Think of the way a component should behave
Specify a set of inputs to the component
Specify the expected output of the component, given these inputs
Repeat these steps for every major "category" of inputs. (i.e. if an input is a number, you'd probably use a large number, a negative number, a small number, 1, and 0, and consider these covering all of your cases.)

Each input/output pair forms an example of the way the program should behave. A test suite written using this style forms an argument like this: "All inputs in the same category should act the same. I've specified examples of every category, and the categories I've specified cover all possible categories, so therefore my system is correct." I won't try to get into the upsides and downsides of this here.

There are several alternatives to example-based testing, and the most common one is "property-based testing", sometimes known as "random testing". Property-based testing goes something like this:

Think of a way a component should behave. Specify it as "for all ____, ____ should be true".
Specify this in code. The first ____ is a data type, T, which holds the input to your component. The second ____ is a function which takes data of type T and asserts something about it.
Specify a generator for type T, which produces values of this type.
Plug the two together and run them a whole bunch of times.

A sort() function on a list is a common example. We define the input as a list of numbers, and specify these properties:

When an array is sorted, each item in the array should be >= the item before it.
When an array is sorted, each array member in the input should appear in the output, exactly the same number of times that it was included in the input.
(Optional, for a stable sort) Sorting an array twice gives the same result as sorting it once.

The argument made by a property-based test suite goes something like this: "I've explicitly stated the rules my program should follow. I've tested each rule for 100 different inputs from my generator, and my generator can produce every possible input, so the program probably follows my rules."

The tradeoffs offered by property-based testing are something like this:

PBTs cover more cases than EBTs
PBTs make the system invariants more explicit
PBTs make the actual inputs and outputs of the system more implicit
PBTs are better at finding bugs in edge cases that the developer did not consider
PBTs are often harder to write than EBTs, because they require thinking about your system more formally, in terms of the invariants that it follows.
PBTs require more advanced library support to use than EBTs

Implementations

There are 3 main families of property-based testing library: those based on quickcheck, hedgehog, and smallcheck. Quickcheck was the original, and is a haskell library. The vast majority of property-based testing libraries in languages other than Haskell are based on quickcheck, but most of them are missing certain features that I think are important. I'll get to these later.

Hedgehog was developed to address some problems with Quickcheck, and its creator has created implementations in other languages.(Scala, F#, and R) To my knowledge there are no published papers on Hedgehog the same way there are on quickcheck, but the creator has given talks on its implementation. Hedgehog is the approach that I prefer by far, and is the approach that I took in my (bad, abandoned, unreleased) property-based testing library.

Smallcheck, to my knowledge, only exists in Haskell. I consider it more theoretical than the other two alternatives, though it does have users and proponents. I would love to see more implementations of it but I am not sure that I would actually use them if a hedgehog implementation were available in my language of choice.

I'll reference these three families in the following sections.

jsverify is a quickcheck implementation in JS. I found it challenging to adopt, but it has been years since I tried to use it. fast-check is a quickcheck implementation in TS. Its documentation leaves some things implicit, especially around writing custom generators, but I found it pleasant to use overall. It also has some unique features which give it some of Hedgehog's benefits as well. ClojureScript has property-based testing built in, and I assume there is a property-based testing framework for PureScript, but I am not aware of other property-based testing libraries for JS or TS.

Generation, assertion, and shrinking

There are 3 main pieces of a property-based testing library: tools for generating data, a test runner that runs properties, and a component which shrinks down inputs which find an error to their minimal form.

Generation

Data generation is best thought of as a combinator library. Generators exist for primitive types, and then generators for more complex types are created by composing simple generators. In Haskell, for quickcheck, type classes are used to provide generators for types automatically, so users do not need to specify generators for their types explicitly. This has obvious upsides and more subtle downsides.

Hedgehog and non-haskell implementations of quickcheck use combinators like you would expect. For TypeScript, these would have a similar structure to the APIs of zod or runtypes, with some additions.

Generation for quickcheck looks like this:

type Generator<T> = (randomSeed: number, size: number) => T

Generation for hedgehog looks like this:

type Generator<T> = (randomSeed: number, size: number) => RoseTree<T>
type RoseTree<T> = [T, Iterable<RoseTree<T>>]

Many frameworks refer to generators as "arbitraries", a reference to the typeclass used in the original quickcheck paper.

The size parameter is used by the framework to control how large of an example the generator will produce. There's not an official rule for how "big" a thing with size n should be, but what is important is that a thing with a smaller size is smaller than a thing with a larger size.

For example, when I wrote a library, I made generation of numbers capable of producing numbers as high as 1 for size=0 and maxint for size >= 100, with an exponential scale. When generating an array I generated arrays of random length, where length <= size.

The reason for using the size parameter is threefold:

Tests with a smaller size run faster
Data with a smaller size will be easier to shrink and debug
When generating recursive data, like trees, the generator can call itself recursively with a smaller size parameter at each step, ensuring that generation eventually terminates.

Not all implementations use a size parameter, (I believe that fast-check does not, I do not know about jsverify) but I consider it important. When implementations use it, they start with a small size, and gradually increase it throughout test runs. So if I run 100 tests for a property, I might start with size 10 and scale linearly to 100.

Frameworks provide combinators for working with generators other than those which reflect the shape of the data being generated. As a motivating example, say we want to run a property on values of a class like this:

class Foo {
  constructor(a: number, b: string) {
    // ...
  }
  // ...
}

We could write a generator from scratch, but instead frameworks generally let you do something like this:

const genFoo = tuple(number(), string()).map(([a, b]) => new Foo(a, b));

Other common combinators might allow you to combine multiple generators with a certain probability, (imagine a generator that produces values over 100 10% of the time and under 0 90% of the time) or one which lets you transform the size parameter for a generator. One combinator that I find essential but which some frameworks lack is chain():

class Generator<A> {
  generator.chain<A, B>(fn: (a: A => Generator<B>)): Generator<B>
}

This lets us produce a generator based on the output of another generator, and is necessary for generating certain types of data. For example, generating two arrays of equal but random length.

Smallcheck is weird. I'll briefly cover it in its own section later.

Shrinking

Data shrinking is one of the most important parts of a property-based testing framework, and is also where a lot of complexity hides. When talking about interfacing a data description library like zod with property-based testing, this is where I would expect most of the challenges to lie. This also varies the most between different property-based testing frameworks.

Input for which a test fails is called a "counterexample". The reason shrinking is essential is that if you are running on complex data, a counterexample could well be an object graph containing hundreds of objects. When trying to reproduce and debug a failure, such a value is not very useful. A framework should provide the user not just with some counterexample, but with as small of a counterexample as possible.

In quickcheck frameworks, shrinking is acheived by attaching a function to generators like this:

type shrink<T> = (counterexample: T) => Iterable<T>

I'll give examples using arrays instead of iterables, but it's very important that the output of shrink is lazy (as well as that the rose tree produced by hedgehog generation is lazy). This is because the iterables produced can be very very large.

The idea of a shrink function is that, given some generated value, shrink produces an array of smaller values, starting with the most aggressive shrink and moving on to less aggressive shrinks. The exact values specified don't matter, as long as they don't form a cycle.

As an example, here's one of the ways that an implementation might shrink the value 100:

[0, 1, 10, 50, 90, 99]

Here we've decided that 0 is our smallest value. If we tried to shrink the value 0, we would get an empty iterable back. There's no reason that an integer needs to shrink towards 0; an alternative integer generator could provide a different shrink method.

When shrinking a counterexample, the framework will try the choices in the array in order. If the test passes, and the input does not cause the assertions to fail, then the framework will try the next possibility in the array. If the test fails then the framework will recurse, and try to shrink this new counterexample.

Going back to the example of shrinking 100, say that the tests pass on inputs 0, 1, and 10, but fail on 50. Then the framework will try to shrink 50, and get something like

[0, 1, 5, 25, 45, 49]

It'll repeat this until shrink(counterexample) returns [], or until we've reached the end of the array without finding a smaller counterexample.

Combinators will generally shrink by shrinking themselves before shrinking their contents. For example, an array's shrink method will try to remove elements from the array and then try to shrink each of the array cell's contents.

The quickcheck method of shrinking presents certain problems for generator combinators. For example, say that we've created our class instance generator from before:

const genFoo = tuple(number(), string()).map(([a, b]) => new Foo(a, b));

Even though the framework has likely built in shrinking methods for numbers, strings, and tuples, we will have to write our own shrinker for instances of Foo. The framework could provide an alternative to map that lets you compose the shrink methods of the generators as well, but you'll have to provide another function which maps a Foo back to [number, string]. Similar problems occur when using chain.

This additional overhead leads to people frequently not providing shrink implementations for their types when using quickcheck-like libraries. This works fine as long as the tests are passing, but leads to challenges when debugging failing tests.

Hedgehog aims to address these problems. It is based around "integrated shrinking", which functions by bundling shrinks alongside generation. Rather than having generation produce a single value, it produces a rose tree. The root node of the rose tree contains the input for the test; the forest contains shrinks of the input. If you draw out the call graph of a series of shrinks for a counterexample, a rose tree reifies this call graph and represents it in data explicitly.

This is important because map and chain are easy to define on rose trees. When we map a generator, we also map the shrinks of values produced by said generator. This way, shrinking composed generators comes for free. It should be noted that shrinking the input into a map doesn't necessarily shrink the output in an intuitive way, because the relationship between the shrunk inputs and the outputs depends on the choice of the function passed to map.

fast-check has a notion of "mappers" that are used in their arbitrary implementations. I did not see documentation on this, but AFAICT, it provides some support for integrated shrinking.

Running properties

Running properties is more straightforward than shrinking. The user specifies a property, a generator, and the number of times that the test should run, and the framework does the work. Frameworks frequently provide additional functionality, though:

Specifing a "filter" function, such that the properties will only run on outputs from the generator that pass the filter. This could be done through generator combinators, but it is desirable to be able to specify the maximum number of attempts at generation as well as the minumum number of runs which pass the filter.
Specifying assertions on properties of the generated data. For instance, say we want to run a test 100 times, but we want at least 30 of those times to satisfy some side-condition.
Testing generators. It's desirable to be able to specify statistical properties of the generated data when writing nontrivial generators.

Smallcheck

These notes turned out longer than expected, so I may try to fill in this section later. The point of smallcheck is that data is represented as a tree, and generation is done via an exhaustive search from the smallest tree through trees with higher depths. No shrinking is necessary, because execution progresses from smaller examples to larger. The paper "SmallCheck and Lazy SmallCheck: automatic exhaustive testing for small values" provides a good explanation.

Function generation

Quickcheck provides the ability to not only generate values of type number or string, but also of functions, like one from number -> string. It does this with a typeclass called "coarbitrary". This is a feature which I consider essential for something to truly be an implementation of quickcheck, but which most frameworks omit. The implementation is surprisingly simple once you learn the trick, but it's beyond the scope of these notes.

josephjunker/notes.md