Elm gets one thing amazingly right (which is unusual in most ecosystems) in that it has in its "default" testing tool built in support for property based testing1. This is a huge step forward since it makes property testing something easily reachable for by the average engineer without needing yet another dependency and boilerplate to set it up and so on.
However, there are a number of things we could do to make fuzz testing more effective and in that way make testing more effective.
Property based testing samples a large space of input data attempting to find an example of the input that fails the test. The faster and more reliably it can find this input, the more likely the test will be useful in uncovering any defects in the program.
This intuitive property is formalized as the F-metric, defined as the number of test cases the test system needs to generate before a defect is uncovered. The F-metric is a useful proxy for comparing the effectiveness of various schemes for test generation, but it doesn't stand alone. Clearly for instance hiring a Q/A engineer to carefully read the source code and then write failing unit tests might have an F = 1, but be quite unhelpful for daily development. Hence I think it is worth combining with a timing of how long the test generation takes. For instance having a F = 10,000, but being able to execute 100,000 tests/sec could be extremely helpful. However, the problem with this is that as test library authors we don't control how long the assertion of the test actually takes to execute; in many cases the assertion may strongly dominate the test generation strategy. Hence the usefulness of F.
This the main thrust of my proposals, but there are other dimensions in which the effectiveness can be increased. First and foremost among these are any ways to make authoring fuzz tests easier, second is better tooling for optimizing test case search. We will look at some of these proposals later.
The source code for Fuzz.float
reads:
float : Fuzzer Float
float =
let
generator =
Random.frequency
( 3, Random.float -50 50 )
[ ( 0.5, Random.constant 0 )
, ( 1, Random.float -1 1 )
, ( 1, Random.float 0 (toFloat <| Random.maxInt - Random.minInt) )
, ( 1, Random.float (toFloat <| Random.minInt - Random.maxInt) 0 )
]
in
custom generator Shrink.float
This generator is biased to produce 0
about 8% of the time and highly biased to produce "small" values (that is if the defect only occurs at values higher than 50, then only about 15% of the tests will probe that area).
I believe this is intentionally biased in the belief that 0
beahves oddly in multiplication and in a mathematically undefined way in division, hence being a value more likely to produce buggy behavior. This well may be a belief strengthened by many PBT tutorials that like to show "hidden" bugs in code simple enought to fit an example - and zero based bugs happen to fit the bill neatly. I haven't seen any evidence that this kind of deffect is particularly common in real world software.
Ultimately there are many mathematically interesting numbers that might cost strange behaviors: 1
, π
2π
, e
, Infinity
, -0
and so on.
But my argument really comes down to the F-metric. If one wants to increase it, then looking in an unbiased way is going to increase it on average.
A function fuzzing four booleans has only 16 possible unique cases, but it takes on average 54 random attempts to cover those cases. It can get worse with some naive fuzzer implementations, like the following for a maybe:
maybe : Fuzz a -> Fuzz (Maybe a)
maybe justFuzzer =
Fuzz.oneOf [ Fuzz.constant Nothing, Fuzz.map Just justFuzzer ]
This will produce Nothing
50% of the time (the real fuzzer is a bit better, producing Nothing
only 25% of the time).
What these two examples have in common is that they can be decomposed into a set of exhaustive tests (i.e. where the whole range of values can be reasonably enumarated) and simpler probabilistic tests (in the boolean case empty).
Now of course what can be considered exhaustive depends on how many tests are being run. For instance a test that generates a unicode character should run probablistically when 100 test cases are being requested, but if a million test cases are being requested, then running exhaustively would be much more efficient (as there are only 144,697 unicode characters defined).
Still I think that having a unified API for both that switches automatically between the modes depending on the requested amount of test cases would be helpful in increasing coverage at lower costs.
Of course one of the issues that comes with trying to do exhaustive testing is that it even if it can work for simple fuzz
tests, it will likely break down with fuzz2
, fuzz3
and more, since the test space simply multiplies drastically.
One could of course simply fallback to fully random testing, but there is some evidence that a more effective route is to do pairwise testing. The basic idea is that most software defects depend on a particular value of at most 2 parameters, therefore exploring the space can be "parallelized" by trying multiple unique pairs of parameters at the same time.
I would recommend reading the wikipedia entry on this technique, but here I would note that I think using it as a prioritisation strategy for test case generation in some scenarios could be a quite cheap way to increase the F-metric without too much overhead.
This proposal is probably the most arcane, but perhaps also the most interesting.
Currently elm-test uses pseudorandom numbers to generate test cases. However, there are other approaches available. An exciting approach is using quasirandom numbers for test generation. Quasirandom numbers (or low-discrepancy sequences) are sequences of numbers that still appear random, but tend to sample the input region in a more uniform manner. They are currently popular in Monte Carlo simulations, because they tend to converge simulations much more quickly. Now the theory goes that software defects are likely to occupy a contigous region in the input domain, so a quasirandom number would be more likely to fall into such a region more quickly.
Quasi-random sampling on the left, random on the right. Image from Wikipedia
I think the main downside of this approach is that the mathematics are somewhat arcane and as far as I know, this hasn't really been implemented in another major testing library. Hence this might be quite high risk work with somewhat uncertain rewards.
@janiczek has already been working on this one for a while, but the basic idea is to change the way test generation and shrinking work to enable a more straightforward composition of fuzzers, the big one being Fuzz.andThen
. The absence of andThen
makes expressing a lot of generators tricky; one has to accept either very "clever" code that potentially shrinks pathologically and implementing custom shrinking (which given how tricky that can be often leads to noShrink
).
Editor tooling that could generate somewhat reasonable fuzzers for custom datasctructures should be reasonably straightforward to put together and could make building tests a lot easier.
Another solution that could be quite nice is to somehow have packages be able to expose "testing" modules that include fuzzers for datastructures that they expose. For instance elm-geometry has a fantastic fuzzer library internally, but doesn't expose it to other packages.
Right now, by default elm-test fuzzes 100 test cases. For finding defects, this is grossly inadaquate (in a batch run scenario at least), as this will only generally find the most obvious bugs. I tend to run CI runs with --fuzz 10000
for a more reasonable search, but I believe that relatively few twiddle with the defaults.
However in my opinion a gross number of test cases is rarely the most reasonable way from the users perspective on how to control this. More typically different usecases have a different time budget and within this time budget I would like to run as many cases as practical. If I run my tests in watch mode, then I generally want the test available by the time I switch screens from VSCode to my terminal, an action taking roughly 1.5 seconds. If that allow only 10 test cases, so be it. More will be run the next time I save a file.
On the other hand in CI, I might have several jobs in parallel, like building docker containers that typically takes say 3 minutes. Hence I would like my tests to take exaclty that amount of time. If that allows for 62,341 tests, then great!
In general, when fuzz testing finds a failure, it would make sense to save the input that caused the failure and run it first the next time the tool is run. This has several benefits:
- Leaves no doubt wheather or not the bug was resolved.
- If I run in watch mode and I find a very rare bug on chance, but save again before looking at the test results I may never see that failure.
- Prevents regressions.
Generally this kind of testing is limited by there being some artificial limit to the number of runs that the tool is willing to do; this is because we use CI servers on PRs that we expect to complete in a reasonable amount of time. But I think it would also make sense to run fuzzing continously on master, restarting every time code is merged and posting reduced test cases back into the issue tracker when issues are found (and closing them when the testcase no longer reproduces). Running a small server continously is not that expensive and could provide a lot of value (especially for more complex fuzz testing like model checking a whole application).
Footnotes
-
The one thing it gets amazingly wrong is calling it fuzz testing. Sigh. Fuzz testing is the practice of trying to find inputs that will crash your program or make it behave in unexpected ways. Fuzzers in Elm aren't even trying, for instance float fuzzers don't send in
NaN
or-Infinity
or even-0
. Int fuzzers don't send in fractional numbers maquarading as floats. String fuzzers don't send in snippets that might trigger bugs in elm-parser. ↩