-
-
Save garybernhardt/a7e920a31ff55a768fc3 to your computer and use it in GitHub Desktop.
This tool is used to compare microbenchmarks across two versions of code. It's | |
paranoid about nulling out timing error, so the numbers should be meaningful. | |
It runs the benchmarks many times, scaling the iterations up if the benchmark | |
is extremely short, and it nulls out its own timing overhead while doing so. It | |
reports results graphically with a text interface in the terminal. | |
You first run it with --record, which generates a JSON dotfile with runtimes | |
for each of your benchmarks. Then you change the code and run again with | |
--compare, which re-runs and generates comparison plots between your recorded | |
and current times. In the example output, I did a --record on the master | |
branch, then switched to my scoring_redesign and did a --compare. In my output, | |
three of the benchmarks' runtimes look to be unchanged; the other three got | |
significantly slower. As you can see at the bottom of the output, the entire | |
process takes about a second from hitting enter to having full output, so I can | |
get very rapid performance feedback. | |
It goes to great lengths to null out both constant timing offsets and timing | |
jitter. Specifically: | |
* All benchmarks are run 16 times, with all of those data points being used in | |
the output (not just the average, median or minimum). | |
* They're run with and without garbage collection, with both being reported. | |
* The cost of the benchmark runner itself is nulled out in a way that should be | |
very accurate. It times the user-supplied block, then repeats the timing with | |
an empty block, subtracting the latter from the former. | |
* It guarantees a minimum test runtime for precise timing. First, it runs the | |
block directly. If it takes less than 1ms, it's deemed unsafe and the process | |
is restarted, but now the block will be run twice. If it's still under 1ms, | |
that will be doubled to four times, etc., until it takes at least 1ms. This | |
is what the "!"s are in the output: each of those is the benchmark being | |
restarted with double the number of repetitions. The 1ms limit is checked | |
after correcting for test runner overhead as described above, so the | |
user-provided benchmark block itself is guaranteed to use at least 1ms of | |
actual CPU time. | |
In the plots, "X" is the median of the 16 iterations. The horizontal lines | |
extend to the minimum and maximum values. This gives you a clear visualization | |
of timing jitter. Each plot's range goes from 0 to the maximum sampled runtime | |
to avoid confusing effects of tightly-zoomed plots. If the lines don't overlap, | |
you can have good confidence that there's a real performance difference. In the | |
second through fourth benchmarks below, you can see a big difference: I've made | |
them much slower by changing Selecta's scoring algorithm. (This was my actual | |
motivation for creating the tool. It's not a synthetic example.) I haven't | |
wanted fine detail so far, so the text plots have been sufficient. I may add | |
numbers eventually. | |
This tool is Ruby-specific, but its design certainly isn't. This is the | |
benchmark tool that I've wanted for years, since before I even knew Ruby. It 1) | |
gives me the bare minimum statistics needed for confidence; 2) nulls out every | |
source of timing offset or jitter that I know of and can control; and 3) | |
automatically scales the repetitions up to get significant samples. |
# I haven't thought much about the API yet. | |
# bench_group is like RSpec's "describe"; bench is like "it". | |
# Before and after blocks are supported for setup and teardown (not used here). | |
bench_group "filtering" do | |
bench "non-matching" do | |
Score.score("x" * 16, "y" * 16) | |
end | |
bench "matching exactly" do | |
Score.score("x" * 16, "x" * 16) | |
end | |
bench "matching broken up" do | |
Score.score("xy" * 20, "x" * 10) | |
end | |
bench "overlapping matches" do | |
Score.score("x" * 40, "x" * 10) | |
end | |
bench "words, non-matching" do | |
WORDS[0,1000].each { |choice| Score.score(choice, "x" * 16) } | |
end | |
bench "words, matching" do | |
WORDS[0,1000].each { |choice| Score.score(choice, WORDS[500]) } | |
end | |
end |
filtering non-matching !!!!!!!!................!!!!!!!!................ | |
filtering matching exactly !!................!!................ | |
filtering matching broken up ................................ | |
filtering overlapping matches ................................ | |
filtering words, non-matching ................................ | |
filtering words, matching ................................ | |
filtering non-matching | |
Before (GC): | -----X-------- | | |
Before (No GC): | ---X--------- | | |
After (GC): | -----X----------- | | |
After (No GC): | --X----------------------------| | |
0 0.0105 | |
filtering matching exactly | |
Before (GC): | X--- | | |
Before (No GC): | X-- | | |
After (GC): | -----X---------------------- | | |
After (No GC): | ----X-----------------------------| | |
0 0.715 | |
filtering matching broken up | |
Before (GC): |X- | | |
Before (No GC): |X- | | |
After (GC): | --X--- | | |
After (No GC): | ---X---------------------------------| | |
0 2.31 | |
filtering overlapping matches | |
Before (GC): |X- | | |
Before (No GC): |X- | | |
After (GC): | ---X---------------------------------| | |
After (No GC): | ----X----------------------- | | |
0 6 | |
filtering words, non-matching | |
Before (GC): | ---X---------------- | | |
Before (No GC): | ---X---- | | |
After (GC): | --------X--------------------------------------| | |
After (No GC): | ---X-------------- | | |
0 20 | |
filtering words, matching | |
Before (GC): | -X------------------------------------------- | | |
Before (No GC): | --X--------------------- | | |
After (GC): | ---X-----------------------------------------| | |
After (No GC): | -X-------- | | |
0 20.1 | |
ruby benchmark.rb --compare 1.03s user 0.08s system 97% cpu 1.136 total |
I can think of a use case for having both variants in the code next to each other -- sharing the code with other people in an easier way than multiple branches. For example, in a gist. ;)
I thought about units more after we exchanged those toots and had the same realization: there's an ambiguity between time and IPS. I'll probably switch this tool to IPS, which seems more intuitive (higher numbers are better). Adding " IPS" to the values is trivial, of course.
But for a gist, you can just show both forms of the code followed by the output of the comparison. That seems to give all of the information present with a multi-variant syntax, but without needing an extra feature.
I see your point about the gists. But that's a bit of work for the person reading the gist to make a comparison. And even more if there are a bunch of variants. (For example, various sorting algorithms.) Before/after (record/compare) doesn't handle more than 2 choices well. But like I said, this is a different use case, and it might not be one that you care about right now.
Anyway, I'm glad I was able to get you to think more about how it works. I can't believe more people aren't excited about this enough to be commenting. I guess you weren't controversial enough in the Tweet announcing it. Recommendation for the 1.0 announcement: "Benchmarking is Dead".
Interesting thought from @garybernhardt (via Twitter) on not needing units on the graphs: "Units don't really matter for relative comparisons." Makes a lot of sense. And seeing as how we don't show how many runs it has gone through, the units wouldn't be that helpful anyway.
But it might be useful to get a ballpark on how fast my algorithms are running, so I don't end up picking the best of 2 really slow algorithms.
Another issue with not having units is that it's kind of unclear whether these are time intervals, or number of runs per time interval.