-
-
Save garybernhardt/a7e920a31ff55a768fc3 to your computer and use it in GitHub Desktop.
This tool is used to compare microbenchmarks across two versions of code. It's | |
paranoid about nulling out timing error, so the numbers should be meaningful. | |
It runs the benchmarks many times, scaling the iterations up if the benchmark | |
is extremely short, and it nulls out its own timing overhead while doing so. It | |
reports results graphically with a text interface in the terminal. | |
You first run it with --record, which generates a JSON dotfile with runtimes | |
for each of your benchmarks. Then you change the code and run again with | |
--compare, which re-runs and generates comparison plots between your recorded | |
and current times. In the example output, I did a --record on the master | |
branch, then switched to my scoring_redesign and did a --compare. In my output, | |
three of the benchmarks' runtimes look to be unchanged; the other three got | |
significantly slower. As you can see at the bottom of the output, the entire | |
process takes about a second from hitting enter to having full output, so I can | |
get very rapid performance feedback. | |
It goes to great lengths to null out both constant timing offsets and timing | |
jitter. Specifically: | |
* All benchmarks are run 16 times, with all of those data points being used in | |
the output (not just the average, median or minimum). | |
* They're run with and without garbage collection, with both being reported. | |
* The cost of the benchmark runner itself is nulled out in a way that should be | |
very accurate. It times the user-supplied block, then repeats the timing with | |
an empty block, subtracting the latter from the former. | |
* It guarantees a minimum test runtime for precise timing. First, it runs the | |
block directly. If it takes less than 1ms, it's deemed unsafe and the process | |
is restarted, but now the block will be run twice. If it's still under 1ms, | |
that will be doubled to four times, etc., until it takes at least 1ms. This | |
is what the "!"s are in the output: each of those is the benchmark being | |
restarted with double the number of repetitions. The 1ms limit is checked | |
after correcting for test runner overhead as described above, so the | |
user-provided benchmark block itself is guaranteed to use at least 1ms of | |
actual CPU time. | |
In the plots, "X" is the median of the 16 iterations. The horizontal lines | |
extend to the minimum and maximum values. This gives you a clear visualization | |
of timing jitter. Each plot's range goes from 0 to the maximum sampled runtime | |
to avoid confusing effects of tightly-zoomed plots. If the lines don't overlap, | |
you can have good confidence that there's a real performance difference. In the | |
second through fourth benchmarks below, you can see a big difference: I've made | |
them much slower by changing Selecta's scoring algorithm. (This was my actual | |
motivation for creating the tool. It's not a synthetic example.) I haven't | |
wanted fine detail so far, so the text plots have been sufficient. I may add | |
numbers eventually. | |
This tool is Ruby-specific, but its design certainly isn't. This is the | |
benchmark tool that I've wanted for years, since before I even knew Ruby. It 1) | |
gives me the bare minimum statistics needed for confidence; 2) nulls out every | |
source of timing offset or jitter that I know of and can control; and 3) | |
automatically scales the repetitions up to get significant samples. |
# I haven't thought much about the API yet. | |
# bench_group is like RSpec's "describe"; bench is like "it". | |
# Before and after blocks are supported for setup and teardown (not used here). | |
bench_group "filtering" do | |
bench "non-matching" do | |
Score.score("x" * 16, "y" * 16) | |
end | |
bench "matching exactly" do | |
Score.score("x" * 16, "x" * 16) | |
end | |
bench "matching broken up" do | |
Score.score("xy" * 20, "x" * 10) | |
end | |
bench "overlapping matches" do | |
Score.score("x" * 40, "x" * 10) | |
end | |
bench "words, non-matching" do | |
WORDS[0,1000].each { |choice| Score.score(choice, "x" * 16) } | |
end | |
bench "words, matching" do | |
WORDS[0,1000].each { |choice| Score.score(choice, WORDS[500]) } | |
end | |
end |
filtering non-matching !!!!!!!!................!!!!!!!!................ | |
filtering matching exactly !!................!!................ | |
filtering matching broken up ................................ | |
filtering overlapping matches ................................ | |
filtering words, non-matching ................................ | |
filtering words, matching ................................ | |
filtering non-matching | |
Before (GC): | -----X-------- | | |
Before (No GC): | ---X--------- | | |
After (GC): | -----X----------- | | |
After (No GC): | --X----------------------------| | |
0 0.0105 | |
filtering matching exactly | |
Before (GC): | X--- | | |
Before (No GC): | X-- | | |
After (GC): | -----X---------------------- | | |
After (No GC): | ----X-----------------------------| | |
0 0.715 | |
filtering matching broken up | |
Before (GC): |X- | | |
Before (No GC): |X- | | |
After (GC): | --X--- | | |
After (No GC): | ---X---------------------------------| | |
0 2.31 | |
filtering overlapping matches | |
Before (GC): |X- | | |
Before (No GC): |X- | | |
After (GC): | ---X---------------------------------| | |
After (No GC): | ----X----------------------- | | |
0 6 | |
filtering words, non-matching | |
Before (GC): | ---X---------------- | | |
Before (No GC): | ---X---- | | |
After (GC): | --------X--------------------------------------| | |
After (No GC): | ---X-------------- | | |
0 20 | |
filtering words, matching | |
Before (GC): | -X------------------------------------------- | | |
Before (No GC): | --X--------------------- | | |
After (GC): | ---X-----------------------------------------| | |
After (No GC): | -X-------- | | |
0 20.1 | |
ruby benchmark.rb --compare 1.03s user 0.08s system 97% cpu 1.136 total |
I thought about units more after we exchanged those toots and had the same realization: there's an ambiguity between time and IPS. I'll probably switch this tool to IPS, which seems more intuitive (higher numbers are better). Adding " IPS" to the values is trivial, of course.
But for a gist, you can just show both forms of the code followed by the output of the comparison. That seems to give all of the information present with a multi-variant syntax, but without needing an extra feature.
I see your point about the gists. But that's a bit of work for the person reading the gist to make a comparison. And even more if there are a bunch of variants. (For example, various sorting algorithms.) Before/after (record/compare) doesn't handle more than 2 choices well. But like I said, this is a different use case, and it might not be one that you care about right now.
Anyway, I'm glad I was able to get you to think more about how it works. I can't believe more people aren't excited about this enough to be commenting. I guess you weren't controversial enough in the Tweet announcing it. Recommendation for the 1.0 announcement: "Benchmarking is Dead".
I can think of a use case for having both variants in the code next to each other -- sharing the code with other people in an easier way than multiple branches. For example, in a gist. ;)