garybernhardt · March 7, 2016 17:37 · booch · May 21, 2014 · garybernhardt · May 21, 2014
diff --git a/1-description.txt b/1-description.txt
 This tool is used to compare microbenchmarks across two versions of code. It's
 paranoid about nulling out timing error, so the numbers should be meaningful.
 It runs the benchmarks many times, scaling the iterations up if the benchmark
 is extremely short, and it nulls out its own timing overhead while doing so. It
 reports results graphically with a text interface in the terminal.

 You first run it with --record, which generates a JSON dotfile with runtimes
 for each of your benchmarks. Then you change the code and run again with
 --compare, which re-runs and generates comparison plots between your recorded
 and current times. In the example output, I did a --record on the master
 branch, then switched to my scoring_redesign and did a --compare. In my output,
 three of the benchmarks' runtimes look to be unchanged; the other three got
 significantly slower. As you can see at the bottom of the output, the entire
 process takes about a second from hitting enter to having full output, so I can
 get very rapid performance feedback.

 It goes to great lengths to null out both constant timing offsets and timing
 jitter. Specifically:

 * All benchmarks are run 16 times, with all of those data points being used in
  the output (not just the average, median or minimum).

 * They're run with and without garbage collection, with both being reported.

 * The cost of the benchmark runner itself is nulled out in a way that should be
  very accurate. It times the user-supplied block, then repeats the timing with
  an empty block, subtracting the latter from the former.

 * It guarantees a minimum test runtime for precise timing. First, it runs the
  block directly. If it takes less than 1ms, it's deemed unsafe and the process
  is restarted, but now the block will be run twice. If it's still under 1ms,
  that will be doubled to four times, etc., until it takes at least 1ms. This
  is what the "!"s are in the output: each of those is the benchmark being
  restarted with double the number of repetitions. The 1ms limit is checked
  after correcting for test runner overhead as described above, so the
  user-provided benchmark block itself is guaranteed to use at least 1ms of
  actual CPU time.

 In the plots, "X" is the median of the 16 iterations. The horizontal lines
 extend to the minimum and maximum values. This gives you a clear visualization
 of timing jitter. Each plot's range goes from 0 to the maximum sampled runtime
 to avoid confusing effects of tightly-zoomed plots. If the lines don't overlap,
 you can have good confidence that there's a real performance difference. In the
 second through fourth benchmarks below, you can see a big difference: I've made
 them much slower by changing Selecta's scoring algorithm. (This was my actual
 motivation for creating the tool. It's not a synthetic example.) I haven't
 wanted fine detail so far, so the text plots have been sufficient. I may add
 numbers eventually.

 This tool is Ruby-specific, but its design certainly isn't. This is the
 benchmark tool that I've wanted for years, since before I even knew Ruby. It 1)
 gives me the bare minimum statistics needed for confidence; 2) nulls out every
 source of timing offset or jitter that I know of and can control; and 3)
 automatically scales the repetitions up to get significant samples.
diff --git a/2-benchmarks.rb b/2-benchmarks.rb
 # I haven't thought much about the API yet.
 # bench_group is like RSpec's "describe"; bench is like "it".
 # Before and after blocks are supported for setup and teardown (not used here).
 bench_group "filtering" do
  bench "non-matching" do
    Score.score("x" * 16, "y" * 16)
  end

  bench "matching exactly" do
    Score.score("x" * 16, "x" * 16)
  end

  bench "matching broken up" do
    Score.score("xy" * 20, "x" * 10)
  end

  bench "overlapping matches" do
    Score.score("x" * 40, "x" * 10)
  end

  bench "words, non-matching" do
    WORDS[0,1000].each { |choice| Score.score(choice, "x" * 16) }
  end

  bench "words, matching" do
    WORDS[0,1000].each { |choice| Score.score(choice, WORDS[500]) }
  end
 end
diff --git a/3-output b/3-output
 filtering non-matching !!!!!!!!................!!!!!!!!................
 filtering matching exactly !!................!!................
 filtering matching broken up ................................
 filtering overlapping matches ................................
 filtering words, non-matching ................................
 filtering words, matching ................................

 filtering non-matching
  Before (GC):    |                                   -----X--------             |
  Before (No GC): |                                ---X---------                 |
  After (GC):     |                                  -----X-----------           |
  After (No GC):  |                               --X----------------------------|
                  0                                                         0.0105

 filtering matching exactly
  Before (GC):    |   X---                                                       |
  Before (No GC): |   X--                                                        |
  After (GC):     |                            -----X----------------------      |
  After (No GC):  |                            ----X-----------------------------|
                  0                                                          0.715

 filtering matching broken up
  Before (GC):    |X-                                                            |
  Before (No GC): |X-                                                            |
  After (GC):     |                       --X---                                 |
  After (No GC):  |                         ---X---------------------------------|
                  0                                                           2.31

 filtering overlapping matches
  Before (GC):    |X-                                                            |
  Before (No GC): |X-                                                            |
  After (GC):     |                         ---X---------------------------------|
  After (No GC):  |                           ----X-----------------------       |
                  0                                                              6

 filtering words, non-matching
  Before (GC):    |                ---X----------------                          |
  Before (No GC): |               ---X----                                       |
  After (GC):     |               --------X--------------------------------------|
  After (No GC):  |                  ---X--------------                          |
                  0                                                             20

 filtering words, matching
  Before (GC):    |                -X------------------------------------------- |
  Before (No GC): |                --X---------------------                      |
  After (GC):     |                 ---X-----------------------------------------|
  After (No GC):  |                 -X--------                                   |
                  0                                                           20.1

 ruby benchmark.rb --compare  1.03s user 0.08s system 97% cpu 1.136 total
	This tool is used to compare microbenchmarks across two versions of code. It's
	paranoid about nulling out timing error, so the numbers should be meaningful.
	It runs the benchmarks many times, scaling the iterations up if the benchmark
	is extremely short, and it nulls out its own timing overhead while doing so. It
	reports results graphically with a text interface in the terminal.

	You first run it with --record, which generates a JSON dotfile with runtimes
	for each of your benchmarks. Then you change the code and run again with
	--compare, which re-runs and generates comparison plots between your recorded
	and current times. In the example output, I did a --record on the master
	branch, then switched to my scoring_redesign and did a --compare. In my output,
	three of the benchmarks' runtimes look to be unchanged; the other three got
	significantly slower. As you can see at the bottom of the output, the entire
	process takes about a second from hitting enter to having full output, so I can
	get very rapid performance feedback.

	It goes to great lengths to null out both constant timing offsets and timing
	jitter. Specifically:

	* All benchmarks are run 16 times, with all of those data points being used in
	the output (not just the average, median or minimum).

	* They're run with and without garbage collection, with both being reported.

	* The cost of the benchmark runner itself is nulled out in a way that should be
	very accurate. It times the user-supplied block, then repeats the timing with
	an empty block, subtracting the latter from the former.

	* It guarantees a minimum test runtime for precise timing. First, it runs the
	block directly. If it takes less than 1ms, it's deemed unsafe and the process
	is restarted, but now the block will be run twice. If it's still under 1ms,
	that will be doubled to four times, etc., until it takes at least 1ms. This
	is what the "!"s are in the output: each of those is the benchmark being
	restarted with double the number of repetitions. The 1ms limit is checked
	after correcting for test runner overhead as described above, so the
	user-provided benchmark block itself is guaranteed to use at least 1ms of
	actual CPU time.

	In the plots, "X" is the median of the 16 iterations. The horizontal lines
	extend to the minimum and maximum values. This gives you a clear visualization
	of timing jitter. Each plot's range goes from 0 to the maximum sampled runtime
	to avoid confusing effects of tightly-zoomed plots. If the lines don't overlap,
	you can have good confidence that there's a real performance difference. In the
	second through fourth benchmarks below, you can see a big difference: I've made
	them much slower by changing Selecta's scoring algorithm. (This was my actual
	motivation for creating the tool. It's not a synthetic example.) I haven't
	wanted fine detail so far, so the text plots have been sufficient. I may add
	numbers eventually.

	This tool is Ruby-specific, but its design certainly isn't. This is the
	benchmark tool that I've wanted for years, since before I even knew Ruby. It 1)
	gives me the bare minimum statistics needed for confidence; 2) nulls out every
	source of timing offset or jitter that I know of and can control; and 3)
	automatically scales the repetitions up to get significant samples.
	# I haven't thought much about the API yet.
	# bench_group is like RSpec's "describe"; bench is like "it".
	# Before and after blocks are supported for setup and teardown (not used here).
	bench_group "filtering" do
	bench "non-matching" do
	Score.score("x" * 16, "y" * 16)
	end

	bench "matching exactly" do
	Score.score("x" * 16, "x" * 16)
	end

	bench "matching broken up" do
	Score.score("xy" * 20, "x" * 10)
	end

	bench "overlapping matches" do
	Score.score("x" * 40, "x" * 10)
	end

	bench "words, non-matching" do
	WORDS[0,1000].each { \|choice\| Score.score(choice, "x" * 16) }
	end

	bench "words, matching" do
	WORDS[0,1000].each { \|choice\| Score.score(choice, WORDS[500]) }
	end
	end
	filtering non-matching !!!!!!!!................!!!!!!!!................
	filtering matching exactly !!................!!................
	filtering matching broken up ................................
	filtering overlapping matches ................................
	filtering words, non-matching ................................
	filtering words, matching ................................

	filtering non-matching
	Before (GC): \| -----X-------- \|
	Before (No GC): \| ---X--------- \|
	After (GC): \| -----X----------- \|
	After (No GC): \| --X----------------------------\|
	0 0.0105

	filtering matching exactly
	Before (GC): \| X--- \|
	Before (No GC): \| X-- \|
	After (GC): \| -----X---------------------- \|
	After (No GC): \| ----X-----------------------------\|
	0 0.715

	filtering matching broken up
	Before (GC): \|X- \|
	Before (No GC): \|X- \|
	After (GC): \| --X--- \|
	After (No GC): \| ---X---------------------------------\|
	0 2.31

	filtering overlapping matches
	Before (GC): \|X- \|
	Before (No GC): \|X- \|
	After (GC): \| ---X---------------------------------\|
	After (No GC): \| ----X----------------------- \|
	0 6

	filtering words, non-matching
	Before (GC): \| ---X---------------- \|
	Before (No GC): \| ---X---- \|
	After (GC): \| --------X--------------------------------------\|
	After (No GC): \| ---X-------------- \|
	0 20

	filtering words, matching
	Before (GC): \| -X------------------------------------------- \|
	Before (No GC): \| --X--------------------- \|
	After (GC): \| ---X-----------------------------------------\|
	After (No GC): \| -X-------- \|
	0 20.1

	ruby benchmark.rb --compare 1.03s user 0.08s system 97% cpu 1.136 total