Benchmarking seems not to be a main focus of any specific academic field, although the problem has been addressed by many different groups in CS.
Some papers I found interesting:
- http://research.microsoft.com/en-us/um/people/nswamy/papers/bottlenecks-ecoop04.pdf
- http://www.sigmod.org/publications/sigmod-record/0806/p45.dewitt.pdf
- http://daniel-wilkerson.appspot.com/trend-prof.pdf
- http://sape.inf.usi.ch/sites/default/files/publication/EvaluateCollaboratoryTR1.pdf
- http://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf
- http://buytaert.net/files/oopsla07-georges.pdf
- http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=49E8694C39A1E8B45581B1B3A57F55BA?doi=10.1.1.43.7647&rep=rep1&type=pd
By far the most basic (and in my mind the most interesting) of these papers is:
Lots of people (including me) have written libraries for benchmarking functions. By far the most interesting I've seen is Bryan O'Sullivan's criterion library:
Other libraries I came across include:
- Codespeed: https://github.com/tobami/codespeed/wiki/Overview
- VBench: http://wesmckinney.com/blog/?p=373
- Benchmark.js: http://benchmarkjs.com/docs
- Trend Prof: http://trend-prof.tigris.org
- Criterion.rs: https://github.com/japaric/criterion.rs
- Rust official benchmarks: http://web.mit.edu/rust-lang_v0.9/doc/guide-testing.html
- Go official benchmarks: http://golang.org/pkg/testing/
- Airspeed Velocity: http://spacetelescope.github.io/asv/using.html
- Does the benchmarking tool account for uncertainty?
- Does the benchmarking tool extrapolate across inputs?