valgrind --tool=cachegrind ./a.out
cg_annotate cachegrind.out.* --auto=yes
- Ir: number of instructions executed
- I1mr: I1 cache read misses
- ILmr: LL (last level) cache instruction read misses
- Dr: number of memory reads
- D1mr: D1 cache read misses
- DLmr: LL (last level) cache data read misses
- Dw: number of memory writes
- D1mw: D1 cache write misses
- DLmw: LL (last level) cache data write misses
On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles, and a mispredicted branch costs in the region of 10 to 30 cycles. Detailed cache and branch profiling can be very useful for understanding how your program interacts with the machine and thus how to make it faster.
brew install qcachegrind --with-graphviz
valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes ./a.out
Profile with --dump-instr=yes
to have more infos.
https://baptiste-wicht.com/posts/2011/09/profile-c-application-with-callgrind-kcachegrind.html