Skip to content

Instantly share code, notes, and snippets.

@karimnaaji
Last active March 12, 2024 09:28
Show Gist options
  • Save karimnaaji/fadcb18ef7b83e1a0d9586057f28ca62 to your computer and use it in GitHub Desktop.
Save karimnaaji/fadcb18ef7b83e1a0d9586057f28ca62 to your computer and use it in GitHub Desktop.
Valgrind steps to get cache misses
valgrind --tool=cachegrind ./a.out
cg_annotate cachegrind.out.* --auto=yes

Instruction misses

  • Ir: number of instructions executed
  • I1mr: I1 cache read misses
  • ILmr: LL (last level) cache instruction read misses

Cache misses

  • Dr: number of memory reads
  • D1mr: D1 cache read misses
  • DLmr: LL (last level) cache data read misses
  • Dw: number of memory writes
  • D1mw: D1 cache write misses
  • DLmw: LL (last level) cache data write misses

On a modern machine, an L1 miss will typically cost around 10 cycles, an LL miss can cost as much as 200 cycles, and a mispredicted branch costs in the region of 10 to 30 cycles. Detailed cache and branch profiling can be very useful for understanding how your program interacts with the machine and thus how to make it faster.

Visualizing data

brew install qcachegrind --with-graphviz
valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes ./a.out

Profile with --dump-instr=yes to have more infos.

https://baptiste-wicht.com/posts/2011/09/profile-c-application-with-callgrind-kcachegrind.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment