Skip to content

Instantly share code, notes, and snippets.

@MattPD
Last active May 1, 2016 22:12
Show Gist options
  • Save MattPD/06e293fb935eaf67ee9c301e70db6975 to your computer and use it in GitHub Desktop.
Save MattPD/06e293fb935eaf67ee9c301e70db6975 to your computer and use it in GitHub Desktop.
Related to discussion of
http://larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss
at https://twitter.com/gregerlars/status/726781584481878017
Three variants:
- Simple: cacheS
- Pointer: cacheP
- LinkedList: cacheL
Context -- looking at average stall durations, we can confirm that cacheL wastes considerably more instruction slots than cacheP, which wastes only somewhat more instruction slots than cacheS:
likwid-perfctr -C S0:0 -g UOPS_EXEC -f ./cacheS 16000000
| Avg stall duration [cycles] | 7.2557 |
likwid-perfctr -C S0:0 -g UOPS_EXEC -f ./cacheP 16000000
| Avg stall duration [cycles] | 8.1357 |
likwid-perfctr -C S0:0 -g UOPS_EXEC -f ./cacheL 16000000
| Avg stall duration [cycles] | 19.9838 |
The question is how to quantitatively characterize the reasons.
Considering L3 and comparing faster cacheP to slower cacheL:
cacheP: | L3 evict data volume [GBytes] | 2.1258 |
cacheL: | L3 evict data volume [GBytes] | 2.3658 |
Attributable to trailing-edge effects?
(Hypothesis: Linked-list -- less useful prefetched data, more BW wasted?)
We have: Miss Penalty = Leading Edge + Effects(Trailing Edge)
// p. 24 of http://ewh.ieee.org/r5/denver/sscs/Presentations/2008_09_Bernstein1.pdf
$ likwid-perfctr -C S0:0 -g L3 -f ./cacheS 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3
+------------------------+---------+------------+
| Event | Counter | Core 0 |
+------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2039351567 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 2959500127 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2778611238 |
| L2_LINES_IN_ALL | PMC0 | 42205726 |
| L2_LINES_OUT_DIRTY_ALL | PMC1 | 37046289 |
+------------------------+---------+------------+
+-------------------------------+-----------+
| Metric | Core 0 |
+-------------------------------+-----------+
| Runtime (RDTSC) [s] | 1.1034 |
| Runtime unhalted [s] | 1.1420 |
| Clock [MHz] | 2760.2860 |
| CPI | 1.4512 |
| L3 load bandwidth [MBytes/s] | 2447.9439 |
| L3 load data volume [GBytes] | 2.7012 |
| L3 evict bandwidth [MBytes/s] | 2148.6951 |
| L3 evict data volume [GBytes] | 2.3710 |
| L3 bandwidth [MBytes/s] | 4596.6389 |
| L3 data volume [GBytes] | 5.0721 |
+-------------------------------+-----------+
$ likwid-perfctr -C S0:0 -g L3 -f ./cacheP 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3
+------------------------+---------+------------+
| Event | Counter | Core 0 |
+------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2091352295 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3364007869 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3088564076 |
| L2_LINES_IN_ALL | PMC0 | 58558135 |
| L2_LINES_OUT_DIRTY_ALL | PMC1 | 33214991 |
+------------------------+---------+------------+
+-------------------------------+-----------+
| Metric | Core 0 |
+-------------------------------+-----------+
| Runtime (RDTSC) [s] | 1.2202 |
| Runtime unhalted [s] | 1.2981 |
| Clock [MHz] | 2822.6985 |
| CPI | 1.6085 |
| L3 load bandwidth [MBytes/s] | 3071.4319 |
| L3 load data volume [GBytes] | 3.7477 |
| L3 evict bandwidth [MBytes/s] | 1742.1590 |
| L3 evict data volume [GBytes] | 2.1258 |
| L3 bandwidth [MBytes/s] | 4813.5909 |
| L3 data volume [GBytes] | 5.8735 |
+-------------------------------+-----------+
$ likwid-perfctr -C S0:0 -g L3 -f ./cacheL 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3
+------------------------+---------+------------+
| Event | Counter | Core 0 |
+------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2091352515 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6149176384 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 5960055218 |
| L2_LINES_IN_ALL | PMC0 | 56525204 |
| L2_LINES_OUT_DIRTY_ALL | PMC1 | 36965921 |
+------------------------+---------+------------+
+-------------------------------+-----------+
| Metric | Core 0 |
+-------------------------------+-----------+
| Runtime (RDTSC) [s] | 2.3319 |
| Runtime unhalted [s] | 2.3728 |
| Clock [MHz] | 2673.8115 |
| CPI | 2.9403 |
| L3 load bandwidth [MBytes/s] | 1551.3503 |
| L3 load data volume [GBytes] | 3.6176 |
| L3 evict bandwidth [MBytes/s] | 1014.5402 |
| L3 evict data volume [GBytes] | 2.3658 |
| L3 bandwidth [MBytes/s] | 2565.8905 |
| L3 data volume [GBytes] | 5.9834 |
+-------------------------------+-----------+
Prefetchers definitely helpful for the simple access (cacheS) -- but don't seem to account for the significant timing disparity between pointer (cacheP) and linked-list (cacheL).
Disabled:
cacheS: | L3 miss ratio | 0.7905 |
cacheP: | L3 miss ratio | 0.8537 |
cacheL: | L3 miss ratio | 0.8445 |
Enabled:
cacheS: | L3 miss ratio | 0.6954 |
cacheP: | L3 miss ratio | 0.8373 |
cacheL: | L3 miss ratio | 0.8338 |
Note the significant increase in the L3 miss ratio for the simple access workload (cacheS) -- from 69.5% to 79% -- but only a very small miss ratios increases for the remaining workloads.
#
#
#
Prefetchers: Detailed results:
We can disable/enable prefetchers using likwid-features:
https://github.com/RRZE-HPC/likwid/wiki/likwid-features
#
# Disabled Prefetchers
#
$ likwid-features -c 0 -l
Feature CPU 0
HW_PREFETCHER on
CL_PREFETCHER on
DCU_PREFETCHER on
IP_PREFETCHER on
FAST_STRINGS on
THERMAL_CONTROL on
PERF_MON on
FERR_MULTIPLEX off
BRANCH_TRACE_STORAGE on
XTPR_MESSAGE off
PEBS on
SPEEDSTEP on
MONITOR on
SPEEDSTEP_LOCK off
CPUID_MAX_VAL off
XD_BIT on
DYN_ACCEL off
TURBO_MODE on
TM2 off
$ likwid-features -c 0 -d HW_PREFETCHER,CL_PREFETCHER,DCU_PREFETCHER,IP_PREFETCHER
HW_PREFETCHER: disabled
Disabled HW_PREFETCHER for CPU 0
CL_PREFETCHER: disabled
Disabled CL_PREFETCHER for CPU 0
DCU_PREFETCHER: disabled
Disabled DCU_PREFETCHER for CPU 0
IP_PREFETCHER: disabled
Disabled IP_PREFETCHER for CPU 0
$ likwid-features -c 0 -l
Feature CPU 0
HW_PREFETCHER off
CL_PREFETCHER off
DCU_PREFETCHER off
IP_PREFETCHER off
FAST_STRINGS on
THERMAL_CONTROL on
PERF_MON on
FERR_MULTIPLEX off
BRANCH_TRACE_STORAGE on
XTPR_MESSAGE off
PEBS on
SPEEDSTEP on
MONITOR on
SPEEDSTEP_LOCK off
CPUID_MAX_VAL off
XD_BIT on
DYN_ACCEL off
TURBO_MODE on
TM2 off
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheS 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3CACHE
+-------------------------------+---------+------------+
| Event | Counter | Core 0 |
+-------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2039352297 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3090290060 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3039766886 |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 21723938 |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 17172188 |
| UOPS_RETIRED_ALL | PMC2 | 3482181080 |
+-------------------------------+---------+------------+
+----------------------+-----------+
| Metric | Core 0 |
+----------------------+-----------+
| Runtime (RDTSC) [s] | 1.2076 |
| Runtime unhalted [s] | 1.1924 |
| Clock [MHz] | 2634.6467 |
| CPI | 1.5153 |
| L3 request rate | 0.0062 |
| L3 miss rate | 0.0049 |
| L3 miss ratio | 0.7905 |
+----------------------+-----------+
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheP 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3CACHE
+-------------------------------+---------+------------+
| Event | Counter | Core 0 |
+-------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2091351388 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3541973052 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3300479884 |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 37460119 |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 31978122 |
| UOPS_RETIRED_ALL | PMC2 | 3536570374 |
+-------------------------------+---------+------------+
+----------------------+-----------+
| Metric | Core 0 |
+----------------------+-----------+
| Runtime (RDTSC) [s] | 1.3089 |
| Runtime unhalted [s] | 1.3667 |
| Clock [MHz] | 2781.1666 |
| CPI | 1.6936 |
| L3 request rate | 0.0106 |
| L3 miss rate | 0.0090 |
| L3 miss ratio | 0.8537 |
+----------------------+-----------+
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheL 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3CACHE
+-------------------------------+---------+------------+
| Event | Counter | Core 0 |
+-------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2091352659 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6285614502 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 6203681042 |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 33333780 |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 28151375 |
| UOPS_RETIRED_ALL | PMC2 | 3534662943 |
+-------------------------------+---------+------------+
+----------------------+-----------+
| Metric | Core 0 |
+----------------------+-----------+
| Runtime (RDTSC) [s] | 2.4317 |
| Runtime unhalted [s] | 2.4256 |
| Clock [MHz] | 2625.5607 |
| CPI | 3.0055 |
| L3 request rate | 0.0094 |
| L3 miss rate | 0.0080 |
| L3 miss ratio | 0.8445 |
+----------------------+-----------+
#
# Enabled Prefetchers
#
$ likwid-features -c 0 -e HW_PREFETCHER,CL_PREFETCHER,DCU_PREFETCHER,IP_PREFETCHER
HW_PREFETCHER: enabled
Enabled HW_PREFETCHER for CPU 0
CL_PREFETCHER: enabled
Enabled CL_PREFETCHER for CPU 0
DCU_PREFETCHER: enabled
Enabled DCU_PREFETCHER for CPU 0
IP_PREFETCHER: enabled
Enabled IP_PREFETCHER for CPU 0
$ likwid-features -c 0 -l
Feature CPU 0
HW_PREFETCHER on
CL_PREFETCHER on
DCU_PREFETCHER on
IP_PREFETCHER on
FAST_STRINGS on
THERMAL_CONTROL on
PERF_MON on
FERR_MULTIPLEX off
BRANCH_TRACE_STORAGE on
XTPR_MESSAGE off
PEBS on
SPEEDSTEP on
MONITOR on
SPEEDSTEP_LOCK off
CPUID_MAX_VAL off
XD_BIT on
DYN_ACCEL off
TURBO_MODE on
TM2 off
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheS 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3CACHE
+-------------------------------+---------+------------+
| Event | Counter | Core 0 |
+-------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2039352165 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 2984023348 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2859151802 |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 17807189 |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 12382320 |
| UOPS_RETIRED_ALL | PMC2 | 3482582482 |
+-------------------------------+---------+------------+
+----------------------+-----------+
| Metric | Core 0 |
+----------------------+-----------+
| Runtime (RDTSC) [s] | 1.1351 |
| Runtime unhalted [s] | 1.1514 |
| Clock [MHz] | 2704.7605 |
| CPI | 1.4632 |
| L3 request rate | 0.0051 |
| L3 miss rate | 0.0036 |
| L3 miss ratio | 0.6954 |
+----------------------+-----------+
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheP 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3CACHE
+-------------------------------+---------+------------+
| Event | Counter | Core 0 |
+-------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2091351601 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3830019764 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3293118868 |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 32824200 |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 27482476 |
| UOPS_RETIRED_ALL | PMC2 | 3536707492 |
+-------------------------------+---------+------------+
+----------------------+-----------+
| Metric | Core 0 |
+----------------------+-----------+
| Runtime (RDTSC) [s] | 1.2999 |
| Runtime unhalted [s] | 1.4779 |
| Clock [MHz] | 3013.9721 |
| CPI | 1.8314 |
| L3 request rate | 0.0093 |
| L3 miss rate | 0.0078 |
| L3 miss ratio | 0.8373 |
+----------------------+-----------+
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheL 16000000
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
n = 16000000
--------------------------------------------------------------------------------
Group 1: L3CACHE
+-------------------------------+---------+------------+
| Event | Counter | Core 0 |
+-------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 2091352731 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6163456542 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 5950193860 |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 31687709 |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 26422113 |
| UOPS_RETIRED_ALL | PMC2 | 3535194571 |
+-------------------------------+---------+------------+
+----------------------+-----------+
| Metric | Core 0 |
+----------------------+-----------+
| Runtime (RDTSC) [s] | 2.3300 |
| Runtime unhalted [s] | 2.3783 |
| Clock [MHz] | 2684.4621 |
| CPI | 2.9471 |
| L3 request rate | 0.0090 |
| L3 miss rate | 0.0075 |
| L3 miss ratio | 0.8338 |
+----------------------+-----------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment