Last active
May 1, 2016 22:12
-
-
Save MattPD/06e293fb935eaf67ee9c301e70db6975 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Related to discussion of | |
http://larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss | |
at https://twitter.com/gregerlars/status/726781584481878017 | |
Three variants: | |
- Simple: cacheS | |
- Pointer: cacheP | |
- LinkedList: cacheL | |
Context -- looking at average stall durations, we can confirm that cacheL wastes considerably more instruction slots than cacheP, which wastes only somewhat more instruction slots than cacheS: | |
likwid-perfctr -C S0:0 -g UOPS_EXEC -f ./cacheS 16000000 | |
| Avg stall duration [cycles] | 7.2557 | | |
likwid-perfctr -C S0:0 -g UOPS_EXEC -f ./cacheP 16000000 | |
| Avg stall duration [cycles] | 8.1357 | | |
likwid-perfctr -C S0:0 -g UOPS_EXEC -f ./cacheL 16000000 | |
| Avg stall duration [cycles] | 19.9838 | | |
The question is how to quantitatively characterize the reasons. | |
Considering L3 and comparing faster cacheP to slower cacheL: | |
cacheP: | L3 evict data volume [GBytes] | 2.1258 | | |
cacheL: | L3 evict data volume [GBytes] | 2.3658 | | |
Attributable to trailing-edge effects? | |
(Hypothesis: Linked-list -- less useful prefetched data, more BW wasted?) | |
We have: Miss Penalty = Leading Edge + Effects(Trailing Edge) | |
// p. 24 of http://ewh.ieee.org/r5/denver/sscs/Presentations/2008_09_Bernstein1.pdf | |
$ likwid-perfctr -C S0:0 -g L3 -f ./cacheS 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3 | |
+------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2039351567 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 2959500127 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2778611238 | | |
| L2_LINES_IN_ALL | PMC0 | 42205726 | | |
| L2_LINES_OUT_DIRTY_ALL | PMC1 | 37046289 | | |
+------------------------+---------+------------+ | |
+-------------------------------+-----------+ | |
| Metric | Core 0 | | |
+-------------------------------+-----------+ | |
| Runtime (RDTSC) [s] | 1.1034 | | |
| Runtime unhalted [s] | 1.1420 | | |
| Clock [MHz] | 2760.2860 | | |
| CPI | 1.4512 | | |
| L3 load bandwidth [MBytes/s] | 2447.9439 | | |
| L3 load data volume [GBytes] | 2.7012 | | |
| L3 evict bandwidth [MBytes/s] | 2148.6951 | | |
| L3 evict data volume [GBytes] | 2.3710 | | |
| L3 bandwidth [MBytes/s] | 4596.6389 | | |
| L3 data volume [GBytes] | 5.0721 | | |
+-------------------------------+-----------+ | |
$ likwid-perfctr -C S0:0 -g L3 -f ./cacheP 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3 | |
+------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2091352295 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3364007869 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3088564076 | | |
| L2_LINES_IN_ALL | PMC0 | 58558135 | | |
| L2_LINES_OUT_DIRTY_ALL | PMC1 | 33214991 | | |
+------------------------+---------+------------+ | |
+-------------------------------+-----------+ | |
| Metric | Core 0 | | |
+-------------------------------+-----------+ | |
| Runtime (RDTSC) [s] | 1.2202 | | |
| Runtime unhalted [s] | 1.2981 | | |
| Clock [MHz] | 2822.6985 | | |
| CPI | 1.6085 | | |
| L3 load bandwidth [MBytes/s] | 3071.4319 | | |
| L3 load data volume [GBytes] | 3.7477 | | |
| L3 evict bandwidth [MBytes/s] | 1742.1590 | | |
| L3 evict data volume [GBytes] | 2.1258 | | |
| L3 bandwidth [MBytes/s] | 4813.5909 | | |
| L3 data volume [GBytes] | 5.8735 | | |
+-------------------------------+-----------+ | |
$ likwid-perfctr -C S0:0 -g L3 -f ./cacheL 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3 | |
+------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2091352515 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6149176384 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 5960055218 | | |
| L2_LINES_IN_ALL | PMC0 | 56525204 | | |
| L2_LINES_OUT_DIRTY_ALL | PMC1 | 36965921 | | |
+------------------------+---------+------------+ | |
+-------------------------------+-----------+ | |
| Metric | Core 0 | | |
+-------------------------------+-----------+ | |
| Runtime (RDTSC) [s] | 2.3319 | | |
| Runtime unhalted [s] | 2.3728 | | |
| Clock [MHz] | 2673.8115 | | |
| CPI | 2.9403 | | |
| L3 load bandwidth [MBytes/s] | 1551.3503 | | |
| L3 load data volume [GBytes] | 3.6176 | | |
| L3 evict bandwidth [MBytes/s] | 1014.5402 | | |
| L3 evict data volume [GBytes] | 2.3658 | | |
| L3 bandwidth [MBytes/s] | 2565.8905 | | |
| L3 data volume [GBytes] | 5.9834 | | |
+-------------------------------+-----------+ | |
Prefetchers definitely helpful for the simple access (cacheS) -- but don't seem to account for the significant timing disparity between pointer (cacheP) and linked-list (cacheL). | |
Disabled: | |
cacheS: | L3 miss ratio | 0.7905 | | |
cacheP: | L3 miss ratio | 0.8537 | | |
cacheL: | L3 miss ratio | 0.8445 | | |
Enabled: | |
cacheS: | L3 miss ratio | 0.6954 | | |
cacheP: | L3 miss ratio | 0.8373 | | |
cacheL: | L3 miss ratio | 0.8338 | | |
Note the significant increase in the L3 miss ratio for the simple access workload (cacheS) -- from 69.5% to 79% -- but only a very small miss ratios increases for the remaining workloads. | |
# | |
# | |
# | |
Prefetchers: Detailed results: | |
We can disable/enable prefetchers using likwid-features: | |
https://github.com/RRZE-HPC/likwid/wiki/likwid-features | |
# | |
# Disabled Prefetchers | |
# | |
$ likwid-features -c 0 -l | |
Feature CPU 0 | |
HW_PREFETCHER on | |
CL_PREFETCHER on | |
DCU_PREFETCHER on | |
IP_PREFETCHER on | |
FAST_STRINGS on | |
THERMAL_CONTROL on | |
PERF_MON on | |
FERR_MULTIPLEX off | |
BRANCH_TRACE_STORAGE on | |
XTPR_MESSAGE off | |
PEBS on | |
SPEEDSTEP on | |
MONITOR on | |
SPEEDSTEP_LOCK off | |
CPUID_MAX_VAL off | |
XD_BIT on | |
DYN_ACCEL off | |
TURBO_MODE on | |
TM2 off | |
$ likwid-features -c 0 -d HW_PREFETCHER,CL_PREFETCHER,DCU_PREFETCHER,IP_PREFETCHER | |
HW_PREFETCHER: disabled | |
Disabled HW_PREFETCHER for CPU 0 | |
CL_PREFETCHER: disabled | |
Disabled CL_PREFETCHER for CPU 0 | |
DCU_PREFETCHER: disabled | |
Disabled DCU_PREFETCHER for CPU 0 | |
IP_PREFETCHER: disabled | |
Disabled IP_PREFETCHER for CPU 0 | |
$ likwid-features -c 0 -l | |
Feature CPU 0 | |
HW_PREFETCHER off | |
CL_PREFETCHER off | |
DCU_PREFETCHER off | |
IP_PREFETCHER off | |
FAST_STRINGS on | |
THERMAL_CONTROL on | |
PERF_MON on | |
FERR_MULTIPLEX off | |
BRANCH_TRACE_STORAGE on | |
XTPR_MESSAGE off | |
PEBS on | |
SPEEDSTEP on | |
MONITOR on | |
SPEEDSTEP_LOCK off | |
CPUID_MAX_VAL off | |
XD_BIT on | |
DYN_ACCEL off | |
TURBO_MODE on | |
TM2 off | |
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheS 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3CACHE | |
+-------------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+-------------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2039352297 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3090290060 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3039766886 | | |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 21723938 | | |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 17172188 | | |
| UOPS_RETIRED_ALL | PMC2 | 3482181080 | | |
+-------------------------------+---------+------------+ | |
+----------------------+-----------+ | |
| Metric | Core 0 | | |
+----------------------+-----------+ | |
| Runtime (RDTSC) [s] | 1.2076 | | |
| Runtime unhalted [s] | 1.1924 | | |
| Clock [MHz] | 2634.6467 | | |
| CPI | 1.5153 | | |
| L3 request rate | 0.0062 | | |
| L3 miss rate | 0.0049 | | |
| L3 miss ratio | 0.7905 | | |
+----------------------+-----------+ | |
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheP 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3CACHE | |
+-------------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+-------------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2091351388 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3541973052 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3300479884 | | |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 37460119 | | |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 31978122 | | |
| UOPS_RETIRED_ALL | PMC2 | 3536570374 | | |
+-------------------------------+---------+------------+ | |
+----------------------+-----------+ | |
| Metric | Core 0 | | |
+----------------------+-----------+ | |
| Runtime (RDTSC) [s] | 1.3089 | | |
| Runtime unhalted [s] | 1.3667 | | |
| Clock [MHz] | 2781.1666 | | |
| CPI | 1.6936 | | |
| L3 request rate | 0.0106 | | |
| L3 miss rate | 0.0090 | | |
| L3 miss ratio | 0.8537 | | |
+----------------------+-----------+ | |
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheL 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3CACHE | |
+-------------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+-------------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2091352659 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6285614502 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 6203681042 | | |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 33333780 | | |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 28151375 | | |
| UOPS_RETIRED_ALL | PMC2 | 3534662943 | | |
+-------------------------------+---------+------------+ | |
+----------------------+-----------+ | |
| Metric | Core 0 | | |
+----------------------+-----------+ | |
| Runtime (RDTSC) [s] | 2.4317 | | |
| Runtime unhalted [s] | 2.4256 | | |
| Clock [MHz] | 2625.5607 | | |
| CPI | 3.0055 | | |
| L3 request rate | 0.0094 | | |
| L3 miss rate | 0.0080 | | |
| L3 miss ratio | 0.8445 | | |
+----------------------+-----------+ | |
# | |
# Enabled Prefetchers | |
# | |
$ likwid-features -c 0 -e HW_PREFETCHER,CL_PREFETCHER,DCU_PREFETCHER,IP_PREFETCHER | |
HW_PREFETCHER: enabled | |
Enabled HW_PREFETCHER for CPU 0 | |
CL_PREFETCHER: enabled | |
Enabled CL_PREFETCHER for CPU 0 | |
DCU_PREFETCHER: enabled | |
Enabled DCU_PREFETCHER for CPU 0 | |
IP_PREFETCHER: enabled | |
Enabled IP_PREFETCHER for CPU 0 | |
$ likwid-features -c 0 -l | |
Feature CPU 0 | |
HW_PREFETCHER on | |
CL_PREFETCHER on | |
DCU_PREFETCHER on | |
IP_PREFETCHER on | |
FAST_STRINGS on | |
THERMAL_CONTROL on | |
PERF_MON on | |
FERR_MULTIPLEX off | |
BRANCH_TRACE_STORAGE on | |
XTPR_MESSAGE off | |
PEBS on | |
SPEEDSTEP on | |
MONITOR on | |
SPEEDSTEP_LOCK off | |
CPUID_MAX_VAL off | |
XD_BIT on | |
DYN_ACCEL off | |
TURBO_MODE on | |
TM2 off | |
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheS 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3CACHE | |
+-------------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+-------------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2039352165 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 2984023348 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 2859151802 | | |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 17807189 | | |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 12382320 | | |
| UOPS_RETIRED_ALL | PMC2 | 3482582482 | | |
+-------------------------------+---------+------------+ | |
+----------------------+-----------+ | |
| Metric | Core 0 | | |
+----------------------+-----------+ | |
| Runtime (RDTSC) [s] | 1.1351 | | |
| Runtime unhalted [s] | 1.1514 | | |
| Clock [MHz] | 2704.7605 | | |
| CPI | 1.4632 | | |
| L3 request rate | 0.0051 | | |
| L3 miss rate | 0.0036 | | |
| L3 miss ratio | 0.6954 | | |
+----------------------+-----------+ | |
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheP 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3CACHE | |
+-------------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+-------------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2091351601 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3830019764 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3293118868 | | |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 32824200 | | |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 27482476 | | |
| UOPS_RETIRED_ALL | PMC2 | 3536707492 | | |
+-------------------------------+---------+------------+ | |
+----------------------+-----------+ | |
| Metric | Core 0 | | |
+----------------------+-----------+ | |
| Runtime (RDTSC) [s] | 1.2999 | | |
| Runtime unhalted [s] | 1.4779 | | |
| Clock [MHz] | 3013.9721 | | |
| CPI | 1.8314 | | |
| L3 request rate | 0.0093 | | |
| L3 miss rate | 0.0078 | | |
| L3 miss ratio | 0.8373 | | |
+----------------------+-----------+ | |
$ likwid-perfctr -C S0:0 -g L3CACHE -f ./cacheL 16000000 | |
-------------------------------------------------------------------------------- | |
CPU name: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz | |
CPU type: Intel Core IvyBridge processor | |
CPU clock: 2.59 GHz | |
-------------------------------------------------------------------------------- | |
n = 16000000 | |
-------------------------------------------------------------------------------- | |
Group 1: L3CACHE | |
+-------------------------------+---------+------------+ | |
| Event | Counter | Core 0 | | |
+-------------------------------+---------+------------+ | |
| INSTR_RETIRED_ANY | FIXC0 | 2091352731 | | |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 6163456542 | | |
| CPU_CLK_UNHALTED_REF | FIXC2 | 5950193860 | | |
| MEM_LOAD_UOPS_RETIRED_L3_ALL | PMC0 | 31687709 | | |
| MEM_LOAD_UOPS_RETIRED_L3_MISS | PMC1 | 26422113 | | |
| UOPS_RETIRED_ALL | PMC2 | 3535194571 | | |
+-------------------------------+---------+------------+ | |
+----------------------+-----------+ | |
| Metric | Core 0 | | |
+----------------------+-----------+ | |
| Runtime (RDTSC) [s] | 2.3300 | | |
| Runtime unhalted [s] | 2.3783 | | |
| Clock [MHz] | 2684.4621 | | |
| CPI | 2.9471 | | |
| L3 request rate | 0.0090 | | |
| L3 miss rate | 0.0075 | | |
| L3 miss ratio | 0.8338 | | |
+----------------------+-----------+ | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment