I'm running some SSE and AVX instructions on Harpertown and Sandy Bridge systems and I noticed the Sandy Bridge system was able to scale to more cores before the performance flat lined. The Harpertowm system did not improve when using 2 threads over 1 thread. So I started to look into why.
Running with 1 thread:
likwid-perfctr -C S0:0@S1:0 -g MEM ./example 1
+-----------------------------+-------------+---------+
| Metric | core 0 | core 1 |
+-----------------------------+-------------+---------+
| Runtime (RDTSC) [s] | 7.795 | 7.795 |
| Runtime unhalted [s] | 0.00113065 | 5.25528 |
| CPI | 1.38358 | 2.23844 |
| Memory bandwidth [MBytes/s] | 0.0809299 | 2162.31 | <--- Look here
| Memory data volume [GBytes] | 0.000630848 | 16.8552 |
+-----------------------------+-------------+---------+
Now I'm going to tell likwid to force pthreads to use socket 0 core 0 and socket 1 core 0 to eliminate any shared in-socket resources.
Running with 2 threads:
likwid-perfctr -C S0:0@S1:0 -g MEM ./example 2
+-----------------------------+---------+---------+
| Metric | core 0 | core 1 |
+-----------------------------+---------+---------+
| Runtime (RDTSC) [s] | 12.0259 | 12.0259 |
| Runtime unhalted [s] | 8.20732 | 8.25249 |
| CPI | 3.49461 | 3.51511 |
| Memory bandwidth [MBytes/s] | 1398.74 | 1399.11 | <--- Look here, bandwidth looks shared!
| Memory data volume [GBytes] | 16.8211 | 16.8256 |
+-----------------------------+---------+---------+