There are some quick benchmarks for the "Decade of Wasted Cores" patches on Linux 4.1. I had to add "extern int sched_max_numa_distance;" to arch/x86/kernel/smpboot.c for Linux 4.1 to compile. Brief analysis during the benchmarks using time(1) and mpstat(1) to check runtimes, usr/sys time, and per-CPU balance; iostat(1) to check for disk bottlenecks.
Summary: no significant difference seen in these tests.
The patch shouldn't make a difference to this 1-node system, but I felt it worth checking, especially since most of our systems are 1-node.
With "make clean" beforehand, then "make -j32"
real 6m3.778s
user 147m43.717s
sys 34m12.343s
some mpstat:
root@bgregg-build-i-b5469632:/mnt/src/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-b5469632) 04/15/2016 _x86_64_ (32 CPU)
05:02:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
05:02:15 PM all 84.45 0.00 14.99 0.01 0.00 0.00 0.01 0.00 0.00 0.54
real 6m5.734s
user 144m31.960s
sys 33m13.331s
This is a little slower, but roughly the same (as one would expect).
some mpstat:
root@bgregg-build-i-774595f0:/mnt/src/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-774595f0) 04/15/2016 _x86_64_ (32 CPU)
05:02:08 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
05:02:18 PM all 80.75 0.00 16.31 0.04 0.00 0.00 0.01 0.00 0.00 2.88
05:02:28 PM all 80.86 0.00 16.29 0.04 0.00 0.00 0.01 0.00 0.00 2.80
With "make clean" beforehand, then "make -j32"
2 runs:
real 4m59.373s
user 132m40.200s
sys 13m12.346s
real 4m59.708s
user 132m38.854s
sys 13m12.634s
some mpstat, system wide:
root@bgregg-build-i-5f37e7d8:/mnt/src/scratch/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-5f37e7d8) 04/15/2016 _x86_64_ (32 CPU)
04:25:50 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
04:26:00 PM all 88.58 0.00 7.64 0.00 0.00 0.00 0.01 0.00 0.00 3.78
04:26:10 PM all 88.76 0.00 7.49 0.00 0.00 0.00 0.01 0.00 0.00 3.74
after the patch
real 5m0.805s
user 134m43.971s
sys 13m28.598s
real 4m59.442s
user 134m35.188s
sys 13m31.877s
run times are roughly the same.
root@bgregg-build-i-6636e6e1:/mnt/src/scratch/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-6636e6e1) 04/15/2016 _x86_64_ (32 CPU)
04:25:51 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
04:26:01 PM all 90.42 0.00 7.73 0.00 0.00 0.00 0.01 0.00 0.00 1.84
04:26:11 PM all 90.88 0.00 7.33 0.00 0.00 0.00 0.01 0.00 0.00 1.77
A higher usr and lower idle looks promising, and suggests a roughly 1.5% performance win. However, since the runtimes are equivalent, I would assume this is misleading, and that the higher %usr time may be a result of slighlty worse memory placement, causing slightly more CPU cycles to accomplish the same number of instructions. PMC testing can confirm. And, this slightly worse memory placement could be accidental, and unrelated to the patch.
Not pictured is "mpstat -P ALL 1" output, which showed both systems were driving all CPUs to over 90%, with occasional 10% idle CPUs. This is at a one second granularity. I can drill down further using a CPU subsecond offset heat map, something I did recently at Netflix to solve an issue ( http://www.slideshare.net/brendangregg/srecon-2016-performance-checklists-for-sres/54 ). I mention this as the paper suggests that systems engineering should be using visualizations more for this kind of analysis -- well, I already am, and have done for many years.
This should have the same performance, and does. I'm more runing this as a sanity test.
root@bgregg-build-i-5f37e7d8:/mnt/src/scratch/linux-4.1# sysbench --max-requests=10000000 --max-time=10 --num-threads=64 --test=cpu --cpu-max-prime=10000 run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 64
Doing CPU performance benchmark
Threads started!
Time limit exceeded, exiting...
(last message repeated 63 times)
Done.
Maximum prime number checked in CPU test: 10000
Test execution summary:
total time: 10.0021s
total number of events: 260639
total time taken by event execution: 636.8024
per-request statistics:
min: 1.05ms
avg: 2.44ms
max: 53.22ms
approx. 95 percentile: 17.23ms
Threads fairness:
events (avg/stddev): 4072.4844/420.30
execution time (avg/stddev): 9.9500/0.04
root@bgregg-build-i-6636e6e1:/mnt/src/scratch/linux-4.1# sysbench --max-requests=10000000 --max-time=10 --num-threads=64 --test=cpu --cpu-max-prime=10000 run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 64
Doing CPU performance benchmark
Threads started!
Time limit exceeded, exiting...
(last message repeated 63 times)
Done.
Maximum prime number checked in CPU test: 10000
Test execution summary:
total time: 10.0018s
total number of events: 260817
total time taken by event execution: 639.2294
per-request statistics:
min: 1.19ms
avg: 2.45ms
max: 53.26ms
approx. 95 percentile: 17.23ms
Threads fairness:
events (avg/stddev): 4075.2656/169.34
execution time (avg/stddev): 9.9880/0.01
Compare the "total number of events" line.
root@bgregg-build-i-5f37e7d8:/mnt/src/scratch/linux-4.1# sysbench --test=mutex --num-threads=64 --mutex-num=16 --mutex-locks=1000000 run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 64
Doing mutex performance test
Threads started!
Done.
Test execution summary:
total time: 20.2562s
total number of events: 64
total time taken by event execution: 1288.6230
per-request statistics:
min: 19831.06ms
avg: 20134.74ms
max: 20254.82ms
approx. 95 percentile: 20247.58ms
Threads fairness:
events (avg/stddev): 1.0000/0.00
execution time (avg/stddev): 20.1347/0.11
root@bgregg-build-i-6636e6e1:/mnt/src/scratch/linux-4.1# sysbench --test=mutex --num-threads=64 --mutex-num=16 --mutex-locks=1000000 run
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 64
Doing mutex performance test
Threads started!
Done.
Test execution summary:
total time: 20.4716s
total number of events: 64
total time taken by event execution: 1303.3781
per-request statistics:
min: 20148.96ms
avg: 20365.28ms
max: 20470.28ms
approx. 95 percentile: 20460.82ms
Threads fairness:
events (avg/stddev): 1.0000/0.00
execution time (avg/stddev): 20.3653/0.08
Not pictured: this test has too much variance between runs (+-5%), I should really ditch it. Just had the command line on-hand from an earlier investigation.
Running it several times doesn't really show one system faster than another.
Haven't done it. If we had an 8 node NUMA system, this would certainly be a priority. :)
These are some quick tests, and I may certainly have made mistakes, and presented data that is later found to be misleading. Getting accurate and trustworthy benchmarks can be weeks of effort, and requires careful analysis using multiple tools to identify the limiting factor of each benchmark.