There are some quick benchmarks for the "Decade of Wasted Cores" patches on Linux 4.1. I had to add "extern int sched_max_numa_distance;" to arch/x86/kernel/smpboot.c for Linux 4.1 to compile. Brief analysis during the benchmarks using time(1) and mpstat(1) to check runtimes, usr/sys time, and per-CPU balance; iostat(1) to check for disk bottlenecks.

Summary: no significant difference seen in these tests.

c3.8xlarge (32 CPU) PV 1-node NUMA

The patch shouldn't make a difference to this 1-node system, but I felt it worth checking, especially since most of our systems are 1-node.

kernel build

before

With "make clean" beforehand, then "make -j32"

real	6m3.778s
user	147m43.717s
sys	34m12.343s

some mpstat:

root@bgregg-build-i-b5469632:/mnt/src/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-b5469632) 	04/15/2016 	_x86_64_	(32 CPU)

05:02:05 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:02:15 PM  all   84.45    0.00   14.99    0.01    0.00    0.00    0.01    0.00    0.00    0.54

after

real	6m5.734s
user	144m31.960s
sys	33m13.331s

This is a little slower, but roughly the same (as one would expect).

some mpstat:

root@bgregg-build-i-774595f0:/mnt/src/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-774595f0) 	04/15/2016 	_x86_64_	(32 CPU)

05:02:08 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:02:18 PM  all   80.75    0.00   16.31    0.04    0.00    0.00    0.01    0.00    0.00    2.88
05:02:28 PM  all   80.86    0.00   16.29    0.04    0.00    0.00    0.01    0.00    0.00    2.80

c3.8xlarge (32 CPU) PVHVM 2-node NUMA

kernel build

With "make clean" beforehand, then "make -j32"

before

2 runs:

real	4m59.373s
user	132m40.200s
sys	13m12.346s

real	4m59.708s
user	132m38.854s
sys	13m12.634s

some mpstat, system wide:

root@bgregg-build-i-5f37e7d8:/mnt/src/scratch/linux-4.1# mpstat 10              
Linux 4.1.0-virtual (bgregg-build-i-5f37e7d8)   04/15/2016      _x86_64_        (32 CPU)

04:25:50 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:26:00 PM  all   88.58    0.00    7.64    0.00    0.00    0.00    0.01    0.00    0.00    3.78
04:26:10 PM  all   88.76    0.00    7.49    0.00    0.00    0.00    0.01    0.00    0.00    3.74

after

after the patch

real	5m0.805s
user	134m43.971s
sys	13m28.598s

real	4m59.442s
user	134m35.188s
sys	13m31.877s

run times are roughly the same.

root@bgregg-build-i-6636e6e1:/mnt/src/scratch/linux-4.1# mpstat 10
Linux 4.1.0-virtual (bgregg-build-i-6636e6e1)   04/15/2016      _x86_64_        (32 CPU)

04:25:51 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:26:01 PM  all   90.42    0.00    7.73    0.00    0.00    0.00    0.01    0.00    0.00    1.84
04:26:11 PM  all   90.88    0.00    7.33    0.00    0.00    0.00    0.01    0.00    0.00    1.77

A higher usr and lower idle looks promising, and suggests a roughly 1.5% performance win. However, since the runtimes are equivalent, I would assume this is misleading, and that the higher %usr time may be a result of slighlty worse memory placement, causing slightly more CPU cycles to accomplish the same number of instructions. PMC testing can confirm. And, this slightly worse memory placement could be accidental, and unrelated to the patch.

Not pictured is "mpstat -P ALL 1" output, which showed both systems were driving all CPUs to over 90%, with occasional 10% idle CPUs. This is at a one second granularity. I can drill down further using a CPU subsecond offset heat map, something I did recently at Netflix to solve an issue ( http://www.slideshare.net/brendangregg/srecon-2016-performance-checklists-for-sres/54 ). I mention this as the paper suggests that systems engineering should be using visualizations more for this kind of analysis -- well, I already am, and have done for many years.

sysbench CPU test

This should have the same performance, and does. I'm more runing this as a sanity test.

before

root@bgregg-build-i-5f37e7d8:/mnt/src/scratch/linux-4.1# sysbench --max-requests=10000000 --max-time=10 --num-threads=64 --test=cpu --cpu-max-prime=10000 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 64

Doing CPU performance benchmark

Threads started!
Time limit exceeded, exiting...
(last message repeated 63 times)
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          10.0021s
    total number of events:              260639
    total time taken by event execution: 636.8024
    per-request statistics:
         min:                                  1.05ms
         avg:                                  2.44ms
         max:                                 53.22ms
         approx.  95 percentile:              17.23ms

Threads fairness:
    events (avg/stddev):           4072.4844/420.30
    execution time (avg/stddev):   9.9500/0.04

after

root@bgregg-build-i-6636e6e1:/mnt/src/scratch/linux-4.1# sysbench --max-requests=10000000 --max-time=10 --num-threads=64 --test=cpu --cpu-max-prime=10000 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 64

Doing CPU performance benchmark

Threads started!
Time limit exceeded, exiting...
(last message repeated 63 times)
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          10.0018s
    total number of events:              260817
    total time taken by event execution: 639.2294
    per-request statistics:
         min:                                  1.19ms
         avg:                                  2.45ms
         max:                                 53.26ms
         approx.  95 percentile:              17.23ms

Threads fairness:
    events (avg/stddev):           4075.2656/169.34
    execution time (avg/stddev):   9.9880/0.01

Compare the "total number of events" line.

sysbench lock test

before

root@bgregg-build-i-5f37e7d8:/mnt/src/scratch/linux-4.1# sysbench --test=mutex --num-threads=64 --mutex-num=16 --mutex-locks=1000000 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 64

Doing mutex performance test
Threads started!
Done.


Test execution summary:
    total time:                          20.2562s
    total number of events:              64
    total time taken by event execution: 1288.6230
    per-request statistics:
         min:                              19831.06ms
         avg:                              20134.74ms
         max:                              20254.82ms
         approx.  95 percentile:           20247.58ms

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   20.1347/0.11

after

root@bgregg-build-i-6636e6e1:/mnt/src/scratch/linux-4.1# sysbench --test=mutex --num-threads=64 --mutex-num=16 --mutex-locks=1000000 run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 64

Doing mutex performance test
Threads started!
Done.


Test execution summary:
    total time:                          20.4716s
    total number of events:              64
    total time taken by event execution: 1303.3781
    per-request statistics:
         min:                              20148.96ms
         avg:                              20365.28ms
         max:                              20470.28ms
         approx.  95 percentile:           20460.82ms

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   20.3653/0.08

Not pictured: this test has too much variance between runs (+-5%), I should really ditch it. Just had the command line on-hand from an earlier investigation.

Running it several times doesn't really show one system faster than another.

real world app test

Haven't done it. If we had an 8 node NUMA system, this would certainly be a priority. :)

further work

These are some quick tests, and I may certainly have made mistakes, and presented data that is later found to be misleading. Getting accurate and trustworthy benchmarks can be weeks of effort, and requires careful analysis using multiple tools to identify the limiting factor of each benchmark.

brendangregg/decade.md

c3.8xlarge (32 CPU) PV 1-node NUMA

kernel build

before

after

c3.8xlarge (32 CPU) PVHVM 2-node NUMA

kernel build

before

after

sysbench CPU test

before

after

sysbench lock test

before

after

real world app test

further work