Created
October 8, 2011 12:59
-
-
Save dspezia/1272254 to your computer and use it in GitHub Desktop.
Benchmark huge pages
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Host | |
==== | |
Physical machine | |
2 * Intel X5670 @ 2.93 GHz, HT activated, 24 threads, cpufreq deactivated | |
48 GB | |
OS: SLES10 SP3 | |
Linux ncegcolnx243 2.6.16.60-0.54.5-smp #1 SMP Fri Sep 4 01:28:03 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux | |
MemTotal: 49452408 kB | |
MemFree: 9745596 kB | |
Buffers: 721068 kB | |
Cached: 4744648 kB | |
SwapCached: 1188 kB | |
Active: 3621452 kB | |
Inactive: 1953992 kB | |
HighTotal: 0 kB | |
HighFree: 0 kB | |
LowTotal: 49452408 kB | |
LowFree: 9745596 kB | |
SwapTotal: 8393920 kB | |
SwapFree: 8391420 kB | |
Dirty: 12 kB | |
Writeback: 0 kB | |
AnonPages: 106088 kB | |
Mapped: 30396 kB | |
Slab: 527300 kB | |
CommitLimit: 16342908 kB | |
Committed_AS: 580044 kB | |
PageTables: 3312 kB | |
VmallocTotal: 34359738367 kB | |
VmallocUsed: 266784 kB | |
VmallocChunk: 34359471279 kB | |
HugePages_Total: 16384 | |
HugePages_Free: 16384 | |
HugePages_Rsvd: 0 | |
Hugepagesize: 2048 kB | |
processor : 23 | |
vendor_id : GenuineIntel | |
cpu family : 6 | |
model : 44 | |
model name : Intel(R) Xeon(R) CPU X5670 @ 2.93GHz | |
stepping : 2 | |
cpu MHz : 2926.082 | |
cache size : 12288 KB | |
physical id : 0 | |
siblings : 12 | |
core id : 10 | |
cpu cores : 6 | |
fpu : yes | |
fpu_exception : yes | |
cpuid level : 11 | |
wp : yes | |
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx smx est tm2 cx16 xtpr dca popcnt lahf_lm ida arat | |
bogomips : 5852.17 | |
clflush size : 64 | |
cache_alignment : 64 | |
address sizes : 40 bits physical, 48 bits virtual | |
power management: | |
available: 2 nodes (0-1) | |
node 0 size: 24240 MB | |
node 0 free: 2773 MB | |
node 1 size: 24215 MB | |
node 1 free: 6730 MB | |
node distances: | |
node 0 1 | |
0: 10 20 | |
1: 20 10 | |
policy: default | |
preferred node: current | |
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
cpubind: 0 1 | |
nodebind: 0 1 | |
membind: 0 1 | |
/sys/devices/system/node/node0: | |
total 0 | |
drwxr-xr-x 2 root root 0 Oct 8 13:31 . | |
drwxr-xr-x 4 root root 0 Aug 25 17:13 .. | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu0 -> ../../../../devices/system/cpu/cpu0 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu10 -> ../../../../devices/system/cpu/cpu10 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu12 -> ../../../../devices/system/cpu/cpu12 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu14 -> ../../../../devices/system/cpu/cpu14 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu16 -> ../../../../devices/system/cpu/cpu16 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu18 -> ../../../../devices/system/cpu/cpu18 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu2 -> ../../../../devices/system/cpu/cpu2 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu20 -> ../../../../devices/system/cpu/cpu20 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu22 -> ../../../../devices/system/cpu/cpu22 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu4 -> ../../../../devices/system/cpu/cpu4 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu6 -> ../../../../devices/system/cpu/cpu6 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu8 -> ../../../../devices/system/cpu/cpu8 | |
-r--r--r-- 1 root root 4096 Oct 8 13:31 cpumap | |
-r--r--r-- 1 root root 4096 Oct 8 13:30 distance | |
-r--r--r-- 1 root root 4096 Oct 8 13:30 meminfo | |
-r--r--r-- 1 root root 4096 Oct 8 13:29 numastat | |
/sys/devices/system/node/node1: | |
total 0 | |
drwxr-xr-x 2 root root 0 Oct 8 13:31 . | |
drwxr-xr-x 4 root root 0 Aug 25 17:13 .. | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu1 -> ../../../../devices/system/cpu/cpu1 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu11 -> ../../../../devices/system/cpu/cpu11 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu13 -> ../../../../devices/system/cpu/cpu13 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu15 -> ../../../../devices/system/cpu/cpu15 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu17 -> ../../../../devices/system/cpu/cpu17 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu19 -> ../../../../devices/system/cpu/cpu19 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu21 -> ../../../../devices/system/cpu/cpu21 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu23 -> ../../../../devices/system/cpu/cpu23 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu3 -> ../../../../devices/system/cpu/cpu3 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu5 -> ../../../../devices/system/cpu/cpu5 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu7 -> ../../../../devices/system/cpu/cpu7 | |
lrwxrwxrwx 1 root root 0 Oct 8 13:41 cpu9 -> ../../../../devices/system/cpu/cpu9 | |
-r--r--r-- 1 root root 4096 Oct 8 13:31 cpumap | |
-r--r--r-- 1 root root 4096 Oct 8 13:30 distance | |
-r--r--r-- 1 root root 4096 Oct 8 13:30 meminfo | |
-r--r--r-- 1 root root 4096 Oct 8 13:29 numastat | |
Result of redis-benchmark of this machine: | |
------------------------------------------ | |
Using loopback | |
PING (inline): 135501.36 requests per second | |
PING: 136798.91 requests per second | |
MSET (10 keys): 78864.35 requests per second | |
SET: 134770.89 requests per second | |
GET: 135685.22 requests per second | |
INCR: 133868.81 requests per second | |
LPUSH: 134952.77 requests per second | |
LPOP: 134952.77 requests per second | |
SADD: 134048.27 requests per second | |
SPOP: 134048.27 requests per second | |
LPUSH (again, in order to bench LRANGE): 134228.19 requests per second | |
LRANGE (first 100 elements): 78988.94 requests per second | |
LRANGE (first 300 elements): 41614.64 requests per second | |
LRANGE (first 450 elements): 29994.00 requests per second | |
LRANGE (first 600 elements): 24195.50 requests per second | |
Using Unix domain socket | |
PING (inline): 194552.53 requests per second | |
PING: 194931.77 requests per second | |
MSET (10 keys): 96805.42 requests per second | |
SET: 194931.77 requests per second | |
GET: 193423.59 requests per second | |
INCR: 194931.77 requests per second | |
LPUSH: 196463.66 requests per second | |
LPOP: 194174.77 requests per second | |
SADD: 194174.77 requests per second | |
SPOP: 192307.70 requests per second | |
LPUSH (again, in order to bench LRANGE): 196078.44 requests per second | |
LRANGE (first 100 elements): 94966.77 requests per second | |
LRANGE (first 300 elements): 46339.20 requests per second | |
LRANGE (first 450 elements): 33333.33 requests per second | |
LRANGE (first 600 elements): 25866.53 requests per second | |
Redis | |
===== | |
Version 2.2.12 + patch to fix LRANGE issue + CPU affinity patch + COW ratio patch | |
Size of the dump file: 3 GB | |
Peak memory consumption: 24 GB (working set after filling or loading) | |
Compiled with huge page support | |
https://gist.github.com/1240452 | |
Benchmark program | |
https://gist.github.com/1272522 | |
COW ratio patch | |
https://gist.github.com/1240427 | |
Note: the COW ratio patch returns meaningless results when huge pages support is activated. | |
This is due to the fact huge pages are not tracked correctly in the page map of the process. | |
With huge pages, COW ratio has been evaluated manually with /proc/meminfo by comparing | |
HugePages_Total and HugePages_Free. | |
Huge pages area was hard limited to 32 GB (working set was about 24 GB) | |
So about one third of Redis memory was still available to support COW at bgsave time. | |
Benchmark | |
========= | |
Fill use case generates about 50M of write queries (multiplied by 3 clients) | |
It is used to build the working set. | |
Read use case generates about 50M of random read queries (multiplied by 3 clients) | |
It is used to evaluate the read throughput | |
Update use case generates random write queries, with a tunable throttle to limit the throughput. | |
It is used to test COW efficiency (by running it concurrencly to a bgsave) | |
All tests are done using Unix domain sockets. | |
Clients are bound on the same CPU than Redis server, but on different cores | |
(enforced using CPU affinity). This is the most efficient configuration. | |
Fill, 3 clients, with HP | |
------------------------ | |
ncegcolnx243:genload> x 0 0 68.20s user 2.98s system 15% cpu 7:31.85 total | |
ncegcolnx243:genload> x 0 1 75.28s user 3.76s system 17% cpu 7:31.90 total | |
ncegcolnx243:genload> x 0 2 77.30s user 3.48s system 17% cpu 7:31.94 total | |
=> throughput = 331931 q/s | |
used_cpu_sys:376.68 | |
used_cpu_user:18.35 | |
used_memory:18810796424 | |
used_memory_human:17.52G | |
used_memory_rss:25773010944 | |
mem_fragmentation_ratio:1.37 | |
Read queries, 3 clients, with HP | |
-------------------------------- | |
ncegcolnx243:genload> x 1 0 85.68s user 3.04s system 23% cpu 6:09.86 total | |
ncegcolnx243:genload> x 1 1 83.82s user 2.98s system 23% cpu 6:09.86 total | |
ncegcolnx243:genload> x 1 2 90.71s user 1.62s system 24% cpu 6:09.86 total | |
=> throughput = 405558 q/s | |
used_cpu_sys:736.58 | |
used_cpu_user:28.31 | |
Bgsave with HP | |
-------------- | |
[6150] 08 Oct 15:51:51 * Fork: 3893 | |
[6150] 08 Oct 15:51:51 * Background saving started by pid 7487 | |
[7487] 08 Oct 15:52:57 * DB saved on disk | |
=> fork latency = 4 ms | |
=> duration = 66 secs | |
Restart with HP | |
--------------- | |
[7630] 08 Oct 15:57:22 * Server started, Redis version 2.2.12 | |
[7630] 08 Oct 15:58:22 * DB loaded from disk: 60 seconds | |
Fill, 3 clients, without HP | |
--------------------------- | |
ncegcolnx243:genload> x 0 0 68.89s user 2.04s system 15% cpu 7:38.15 total | |
ncegcolnx243:genload> x 0 2 67.00s user 3.56s system 15% cpu 7:38.22 total | |
ncegcolnx243:genload> x 0 1 74.39s user 2.95s system 16% cpu 7:38.24 total | |
=> throughput = 327367 q/s | |
used_cpu_sys:381.86 | |
used_cpu_user:24.17 | |
used_memory:18810795776 | |
used_memory_human:17.52G | |
used_memory_rss:25715482624 | |
mem_fragmentation_ratio:1.37 | |
Read queries, 3 clients, without HP | |
----------------------------------- | |
ncegcolnx243:genload> x 1 0 81.52s user 2.60s system 22% cpu 6:21.30 total | |
ncegcolnx243:genload> x 1 1 84.88s user 1.30s system 22% cpu 6:21.30 total | |
ncegcolnx243:genload> x 1 2 87.23s user 2.46s system 23% cpu 6:21.30 total | |
=> throughput = 393391 q/s | |
used_cpu_sys:752.62 | |
used_cpu_user:34.69 | |
Bgsave without HP | |
----------------- | |
[17603] 08 Oct 18:34:06 * Fork: 286990 | |
[17603] 08 Oct 18:34:06 * Background saving started by pid 18600 | |
[18600] 08 Oct 18:35:13 * DB saved on disk | |
=> fork latency = 287 ms | |
=> duration = 67 secs | |
Restart without HP | |
------------------ | |
[19396] 08 Oct 18:48:38 * Server started, Redis version 2.2.12 | |
[19396] 08 Oct 18:49:41 * DB loaded from disk: 63 seconds | |
COW efficiency evaluation | |
------------------------- | |
With huge pages, at only 60 w/s, ratio is about 25% | |
With higher throughput (for instance 120 w/s), the limit of 32 GB is quickly reached, and Redis is killed, | |
which means too many pages (more than one third) are copied. | |
Without huge pages, at 60 w/s, ratio is negligible | |
Without huge pages, at 5000 w/s, ratio is about 5% | |
Final results | |
============= | |
Here are all the results in a single table: | |
With HP Without HP Ratio % | |
Throughput fill (q/s) 331931 327367 101.39 | |
CPU user 376.68 381.86 98.64 | |
CPU sys 18.35 24.17 75.92 | |
CPU total 395.03 406.03 97.29 | |
Throughput read queries (q/s) 405558 393391 103.09 | |
CPU user 359.9 370.76 97.07 | |
CPU sys 9.96 10.52 94.68 | |
CPU total 369.86 381.28 97.00 | |
Fork latency (ms) 4 287 1.39 | |
BGSAVE duration (s) 66 67 98.51 | |
Load duration (s) 60 63 95.24 | |
We can see the gain in throughput due to huge pages is | |
between 1 and 4 % (i.e. a few percent only). The gain | |
in system CPU is about 25% at object creation time | |
(but system CPU accounts for a tiny fraction of the | |
total CPU consumption). There is also a small gain of | |
1.5% and 5% at save and load time. | |
It is clear that activating huge pages to boost general | |
performance of Redis is not really worth it. | |
The most noticeable benefit is of course the fork latency | |
which dramatically drops from 287 to only 4 ms (for a 24 GB | |
instance). | |
COW efficiency is abysmal with huge pages. Even with very low | |
update rate, most of the pages end up duplicated very quickly. | |
It is almost mandatory to provision twice the memory to support | |
background save, unless some strong locality of the traffic can | |
be exploited. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment