Last active
March 23, 2022 22:34
-
-
Save PatrickDehkordi/695bb4b059db411ede6efeaecb04d644 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CREATING syn flood and web work load: | |
%hping3 -I p1p1 -p 8701 -S -V --flood 192.168.50.1 | |
%wrk -t8 -c1000 -d10s http://192.168.50.1:8701 | |
DRIVER VERSION | |
%route add default gw a.b.31.254 em1 | |
%ethtool -i p1p1 | |
CPU | |
Set the cpupower to the approriate governor profile. | |
cpupowerutils package is required: | |
%yum install cpupowerutils | |
List installed governors: | |
%cpupower frequency-info --governors | |
Set the governor: | |
%cpupower frequency-set --governor performance | |
analyzing CPU 0: | |
performance powersav | |
Make sure the service is running: | |
%systemctl status cpupower | |
Load the ioatdma module: | |
%modprobe ioatdma | |
%lsmod | grep dma | |
Inspect and configure for desired NUMA configuration: | |
View NUMA architecture: | |
%numactl —hardware | |
available: 2 nodes (0-1) | |
node 0 cpus: 0 2 4 6 8 10 12 14 | |
node 1 cpus: 1 3 5 7 9 11 13 15 | |
Locat adapter node: | |
%cat /sys/class/net/p1p1/device/numa_node | |
1 | |
List CPU masks | |
%cat /sys/class/net/p1p1/device/local_cpus | |
0000,00000000,00000000,00000000,0000aaaa | |
[... 1010 1010 1010 1010] | |
[... fedc ba98 7654 3210] | |
%cat /sys/class/net/p1p1/device/local_cpulist | |
1,3,5,7,9,11,13,15 | |
Bind Onload data structure to loca NUMA node: | |
%numactl —cpunodebind=1 onload_tool reload | |
Confire Offload Engines: | |
List Offload Engines | |
%ethtool -k p1p1 | |
rx, tx,tso, lro | |
Turn on/off: | |
%ethtool -K p1p1 rx|tx|tso|lro | |
SET FIRMWARE VARIANT | |
%sfboot -h | |
firmware-variant=full-feature|ultra-low-latency|capture-packed-stream|auto | |
sfboot -i p1p1 | |
sfboot -i p1p1 firmware-variant=ultra-low-latency | |
IRQ Coalescing | |
systemctl status irqbalance | |
ethtool -c p1p1 | |
ethtool -C p1p1 rx-usecs 60 adaptive-rx off | |
/etc/modprobe.d/sfc.conf | |
options sfc rx_irq_mod_usec=60 | |
options sfc tx_irq_mod_usec=150 | |
IRQ Affinity (script) | |
IRQ Local, Application seperate | |
sfcaffinity_config auto | |
cat /proc/interrupts | grep p1p1 | |
A,B,C,D ... | |
cat /proc/irq/A/smp_affinity | |
cat /proc/irq/B/smp_affinity | |
cat /proc/irq/C/smp_affinity | |
cat /proc/irq/D/smp_affinity | |
1,2,4,8,16,32,64,128 [0,1,2,3,4,5,6,7] | |
echo 2 > /proc/irq/A/smp_affinity | |
echo 4 > /proc/irq/B/smp_affinity | |
echo 16 > /proc/irq/C/smp_affinity | |
echo 64 > /proc/irq/D/smp_affinity | |
taskset -c 0 haproxy -db -f haproxy.conf & | |
taskset -c 2 haproxy -db -f haproxy.conf & | |
taskset -c 4 haproxy -db -f haproxy.conf & | |
taskset -c 6 haproxy -db -f haproxy.conf | |
RSS | |
options sfc rss_cpus=8 | |
options sfc rss_numa_local=1 | |
options sfc rx_recycle_ring_size=2048 | |
modprobe sfc | |
RFS | |
echo 8192 > /proc/sys/net/core/rps_sock_flow_entries | |
echo 1024 > /sys/class/net/p1p1/queues/rx-0/rps_flow_cnt | |
echo 1024 > /sys/class/net/p1p1/queues/rx-1/rps_flow_cnt | |
echo 1024 > /sys/class/net/p1p1/queues/rx-2/rps_flow_cnt | |
echo 1024 > /sys/class/net/p1p1/queues/rx-N/rps_flow_cnt | |
ethtool -K p1p1 ntuple on | |
XPS | |
options sfc sxps_enabled=0 | |
options sfc rx_copybreak=192 | |
Kernel | |
/etc/syscntl.conf | |
sysctl net.core.tcp_ #optional | |
sysctl net.core.busy_poll #optional | |
net.ipv4.tcp_syncookies = 1 | |
net.ipv4.tcp_max_syn_backlog = 2048 | |
net.ipv4.tcp_synack_retries = 3 | |
OPF | |
ulimit -n 1000000 | |
/usr/libexec/onload/profiles/haproxy.opf | |
onload -p haproxy taskset -c N haproxy -db -f haproxy.conf | |
# OpenOnload haproxy Profile | |
onload_set EF_POLL_USEC 100000 #Enable for spin | |
onload_set EF_TCP_FASTSTART_INIT 0 #latency | |
onload_set EF_TCP_FASTSTART_IDLE 0 #latency | |
onload_set EF_CLUSTER_IGNORE=0 #enable to single instance | |
onload_set EF_CLUSTER_SIZE=N # No of HA Proxy Instances | |
onload_set EF_TCP_SYNCOOKIES=1 #CPU utilization | |
onload_Set EF_TCP_BACKLOG_MAX=N #Set equal to kernel per instance | |
onload_Set EF_TCP_SYNRECV_MAX=N #Set equal to kernel pe instance | |
onload_set EF_SOCKET_CACHE_MAX=N #Test 90000 | |
onload_set EF_PER_SOCKET_CACHE_MAX=N #Test Unset | |
onload_set EF_MAX_ENDPOINTS=N #Test 100000 | |
#onload_set EF_SCALABLE_FILTERS_ENABLE=0 | |
#onload_set EF_SCALABLE_FILTERS=p1p1=passive | |
#onload_set EF_SOCKET_CACHE_PORTS=8701 | |
#onload_Set EF_TCP_FORCE_REUSEPORT=8701 | |
#onload_set EF_HIGH_THROUGHPUT_MODE=0 | |
HAProxy | |
/etc/haproxy/haproxy.cfg | |
mode tcp | |
retries 2 | |
timeout http-request 3s | |
timeout queue 10s | |
timeout connect 3s | |
timeout client 10s | |
timeout server 10s | |
timeout http-keep-alive 4s | |
timeout check 3s | |
maxconn 8000 | |
As David mentioned in his email, it’s beneficial with onload to turn on application clustering. | |
Clustering should be enabled by default but to set it on explicitly add the following to the OPF file we created: | |
onload_set EF_CLUSTER_IGNORE=0 | |
onload_set EF_CLUSTER_SIZE=N | |
Where N is the number of harproxy workers. | |
Also it is recommended that you do NOT run haproxy in daemon mode. There maybe an issue with fork as it relates to epoll. | |
This may have been solved in later kernel and latest onload. | |
But it’s worth testing to see if you see any performance improvement. | |
So try launching N workers explicitly for example: | |
onload haproxy -db -f haproxy.conf & | |
onload haproxy -db -f haproxy.conf & | |
… and so on N times. | |
Per our conversation with regards to pinning of application and interrupts, | |
on our machine which was was a quad core dual socket, with hyper treading enabled, | |
we noticed for up to 5 haproxy threads it is better to keep the interrupts in the local numa node, | |
but with 6 or more haproxy threads it is better to leave the interrupts in cpus 0 to 7. | |
You can use the script set_irq_affinity or do explicitly with “ethtool –L <interfacename> combined 7” | |
I think the best we’ve seen is 6x connections per sec when compared to kernel. | |
Plot is attached of haproxy performance ND (net driver) vs Onload (O). | |
The testing for Transparent Mode (aka upstream keepalive) is labeled with 0 | |
The testing for Non-Transparent Mode (aka upstream non-keepalive) is labeled with 1. | |
The X-Axis is 1000s of 1Kbyte HTTP requests per second. | |
Since we did this testing we’ve added a feature called “scalable wild filters” that could have additional benefits, | |
but we have not tested it with haproxy. | |
ethtool -N|-U|--config-nfc|--config-ntuple DEVNAME Configure Rx network flow classification options or rules | |
rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 m|v|t|s|d|f|n|r... | | |
flow-type ether|ip4|tcp4|udp4|sctp4|ah4|esp4 | |
[ src %x:%x:%x:%x:%x:%x [m %x:%x:%x:%x:%x:%x] ] | |
[ dst %x:%x:%x:%x:%x:%x [m %x:%x:%x:%x:%x:%x] ] | |
[ proto %d [m %x] ] | |
[ src-ip %d.%d.%d.%d [m %d.%d.%d.%d] ] | |
[ dst-ip %d.%d.%d.%d [m %d.%d.%d.%d] ] | |
[ tos %d [m %x] ] | |
[ l4proto %d [m %x] ] | |
[ src-port %d [m %x] ] | |
[ dst-port %d [m %x] ] | |
[ spi %d [m %x] ] | |
[ vlan-etype %x [m %x] ] | |
[ vlan %x [m %x] ] | |
[ user-def %x [m %x] ] | |
[ dst-mac %x:%x:%x:%x:%x:%x [m %x:%x:%x:%x:%x:%x] ] | |
[ action %d ] | |
[ loc %d]] | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment