Blog 2023/8/13
To learn more about implementing high-performance HTTP API's, I've implemented a few trivial 'Hello, world!' servers and run some basic benchmarks.
These servers simply respond to any HTTP request with a 200 OK
text/plain
response of Hello World!
.
In this post I'll cover three server implementations:
- Single-threaded
- Thread-per-connection
- Fixed-size pool of threads
We start with a bare-bones HTTP socket server implementation: only one connection is handled at a time.
This implementation spawns a new thread for each connection, and each thread is discarded after handling the connection.
This implementation spawns a fixed quantity of threads upon startup and uses them to handle all connections.
A stack is used to hand off the connection sockets from the main thread to the worker threads, protected by a pthread condition and mutex.
The main thread accepts connections and pushes the resulting file descriptors onto a stack, then signals the worker threads to wake up, consume the connections from the stack and reply to the incoming HTTP requests.
(Note: in retrospect, a stack is perhaps not a good choice, as the connections towards the bottom of the stack could suffer from starvation / excess latency).
The players:
127.0.0.1
(indium
)- Mac Mini (M1, gigabit Ethernet)
- macOS Monterey
plasma
- Macbook Pro (M1, gigabit Ethernet)
- macOS Ventura
flouride
- 2012 Macbook Pro (i7-3820QM @2.7GHz, gigabit Ethernet)
- macOS Ventura (thanks to OpenCore Legacy Patcher)
opti7050
- Dell Optiplex 7050 (i5-7500 @3.8GHz, gigabit Ethernet)
- Ubuntu 22.10
thinkpad
- Thinkpad T500 (Core 2 Duo P8600 @2.4GHz, gigabit Ethernet)
- Ubuntu 23.04
pi2b-1
- Raspberry Pi 2 Model B (ARMv7 @900MHz, 100Mbit Ethernet)
- Raspbian Bullseye
nslu2
- Linksys NSLU2 (ARMv5 @266MHz, 100Mbit Ethernet)
- Debian Jessie
pmacg5
- PowerMac G5 (PowerPC G5 @2.3GHz x2, gigabit Ethernet)
- OS X Leopard
emac3
- eMac (PowerPC G4 @1.25GHz, 100Mbit Ethernet)
- OS X Leopard
graphite
- PowerMac G4 (PowerPC G4 @500MHz x2, gigabit Ethernet)
- OS X Tiger
pmacg3
- PowerMac G3 (PowerPC G3 @400MHz, 100Mbit Ethernet)
- OS X Tiger
Before we test the HTTP server implementations, first let's guage the network bandwidth capabilities of these hosts using iperf.
Typical output from iperf
:
cell@indium(master)$ iperf -c plasma
------------------------------------------------------------
Client connecting to plasma, TCP port 5001
TCP window size: 129 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.98 port 63271 connected with 192.168.1.176 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.04 sec 1.09 GBytes 935 Mbits/sec
See the full output: iperf.txt
Summary of results for all hosts:
127.0.0.1
: 68.6 Gbits/sec 🤯plasma
: 935 Mbits/secflouride
: 937 Mbits/secopti7050
: 939 Mbits/secthinkpad
: 939 Mbits/secpi2b-1
: 94.0 Mbits/secnslu2
: 72.9 Mbits/sec⚠️ pmacg5
: 939 Mbits/secemac3
: 94.0 Mbits/secgraphite
: 496 Mbits/sec⚠️ pmacg3
: 93.7 Mbits/sec
Of note are:
- the crazy-high loopback bandwidth to localhost!
- despite having gigabit, the dual G4 500MHz (
graphite
) can't quite saturate it - the NSLU2 can't quite saturate 100Mbit with its 266MHz ARM processor
I used wrk
to load-test these HTTP servers.
Typical wrk
output:
cell@indium(master)$ wrk -t8 -c32 -d5s http://flouride:8080
Running 5s test @ http://flouride:8080
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 542.74us 18.81us 1.41ms 89.78%
Req/Sec 1.84k 27.67 1.99k 90.20%
9316 requests in 5.10s, 700.52KB read
Requests/sec: 1826.70
Transfer/sec: 137.36KB
I tested each host using -t8 -c32
(8 threads pumping 32 connections) as well as -t8 -c64
(8 threads pumping 64 connections).
First, let's look at the "fast" machines:
Across the board, we see that the threaded implementations far surpass the performance of the single-threaded implementation. No surprise there.
We also see that testing over loopback to the same host (127.0.0.1) is by far the winner. No surprise there either.
What was a bit surprising was that we see virtually no performance difference at all between the thread-per-connection and thread-pool implementations. I would guess this is due to my naivete with pthread conditions / mutexes.
The results for plasma
(my work laptop) appear to be anomalous. I have no idea why its network performance is so poor. iperf
proved it could saturate gigabit Ethernet, so I'm not sure what the issue is.
For the three x86_64 machines (flouride
, opti7050
, thinkpad
), we see that bumping wrk
from 32 to 64 connections increases performance by anywhere from 28% to 82%.
Now let's take a look at the "slower" machines:
Interestingly, here we see a slight to significant decrease in performance when jumping from 32 to 64 connections, especially with the G5 (pmacg5
).
Coincidentally, the performance of an (older) Raspberry Pi (pi2b-1
) very closely matches that of the two G4 machines (emac3
, graphite
).
Also interesting to note that a dual-processor 500MHz G4 (graphite
) matches the performance of a single-processor 1.25GHz G4 (emac3
), and that the dual G4 500MHz (graphite
) is more than twice as fast as the G3 400MHz (pmacg3
).
Surprisingly, threading does not seem to help at all on the 266MHz NSLU2.