Blog 2023/8/13
<- previous | index | next ->
To learn more about implementing high-performance HTTP API's, I've implemented a few trivial 'Hello, world!' servers and run some basic benchmarks.
These servers simply respond to any HTTP request with a 200 OK text/plain response of Hello World!.
In this post I'll cover three server implementations:
- Single-threaded
- Thread-per-connection
- Fixed-size pool of threads
We start with a bare-bones HTTP socket server implementation: only one connection is handled at a time.
This implementation spawns a new thread for each connection, and each thread is discarded after handling the connection.
This implementation spawns a fixed quantity of threads upon startup and uses them to handle all connections.
A stack is used to hand off the connection sockets from the main thread to the worker threads, protected by a pthread condition and mutex.
The main thread accepts connections and pushes the resulting file descriptors onto a stack, then signals the worker threads to wake up, consume the connections from the stack and reply to the incoming HTTP requests.
(Note: in retrospect, a stack is perhaps not a good choice, as the connections towards the bottom of the stack could suffer from starvation / excess latency).
The players:
127.0.0.1(indium)- Mac Mini (M1, gigabit Ethernet)
- macOS Monterey
plasma- Macbook Pro (M1, gigabit Ethernet)
- macOS Ventura
flouride- 2012 Macbook Pro (i7-3820QM @2.7GHz, gigabit Ethernet)
- macOS Ventura (thanks to OpenCore Legacy Patcher)
opti7050- Dell Optiplex 7050 (i5-7500 @3.8GHz, gigabit Ethernet)
- Ubuntu 22.10
thinkpad- Thinkpad T500 (Core 2 Duo P8600 @2.4GHz, gigabit Ethernet)
- Ubuntu 23.04
pi2b-1- Raspberry Pi 2 Model B (ARMv7 @900MHz, 100Mbit Ethernet)
- Raspbian Bullseye
nslu2- Linksys NSLU2 (ARMv5 @266MHz, 100Mbit Ethernet)
- Debian Jessie
pmacg5- PowerMac G5 (PowerPC G5 @2.3GHz x2, gigabit Ethernet)
- OS X Leopard
emac3- eMac (PowerPC G4 @1.25GHz, 100Mbit Ethernet)
- OS X Leopard
graphite- PowerMac G4 (PowerPC G4 @500MHz x2, gigabit Ethernet)
- OS X Tiger
pmacg3- PowerMac G3 (PowerPC G3 @400MHz, 100Mbit Ethernet)
- OS X Tiger
Before we test the HTTP server implementations, first let's guage the network bandwidth capabilities of these hosts using iperf.
Typical output from iperf:
cell@indium(master)$ iperf -c plasma
------------------------------------------------------------
Client connecting to plasma, TCP port 5001
TCP window size: 129 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.1.98 port 63271 connected with 192.168.1.176 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.04 sec 1.09 GBytes 935 Mbits/sec
See the full output: iperf.txt
Summary of results for all hosts:
127.0.0.1: 68.6 Gbits/sec 🤯plasma: 935 Mbits/secflouride: 937 Mbits/secopti7050: 939 Mbits/secthinkpad: 939 Mbits/secpi2b-1: 94.0 Mbits/secnslu2: 72.9 Mbits/sec⚠️ pmacg5: 939 Mbits/secemac3: 94.0 Mbits/secgraphite: 496 Mbits/sec⚠️ pmacg3: 93.7 Mbits/sec
Of note are:
- the crazy-high loopback bandwidth to localhost!
- despite having gigabit, the dual G4 500MHz (
graphite) can't quite saturate it - the NSLU2 can't quite saturate 100Mbit with its 266MHz ARM processor
I used wrk to load-test these HTTP servers.
Typical wrk output:
cell@indium(master)$ wrk -t8 -c32 -d5s http://flouride:8080
Running 5s test @ http://flouride:8080
8 threads and 32 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 542.74us 18.81us 1.41ms 89.78%
Req/Sec 1.84k 27.67 1.99k 90.20%
9316 requests in 5.10s, 700.52KB read
Requests/sec: 1826.70
Transfer/sec: 137.36KB
I tested each host using -t8 -c32 (8 threads pumping 32 connections) as well as -t8 -c64 (8 threads pumping 64 connections).
First, let's look at the "fast" machines:
Across the board, we see that the threaded implementations far surpass the performance of the single-threaded implementation. No surprise there.
We also see that testing over loopback to the same host (127.0.0.1) is by far the winner. No surprise there either.
What was a bit surprising was that we see virtually no performance difference at all between the thread-per-connection and thread-pool implementations. I would guess this is due to my naivete with pthread conditions / mutexes.
The results for plasma (my work laptop) appear to be anomalous. I have no idea why its network performance is so poor. iperf proved it could saturate gigabit Ethernet, so I'm not sure what the issue is.
For the three x86_64 machines (flouride, opti7050, thinkpad), we see that bumping wrk from 32 to 64 connections increases performance by anywhere from 28% to 82%.
Now let's take a look at the "slower" machines:
Interestingly, here we see a slight to significant decrease in performance when jumping from 32 to 64 connections, especially with the G5 (pmacg5).
Coincidentally, the performance of an (older) Raspberry Pi (pi2b-1) very closely matches that of the two G4 machines (emac3, graphite).
Also interesting to note that a dual-processor 500MHz G4 (graphite) matches the performance of a single-processor 1.25GHz G4 (emac3), and that the dual G4 500MHz (graphite) is more than twice as fast as the G3 400MHz (pmacg3).
Surprisingly, threading does not seem to help at all on the 266MHz NSLU2.


