Pipeline | Pipeline io_uring | Non-pipelined | Non-pipelined io_uring | |
---|---|---|---|---|
CPU | 99 | 50 (-50%) | 97 | 48 (-50%) |
RPS | 2,592,670 | 2,878,222 (+11%) | 497,429 | 631,976 (+26%) |
Working set | 79 | 81 | 79 | 81 |
Latency (mean) | 1.28 | 0.98 | 1.07 | 1.47 |
Latency (99th) | n/a | 7.57 | 14.8 | 14.67 |
-
-
Save sebastienros/82f5dd4ef1560b793574f3c7bd8dc656 to your computer and use it in GitHub Desktop.
Thank you so much, @sebastienros, for taking the time to run those benchmarks!
I had a look at your sebros/kernel branch, and it looks good to me.
I'm glad you were able to confirm my claims regarding throughput in the non-pipelined case. It is not a big surprise that the relative gain is smaller in the pipelined case (as predicted by @tmds here). My setup is apparently unable to give an accurate representation of the latency. I'll look into options to improve that.
The numbers do raise some questions, however:
- The decrease in CPU usage is significant. May I ask how the CPU usage of the load generators looked like during that time? I'm curious whether there was just headroom for more connections/load on the server or whether something is holding the server back from using its full potential. Remember that we've previously had a "usage error" leading to suboptimal CPU utilization.
- The increase in avg latency in the non-pipelined case is also significant. The transport is currently optimized to maximize the number of I/O requests per syscall (
io_uring_enter
), potentially leading to a latency increase. It could be that submitting to the kernel more often (earlier) would improve latency (at the cost of throughput). Let me know if you or someone else with access to Citrine is interested in exploring this.
Would you be able to re-run the benchmarks with traces enabled (and an eye on htop on the load generators) to maybe answer some of those questions and to spot potential low-hanging fruit regarding perf improvements?
Client scenarios:
IoUring.Transport also supports handling client (outbound) connections (via IConnectionFactory). This would allow microservice scenarios or reverse-proxies (such as YARP) to increase the number of I/O requests per syscall even further. For reverse-proxies, kernel v5.7 even added support for splice(2)
via io_uring (although I'm not sure how we would best expose that via the existing connection abstractions). Are there any workloads in your set of benchmarks that include client connections via IConnectionFactory? It would be interesting to see how IoUring.Transport does for this kind of workload.
These benchmarks were ran on a 12-core machine, because Citrine setup doesn't have the required kernel (yet).
The decrease in CPU usage is significant.
The Transport defaults to half the number of Processors: https://github.com/tkp1n/IoUring.Transport/blob/bed647373487aac25a58de34598e2bc9251c903b/src/IoUring.Transport/IoUringOptions.cs#L9.
And ApplicationSchedulingMode is set to Inline: https://github.com/tkp1n/IoUring.Transport/blob/13e571a5d6d0e63937da2e8a0e18a9a589648bb8/tests/PlatformBenchmarks/BenchmarkConfigurationHelpers.cs#L59
This means the code runs on half of the processors, so 50% is the expected CPU load.
If you increase the ThreadCount
option, CPU usage will go up, but RPS will probably go down (cfr benchmarks ran in tmds/Tmds.LinuxAsync#39 (comment)).
I assumed that the ThreadCount
would be controlled due to this config in the Benchmarks repo. The results make more sense now, thanks ๐
.
In fact, I set the default ThreadCount
to half the logical threads (~ the number of physical cores) based on the findings in comment you've linked.
When comparing the results from tmds/Tmds.LinuxAsync#39 (comment) with the results above, we notice an increase in RPS from 518,186 -> 631,976 with the update to kernel v5.7 and the necessary code changes to leverage IORING_FEAT_FAST_POLL
. Assuming, of course, the infrastructure hasn't changed since then. That would be as close to "free lunch" as it gets ๐
Non-pipelined io-uring result looks unusual: average latency rose but worst case latency of the 99th percentile fell compared to non-io-uring? Throughput also increased but average latencies were up?