DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS=1
noticeably improves simple TE benchmarks such as the following ones on all UNIX archs. From my understanding, it avoids dispatching from the event-thread to threadpool and does the work in the same thread it got request from.
TE Benchmark | Baseline, RPS | MyTest, RPS | diff, % |
---|---|---|---|
ARM64 Platform-JSON PGO | 661,663 | 778,925 | +17.72% |
ARM64 Platform-Caching PGO | 186,188 | 218,004 | +17.09% |
ARM64 Platform-Plaintext PGO | 6,933,964 | 7,563,428 | +9.08% |
x64 Platform-JSON PGO | 1,299,388 | 1,432,200 | +10.22% |
x64 Platform-Caching PGO | 413,123 | 445,144 | +7.75% |
x64 Platform-Plaintext PGO | 12,529,587 | 13,137,836 | +4.85% |
(+17% on arm64 seems to be a sign that something can be improved on it, e.g. Threads-per-engine heuristic, or SpinWait params?)
However, it most likely regresses pretty much anything more complicated than "receive a tiny request and immediately send something back":
TE Benchmark | Baseline, RPS | MyTest, RPS | diff, % |
---|---|---|---|
ARM64 Platform-Fortunes PGO | 88,765 | 51,648 | -41.81% |
x64 Platform-Fortunes PGO | 494,777 | 410,766 | -16.98% |
Can we do a sort of PGO (static or dynamic) but on managed level to adapt to users' workloads dynamicly?