Comparison of the performance of FFI vs XS zeromq bindings. For FFI the
ZMQ::FFI
bindings are used, first using FFI::Raw
on the backend and then
using FFI::Platypus
. For XS ZMQ::LibZMQ3
is used.
Comparison is done using the zeromq weather station example, first by timing
wuclient.pl using the various implementations, and then by profiling
wuserver.pl using Devel::NYTProf
. When profiling the server is changed to
simply publish 1 million messages and exit.
Weather station example code was lightly optimized (e.g. don't declare vars in loop) and modified to be more consistent.
Additionally, a more direct benchmark and comparison of FFI::Platypus
vs XS
xsubs is also done.
C and Python implementation results are provided as a baseline for performance.
All the code that was created or modified for these benchmarks is listed at the end (C/Python wuclient/wuserver code can be found in the zmq guide).
CPU: Intel Core Quad i7-2600K CPU @ 3.40GHz
Mem: 4GB
OS: Arch Linux
ZMQ: 4.0.5
Perl: 5.20.1
ZMQ::FFI = 0.19 (FFI::Raw backend), dev (FFI::Platypus backend)
FFI::Raw = 0.32
FFI::Platypus = 0.31
ZMQ::LibZMQ3 = 1.19
Platypus doesn't actually generate an xsub at run-time for each function you attach (that would make attach() really slow!), but uses a generic C functon for all of them, which looks at a function descriptor structure and handles the arguments individually using either a big switch statement (in the main branch) or indirect function pointer calls (on my all-tests-pass branch (well, that's obviously a branch which I only commit to when, er, all tests pass)). Switch statements are slow. Function pointer calls used to be really really slow, but I think they've been downgraded to only really slow by now. Then it handles the return value in another big switch statement/function pointer call.
Furthermore, libffi copies all arguments one more time, though that's probably in the cache and should be relatively fast.
So in a way, the holy grail is unachievable: anything we can do in our generic function (if you want to have a look, it's at https://github.com/pipcet/FFI-Platypus/blob/lazy/include/ffi_platypus_rtypes_call.h and https://github.com/pipcet/FFI-Platypus/blob/lazy/include/ffi_platypus_call.h), XS can do. In theory. In practice, each XSUB has to optimized by hand for things we can do automatically. For example, assume there is a string argument to our native function. It's likely, then, that the native function will actually look at the string; we can exploit that by prefetching the string contents to the CPU cache before we even look at the other arguments. That kind of optimization is really hard to do for a hundred, or even a dozen, XSUBs, but we only have to do it once. As cache misses are a major source of latency in program execution, it's quite possible we can beat XS just by avoiding one of them.
The only problem with that example is it's realistic: it describes actual programs, which have cache misses, rather than simple benchmark programs, which have everything in cache already and for which prefetch instructions are unnecessary overhead.
The Platypus user isn't affected by all that. All they see is the
->attach
call with a harmless 'string' in it, and they never know we're prefetching that string for them.There is another way in which we're faster than XS, though I'm not entirely happy with it: XS and the rest of Perl use blessed references to integer scalars for pointers, while we use plain integers. That's a pointer deref we don't have to do, but because only references can be blessed, that means you cannot free an opaque pointer in its DESTROY method.
Sorry this got a bit long. In summary, Platypus gives up some low-level tweak-the-assembler-code optimization but gains the ability of automating exotic optimization techniques. Those paltry 200 CPU cycles you might be losing today are nothing compared to what you gain in maintainability, and potentially in future performance.