As tested on Linux:
- An SCM_RIGHTS ancillary message is "attached" to the range of data bytes sent in the same sendmsg() call.
- However, as always, recvmsg() calls on the receiving end don't necessarily map 1:1 to sendmsg() calls. Messages can be coalesced or split.
- The recvmsg() call that receives the first byte of the ancillary message's byte range also receives the ancillary message itself.
- To prevent multiple ancillary messages being delivered at once, the recvmsg() call that receives the ancillary data will be artifically limited to read no further than the last byte in the range, even if more data is available in the buffer after that byte, and even if that later data is not actually associated with any ancillary message.
- However, if the recvmsg() that received the first byte does not provide enough buffer space to read the whole message, the next recvmsg() will be allowed to read past the end of the mesage range and even into a new ancillary message's range, returning the ancillary data for the later message.
- Regular read()s will show the same pattern of potentially ending early even though they cannot receive ancillary messages at all. This can mess things up when using edge triggered I/O if you assumed that a short read() indicates no more data is available.
- A single SCM_RIGHTS message may contain up to SCM_MAX_FD (253) file descriptors.
- If the recvmsg() does not provide enough ancillary buffer space to fit the whole descriptor array, it will be truncated to fit, with the remaining descriptors being discarded and closed. You cannot split the list over multiple calls.
Perhaps I should write something of my own out of my findings, but I'm far from there.
Yes.
It's hard to really mitigate this type of DoS in a reactor-style API. I'd just suggest to migrate to proactor APIs (e.g. io_uring) if possible. File IO for one will never play nice with reactors (file IO is always ready per reactor style APIs, but will block the thread nevertheless).
Right now I'm curious about
SCM_CREDENTIALS
. What happens if you send a message to a socket running inside a Linux user namespace? I'll have to test for that.SCM_RIGHTS
can be used to build a capabilities-based IPC while Linux namespaces can provide the sandboxing, but then one starts to wonder: what am I leaking by the time I send a new fd to the guest? The Linux port of the capsicum project made changes to*at()
functions so the “guest” wouldn't be able to use a fd to inspect thefd
's directory: https://github.com/google/capsicum-linux/blob/e85a99a937ee0eb0b4b9fe19f4055ffc5857eb91/README.md#topic-branches. However I'm not seeing many other changes on this code, so maybe there's not really a lot to worry about and you can in fact exhaust the list.