Skip to content

Instantly share code, notes, and snippets.

@kentonv
Last active October 17, 2024 19:11
Show Gist options
  • Save kentonv/bc7592af98c68ba2738f4436920868dc to your computer and use it in GitHub Desktop.
Save kentonv/bc7592af98c68ba2738f4436920868dc to your computer and use it in GitHub Desktop.
SCM_RIGHTS API quirks

As tested on Linux:

  • An SCM_RIGHTS ancillary message is "attached" to the range of data bytes sent in the same sendmsg() call.
  • However, as always, recvmsg() calls on the receiving end don't necessarily map 1:1 to sendmsg() calls. Messages can be coalesced or split.
  • The recvmsg() call that receives the first byte of the ancillary message's byte range also receives the ancillary message itself.
  • To prevent multiple ancillary messages being delivered at once, the recvmsg() call that receives the ancillary data will be artifically limited to read no further than the last byte in the range, even if more data is available in the buffer after that byte, and even if that later data is not actually associated with any ancillary message.
  • However, if the recvmsg() that received the first byte does not provide enough buffer space to read the whole message, the next recvmsg() will be allowed to read past the end of the mesage range and even into a new ancillary message's range, returning the ancillary data for the later message.
  • Regular read()s will show the same pattern of potentially ending early even though they cannot receive ancillary messages at all. This can mess things up when using edge triggered I/O if you assumed that a short read() indicates no more data is available.
  • A single SCM_RIGHTS message may contain up to SCM_MAX_FD (253) file descriptors.
  • If the recvmsg() does not provide enough ancillary buffer space to fit the whole descriptor array, it will be truncated to fit, with the remaining descriptors being discarded and closed. You cannot split the list over multiple calls.
@kentonv
Copy link
Author

kentonv commented May 16, 2022

@vinipsmaker Good question, but sorry, I don't have such a list. I suspect you could come up with a different list of concerns for every type of FD.

I would argue the SIGPIPE thing specifically is not unique to FD passing. That issue comes up for plain old network connections formed using connect() without any FD passing. So hopefully apps are prepared for that.

What's scarier is that if you received one end of a network connection from another process that you don't trust, then that other process could mess with the settings on that FD at any time, for example turning off O_NONBLOCK so that your process unexpectedly locks up on a read(). You can maybe avoid this by always using recv() with MSG_DONTWAIT? I bet there are a lot of other issues like this, though.

@vinipsmaker
Copy link

Perhaps I should write something of my own out of my findings, but I'm far from there.

You can maybe avoid this by always using recv() with MSG_DONTWAIT?

Yes.

It's hard to really mitigate this type of DoS in a reactor-style API. I'd just suggest to migrate to proactor APIs (e.g. io_uring) if possible. File IO for one will never play nice with reactors (file IO is always ready per reactor style APIs, but will block the thread nevertheless).

Right now I'm curious about SCM_CREDENTIALS. What happens if you send a message to a socket running inside a Linux user namespace? I'll have to test for that.

SCM_RIGHTS can be used to build a capabilities-based IPC while Linux namespaces can provide the sandboxing, but then one starts to wonder: what am I leaking by the time I send a new fd to the guest? The Linux port of the capsicum project made changes to *at() functions so the “guest” wouldn't be able to use a fd to inspect the fd's directory: https://github.com/google/capsicum-linux/blob/e85a99a937ee0eb0b4b9fe19f4055ffc5857eb91/README.md#topic-branches. However I'm not seeing many other changes on this code, so maybe there's not really a lot to worry about and you can in fact exhaust the list.

@kentonv
Copy link
Author

kentonv commented May 17, 2022

I'd expect SCM_CREDENTIALS correctly maps identifiers when crossing namespaces, otherwise that would be a serious security flaw.

I'm not sure to what extent io_uring actually defends you against hanging files. Aren't the calls handled by kernel threads? So you still get a hanging kernel thread. Not sure if that's much better than a hanging userspace thread? Can the operation actually be canceled?

@vinipsmaker
Copy link

vinipsmaker commented May 18, 2022

I'd expect SCM_CREDENTIALS correctly maps identifiers when crossing namespaces, otherwise that would be a serious security flaw.

Agreed. But I still want to test it. After all, bugs happens. For instance:

https://lwn.net/Articles/641275/
Then there is an interesting little problem in the intersection of capabilities and user namespaces. If a process connects to D-Bus, then moves into its own user namespace, it will appear to have all available capabilities.

I really really doubt we have this bug anyway.

Aren't the calls handled by kernel threads? So you still get a hanging kernel thread.

Yes.

Not sure if that's much better than a hanging userspace thread?

It's not really a kernel-managed thread. It's more like a kernel-managed thread-pool.

I do believe it's better to hang a kernel thread than to hang an userspace thread. It means the user program can be made single-threaded. How many threads do we need? If you're doing threads purely to exploit IO concurrency, your application has no better knowledge than the kernel to know how many threads it should be spawning.

Also we had very specific cases where kernel AIO would work for certain combinations of kernel drivers and filesystems. A state machine would be another valid approach to implement the IO operation within the kernel. The fact it currently uses threads is just an implementation detail.

A thread blocking on a single IO operation doesn't equal to a system under full load that can't accept new IO requests. The correct error condition should be propagated, and io_uring has that (submission queue full).

@vinipsmaker
Copy link

What's scarier is that if you received one end of a network connection from another process that you don't trust, then that other process could mess with the settings on that FD at any time, for example turning off O_NONBLOCK so that your process unexpectedly locks up on a read().

Check this out: https://www.mankier.com/2/fcntl#Description-Mandatory_locking

It may be true that we can avoid a blocking operation with MSG_DONTWAIT, but the non-trusted process holding a dup() of our fd could set O_NONBLOCK off and then set a traditional (process-associated) file lock to DoS our process. I think io_uring would dodge this attack as well. And mandatory locks are not easily enablable anyway (and they're even planning to remove it entirely).

@ClosetGeek-Git
Copy link

The more I read about SCM_RIGHTS the more confused I become. It leaves me feeling like where missing something. These aren't trivial bugs, and almost makes the whole concept unusable unless there's something undocumented that is somehow being missed.

@ClosetGeek-Git
Copy link

There seems no way to be sure of it's boundaries, and no way to verify after the fact. How is this ever ok?

@egmontkob
Copy link

@ClosetMonkey The more I read about SCM_RIGHTS the more confused I become.

Me too.

@kentonv you MUST check if you received an SCM_RIGHTS message and, if so, close the file descriptors [...] You MUST check whether you received two and close the second one

Thanks for creating this gist! What I'm puzzled about: How do I know how many file descriptors I received?

The formula would essentially be the inverse of CMSG_LEN(), divided by sizeof(int). But I can't find a macro for this.

One possibility is to open up CMSG_LEN()'s definition, e.g. with the glibc header files the inverse would be len - CMSG_ALIGN (sizeof (struct cmsghdr)), whoops, there goes portability I'm afraid.

Another possibility is to guess the value in a loop (or binary search), passing the guesses to CMSG_LEN() and comparing them to the actual received ancillary data length.

Am I missing something?

@kentonv
Copy link
Author

kentonv commented Sep 9, 2022

@egmontkob Unfortunately, you do indeed need to divide the number of bytes that the kernel indicates you actually received, by the size of an int. Here's my code.

https://github.com/capnproto/capnproto/blob/7c8802fb9bec8818f289a44b0ec22419a845b249/c++/src/kj/async-io-unix.c++#L665-L680

@ClosetMonkey This is a very old interface and clearly it wouldn't pass muster by modern standards, but a lot of things are built on it now so it can't really go away. Instead we say, this is an old weird API and if you want to use it you'd better be careful.

I am a bit surprised that MacOS gets away with not closing the file descriptors for you, that feels like it might be a vulnerability of some sort, though I guess on single-user desktops, resource exhaustion vulnerabilities aren't considered to be a huge deal.

@egmontkob
Copy link

@kentonv Thanks for your response!

It's not the division I was worried about, rather having to subtract CMSG_ALIGN (sizeof (struct cmsghdr)) which looked presumably non-portable for me, I mean, is it guaranteed that that's what CMSG_LEN() adds to the payload length?

Your approach of writing CMSG_LEN(0) instead is definitely nicer, the nicest so far. You still rely on the (pretty reasonable) assumption that the function is linear with a slope of sizeof(int), but don't make an assumption on the overhead's size. Nice!

I really think there should be a macro doing this. I'm not sure if glibc / freebsd / etc. developers would be open to this idea; or if it should be brought up with POSIX / Austin Group or who else exactly.

@o11c
Copy link

o11c commented Sep 11, 2023

recvmmsg (note the extra m) should ease the issue regarding recvmsg stopping at the end of the byte range. It's usually needed for performance anyway.

@tgmatos
Copy link

tgmatos commented Oct 17, 2024

@vinipsmaker told me to share this tweet by Andreas Kling (Ladybug browser leader developer):

We are hitting this kernel bug with file descriptor passing on macOS/XNU and it's pretty terrible and it's known since 2011!?
The obvious workaround is to use mach_msg and port rights, which is what WebKit does, and what we will do as well. But.. uhh..
https://openradar.me/9477351

https://x.com/awesomekling/status/1846424613951099317
https://xcancel.com/awesomekling/status/1846424613951099317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment