Onion messages rate-limiting

During the recent Oakland Dev Summit, some lightning engineers got together to discuss DoS protection for onion messages. Rusty proposed a very simple rate-limiting scheme that statistically propagates back to the correct sender, which we describe in details below.

Nodes apply per-peer rate limits on incoming onion messages that should be relayed (e.g. N/seconds with some burst tolerance). It is recommended to allow more onion messages from peers with whom you have channels, for example 10/seconds when you have a channel and 1/second when you don't.

When relaying an onion message, nodes keep track of where it came from (by using the node_id of the peer who sent that message). Nodes only need the last such node_id per outgoing connection, which ensures the memory footprint is very small. Also, this data doesn't need to be persisted.

Let's walk through an example to illustrate this mechanism:

Bob receives an onion message from Alice that should be relayed to Carol
After relaying that message, Bob stores Alice's node_id in its per-connection state with Carol
Bob receives an onion message from Eve that should be relayed to Carol
After relaying that message, Bob replaces Alice's node_id with Eve's node_id in its per-connection state with Carol
Bob receives an onion message from Alice that should be relayed to Dave
After relaying that message, Bob stores Alice's node_id in its per-connection state with Dave
...

We introduce a new message that will be sent when dropping an incoming onion message because it reached rate limits:

type: 515 (onion_message_drop)
data:
- [rate_limited:u8]
- [shared_secret_hash:32*byte]

Whenever an incoming onion message reaches the rate limit, the receiver sends onion_message_drop to the sender. The sender looks at its per-connection state to find where the message was coming from and relays onion_message_drop to the last sender, halving their rate limits with that peer.

If the sender doesn't overflow the rate limit again, the receiver should double the rate limit after 30 seconds, until it reaches the default rate limit again.

The flow will look like:

Alice                      Bob                      Carol
  |                         |                         |
  |      onion_message      |                         |
  |------------------------>|                         |
  |                         |      onion_message      |
  |                         |------------------------>|
  |                         |    onion_message_drop   |
  |                         |<------------------------|
  |    onion_message_drop   |                         |
  |<------------------------|                         |

The shared_secret_hash field contains a BIP 340 tagged hash of the Sphinx shared secret of the rate limiting peer (in the example above, Carol):

shared_secret_hash = SHA256(SHA256("onion_message_drop") || SHA256("onion_message_drop") || sphinx_shared_secret)

This value is known by the node that created the onion message: if onion_message_drop propagates all the way back to them, it lets them know which part of the route is congested, allowing them to retry through a different path.

Whenever there is some latency between nodes and many onion messages, onion_message_drop may be relayed to the incorrect incoming peer (since we only store the node_id of the last incoming peer in our outgoing connection state). The following example highlights this:

 Eve                       Bob                      Carol
  |      onion_message      |                         |
  |------------------------>|      onion_message      |
  |      onion_message      |------------------------>|
  |------------------------>|      onion_message      |
  |      onion_message      |------------------------>|
  |------------------------>|      onion_message      |
                            |------------------------>|
Alice                       |    onion_message_drop   |
  |      onion_message      |                    +----|
  |------------------------>|      onion_message |    |
  |                         |--------------------|--->|
  |                         |                    |    |
  |                         |                    |    |
  |                         |                    |    |
  |    onion_message_drop   |<-------------------+    |
  |<------------------------|                         |

In this example, Eve is spamming but onion_message_drop is propagated back to Alice instead. However, this scheme will statistically penalize the right incoming peer (with a probability depending on the volume of onion messages that the spamming peer is generating compared to the volume of legitimate onion messages).

It is an interesting research problem to find formulas for those probabilities to evaluate how efficient this will be against various types of spam. We hope researchers on this list will be interested in looking into it and will come up with a good model to evaluate that scheme.

To increase the accuracy of attributing onion_message_drop, more data could be stored in the future if it becomes necessary. We need more research to quantify how much accuracy would be gained by storing more data and making the protocol more complex.

Cheers, Bastien

t-bast/Onion messages rate-limiting.md

Onion messages rate-limiting

bshramin commented May 16, 2023

Uh oh!