Skip to content

Instantly share code, notes, and snippets.

@jdarcy
Created November 9, 2022 16:10
Show Gist options
  • Save jdarcy/60107fe4e653819138396257df302eef to your computer and use it in GitHub Desktop.
Save jdarcy/60107fe4e653819138396257df302eef to your computer and use it in GitHub Desktop.
Some thoughts about ActivityPub

I've commented a few times about some issues I see with the scalability of ActivityPub - the protocol behind the Fediverse and its best-known implementation Mastodon. A couple of folks have asked for more elaboration, so ... here it is.

First, let me add some disclaimers and warnings. I haven't devoted a lot of time to looking at ActivityPub, so there might be some things I've misunderstood about it. On the other hand, I've brought bigger systems - similar node counts and orders of magnitude more activity per node - from broken to working well based on less study of the protocols involved. So if you want to correct particular misconceptions, that's great. Thank you in advance. If you want to turn this into an appeal to authority and say that I'm wrong only because I haven't developed a full ActivityPub implementation or worked on it for X years ... GTFO.

What

What is ActivityPub? It's an HTTP- and JSON-based protocol for exchanging information about "activities". An activity could be many things. The only real common element is that activities are published by authors, creating a stream of activities from that author, and then read from that stream by others. As examples, there are ActivityPub-based servers for (somewhat) Twitter-like microblogging, for regular blogging (arguably what I should use to distributed this note), forums, images, music, and even online chess. Cool idea, huh? A younger me might even have been impressed by the generality and breadth of applications. An older me frankly sees those as red flags. You might see hints of why in the parts that follow, but that is at core just a gut feel that I can't substantiate, so I won't harp on it more right now.

Why

Why do I think scaling with ActivityPub is going to be problematic? Because it already seems to be. As I write this, there seem to be about 7M Fediverse users with an average of 800 per node, or about 9K nodes. In terms of users, that's around 3% of Twitter scale ... and there already seem to be some cracks appearing. Sidekiq is the part of Mastodon that handles inter-node communications. I've seen many accounts of Sidekiq queues in the hundreds of thousands, even millions, resulting in delays of hours before posts become available. This is even with proprietors spending lots of money to turn up dozen-machine clusters and tune the hell out of them, suspend user registrations, turn off features, etc. What this tells us is that inter-node communications has become a drag on the whole system. Even if I'm wrong about the specific causes or solutions, that's just empirical fact.

Is this just a Mastodon problem? I've certainly seen many people focusing on that, but I don't think so. I think the Mastodon implementation has just hit the wall first, due to some combination of implementation (in)efficiency and popularity. I strongly suspect that Pleroma and Misskey and others will hit it pretty soon too, because the wall is that ActivityPub in its basic form creates an all-to-all communication pattern. Anyone who has worked in distributed systems knows that's going to be an exponentially more painful problem as the system grows. (Note BTW that I'm not saying that the communication is literally every node ("instance" in Fediverse speak) talking to every other. That's just the algorithmic O(n) complexity. For purposes of this discussion that's what matters.)

Some people have suggested that HTTP caching can or will solve this problem. I don't think so. For one thing, that would already be solving it and it manifestly isn't. Caching and CDNs don't seem widely deployed. Why not? I think there are three reasons.

  • First, any caching works best for traffic that matches its design assumptions. The first of those assumptions for HTTP caching is stable URLs. "Stable" in this sense means that the cache or CDN can distinguish between essential and non-essential parts of a URL (e.g. all those tracking arguments) so that it can match a request to content it already has. This works well for URLs that represent actual static content, but not for URLs that represent RPC endpoints - including ActivityPub. Caching those would require peeking beyond the HTTP headers, which third parties don't do and often can't do e.g. because of encryption.

  • The second set of design assumptions for HTTP caching has to do with traffic patterns. HTTP caching works super well when many clients near a particular cache are requesting the same content at the same time, because that's what it was designed for. Again, because it's important: same clients near same cache, same content at same time. Does Fediverse traffic match those characteristics? Hell no. In fact it's a lot closer to the exact opposite. Caching is literally worthless - even of negative value - when the time between two requests for the same content exceeds the cache expiration/turnover time. The second request will have to go to the origin anyway. That's likely to be a common case in the Fediverse, leading to those long Sidekiq queues or equivalent on other implementations. In other cases there's likely to be some benefit from caching, but quite likely not enough to solve the fundamental problem.

  • Lastly, if the caching is done by not-so-trustworthy third parties (whether or not to trust CDN operators like CloudFlare is a topic for another day), then you have to deal with identify spoofing and content modification/leakage. That means encryption, signing, etc. It's entirely possible that such information could be put into ActivityPub requests and responses - the JSON-based data format is pleasantly extensible that way) but AFAICT nobody has even begun that work and I know it's a long road. One reason I know is that my first few patents relate to a project to do exactly this kind of thing - secure distributed caching between non-mutually-trusting parties. I spent years on this stuff, albeit long ago. Anybody who tries to brush it aside as an easy problem is in for a very rude awakening.

In conclusion, I believe that inter-node ("between instance") communication is going to be the key problem as the Fediverse gets bigger, and that there are no easy band-aids that will just make it go away.

How

Lastly, in the interests of making a positive as well as negative contribution: how do we solve this? For the reasons outlined above, I believe that trying to make ActivityPub work with CDNs is somewhere between Not Worth It and Impossible. What's needed is an ActivityPub specific caching system. Something that is connected in the right ways so that the cache hits more than it misses, that can look inside the ActivityPub data to recognize matches, and that fully supports the necessary security mechanisms. Yes, I am aware of relays. In fact, I believe they're a step in the right direction, but only the first step in what I know to be a long road.

What kind of cache connection pattern would be (a) implementable and (b) sufficient to make caching efficacious? Well, first you need for many relays - might as well keep calling them that - to exist. Then you need some sort of system for registering and discovering the "right" relays to use. That system can be either centralized or distributed, in either a logistical or organizational sense. It could also be embedded right into the relays themselves, or separate. There are a lot of design choices to be made, and I'm going to hand-wave over all of them. I hope that people will start thinking seriously about the alternatives and tradeoffs, and I might even enjoy being part of those discussions, but for now I'll just defer on that part. What's important is how this could change the ActivityPub traffic pattern from a very cache-hostile one to a cache-friendly one. Instead of clients connecting directly to servers to fetch data, here's what would happen.

  1. Client queries the registry system to find the relay nearest either to themselves or to the desired server. Note that "nearest" would probably mean in terms of network distance, but not necessarily. In the old P2P world many technologies were developed that actually worked better with "nearest" being in terms of some mathematical relationship instead.

  2. The relay first checks its cache, and delivers the matching content if it finds some. Voila! The client will probably gets its result faster, but more importantly that load is kept off the server (which can then spend more of its time dealing with its own clients/users instead of getting bogged down by excessive between-instance traffic).

  3. If the relay doesn't find a match, it can query either another relay or it can query the actual server.

  4. When the relay gets a response back, it stores it using some "tag" that's easy to look up later, along with expiration/refresh information. This is taking advantage of the fact that (at least in most uses) it's OK for ActivityPub data to be eventually consistent. Faster expirations or shorter refresh times mean more current data, but also more server load. It's a knob that can be turned, based on empirical data on what provides the best user experience when server queue lengths and response times are considered.

  5. The relay also delivers the result back to the originally requesting client.

  6. When the relay gets a second request for the same activity stream, it should with reasonably high probability be able to deliver a result from cache with low local overhead and no need to bother the original server. That is, after all, the point of the whole exercise.

In a very simple implementation, there might just be a single tier of relays as there is today (but more of them and more organized). The next step would be to have client-facing and server-facing tiers talking to each other, creating two opportunites for each piece of data to be cached. The most complex would be a complex mesh, looking either like a CDN if relay assignments are geographically based or like an old-school P2P network if relay assignments are otherwise. In this case there would be multiple opportunities for data to be cached between client and server, hopefully increasing the cache hit rate more than it increases the cache-miss latency penalty.

Lastly, we need to talk about security. If the Fediverse is to remain distributed, both logistically and organizationally, that means the relays can't be trusted by either clients or servers. To avoid the aforementioned problems with spoofing etc. that means relays have to do their part in maintaining a "chain of custody" between clients and servers. In practice, this means that every item in a relay's cache has to have not only the tag and expiration time I mentioned before, but also various kinds of information that will allow a client - eventually - to verify and decrypt the data it's getting. Or not, if the client isn't authorized to have that particular piece of data. That means such information also needs to be carried within the ActivityPub protocol. I won't get into all the permutations of signatures, checksums, HMACs, key IDs, initialization vectors, and so on that can be involved. What's important is that there's a huge community and body of work that's readily applicable here, and which should applied to make the Fediverse secure even at the necessary scale and complexity.

Coda

In case you can't tell, there's a lot of technical "meat on the bone" here. There's a reason I spent years of my life working on exactly these kinds of problems. None of these problems are unsolvable by any means, but they are not solved already and AFAICT there's not a lot of motion yet on solving them. Primarily I think that's because a very important group of stakeholders - instance operators - is too busy right now keeping their heads above water in the system as it is. Another group - developers - seems a bit focused on local-implementation issues. That's quite reasonable given where things are at now, but I think people also need to look at the thing that I believe will be "the wall" within the next year or so - between-instance communication complexity, relays, caching, etc. My hope in writing this is to help get people looking at it and talking about it. If you don't think I've provided good starting points, please do provide your own. If you think I've misunderstood something, or you disagree with my diagnoses and proposals, please do say so. At least then we'll be talking about the elephant in the room.

@mariusor
Copy link

Hi Jeff, allow me to offer a perspective from someone that worked with the ActivityPub specification for a number of years.

I will add some thougts about caching:

The main reason why caching is not very useful on the Fediverse (including the Mastodon flavoured Fediverse) is that the majority of ActivityPub traffic is formed of POST requests between instances, which is uncacheable by definition.

On retrieval there are some parts where caching would work, and some where it'll kind of work:

The most cacheable resources are the Activities, which can be cached indefinitely because they're mostly immutable (outside of being the object of an Undo).

Collections can be cached easily but the freshness on the first CollectionPage will probably be poor (because they are ordered lists, with newest at the top), because there's a lot of activity there. If the services uses a keyset pagination format that provides a stable and cacheable resource for the next pages.

Objects can also be cached (probably better than collections) but they could be stale because there are multiple Activity types that can modify them (Update, Delete, Move, etc). I imagine that a proper caching mechanism using etags and a HEAD/GET pair of requests on the clients could work well here.

Overall I think the solution to ActivityPub being fast(er) is not caching but minimizing the times we have to build collections/activities/objects for every request from component parts - ie, store them as close as possible to the original format instead of hitting multiple database tables and aggregating the information.

Replacing the innefficient queuing tools that Mastodon uses will probably decrease the loads on the individual machines that host the instances, and to me that looks the most self contained change that can be done reasonably fast by the Mastodon devs. The setup cost of an instance is already pretty high, so replacing sidekiq with something lighter and faster won't have a big impact.

@alexanderankin
Copy link

Overall I think the solution to ActivityPub being fast(er) is not caching but minimizing the times we have to build collections/activities/objects for every request from component parts - ie, store them as close as possible to the original format instead of hitting multiple database tables and aggregating the information.

+1

@robconery
Copy link

robconery commented Nov 22, 2022

Why do I think scaling with ActivityPub is going to be problematic? Because it already seems to be. As I write this, there seem to be about 7M Fediverse users with an average of 800 per node, or about 9K nodes. In terms of users, that's around 3% of Twitter scale ... and there already seem to be some cracks appearing. Sidekiq is the part of Mastodon that handles inter-node communications. I've seen many accounts of Sidekiq queues in the hundreds of thousands, even millions, resulting in delays of hours before posts become available. This is even with proprietors spending lots of money to turn up dozen-machine clusters and tune the hell out of them, suspend user registrations, turn off features, etc. What this tells us is that inter-node communications has become a drag on the whole system. Even if I'm wrong about the specific causes or solutions, that's just empirical fact.

I'm digging into the Mastodon streaming code and it's fascinating to, to say the least. Before I go over that, I would suggest that basing your critique of ActivityPub on Mastodon's architecture could be problematic. I do see your point - particularly the O(n) push issue - but for some platforms (namely Erlang) that's not much of an issue. And I'm also not convinced that every toot needs to be pushed to every other instance but I could be wrong about that - I'm still digging in.

Anyway - it appears that Mastodon uses Node to spawn n workers to handle subscription "stuff" which, from what I can tell, is subscribing to remote servers and handling the realtime aspect. I'm still trying to understand the code - but from what I see this is where the heavy lifting is and has nothing to do with Sidekiq.

EDIT
While still O(n), the theta is something entirely different based on this comment:

The federated timeline is a firehose view of the Mastodon network. But it only includes people your neighbours are subscribed to, so it's not complete.

Seems like an interesting optimization for smaller servers....


In fact I would go so far as to say it suggests Redis is the bottleneck as well as HTTP traffic in general, but I need to read more.

Anyway: my point is that I don't think ActivityPub is the problem in and of itself - I think Mastodon's streaming is the problem you're pointing out (which I think I agree with). Will be interesting to see how this plays out over time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment