blog-post.md

Onyx 0.5.0: The Cluster a Value

Onyx 0.5.0 hit Clojars this morning, and I couldn't be more proud. More than 4 months of hard work has gone into this release, featuring some of my most creative work to date. At the top of the list of things to discuss is a prolific design change: Onyx has become a masterless compute system. Following directly as a result of this new capability are a few distinctive features - including realtime notification of cluster events via core.async, tunable, fine grained cluster allocation algorithms, and a host of new APIs to enhance usability.

In this post, I'm going to gently take you through I was able to achieve a compute system with no centralized coordinating entity. For a very in depth explanation about how everything works, you can check out the Onyx Internal Documentation, which details algorithm specifics, step-by-step examples, and edge cases. Next, I'll describe how to use some of the new features, and how they're a direct result of the redesign effort.

If you'd like to skip the design-talk, you can go right for the release notes, examples, or starter repository. Otherwise, put your distributed systems hat on! (Mine is permanently glued on.)

We don't need no stinking Nimbus

I'm not sure if I made it look easy to anyone who was watching, but getting the first public release of Onyx ready for StrangeLoop ready was difficult to say the least. Cranking out features and documentation each night for months on end amplified the best and worst parts of Onyx's design. We made it to the finish line, though, with a respectfully powerful platform.

Following that engagement, I had a some time to reflect on what the most erorr-prone and cognitively straining pieces of Onyx are. Distributed systems veterans won't be surprised that I determined that coordination bit the hardest. Since that realization, I've had mesmeric focus on eliminating this heap of complexity from the project. But first, a bit of background on Onyx 0.4.1 and prior.

The pros of a centralized coordination node

Onyx previously featured a number of "peers" that perform work on the cluster, and a single coordination node that directed the rest of the workers on which tasks to perform. The coordinator detected downed machines, reassigned tasks if the workload changed, and monitored timeout situations. Coupled with making the coordinator highly available with a hot stand-by, this approach has some good characteristics:

Most znodes in ZooKeeper will only have one watch that triggers on change - the coordinator.
Znode contention is very low. Usually only the coordinator and one peer are accessing znode.
Coordination decisions happen in a single place, and can be easily traced down to a single location in the code executing on a single box.

Both Storm and Hadoop take this approach, with Nimbus and the Name Node respectively.

The cons of a centralized coordination node

In my case, I decided to take a different approach. Using a centralized coordinator just wasn't working out for me, for the following reasons:

More bugs have come up inside the coordinator than any other part of Onyx
Multiple implementations of coordinator need to be supported in parallel (HTTP, memory)
Some state in ZooKeeper is mutable, and the coordinator needs an atomic read of some znodes - but not all
Most writes and reads to ZooKeeper need to have (application/semantic level) serializable consistency - burden on me to do it right
Burden on the user to stand up a coordinator and its failover instances
It's hard to support good scheduling algorithms when a subset of the information is ZooKeeper is mutable
Task rotation to support peer failure replacement is really hard with the 0.4.0 design

Towards a masterless design

My goal is to support Onyx long-term, and I needed a design that was going to survive. And so, I made the leap and aggressively tore up more than half of the code base. We'll go through this section, and step-by-step introduce new components and ideas to finally arrive at a masterless architecture.

The Log

This design centers around a totally ordered sequence of commands using a log structure. The log acts as an immutable history and arbiter. It's maintained through ZooKeeper using a directory of persistent, sequential znodes. Virtual peers act as processes that consume from the log one entry at a time. Each peer maintains a local value in memory called a replica. A replica is an immutable value that reflects the application of a log entries.

Log entries represent deterministic, idempontent functions to be applied to the local replica. The peer plays the log from the beginning, applying each log entry to its local replica. The local replica is transformed from one value to another. At the time a peer starts up, it initializes its local replica to the "empty state".

======================================= As a reaction to each replica transformation, the peer may send more commands to the tail of the log. Peers may play the log at any rate. After each peer has played k log entries, all peers at time k will have exactly the same local replica. Peers store everything in memory - so if a peer fails, it simply reboots from scratch with a new identifier and plays the log forward from the beginning. Since it has a new identifier, it has no association to its the commands it previously issued - preventing live lock issues.

Choosing btw performance and coordination to fix first after STL
No more in-memory mode. In memory services instead
- This has the advantage that the same code you use to test locally is the same code you deploy to production with. No dependency injection, no alternate interfaces. Just a few changes to an EDN configuration map to find ZooKeeper and HornetQ. Same as Erlang. Much easier to work with than Cascading/Cascalog's local mode.
Log structure
- Logs make great transaction arbiters
- Tail of the log acts as a monotonic clock
- The log doesn't need to be super highly performant
Replicas
- Stress that replicas update on indepenent timelines
Realtime event notifications
- You can now do an open set of things
Schedulers
- Capped number of peers
Removing dead peers
Adding new peers
Garbage collection
Winning the War
- Towards 0.6.0
- HornetQ, Storm, and Performance

I'm elated to announce the next release of Onyx, version 0.5.0. This release is packed with new services, enhanced API functionality, and core design changes. Most notably, Onyx is now a fully masterless system. Additionally, Onyx now exposes advanced schedulers for expressing cluster allocation across jobs and tasks at a fine grained level. In this post, I'd like to walk you through how we were able to achieve a masterless compute system, and how to exercise the new functionality.

Now that things have calmed down after StrangeLoop, I knew that coordination was in the worst dstate out of any part of the code base. I needed a design that was going to survive. Distributed coordination was by far the hardest piece to get right with Onyx out of the gate. I needed ZooKeeper to expose different semantic consistency models for different pieces of data (e.g. all reads and writes serialized for this data set, reads at any point but writes serialized for another data set, etc)

Much of Onyx's internals can be tested as a pure function Percentage allocators are a big deal - easy to set numeric values. Need to redeploy/rebalance when the cluster grows or shrinks

Talk about drawbacks!

And so dear friends (I believe we are friends at this point, if you're still reading)

Talk about the local state on each peer. Be clear that they actually do something

You can see every single log command in the docs Want to send an email after each job completes?

The information model remains in tact!