description.md

##Our project:

####Decentralized Membership System for Health Check and Service Discovery. (For clarification, we are proposing an engineering project, not a research topic. We want to open source this project, and hope we can use this in real production environment in the future.)

Problem we want to solve: Relieve the load on central coordinator, thus improve the scalability.

Why we think this might be a problem:

###0. Brief: In a cluster containing 50k+ nodes. The average delay time for a client to detect a service node failure:

Using central coordinator system: 10s
Using decentralized system: 3s

######The central coordinator system will be already saturated. ######The decentralized system is NOT supposed to replace the central coordinator system, but to help it. Following is our analysis and how we compute the result.

###1. Senario: #####Description of how membership (service discovery) works in real:

In a typical production cluster, there are several nodes consists a "service", including some working nodes and some backup nodes. The working node will register its identity, address and some other information in a central configuration coordinator (e.g. zookeeper, etcd, epaxos quorum). The entry is temporal so that once a node dies, the entry will disappear soon, in order to keep the entry, the node have to to a write operation to extend the its entry's TTL before the entry expires.

The backup nodes will keep watching the entry in the configuration coordinator, once they get notified that the entry disppears, they will compete to register themselves and begin serving requests when successfully registered.

For clients, there are 2 approaches. a, The clients will keep watching the entry and get notified when the change happens. b, The client reads the entry and caches them, and re-read the them again when it thinks it's necessary. (polling)

Assume we use the first approach here. Later We will describe what if we use the second approach.

Workloads characteristics: Imagine that we have a cluster containing 50k service nodes (They can be 50k machines or 50k containers) and 50k clients. Then we will have a constantly heavy write load (nodes keep updating the entry), and a burst load of sending notifications to the clients when a failure happens.

###2. The power (or limit) of the central coordinators:

#####The typical throughput of the central coordinator system: *Zookeeper: According to (https://zookeeper.apache.org/doc/r3.2.2/zookeeperOver.html#Performance). Assuming a half/half read-write workload (In fact, we think the workload for maintaining membership will be write heavy) Zookeeper can handle around 40k req/s.

*etcd: No official benchmark found, but according to the authors, it can handle around 10k req/s pretty well.

*epaxos: For epaxos, with write heavy, short key workload, it can get more that 40k req/s. But when disk I/O is included, the throughput decreases.

#####The notification sending rate on one server: Assume 10Gbps ethernet NIC, each notification is around 100 bytes (including headers), then the request rate cannot exceed 125k pkt/s. However, since the CPU needs to do some encoding and memory copy during the process, so in actual, we will probably hit the CPU bound before hitting the NIC's limit. A reasonable sending rate is around 40k pkt/s. (Note that this rate will probably saturate the CPU.)

###3. Applying the senario: Assume we have a cluster with 50k service nodes, and for a single service, there are 50k clients using it. (watching on the service's entry) Since we need some CPU time to notify the clients when a node failure is detected, so we are not going to spend all the CPU time on handling those "keep-alive" write request. Let's say we spend 90% of the CPU time on this work, then the time between two consecutive writes from a single service node should be at least 1.5s. Now since we have only 10% CPU time for sending notification, and we have 50k notifications to send, so even let all machines in the quorum help sending those notification together, it will still take about 4 to 5 seconds to broadcast the message to all clients.

Let's say the server spends 90% of the CPU handling writes requests, then the write frequency cannot be less than 1.5s, And now we have only 10% CPU sending response, so when the server finds an entry disappear, it will take around 4-5s to cover all the clients.

Besides, in order to reduce the false positive of failure detection, the TTL of an entry is usually several times larger than the time between two consecutive writes. In many cases, people choose 5 as the factor.

So the average time from a node failure happens to the time a client nodes get notification will be about 1.5 * 5 + 0.5 * 5 = 10s.

10s might sound not bad in some cases, but why not improve it if we can?

###4. What a decentralized membership system can do?

We did a simulation to test the characteristics of a gossip based membership system. (https://github.com/go-distributed/gossip_simulator)

In our simulation, we set the parameters as follows:

Total number of nodes = 50k
Network delay = 10ms - 20ms (very conservative)
10% packet loss rate.

Our experiment result shows that it takes about 700ms to reach 99.9% coverage when broadcasting a message. So the average delay is less than 500ms.

Besides, since the load is fully distributed over the cluster (200k messages in total, each one handle at most 5 sends and receives), it is trivial for a node to increase the heartbeat frequency. Let's just set it to 2 beat/s, and we mark a node fails when we haven't received 5 consecutive notifications.

Thus the average delay from the failure happens to a client get notified is around 3s.

Actually in a datacenter environment, the network delay is much smaller, thus the average delay should be definitely less than 3s.

Another problem that decentralized system can help is: When some nodes get partitioned from the central coordinators, it will be marked as failed. However we know this is not the truth, those partitioned nodes are still able to work. Thus a peer-to-peer decentralized membership system can reduce the false positive better.

OK. Now let's talk what if the clients just polling the central coordinators, not watching on the entry all the time. We think that can be an alternative approach to relieve the load on the central coordinator, but the question is how often to poll? If we poll too often, it's even worse than watching. But if the polling circle is too long, then the latency for failure detection will be much larger.

After all, we are "NOT" proposing something to "REPLACE" a central coordinator, we want to provide a system that can work with the central coordinator and help it, so that we can achieve even better scalability.

We have discussed and thought about your questions, we think you are right that we should improve the central coordinator when it has trouble. But we also believe by using some decentralized system, it can also help the central coordinator. According to the paper and our simulation, we think the decentralized system we want to implement is highly robust and fast to converge, it is very suitable for broadcasting. Even it's not a "reliable broadcast", but it's pretty "reliable", than means the penalty is very small and it worth the load we will save from the central coordinator.

Again, we are not proposing something to replace a centralized coordinator system, we just want to take very little resource from every node in a cluster and use them to improve the scalability. We think this is a good trade-off.

And that's it, thanks for taking time to read this!

Please feel free to point out any assumptions or measurements we made here that you feel hard to agree, or anything we missed. Any questions or advice are welcome! We are all ears to hear!

Regards,

-Yifan

yifan-gu/description.md