Skip to content

Instantly share code, notes, and snippets.

@prakhar1989
Last active September 17, 2015 23:14
Show Gist options
  • Save prakhar1989/b6878499db6fc66092a7 to your computer and use it in GitHub Desktop.
Save prakhar1989/b6878499db6fc66092a7 to your computer and use it in GitHub Desktop.
Distributed Systems

Jepsen Summaries

  • First in the series explaining the motivation and the goals of the project
  • Since computers have finite memory and latency bounds, we introduce timeouts, which close the connection when expected messages fail to arrive within a given time frame.
  • Detecting network failures is hard. Since our only knowledge of the other nodes passes through the network, delays are indistinguishible from failure. This is the fundamental problem of the network partition: latency high enough to be considered a failure.
  • The CAP theorem tells us that we can either have consistency (technically, linearizability for a read-write register), or availability (all nodes can continue to handle requests), but not both.
  • In this series, I'm going to demonstrate how some real distributed systems behave when the network fails.
  • To simulate slow networks, or networks which drop packets, we can use tc to adjust the ethernet interface. Jepsen assumes the inter-node interface is eth0. To drop traffic, iptables are used.
  • Methodology: For each write we record whether the database acknowledged the write successfully or whether there was an error. At the end of the run, we ask the database for the full set. If acknowledged writes are missing, or unacknowledged writes are present, we know that the system was inconsistent in some way: that the client application and the database disagreed about the state of the system.
  • hello world

Testing etc and Consul (linked below). This is quite representative of the code that we'll have to write in order to test Jepsen against a distributed system - https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

A good summary - https://aphyr.com/posts/286-call-me-maybe-final-thoughts

Distributed Systems RethinkDB: A scalable realtime datastore that works on the push architecture. Checkout the founder's talk for more. Serf: A lightweight solution for cluster membership. This is used in Consul amongst other things. You can check out my blog post on the SWIM gossip protocol if you are interested in diving deeper. CockroachDB: The design document itself is super interesting if you want to know how distributed databases work. Consul: A service discovery distributed system by the same guys behind Hashicorp Spark: A scalable batch + streaming computation system developed by the AMPlab at Berkeley. Zookeeper: A service for maintaining configuration information, naming, providing distributed synchronization etc. Used by Spark and lot of other open-source projects to solve some of their distributed system problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment