kchida · July 17, 2014 05:06 · pvoznenko · Jan 23, 2015 · dreampuf · Mar 9, 2015
diff --git a/gistfile1.txt b/gistfile1.txt
 - What do Etcd, Consul, and Zookeeper do?
  - Service Registration:
    - Host, port number, and sometimes authentication credentials, protocols, versions
      numbers, and/or environment details.
  - Service Discovery:
    - Ability for client application to query the central registry to learn of service location.
  - Consistent and durable general-purpose K/V store across distributed system.
    - Some solutions support this better than others.
    - Based on Paxos or some derivative (i.e. Raft) algorithm to quickly converge to a consistent state.
    - Centralized locking can be based on this K/V store.
  - Leader Election:
    - Not to be confused with leader election within the quorum of Etcd/Consul nodes. This is an
      implementation detail that is transparent to the user. What we are talking about here is leader
      election among the services that are registered against Etcd/Consul.
    - Etcd tabled their leader election module until the API stabilizes.
  - Other non-standard use cases:
    - Distributed locking
    - Atomic broadcast
    - Sequence numbers
    - Pointers to data in eventually consistent stores.
 
 - How do they behave in a distributed system?
  - All of the solutions under consideration are primarily CP systems in the CAP context.
    That is, they favor consistency over availability. This means that all nodes have a
    consistent view of written data but at the expense of availability in the event that
    a network partitions occurs (i.e. loss of node).
    - Some of these solutions will support "stale reads" in the event of node loss.
  - Each solution can work with only one node. It is generally advised that we have one etcd/
    consul per VM/physical host. We do not want to have an etcd/consul per container!
 
 - Immediate problems that we are trying to solve:
  - Get and set dynamic configuration across a distributed system (e.g. things in moc.config.json):
    - This is perhaps the most pressing problem that we need to solve.
    - An SCM tool like Puppet/Anisble are great for managing static configurations but
      they are too heavy for dynamic changes.
  - Service registration:
    - We need to be able to spin up a track and have services make themselves visible
      via DNS.
    - This would be useful primarily outside of production where we would want to regularly
      spin up and destroy tracks.
    - That said, we don't have a highly-distributed and elastic architecture so we could get
      by without this for a while.
  - Service discovery:
    - Services must be able to determine which host to talk to for a particular service.
    - This may not be as important for production if we have a loadbalancer. In fact, a
      loadbalancer would be more transparent to our existing apps as they work at the IP level.
    - That said, we don't have a highly-distributed and elastic architecture so we could get
      by without this for a while.
 
 - Features that we don't need for now:
  - Leader election. Many of our apps are currently not designed to scale horizontally.
    However, it should be noted that Consul has the ability to select a leader based on
    health checks.
 
 - Problems that these tools are not designed to solve:
  - Load-balancing.
 
 - Things that I've explored:
  - Etcd:
    - Basic info:
      - Service registration relies on using a key TTL along with heartbeating from the service
        to ensure the key remains available. If a services fails to update the key’s TTL, Etcd
        will expire it. If a service becomes unavailable, clients will need to handle the
        connection failure and try another service instance.
      - There would be a compelling reason to favor Etcd if we ever planned to use CoreOS
        but I don't see this happening anytime soon.
    - Pros:
      - Service discovery involves listing the keys under a directory and then waiting for
        changes on the directory. Since the API is HTTP based, the client application keeps a 
        long-polling connection open with the Etcd cluster.
      - Has been around for longer than Consul. 150% more github watches/stars.
      - 3 times as many contributors (i.e. more eyes) and forks on github.
    - Cons:
      - There are claims that the Raft implementation used by Etcd (go-raft) is not quite right (unverified).
      - Immature, but by the time its use is under consideration in production, it should
        have reached 1.0.
      - Serving DNS records from Etcd may require a separate service/process (verify):
        - http://probablyfine.co.uk/2014/03/02/serving-dns-records-from-etcd/
        - SkyDNS is essentially DNS on top of Etcd
 
  - Consul:
    - Pros:
      - Has more high-level features like service monitoring.
      - There is another project out of Hashicorp that will read/set environment variable
        for processes from Consul.
        - https://github.com/hashicorp/envconsul
      - Better documentation.
        - I had an easier time installing and configuring this over Etcd, not that Etcd was
          particularly hard. Docs make all the difference.
        - Stuff like this makes me want to shed a tear. I commend the KIDS at Hashicorp.
          - http://www.consul.io/docs/internals/index.html
      - You can make DSN queries directly against Consul agent! Nice! No need for SkyDNS or Helix
      - We can add arbitrary checks! Nice, if we are into that sort of thing.
      - Understands the notion of a datacenter. Each cluster is confined to datacenter but the
        cluster is able to communicate with other datacenters/clusters.
        - At Skybox, we might use this feature to separate docker tracks, even if they live on same host.
      - It has a rudimentary web UI:
        - http://demo.consul.io/ui/
    - Cons:
      - There are claims that Consul's implementation of Raft is better (unverified).
      - Immature. Even younger than Etcd (though there are no reason to believe that there are problems with it).
 
 - Etcd and Consul similarities:
  - HTTP+JSON based API. Curl-able.
  - Docker containers can talk directly with Etcd/Consul over the docker0 interface (i.e. default gateway).
  - Atomic look-before-you-set:
    - Etcd: Compare-and-set by both value and version index.
    - Consul: Check-and-set by sequence number (ModifyIndex)
  - DNS TTLs can be set to something VERY low.
    - Etcd: supports TTL (time-to-live) on both keys and directories, which will be honoured:
      if a value has existed beyond its TTL
    - Consul: By default, serves all DNS results with a 0 TTL value
  - Has been tested with Jepsen (tool to simulate network partitions in distributed databases).
    - Results were not 100% for either but still generally promising.
    - https://news.ycombinator.com/item?id=7884640
  - Both work with Confd by Kelsey Hightower.
    - A tool that watches Etcd/Consul and modifies config files on disk.
    - https://github.com/kelseyhightower/confd
  - Long polling for changes:
    - Etcd: Easily listen for changes to a prefix via HTTP long-polling.
    - Consul: A blocking query against some endpoints will wait for a change to potentially
      take place using long polling.
 
 - Things that I have not explored:
  - SkyDNS: Anyone have good input on this one?
  - Zookeeper: It seems mature but it would take a lot more work to make it work for us.
    - We would be have to configure and use it without high-level features.
    - Provides only a primitive K/V store.
    - Requires that application developers build their own system to provide service discovery.
    - Java dependency (and Dan Streit hates Java)
    - All clients must maintain active connections to the ZooKeeper servers, and perform keep-alives.
    - Zookeeper not recommended for virtual environments? Why? I just read this somewhere.
  - Corosync/Pacemaker (not sure if this is a viable solution, actually)
  - Redis is not viable! It is an in-memory K/V that does not persist data. Nope.
  - Smartstack + Synapse + Nerve from AirBnB (not viable as it only does TCP through HAproxy).
    - Ruby dependencies and many moving parts.
 
 - References:
  http://www.hashicorp.com/blog/twelve-factor-consul.html   (heroku's excellent 12-factor thing).
  http://12factor.net/
  http://www.consul.io/intro/vs/index.html
  http://www.consul.io/docs/internals/index.html
  https://news.ycombinator.com/item?id=7604787
  https://news.ycombinator.com/item?id=7623317
  https://news.ycombinator.com/item?id=7884640
  http://www.activestate.com/blog/2014/03/brandon-philips-explains-etcd
  http://jpmens.net/2013/10/24/a-key-value-store-for-shared-configuration-etcd-confd/
  http://igor.moomers.org/smartstack-vs-consul/
  http://jasonwilder.com/blog/2014/02/04/service-discovery-in-the-cloud/
  http://nerds.airbnb.com/smartstack-service-discovery-cloud/
	- What do Etcd, Consul, and Zookeeper do?
	- Service Registration:
	- Host, port number, and sometimes authentication credentials, protocols, versions
	numbers, and/or environment details.
	- Service Discovery:
	- Ability for client application to query the central registry to learn of service location.
	- Consistent and durable general-purpose K/V store across distributed system.
	- Some solutions support this better than others.
	- Based on Paxos or some derivative (i.e. Raft) algorithm to quickly converge to a consistent state.
	- Centralized locking can be based on this K/V store.
	- Leader Election:
	- Not to be confused with leader election within the quorum of Etcd/Consul nodes. This is an
	implementation detail that is transparent to the user. What we are talking about here is leader
	election among the services that are registered against Etcd/Consul.
	- Etcd tabled their leader election module until the API stabilizes.
	- Other non-standard use cases:
	- Distributed locking
	- Atomic broadcast
	- Sequence numbers
	- Pointers to data in eventually consistent stores.

	- How do they behave in a distributed system?
	- All of the solutions under consideration are primarily CP systems in the CAP context.
	That is, they favor consistency over availability. This means that all nodes have a
	consistent view of written data but at the expense of availability in the event that
	a network partitions occurs (i.e. loss of node).
	- Some of these solutions will support "stale reads" in the event of node loss.
	- Each solution can work with only one node. It is generally advised that we have one etcd/
	consul per VM/physical host. We do not want to have an etcd/consul per container!

	- Immediate problems that we are trying to solve:
	- Get and set dynamic configuration across a distributed system (e.g. things in moc.config.json):
	- This is perhaps the most pressing problem that we need to solve.
	- An SCM tool like Puppet/Anisble are great for managing static configurations but
	they are too heavy for dynamic changes.
	- Service registration:
	- We need to be able to spin up a track and have services make themselves visible
	via DNS.
	- This would be useful primarily outside of production where we would want to regularly
	spin up and destroy tracks.
	- That said, we don't have a highly-distributed and elastic architecture so we could get
	by without this for a while.
	- Service discovery:
	- Services must be able to determine which host to talk to for a particular service.
	- This may not be as important for production if we have a loadbalancer. In fact, a
	loadbalancer would be more transparent to our existing apps as they work at the IP level.
	- That said, we don't have a highly-distributed and elastic architecture so we could get
	by without this for a while.

	- Features that we don't need for now:
	- Leader election. Many of our apps are currently not designed to scale horizontally.
	However, it should be noted that Consul has the ability to select a leader based on
	health checks.

	- Problems that these tools are not designed to solve:
	- Load-balancing.

	- Things that I've explored:
	- Etcd:
	- Basic info:
	- Service registration relies on using a key TTL along with heartbeating from the service
	to ensure the key remains available. If a services fails to update the key’s TTL, Etcd
	will expire it. If a service becomes unavailable, clients will need to handle the
	connection failure and try another service instance.
	- There would be a compelling reason to favor Etcd if we ever planned to use CoreOS
	but I don't see this happening anytime soon.
	- Pros:
	- Service discovery involves listing the keys under a directory and then waiting for
	changes on the directory. Since the API is HTTP based, the client application keeps a
	long-polling connection open with the Etcd cluster.
	- Has been around for longer than Consul. 150% more github watches/stars.
	- 3 times as many contributors (i.e. more eyes) and forks on github.
	- Cons:
	- There are claims that the Raft implementation used by Etcd (go-raft) is not quite right (unverified).
	- Immature, but by the time its use is under consideration in production, it should
	have reached 1.0.
	- Serving DNS records from Etcd may require a separate service/process (verify):
	- http://probablyfine.co.uk/2014/03/02/serving-dns-records-from-etcd/
	- SkyDNS is essentially DNS on top of Etcd

	- Consul:
	- Pros:
	- Has more high-level features like service monitoring.
	- There is another project out of Hashicorp that will read/set environment variable
	for processes from Consul.
	- https://github.com/hashicorp/envconsul
	- Better documentation.
	- I had an easier time installing and configuring this over Etcd, not that Etcd was
	particularly hard. Docs make all the difference.
	- Stuff like this makes me want to shed a tear. I commend the KIDS at Hashicorp.
	- http://www.consul.io/docs/internals/index.html
	- You can make DSN queries directly against Consul agent! Nice! No need for SkyDNS or Helix
	- We can add arbitrary checks! Nice, if we are into that sort of thing.
	- Understands the notion of a datacenter. Each cluster is confined to datacenter but the
	cluster is able to communicate with other datacenters/clusters.
	- At Skybox, we might use this feature to separate docker tracks, even if they live on same host.
	- It has a rudimentary web UI:
	- http://demo.consul.io/ui/
	- Cons:
	- There are claims that Consul's implementation of Raft is better (unverified).
	- Immature. Even younger than Etcd (though there are no reason to believe that there are problems with it).

	- Etcd and Consul similarities:
	- HTTP+JSON based API. Curl-able.
	- Docker containers can talk directly with Etcd/Consul over the docker0 interface (i.e. default gateway).
	- Atomic look-before-you-set:
	- Etcd: Compare-and-set by both value and version index.
	- Consul: Check-and-set by sequence number (ModifyIndex)
	- DNS TTLs can be set to something VERY low.
	- Etcd: supports TTL (time-to-live) on both keys and directories, which will be honoured:
	if a value has existed beyond its TTL
	- Consul: By default, serves all DNS results with a 0 TTL value
	- Has been tested with Jepsen (tool to simulate network partitions in distributed databases).
	- Results were not 100% for either but still generally promising.
	- https://news.ycombinator.com/item?id=7884640
	- Both work with Confd by Kelsey Hightower.
	- A tool that watches Etcd/Consul and modifies config files on disk.
	- https://github.com/kelseyhightower/confd
	- Long polling for changes:
	- Etcd: Easily listen for changes to a prefix via HTTP long-polling.
	- Consul: A blocking query against some endpoints will wait for a change to potentially
	take place using long polling.

	- Things that I have not explored:
	- SkyDNS: Anyone have good input on this one?
	- Zookeeper: It seems mature but it would take a lot more work to make it work for us.
	- We would be have to configure and use it without high-level features.
	- Provides only a primitive K/V store.
	- Requires that application developers build their own system to provide service discovery.
	- Java dependency (and Dan Streit hates Java)
	- All clients must maintain active connections to the ZooKeeper servers, and perform keep-alives.
	- Zookeeper not recommended for virtual environments? Why? I just read this somewhere.
	- Corosync/Pacemaker (not sure if this is a viable solution, actually)
	- Redis is not viable! It is an in-memory K/V that does not persist data. Nope.
	- Smartstack + Synapse + Nerve from AirBnB (not viable as it only does TCP through HAproxy).
	- Ruby dependencies and many moving parts.

	- References:
	http://www.hashicorp.com/blog/twelve-factor-consul.html (heroku's excellent 12-factor thing).
	http://12factor.net/
	http://www.consul.io/intro/vs/index.html
	http://www.consul.io/docs/internals/index.html
	https://news.ycombinator.com/item?id=7604787
	https://news.ycombinator.com/item?id=7623317
	https://news.ycombinator.com/item?id=7884640
	http://www.activestate.com/blog/2014/03/brandon-philips-explains-etcd
	http://jpmens.net/2013/10/24/a-key-value-store-for-shared-configuration-etcd-confd/
	http://igor.moomers.org/smartstack-vs-consul/
	http://jasonwilder.com/blog/2014/02/04/service-discovery-in-the-cloud/
	http://nerds.airbnb.com/smartstack-service-discovery-cloud/