Skip to content

Instantly share code, notes, and snippets.

@ankurcha
Last active January 19, 2016 12:38
Show Gist options
  • Save ankurcha/33c25684783df05e6237 to your computer and use it in GitHub Desktop.
Save ankurcha/33c25684783df05e6237 to your computer and use it in GitHub Desktop.
[ Project Drywall ] Because software needs more than just colums and beams Raw

This is an aggregation of thoghts, observations, talks, blogs, code reviews and many sleepless nights. Obviously not complete or exhaustive.

Edits/Comments welcome!

Developer Production

  • Measure success of developer experience by time to setup.
  • Same scripts for CI as for developer sandbox.
  • Each app should include the build sripts that are used to compile from scratch in a sandbox.
  • Mentality - Ask the question - Can we opensource/handoff this repo as it stands today?
  • Readme.md must be complete and contain 3 main sections
  • Project description - short description, dependencies and motivation
  • Links to CI system, owner contact (hipchat room), product docs and developer docs
  • Build instructions - should be short and simple
  • Follow something like: git clone && docker build -t project_name . or vagrant up
  • Have a batteries included approach.
    • Don't ask me to download mongodb, hadoop, redis-server, gasmask and configure 15 system parameters etc.

Continuous deployment

  • New build scripts that produce runnable binary/docker image.
  • Reduce pets to cattle
  • Deploy scripts that emit a deployment spec - something like a marathon.json.
  • Must contain a health checks (up/down indicators).
  • Should have up/down controls for individual pod.
  • ochopod May include a co-process that renders the configuration for environment if needed. * Helpful when we want to convert pets to cattle. * This is very helpful when we want to do some configuration steps or coordination step before kicking off a process.
  • Git based configuration that is applied by way of environment variables (may be rendered using the co-process).
  • Define invariants and points of coordination and use known good ways to either handle them.
  • Think in terms of Harvest and Yield (brewer's talk page 8)[http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf].

Monitoring and Alerting

  • All alerts must be actionable
  • Add a survey to each alert - Was this alert helpful? [y/n] - this is a metric on a metric
  • Alert on presence of failures rather than absence of success.
  • Takes care when we are just under low traffic or migrating things.
  • Define good outlier detection / metrics / thresholds.
  • Forget historical log files (when was the last time you actually went back and crawled through +1 week old log files)
    • If something is going wrong and producing errors, it will continue to do so.
    • Log stream is kept for a few days and then discarded - make it easy to find/filter/pipe.
    • fluentd / over kafka / logstash for logs firehose
    • splunk / elasticsearch / sumologic for "search"

Automated tests

  • Must be capable of running in parallel and must not assume any pre-existing setup.
  • Total test runs should not take more than a few minutes for unit/component tests and 10s of minutes for integration tests.
  • Should install all required software to run the tests (gems, java, ruby) in sandbox + never assume sudo access.
  • Run simian army
  • Regular failover tests against production to identify weakness early and figure out metrics to "predict" when things are going south.
  • Atleast chaos monkey.
  • Try for latency monkey or use tc (on linux).

Service discovery and routing

  • You need it and an ELB is not enough.
  • service discovery via twitter-server/ServerSets, Netflix/Eureka or coreos/etcd
  • Assume not all clients are dumb and can be expected to make use of the service registry.
    • Eureka
    • ServerSets
  • Allow for ways to manually downtime some resources (optional)
  • No DNS - slow, cached, is just not adequate
  • No /etc/hosts - use a dns or use service discovery

Network and Traffic configuration management

  • Take time and understand VPC / network setup
  • Make tools to manage network configuration and traffic routing.
    • Have the ability to isolate resources.
    • Have the ability to divert traffic away from resources.
  • Use an existing loadbalancer
    • HaProxy is a good one - Use it.
    • Configure dynamically using service registry.

Security

  • No "soft juicy center"
  • Perimeter security is not enough.
  • if someone gets into an app, an adevsary can't go anywhere else.
  • Don't write your own security tools - lots of edge cases and easy to get wrong.
  • Minimize surface area
    • Use minimal docker container.
    • Assume everything in the local disk (sandbox) will go away regularly.
    • Apps must define it's dependencies
      • Ports
      • Local disk space (only in sandbox)
      • Dependencies - networks, system previledges, file systems

Data platform

  • Have tiers and define them in terms of SLA, CAP theorem.
  • Blob store as S3 / google cloud storage / hdfs (CP)
  • Columnar/KV store - Cassandra (AP) / BigTable (CP) / HBase (CP)
  • Document storage - (Mongo/Toku) - ?? -
  • Replicated log (Kafka) / google pubsub / kinesis
  • RDBMS - Use RDS / google cloudsql (ACID Transactionality)
  • Co-locate compute and data as much as possible.
  • Prefer hosted to self-managed solutions unless there is a good reason to not do it (price, availability, support) (be sass-y)
  • Go wide not deep
  • Define invariants / commutativity ops at the door and stick to them
  • Always keep things replicated
    • Replication and backups
    • Test backup restores and failover regularly
  • Expose data platform semantics
    • User managed sharding - it's hard and everyone needs something peculiar.
    • Expose regions, availability zones, placement zones to users.
    • Expose operational metrics to users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment