[ Project Drywall ] Because software needs more than just colums and beams Raw

This is an aggregation of thoghts, observations, talks, blogs, code reviews and many sleepless nights. Obviously not complete or exhaustive.

Edits/Comments welcome!

Developer Production

Measure success of developer experience by time to setup.
Same scripts for CI as for developer sandbox.
Each app should include the build sripts that are used to compile from scratch in a sandbox.
Mentality - Ask the question - Can we opensource/handoff this repo as it stands today?
Readme.md must be complete and contain 3 main sections
Project description - short description, dependencies and motivation
Links to CI system, owner contact (hipchat room), product docs and developer docs
Build instructions - should be short and simple
Follow something like: git clone && docker build -t project_name . or vagrant up
Have a batteries included approach.
- Don't ask me to download mongodb, hadoop, redis-server, gasmask and configure 15 system parameters etc.

New build scripts that produce runnable binary/docker image.
Reduce pets to cattle
Deploy scripts that emit a deployment spec - something like a marathon.json.
Must contain a health checks (up/down indicators).
Should have up/down controls for individual pod.
ochopod May include a co-process that renders the configuration for environment if needed. * Helpful when we want to convert pets to cattle. * This is very helpful when we want to do some configuration steps or coordination step before kicking off a process.
Git based configuration that is applied by way of environment variables (may be rendered using the co-process).
Define invariants and points of coordination and use known good ways to either handle them.
Think in terms of Harvest and Yield (brewer's talk page 8)[http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf].

All alerts must be actionable
Add a survey to each alert - Was this alert helpful? [y/n] - this is a metric on a metric
Alert on presence of failures rather than absence of success.
Takes care when we are just under low traffic or migrating things.
Define good outlier detection / metrics / thresholds.
Forget historical log files (when was the last time you actually went back and crawled through +1 week old log files)
- If something is going wrong and producing errors, it will continue to do so.
- Log stream is kept for a few days and then discarded - make it easy to find/filter/pipe.
- fluentd / over kafka / logstash for logs firehose
- splunk / elasticsearch / sumologic for "search"

Must be capable of running in parallel and must not assume any pre-existing setup.
Total test runs should not take more than a few minutes for unit/component tests and 10s of minutes for integration tests.
Should install all required software to run the tests (gems, java, ruby) in sandbox + never assume sudo access.
Run simian army
Regular failover tests against production to identify weakness early and figure out metrics to "predict" when things are going south.
Atleast chaos monkey.
Try for latency monkey or use tc (on linux).

You need it and an ELB is not enough.
service discovery via twitter-server/ServerSets, Netflix/Eureka or coreos/etcd
Assume not all clients are dumb and can be expected to make use of the service registry.
- Eureka
- ServerSets
Allow for ways to manually downtime some resources (optional)
No DNS - slow, cached, is just not adequate
No /etc/hosts - use a dns or use service discovery

Take time and understand VPC / network setup
Make tools to manage network configuration and traffic routing.
- Have the ability to isolate resources.
- Have the ability to divert traffic away from resources.
Use an existing loadbalancer
- HaProxy is a good one - Use it.
- Configure dynamically using service registry.

Have tiers and define them in terms of SLA, CAP theorem.
Blob store as S3 / google cloud storage / hdfs (CP)
Columnar/KV store - Cassandra (AP) / BigTable (CP) / HBase (CP)
Document storage - (Mongo/Toku) - ?? -
Replicated log (Kafka) / google pubsub / kinesis
RDBMS - Use RDS / google cloudsql (ACID Transactionality)
Co-locate compute and data as much as possible.
Prefer hosted to self-managed solutions unless there is a good reason to not do it (price, availability, support) (be sass-y)
Go wide not deep
Define invariants / commutativity ops at the door and stick to them
Always keep things replicated
- Replication and backups
- Test backup restores and failover regularly
Expose data platform semantics
- User managed sharding - it's hard and everyone needs something peculiar.
- Expose regions, availability zones, placement zones to users.
- Expose operational metrics to users.