This is an aggregation of thoghts, observations, talks, blogs, code reviews and many sleepless nights. Obviously not complete or exhaustive.
Edits/Comments welcome!
- Measure success of developer experience by time to setup.
- Same scripts for CI as for developer sandbox.
- Each app should include the build sripts that are used to compile from scratch in a sandbox.
- Mentality - Ask the question - Can we opensource/handoff this repo as it stands today?
- Readme.md must be complete and contain 3 main sections
- Project description - short description, dependencies and motivation
- Links to CI system, owner contact (hipchat room), product docs and developer docs
- Build instructions - should be short and simple
- Follow something like:
git clone && docker build -t project_name .
orvagrant up
- Have a batteries included approach.
- Don't ask me to download mongodb, hadoop, redis-server, gasmask and configure 15 system parameters etc.
- New build scripts that produce runnable binary/docker image.
- Reduce pets to cattle
- Deploy scripts that emit a deployment spec - something like a
marathon.json
. - Must contain a health checks (up/down indicators).
- Should have up/down controls for individual pod.
- ochopod May include a co-process that renders the configuration for environment if needed. * Helpful when we want to convert pets to cattle. * This is very helpful when we want to do some configuration steps or coordination step before kicking off a process.
- Git based configuration that is applied by way of environment variables (may be rendered using the co-process).
- Define invariants and points of coordination and use known good ways to either handle them.
- Think in terms of Harvest and Yield (brewer's talk page 8)[http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf].
- All alerts must be actionable
- Add a survey to each alert - Was this alert helpful? [y/n] - this is a metric on a metric
- Alert on presence of failures rather than absence of success.
- Takes care when we are just under low traffic or migrating things.
- Define good outlier detection / metrics / thresholds.
- Forget historical log files (when was the last time you actually went back and crawled through +1 week old log files)
- If something is going wrong and producing errors, it will continue to do so.
- Log stream is kept for a few days and then discarded - make it easy to find/filter/pipe.
- fluentd / over kafka / logstash for logs firehose
- splunk / elasticsearch / sumologic for "search"
- Must be capable of running in parallel and must not assume any pre-existing setup.
- Total test runs should not take more than a few minutes for unit/component tests and 10s of minutes for integration tests.
- Should install all required software to run the tests (gems, java, ruby) in sandbox + never assume sudo access.
- Run simian army
- Regular failover tests against production to identify weakness early and figure out metrics to "predict" when things are going south.
- Atleast chaos monkey.
- Try for latency monkey or use
tc
(on linux).
- You need it and an ELB is not enough.
- service discovery via twitter-server/ServerSets, Netflix/Eureka or coreos/etcd
- Assume not all clients are dumb and can be expected to make use of the service registry.
- Eureka
- ServerSets
- Allow for ways to manually downtime some resources (optional)
- No DNS - slow, cached, is just not adequate
- No /etc/hosts - use a dns or use service discovery
- Take time and understand VPC / network setup
- Make tools to manage network configuration and traffic routing.
- Have the ability to isolate resources.
- Have the ability to divert traffic away from resources.
- Use an existing loadbalancer
- HaProxy is a good one - Use it.
- Configure dynamically using service registry.
- No "soft juicy center"
- Perimeter security is not enough.
- if someone gets into an app, an adevsary can't go anywhere else.
- Don't write your own security tools - lots of edge cases and easy to get wrong.
- Minimize surface area
- Use minimal docker container.
- Assume everything in the local disk (sandbox) will go away regularly.
- Apps must define it's dependencies
- Ports
- Local disk space (only in sandbox)
- Dependencies - networks, system previledges, file systems
- Have tiers and define them in terms of SLA, CAP theorem.
- Blob store as S3 / google cloud storage / hdfs (CP)
- Columnar/KV store - Cassandra (AP) / BigTable (CP) / HBase (CP)
- Document storage - (Mongo/Toku) - ?? -
- Replicated log (Kafka) / google pubsub / kinesis
- RDBMS - Use RDS / google cloudsql (ACID Transactionality)
- Co-locate compute and data as much as possible.
- Prefer hosted to self-managed solutions unless there is a good reason to not do it (price, availability, support) (be sass-y)
- Go wide not deep
- Define invariants / commutativity ops at the door and stick to them
- Always keep things replicated
- Replication and backups
- Test backup restores and failover regularly
- Expose data platform semantics
- User managed sharding - it's hard and everyone needs something peculiar.
- Expose regions, availability zones, placement zones to users.
- Expose operational metrics to users.