Skip to content

Instantly share code, notes, and snippets.

@felixfyi
Forked from bhenerey/ideal ops.md
Created June 26, 2012 14:55
Show Gist options
  • Save felixfyi/2996257 to your computer and use it in GitHub Desktop.
Save felixfyi/2996257 to your computer and use it in GitHub Desktop.
ideal ops checklist

In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

Documentation

  • Accurate / up-to-date systems architecture diagram

  • Accurate / up-to-date network diagram

  • Out-of-hours support plan

  • Incident management plan

  • Change management plan

  • Application documentation

Metric collection:

  • comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)

  • application metrics instrumented in code (eg. queue length, time to post new job) [statsd]

  • business metrics instrumented in code as well (eg. registrations) [statsd]

  • include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)

  • include storage (eg. netapp)

  • include database

  • include cron jobs

  • include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)

  • majority of monitoring from internal systems

  • also monitor from external systems (e.g., Nimsoft/Watchmouse)

  • retrieve external monitoring data into internal collection for correlation

Alert system:

  • alert off data collected (passive)

  • alert on checks (active)

  • call-out on important alerts

  • email, irc/chat, sms, mobile escalation

  • call-out rotation, escalation plans

Dashboards:

  • Real-time dashboards of all services

  • Real-time dashboard of what is being viewed on the site, where traffic is coming from

  • Dashboards to include event / deploy lines

  • Anyone can create/share dashboards

  • No passwords to access dashboards

  • Key dashboards visible in the office on screen

  • Dashboard of environments - what's deployed

  • Cost dashboard (IaaS, SaaS)

Correlation / Investigation

  • Graphing system which allows ad-hoc metric correlation (eg. Graphite)

  • Centralized logging with search (eg. Logstash, Greylog)

  • Record of everything that has changed, by whom, when, and what the change was

  • Access to all relevant systems

Infrastructure as Code

  • Infrastructure DB with API (Chef server)
  • All infra changes tracked, done via configuration management

Security

  • Automated view of what needs to be patched/updated
  • Regular vulnerability scans with recorded history
  • ssh-key as only authentication
  • segregated environments (dev, test, prod)
  • data anonymisation for performance testing

Performance testing

  • Prod-like environment to test in
  • Good performance test, with assumptions and approximations documented
  • Record of all previous test results
  • Automated running of test
  • Automated comparison of test results with previous tests

Communications

  • whole company using the same instant messaging / chat system
  • task/kanbansystem for giving work to systems engineers / infrastructure developers
  • ops twitter
  • ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

  • single-click deploy
  • rollback-able
  • performed by developer
  • dashboard/KPI used to validate release
  • zero-down time
  • dark-launches
  • feature flags can be turned on/off via webui

Standards

  • Published standards of web systems requirements

Process

  • Light-weight post-mortem process, blame-free
  • Daily operations review
  • Monthly/quarterly architecture summit
  • Daily stand-ups
  • Iteration planning/review
  • Regular capacity planning /cost optimization

Meta-metrics

  • MTTD
  • MTTR
  • Availability
  • Service degradation (Slow versus broken; features disabled to protect site)
  • CD Pipeline Availability
  • Release tracking (type, success/failure, success rate, length of incident)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment