ideal ops checklist

In a perfect world, where things are done well, not just quickly, I would expect to find the following when joining the company:

Documentation

Metric collection:

comprehensive system metrics (eg. cpu, load, mem, disk, network, etc)
application metrics instrumented in code (eg. queue length, time to post new job) [statsd]
business metrics instrumented in code as well (eg. registrations) [statsd]
include network devices (eg. firewall, loadbalancers, switches, vpns, vpc)
include storage (eg. netapp)
include database
include cron jobs
include CD pipeline systems/applications (e.g., jenkins, chef, build / test farm)
majority of monitoring from internal systems
also monitor from external systems (e.g., Nimsoft/Watchmouse)
retrieve external monitoring data into internal collection for correlation

Alert system:

Dashboards:

Real-time dashboards of all services
Real-time dashboard of what is being viewed on the site, where traffic is coming from
Dashboards to include event / deploy lines
Anyone can create/share dashboards
No passwords to access dashboards
Key dashboards visible in the office on screen
Dashboard of environments - what's deployed
Cost dashboard (IaaS, SaaS)

Correlation / Investigation

Infrastructure as Code

Security

Performance testing

Communications

whole company using the same instant messaging / chat system
task/kanbansystem for giving work to sysadmins / system engineers / infrastructure developers
ops status (eg. etsystatus.com; stashboard; amazon status)

Deployment

Standards

Process

Meta-metrics

jespada/ideal ops.md