Skip to content

Instantly share code, notes, and snippets.

@ravsau
Last active December 4, 2018 03:16
Show Gist options
  • Save ravsau/cce07864c51c3833ea75b7fac28266ec to your computer and use it in GitHub Desktop.
Save ravsau/cce07864c51c3833ea75b7fac28266ec to your computer and use it in GitHub Desktop.
Few Points to achieve operational excellence from the new Well Architected tool on AWS released Reinvent 2018

OPS 1. How do you determine what your priorities are?

Everyone needs to understand their part in enabling business success. Have shared goals in order to set priorities for resources. This will maximize the benefits of your efforts.

  • Evaluate external customer needs

  • Evaluate internal customer needs

  • Evaluate compliance requirements

  • Evaluate threat landscape

  • Evaluate tradeoffs

  • Manage benefits and risks

OPS 2. How do you design your workload so that you can understand its state?

Design your workload so that it provides the information necessary for you to understand its internal state (for example, metrics, logs, and traces) across all components. This enables you to provide effective responses when appropriate.

  • Implement application telemetry

  • Implement and configure workload telemetry

  • Implement user activity telemetry

  • Implement dependency telemetry

  • Implement transaction traceability

OPS 3. How do you reduce defects, ease remediation, and improve flow into production?

Adopt approaches that improve flow of changes into production, that enable refactoring, fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit issues deployed, and enable rapid identification and remediation of issues introduced through deployment activities.

  • Use version control

  • Test and validate changes

  • Use configuration management systems

  • Use build and deployment management systems

  • Perform patch management

  • Share design standards

  • Implement practices to improve code quality

  • Use multiple environments

  • Make frequent, small, reversible changes

  • Fully automate integration and deployment

OPS 4. How do you mitigate deployment risks?

Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes. Using these practices mitigates the impact of issues introduced through the deployment of changes.

  • Plan for unsuccessful changes

  • Test and validate changes

  • Use deployment management systems

  • Test using limited deployments

  • Deploy using parallel environments

  • Deploy frequent, small, reversible changes

  • Fully automate integration and deployment

  • Automate testing and rollback

OPS 5. How do you know that you are ready to support a workload?

Evaluate the operational readiness of your workload, processes and procedures, and personnel to understand the operational risks related to your workload.

  • Ensure personnel capability

  • Ensure consistent review of operational readiness

  • Use runbooks to perform procedures

  • Use playbooks to identify issues

  • Make informed decisions to deploy systems and changes

OPS 6. How do you understand the health of your workload?

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action.

  • Identify key performance indicators

  • Define workload metrics

  • Collect and analyze workload metrics

  • Establish workload metrics baselines

  • Learn expected patterns of activity for workload

  • Alert when workload outcomes are at risk

  • Alert when workload anomalies are detected

  • Validate the achievement of outcomes and the effectiveness of KPIs and metrics

OPS 7. How do you understand the health of your operations?

Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action.

  • Identify key performance indicators

  • Define operations metrics

  • Collect and analyze operations metrics

  • Establish operations metrics baselines

  • Learn the expected patterns of activity for operations

  • Alert when operations outcomes are at risk

  • Alert when operations anomalies are detected

  • Validate the achievement of outcomes and the effectiveness of KPIs and metrics

OPS 8. How do you manage workload and operations events?

Prepare and validate procedures for responding to events to minimize their disruption to your workload.

  • Use processes for event, incident, and problem management

  • Use a process for root cause analysis

  • Have a process per alert

  • Prioritize operational events based on business impact

  • Define escalation paths

  • Enable push notifications

  • Communicate status through dashboards

  • Automate responses to events

OPS 9. How do you evolve operations?

Dedicate time and resources for continuous incremental improvement to evolve the effectiveness and efficiency of your operations.

  • Have a process for continuous improvement

  • Implement feedback loops

  • Define drivers for improvement

  • Validate insights

  • Perform operations metrics reviews

  • Document and share lessons learned

  • Allocate time to make improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment