Skip to content

Instantly share code, notes, and snippets.

@jonbrouse
Created September 6, 2018 14:40
Show Gist options
  • Save jonbrouse/ee9de93e27a9bd4891d803472bbe413e to your computer and use it in GitHub Desktop.
Save jonbrouse/ee9de93e27a9bd4891d803472bbe413e to your computer and use it in GitHub Desktop.

Launch Readiness Review Survey

The purpose of this document is to gather information to be evaluated prior to the launch of a new service.

General Launch Information

  • What is the service name?
  • When is the launch date/time?
  • Is this a soft or hard launch?
General Launch Information
What is the service name?
When is the launch date/time?
Is this a soft or hard launch?
Architecture
Describe the system architecture. Link to architecture documents if possible.
How does the failover work in the event of single-machine, rack, and datacenter failure?
(AZ, Region, AWS)
How is the system designed to scale under normal conditions?

Capacity

  • What is the expected initial volume of users and QPS (queries per second)?
  • How was this number arrived at? (Link to load tests and reports.)
  • What is expected to happen if the initial volume is 2x expected? 5x? (Link to emergency capacity documents.)
  • What is the expected external (internet) bandwidth usage?
  • What are the requirements for network and storage after 1, 3, and 12 months? (Link to confirmation documents from the network and storage teams capacity planner.)

Dependencies

  • Which systems does this depend on? (Link to dependency/data flow diagram.)
  • Which RPC limits are in place with these dependencies? (Link to limits and confirmation from external groups they can handle the traffic.)
  • What will happen if these RPC limits are exceeded ?
  • For each dependency, list the ticket number where this new service's use of the dependency (and QPS rate) was requested and positively acknowledged.

Monitoring

  • Are all subsystems monitored? Describe the monitoring strategy and document what is monitored.
  • Does a dashboard exist for all major subsystems?
  • Do metrics dashboards exist? Are they in business, not technical, terms?
  • Was the number of "false alarm" alerts in the last month less than x?
  • Is the number of alerts received in a typical week less than x?

Documentation

  • Does a playbook exist and include entries for all operational tasks and alerts?
  • Have an LRE review each entry for accuracy and completeness.
  • Is the number of open documentation-related bugs less than x?

Oncall

  • Is the oncall schedule complete for the next n months?
  • Is the oncall schedule arranged such that each shift is likely to get fewer than x alerts?

Disaster Preparedness

  • What is the plan in case first-day usage is 10 times greater than expected?
  • Do backups work and have restores been tested?

Operational Hygiene

  • Are "spammy alerts" adjusted or corrected in a timely manner?
  • Are bugs filed to raise visibility of issues-even minor annoyances or issues with commonly known workarounds?
  • Do stability-related bugs take priority over new features?
  • Is a system in place to assure that the number of open bugs is kept low?

Approvals

  • Has marketing approved all logos, verbiage, and URL formats?
  • Has the security team audited and approved the service?
  • Has a privacy audit been completed and all issues remediated?

Source: The Practice of Cloud System Administration Volume 2, pg 157

The purpose of this document is to gather information to be evaluated by a Launch Readiness Engineer (LRE) when approving the launch of a new service. Please complete the survey prior to meeting with your LRE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment