The purpose of this document is to gather information to be evaluated prior to the launch of a new service.
General Launch Information
- What is the service name?
- When is the launch date/time?
- Is this a soft or hard launch?
General Launch Information |
---|
What is the service name? |
When is the launch date/time? |
Is this a soft or hard launch? |
Architecture |
---|
Describe the system architecture. Link to architecture documents if possible. |
How does the failover work in the event of single-machine, rack, and datacenter failure? (AZ, Region, AWS) |
How is the system designed to scale under normal conditions? |
Capacity
- What is the expected initial volume of users and QPS (queries per second)?
- How was this number arrived at? (Link to load tests and reports.)
- What is expected to happen if the initial volume is 2x expected? 5x? (Link to emergency capacity documents.)
- What is the expected external (internet) bandwidth usage?
- What are the requirements for network and storage after 1, 3, and 12 months? (Link to confirmation documents from the network and storage teams capacity planner.)
Dependencies
- Which systems does this depend on? (Link to dependency/data flow diagram.)
- Which RPC limits are in place with these dependencies? (Link to limits and confirmation from external groups they can handle the traffic.)
- What will happen if these RPC limits are exceeded ?
- For each dependency, list the ticket number where this new service's use of the dependency (and QPS rate) was requested and positively acknowledged.
Monitoring
- Are all subsystems monitored? Describe the monitoring strategy and document what is monitored.
- Does a dashboard exist for all major subsystems?
- Do metrics dashboards exist? Are they in business, not technical, terms?
- Was the number of "false alarm" alerts in the last month less than x?
- Is the number of alerts received in a typical week less than x?
Documentation
- Does a playbook exist and include entries for all operational tasks and alerts?
- Have an LRE review each entry for accuracy and completeness.
- Is the number of open documentation-related bugs less than x?
Oncall
- Is the oncall schedule complete for the next n months?
- Is the oncall schedule arranged such that each shift is likely to get fewer than x alerts?
Disaster Preparedness
- What is the plan in case first-day usage is 10 times greater than expected?
- Do backups work and have restores been tested?
Operational Hygiene
- Are "spammy alerts" adjusted or corrected in a timely manner?
- Are bugs filed to raise visibility of issues-even minor annoyances or issues with commonly known workarounds?
- Do stability-related bugs take priority over new features?
- Is a system in place to assure that the number of open bugs is kept low?
Approvals
- Has marketing approved all logos, verbiage, and URL formats?
- Has the security team audited and approved the service?
- Has a privacy audit been completed and all issues remediated?
Source: The Practice of Cloud System Administration Volume 2, pg 157
The purpose of this document is to gather information to be evaluated by a Launch Readiness Engineer (LRE) when approving the launch of a new service. Please complete the survey prior to meeting with your LRE.